Durable Workflows for Agents (pause / resume / retry).
the path
Read. Master the vocabulary. Fire two hot-takes. Then write the pitch and draw the system. End-state: you speak this like it's native.
The brief
Agents that run for minutes or hours need the same guarantees backend jobs do: survive crashes, retry transient failures, resume from the last step, and be observable. Durable workflow engines (Temporal, Inngest, Vercel Workflow) turn brittle long-running agents into crash-safe pipelines of idempotent steps.
- 01Durable engines add infra and a learning curve; overkill for sub-second agents.
- 02Forcing idempotency constrains how you write tool calls — not all APIs cooperate.
- 03Human-in-the-loop waits can hold state for days; storage and auth matter.
- 04Vendor lock-in risk with managed offerings; self-hosted Temporal is powerful but ops-heavy.
“Treat every long-running agent as a distributed system — because under load, it is.”
The system
Vocabulary gym
Idempotent step
A unit of work safe to re-run; identified by a deterministic key.
flip back ←Hot-takes
Two hot-takes. One sentence each. No hedging, no lists — just the sharpest answer you can land. The coach replies in seconds with a score and a tighter rewrite.
What breaks if you run a 20-step agent on a regular serverless function?
How do you make a tool call idempotent when the underlying API isn't?
The drill
Explain in 400–600 words why a reliable multi-step research agent needs a durable workflow runtime. Use one concrete failure story (tool timeout, crashed function, rate limit) and show how each layer — retry, checkpoint, human gate — saves the run.