The Human in the Loop: Durable Workflows for Long-Running Systems

I spent the last few years in FinTech, and the one thing nobody warns you about is how a "simple" process refuses to stay simple. The long waits, the human approvals, the back-and-forth, the retries: from the outside it's a button, and from the inside it's a multi-week negotiation between systems and people.

The lesson took a while to land: some systems are not applications. They're business processes that happen to use software. These are processes that run for a long time, pause on humans, lean on flaky external systems, and absolutely cannot end up half-finished. We reach for the tools we know by reflex (a status column, a queue, a couple of cron jobs) and then spend the next two years patching the gaps between them.

I'll use one running example to keep this concrete: invoice factoring, because I worked on it and know exactly where the bodies are buried. But the example is almost incidental. What matters is the shape of the problem, and once you can see the shape, you start spotting it everywhere.

The problem

Factoring, from the outside, is a button: a business uploads an invoice, money shows up. From the inside it's a KYB check, a credit pull, an underwriter who's at lunch, a back-and-forth over a missing document, an advance, then 45 days of waiting on a debtor who may or may not pay on time.

So you build it. One tidy function. Then the bureau starts timing out, so you add a retry. The underwriter takes three days, so the request can't block, so you add a status column and a queue. You need to find stuck applications, so you add a cron job to poll that column. The debtor takes 45 days, so that's another cron, another service, another table. Six months later that one function is five services, a pile of crons, a retry table, an eleven-value status enum, and a Slack channel called #whats-happening. Nobody can say where a given application is without running three queries.

That was roughly us. The real cost was never the code; it was the gaps between those five services, which nothing owned. Underwriters waited on applications and applications waited on underwriters, because no system told either side what needed attention next. We had no shared view of the work, so we pulled status with cron-driven reports and closed the gaps with manual follow-ups. Applications lost days sitting in handoffs nobody could see.

Plenty of tools promise to help. Zapier, n8n, Workato. They're great at automation: a trigger fires, apps connect, done. But this isn't automation. It's a transaction that runs for weeks, where a half-finished state means real money in the wrong place. For that you want durable workflows, and the one I reach for now is Temporal.

The claim that sounds fake

Here it is: your workflow code runs as if the process never crashes.

You write a normal function that calls an API, waits 30 days, calls another API, waits for a human to approve, and then continues. If the server dies on day 14, a different server picks it up and resumes exactly where it left off, every variable intact. You didn't write a state machine. You didn't persist anything. You wrote a function.

Temporal pulls this off by recording every step into an event history and replaying it to rebuild state whenever a worker resumes. Your code becomes the source of truth for the process. The status column, the polling cron, the retry table: gone, because the platform does all of it underneath.

Two ideas carry everything

A Workflow is the orchestration, the script of the process. An Activity is anything that touches the outside world: API calls, database writes, charging a card. The split is the whole trick. The Workflow stays predictable so it can be replayed; Activities are where the messy real world is allowed to be messy, and Temporal retries them for you.

Activities are just functions, called through a proxy that adds retries and timeouts:

const { pullCreditReport, advanceFunds } = proxyActivities<typeof activities>({
  startToCloseTimeout: '1 minute',
  retry: { maximumAttempts: 5 },
});

That retry block is the retry table you were about to build. The bureau can fail four times and your workflow never notices.

The human in the loop

The hardest part of these systems is rarely the code. It's the waiting, especially waiting on a person. Here's the underwriting step written the way durable workflows let you write it:

import { defineSignal, setHandler, condition } from '@temporalio/workflow';
 
// An underwriter's approval arrives as a Signal
export const decision = defineSignal<[boolean]>('underwriterDecision');
 
export async function onboardBusiness(id: string): Promise<void> {
  if (!(await runKybCheck(id))) return notify(id, 'rejected');
  await pullCreditReport(id);
 
  // Pause here until a human decides, or 7 days pass. No polling, no cron.
  let approved: boolean | undefined;
  setHandler(decision, (d) => { approved = d; });
  await condition(() => approved !== undefined, '7 days');
 
  await notify(id, approved ? 'approved' : 'declined');
}

condition() pauses the workflow until the signal arrives or seven days pass. The workflow is genuinely asleep, costing nothing, and the approval button just sends a signal:

await client.workflow.getHandle(id).signal(decision, true);

A Signal is how the outside world pushes data into a running workflow: a human decision, a fraud flag, a cancellation. This is the human in the loop modeled honestly, not a status column you poll and pray over, but a workflow that pauses, waits for a person, and resumes the moment they act. The long debtor wait works the same way: condition(() => paid >= amount, '60 days') races a payment signal against a timer that survives every deploy and restart in between.

Recognizing the shape

Most of the value here isn't in the API. It's in learning to recognize when you're dealing with a workflow rather than an application.

You're probably looking at a durable workflow when:

The process lives longer than a single request.
Humans participate in the process.
External systems control part of the timeline.
Steps must happen in a specific order.
Failure halfway through creates business damage.
Someone regularly asks, "Where is this application right now?"
You have a status enum that's growing faster than you'd like.
A cron job exists primarily to find work that got stuck.

If several of those sound familiar, there's a good chance you're not building CRUD anymore.

Once you have that lens, factoring stops looking special. The same shape appears in onboarding and KYC, order fulfillment, account provisioning, subscription lifecycles, document processing, reconciliation systems, payouts, and large-scale migrations. Different industries, same underlying problem.

And when not to: a 200ms CRUD endpoint, a fire-and-forget notification, plain request/response. Durable workflows cost something to run (a cluster or Temporal Cloud, plus workers), so for short, stateless work they're pure overkill. The payoff scales with how long the process lives and how much it hurts when it breaks midway.

What you stop writing

This is the real return. You stop writing status columns, because the code's position is the state. You stop writing polling jobs, bespoke retry logic, and reconciliation scripts that guess what happened after a crash. You stop writing the defensive "what if we deploy mid-process" handling, because mid-flight deploys become a non-event. And you get observability for free: every run has a complete, queryable history of what happened and when, so "why did this sit for nine days?" is a log query, not a forensic exercise. This is where durable workflows stop being an architectural preference and start becoming an economic argument.

The one thing you take on is a discipline: Workflow code has to be deterministic, so anything unpredictable (the clock, randomness, network calls) lives in Activities. In practice that's just orchestration in the Workflow, outside-world stuff in Activities, which is the separation you wanted anyway.

The takeaway

The skill isn't knowing Temporal's API. It's recognizing the shape: long-running, human-gated, stateful, can't-fail-halfway. When a process has that shape, stop hand-rolling durability one cron job at a time.

Write the process as one honest function and let the platform handle the waiting, the retries, and the recovery. Otherwise you'll build your own workflow engine anyway, slowly, accidentally, in production at 3am.

The first time one of these survives a crash untouched, it feels like cheating. It isn't. You're just finally building the system instead of babysitting it.