System Design Space
Knowledge graphSettings

Updated: June 9, 2026 at 5:47 PM

Workflow Orchestration: Temporal, Cadence, Step Functions

medium

How to design long-running business processes in microservices: durable execution, retries, compensations, workflow state, and platform trade-offs between Temporal, Cadence, and AWS Step Functions.

Workflow orchestration matters once a business process outlives individual requests, services, and even platform restarts.

In real design work, the chapter shows how long-running processes, compensations, state ownership, and durable execution reshape the system more deeply than the choice between Temporal, Cadence, or Step Functions.

In interviews and engineering discussions, it helps compare orchestration and choreography through control visibility, evolution cost, and the risk of hanging or duplicated actions.

Practical value of this chapter

Design in practice

Design long-running processes with explicit compensation steps and state ownership.

Decision quality

Compare orchestration and choreography by control visibility and evolution complexity.

Interview articulation

Frame Saga answers through the main process path, failure paths, and recovery rules.

Failure framing

Set timeout and retry limits so workflows do not hang or duplicate side effects.

Primary source

Temporal Workflows

Core model for durable execution and Temporal Workflow semantics.

Open documentation

Workflow orchestration is an architectural layer for coordinating long-running business processes across microservices. It centralizes process state, retry and timeout policies, compensations, and operational control over execution.

When Orchestration Is Actually Needed

The process runs for minutes, hours, or days

When a business process outlives a single HTTP request, you need durable state and safe continuation after failures.

Compensations and rollback paths are part of the design

If steps touch multiple services, an orchestrator makes Saga execution explicit: compensations, rollback order, and a transparent action history.

Retry and timeout policies must be consistent

Shared rules for retries, backoff, and deadlines remove duplicated infrastructure logic from individual services.

Operational control matters

Replay, manual step restart, pause/resume, audit, and workflow-state metrics need to live in one operational plane.

Temporal, Cadence, and Step Functions: Practical Comparison

Temporal

State model
Durable execution and event history
Authoring model
Process logic in SDK code (Go/Java/TS/...)
Retries and timeouts
Retry policies for activities and workflows, plus timers
Trade-offs
Requires deterministic workflow discipline and a dedicated operating plane.

Cadence

State model
Durable execution, architecturally close to Temporal
Authoring model
Process logic in SDK code
Retries and timeouts
Activity retry policies and domain-level controls
Trade-offs
More common in existing installations and migration paths.

AWS Step Functions

State model
Managed state machine with ASL and visual states
Authoring model
Declarative state machines and AWS integrations
Retries and timeouts
State-level retry and error handling
Trade-offs
Strong AWS integration with higher vendor lock-in risk.

Reference Process With Compensations

A typical order process reserves inventory, charges payment, creates a shipment, and sends confirmation. If a step fails, compensations run in reverse order.

Reference Orchestration Process

Happy path and Saga compensations in a single visual flow.

Success pathCompensation pathFailure in createShipmentOrder receivedworkflow startedreserveInventory()reserve itemschargePayment()charge customercreateShipment()prepare shipmentsendConfirmation()notify customerCompletedworkflow doneStep failureshipment creationfailedrefundPayment()reverse paymentreleaseInventory()release reservationRolled Backsaga compensated
Successful pathCompensationsFailure point
export async function OrderWorkflow(input: OrderInput): Promise<void> {
  const reservation = await reserveInventory(input.orderId, input.items);

  try {
    await chargePayment(input.orderId, input.amount);
    await createShipment(input.orderId, reservation.warehouseId);
    await sendConfirmation(input.orderId);
  } catch (error) {
    await refundPayment(input.orderId);
    await releaseInventory(input.orderId);
    throw error;
  }
}

Execution Contract and Reliability Checklist

Execution contract

  • Every activity is idempotent: re-execution must not corrupt business state.
  • Every external call and the overall process have explicit timeouts and deadlines.
  • Compensations are business-valid reverse actions, not only technical rollbacks.
  • Workflow logic is versioned so running instances can finish under older rules.
  • Every workflow state is visible through metrics and tracing.

Reliability checklist

  • Every workflow instance has a stable business key, such as `orderId`, and a deduplication policy.
  • Activities avoid hidden nondeterministic calls unless wrapped in explicit side-effect primitives.
  • Errors are split into retryable and non-retryable classes with different handling policies.
  • Manual operations such as resume, terminate, and restart from a failed step are documented as runbooks.
  • The orchestration SLO is measured separately: start latency, completion time, and failed-process rate.

Implementation Risks

Mixing business logic with transport details

Keep the process as a coordination layer; move domain decisions and external integration details into separate activity and handler layers.

Implicit compensations

Define compensations next to each step and test them separately with fault injection.

One giant workflow

Split the flow into subprocesses with clear inputs, outputs, and bounded-context ownership.

Insufficient observability

Publish metrics for step status, retry depth, queue growth, and time to completion.

References

Related chapters

Enable tracking in Settings