This Theme 9 chapter focuses on workflow orchestration, compensation, and idempotent step handling.
In real-world design, this material helps drive decisions using measurable constraints: latency budget, blast radius, contract stability, and integration operating cost.
For system design interviews, it provides a clear narrative: why this approach was chosen, which alternatives were considered, and which operational risks must be made explicit.
Practical value of this chapter
Design in practice
Design long-running processes with explicit compensation steps and state ownership.
Decision quality
Compare orchestration and choreography by control visibility and evolution complexity.
Interview articulation
Frame saga answers via happy path, failure path, and recovery policy.
Failure framing
Set timeout/retry limits so workflows do not hang or duplicate side effects.
Primary source
Temporal Workflows
Core model for durable execution and workflow semantics.
Workflow orchestration is an architectural layer for coordinating long-running business processes across microservices. It centralizes process state, retry/timeout policies, compensation logic, and runtime operational control.
Signals That Orchestration Is Actually Needed
The process runs for minutes, hours, or days
When a business flow outlives a single HTTP request, you need durable state and safe continuation after failures.
You need compensations and rollback scenarios
If steps touch multiple services, an orchestrator simplifies Saga execution: explicit compensations, rollback order, and transparent history.
Standardized retry/timeout policies are required
A shared policy for retries, backoff, and deadlines removes duplicated infrastructure logic from each microservice.
Operational control matters
You need replay, manual step restart, pause/resume, audit, and workflow state metrics in one operational plane.
Temporal, Cadence, Step Functions: Practical Comparison
| Platform | State model | Authoring model | Retry/timeout | Trade-offs |
|---|---|---|---|---|
| Temporal | Durable execution + event history | Code-first workflows in SDKs (Go/Java/TS/...) | Retry policies on activity/workflow tasks + timers | Requires deterministic coding discipline and a dedicated ops plane. |
| Cadence | Durable execution (architecturally close to Temporal) | Code-first workflows in SDKs | Activity retry policies + domain-level controls | Often chosen in existing installations and migration paths. |
| AWS Step Functions | Managed state machine (ASL, visual states) | Declarative state machines + AWS integrations | Retry/Catch per state | Strong AWS integration with a higher vendor lock-in risk. |
Reference Flow With Compensations
Typical order flow: reserve inventory, charge payment, create shipment, send confirmation. If a step fails, compensations run in reverse order.
Reference Orchestration Process
Happy path and Saga compensations in a single visual flow.
export async function OrderWorkflow(input: OrderInput): Promise<void> {
const reservation = await reserveInventory(input.orderId, input.items);
try {
await chargePayment(input.orderId, input.amount);
await createShipment(input.orderId, reservation.warehouseId);
await sendConfirmation(input.orderId);
} catch (error) {
await refundPayment(input.orderId);
await releaseInventory(input.orderId);
throw error;
}
}Execution Contract and Reliability Checklist
Execution contract
- Every activity is idempotent: re-execution must not corrupt business state.
- Every external call and the overall workflow has explicit timeout/deadline boundaries.
- Compensations are business-valid reverse actions, not only technical rollbacks.
- Workflow logic is versioned so running instances can finish on older behavior safely.
- Every workflow state is observable through metrics and tracing.
Reliability checklist
- Every workflow has a stable business key (for example, `orderId`) and a dedup policy.
- Activities avoid hidden nondeterministic behavior unless wrapped in side-effect primitives.
- Errors are split into retryable vs non-retryable with distinct handling policies.
- Manual operations (`resume`, `terminate`, `retry from failed step`) are documented as runbooks.
- Orchestration SLO is measured separately: start latency, completion latency, failure rate.
Implementation Risks
Mixing business logic with transport details
Keep workflows as a coordination layer; move domain logic and external integration details to activity/handler layers.
Implicit compensations
Define compensations next to each step and test them separately with fault-injection scenarios.
One giant workflow
Split flows into subprocesses with clear inputs/outputs and explicit bounded-context ownership.
Insufficient observability
Publish metrics for step status, retry depth, queue lag, and time-to-completion.
References
Related chapters
- Interservice communication patterns - Core sync/async interaction context and boundary decisions across services.
- Distributed Transactions: 2PC and 3PC - Distributed transaction context and why Saga often wins over 2PC in practice.
- Event-Driven Architecture: Event Sourcing, CQRS, Saga - How orchestration and choreography compare in event-driven systems.
- Service Discovery - Stable routing and service lookup for workflow steps across microservices.
- Fault Tolerance Patterns: Circuit Breaker, Bulkhead, Retry - Failure-management and graceful-degradation patterns for each workflow step.
