System Design Space
Knowledge graphSettings

Updated: April 7, 2026 at 6:22 AM

Agentic Workflows and Tool Calling Architecture

medium

How to design agentic systems: tool registries, planning and execution loops, state, approvals, and safe failure handling.

Once an agent is allowed to call tools, the architecture stops being a prompt flow and becomes a system with permissions, state, and budgets.

The chapter shows how to design planning loops, tool registries, approval gates, retries, and stopping rules so the agent adds value instead of uncontrolled operational complexity.

In interviews, it helps you discuss not an abstract agent, but a concrete runtime: who can call what, how arguments are validated, and how the system degrades safely under failure.

Practical value of this chapter

Runtime shape

Translate agent loops and tool-calling concepts into architecture decisions about permissions, state, budgets, and safe-failure paths.

Autonomy control

Evaluate the system through tool reliability, autonomy control, latency, cost, and operational risk.

Interview material

Frame the answer around the agent loop, tool registry, approvals, and stopping conditions, showing where constraints appear and how you manage them.

System trade-offs

Make trade-offs explicit in agent loops, tool calling, and controlled action execution: experiment speed, quality, explainability, resource budget, and maintenance complexity.

Once an agent is allowed to call tools, read external context, and affect system state, the problem is no longer “a clever prompt.” It becomes a small runtime with permissions, budget tiers, checkpoint state, and explicit safe-failure paths.

The most useful way to discuss that system is through capability boundaries, recovery semantics, approvals, and observability. Those layers determine whether the agent remains controllable under load, during failures, and on the write path.

Capability boundary

The main decision is not just which model to use, but which actions the agent is allowed to see and invoke at all.

Recovery semantics

You need to define ahead of time how the workflow resumes after a timeout, tool failure, or partially completed step.

Human override

Write paths, costly actions, and external side effects should have an explicit approval path and a clear manual takeover option.

Runtime building blocks

The easiest way to reason about an agent system is layer by layer: who chooses the next step, who checks access, who executes the action, and who owns recovery after failure.

Planner / Orchestrator

The orchestrator receives intent, current run state, and the active budget tier, then decides only the next step instead of trying to predict the full workflow up front.

Key contract: Input: task, history, capabilities, and risk context. Output: one next action with a reason code and a budget boundary.

What to watch: Loop length, budget burn rate, average replanning depth, and the share of runs that reach a stop condition without human intervention.

Tool registry and contracts

The tool registry defines a typed surface for the model: capability names, input/output schema, access scope, and risk class.

Key contract: Every tool should have an argument schema, response contract, action allowlist, and an explicit description of side effects.

What to watch: Schema violations, denied tool calls, stale contract definitions, and the share of calls that produced no useful outcome.

Executor / sandbox

The execution layer turns the orchestrator's decision into a concrete command, API call, or internal service action without exposing the model to an unlimited runtime.

Key contract: Each call runs in a bounded environment with timeouts, retry policy, idempotency expectations, and a log of actual side effects.

What to watch: Tool latency, timeout rate, repeated calls for the same step, side-effect success rate, and the share of manual rollbacks after execution.

Policy and approvals

This layer separates read-only flows from actions that can cost money, change data, or trigger an external operation.

Key contract: Each action receives a risk tier, approval requirement, deny path, and a safe rejection reason that is understandable to both the user and the operator.

What to watch: Approval rate, denied actions, false-positive blocking, human review latency, and the number of write paths that ran without required approval.

State and memory

Transient context, checkpoint state, and long-term memory should not be blended together: they have different lifecycles and serve different decisions.

Key contract: Short-lived context stores current-request data, run state stores step progress, and durable memory stores facts and prior outcomes.

What to watch: Resume success rate, stale memory hits, checkpoint size, and drift between the current world and stored memory.

Observability and evaluation

An agent runtime should be observed like a distributed system: model spans, tool spans, approvals, failures, and outcomes belong in one trace.

Key contract: One trace id links intent, tool calls, approvals, fallback, and final outcome, while replay/eval pipelines test regressions before rollout.

What to watch: Task completion, rollback rate, grounded answer rate, replay regressions, and the cost of a useful workflow rather than the cost of a single model call.

Reference agent runtime architecture

Intent and Taskrequest, context, budget tierPlanner / Orchestratornext step, budget, stop rulesTool Registrycapabilities, schema,allowlistPolicy / Approvalrisk gates, human review, denypathExecutor / Sandboxtool call, timeout, retry, sideeffectState / Memoryrun state, checkpoints, durablememoryObservability / Evaltrace, tool spans, replay,regressionsintent + budgetallowed toolsapproval / denyvalidated actioncheckpoint / restoretrace + eval

The orchestrator should not do everything

It decides the next step and budget, while execution and side effects stay behind a separate boundary.

Registry and policy reduce tool surface area

The agent only sees typed capabilities, and dangerous actions are routed through an approval path.

State and traces are separate systems

Without checkpoints and an audit trail, the loop is hard to recover and almost impossible to investigate.

Execution path: from intent to a safe action

It helps to read the runtime not as one monolithic workflow, but as a sequence of steps with checks, budgets, and checkpoints that let the system resume safely after failure or pause.

1. Intent classification and budget tier assignment

~20-80 ms

At the start, the system decides which capabilities are visible at all, which budget is allowed, and which stop rules apply to this workflow.

Control point: Decide whether the run is read-only, write-capable, or approval-required.

2. Context and run-state restoration

~20-120 ms

If the run continues after a failure or pause, the runtime should restore only relevant state instead of mixing it with stale memory.

Control point: Hydrate session context, checkpoint, and the facts required from memory.

3. Planning the next step only

~100-500 ms

A good orchestrator formulates the next verifiable action unit: which tool is needed, what input is expected, and how to tell whether the step succeeded.

Control point: Choose a capability, not a full workflow script.

4. Argument, ACL, and risk-code validation

~10-70 ms

Before the real tool call, the runtime checks argument schema, access rules, action cost, and whether human approval is required.

Control point: Stop the dangerous call before execution, not after it.

5. Sandboxed execution and result inspection

~50 ms - seconds

After the call, the runtime evaluates the tool response, the result quality, the presence of side effects, and decides whether to continue, replan, or fallback.

Control point: Capture output, side effects, and reason-coded failures.

6. Trace persistence, stop, or loop continuation

~10-50 ms

Even a successful step should write a trace, update the checkpoint, and go through a stop condition so the loop does not keep running by inertia.

Control point: Persist a checkpoint and finish the workflow deterministically.

Execution loop and decision points

Plan Stepgoal, budget,tool classSelect Toolcapability, nextactionValidateArgsschema, ACL,risk codeExecutesandbox,timeout,idempotencyInspectResultoutput quality,side effects,logsDecisionretry, approve,stopRetry / Replanfix args, swap tools, orshrink scopeApproval Patha human confirms thewrite path or externalside effectStop / Fallbackbudget is exhausted,risk is high, orconfidence is too lowruntime loopcontrolled outcomesloop back

Validation should be explicit

Schema, ACL, and risk checks before execution reduce expensive and unsafe tool calls.

Every result needs a decision node

The agent should not treat any output as an automatic justification for the next side effect.

Retry is not a blind repeat

A useful retry changes scope, arguments, or the tool rather than running the exact same call again.

Agent workflow examples

The two scenarios below show the difference between a retrieval-first assistant and an agent that can prepare changes, but reaches the write path only after a separate approval step.

Retrieval-first copilot

A workflow for enterprise assistants and knowledge copilots: the main job is to build a grounded answer from retrieval and business read tools without enabling write actions by default.

1. Intent -> classify the request and choose a read-only capability set2. Retrieval -> search knowledge with ACLs, tenant filters, and freshness checks3. Business read tools -> reference APIs, service status, runbooks, tickets4. Grounded answer -> citations, source links, and reason codes if the answer is incomplete5. Approval only for write actions -> creating tickets, changing state, or launching workflows only after explicit approval
Read-only mode should be the default: the assistant explains and answers, but does not change the world without a separate step.
Retrieval and read tools must be tenant-aware and leave traceable source attribution.
If sources conflict or confidence is low, it is safer to return a partial answer than to invent a complete workflow ending.

Write-capable agent

A workflow for coding agents, internal ops automation, or managed remediation: the agent may prepare a change, but the path to a side effect is separated from planning and protected by approval gates.

1. Plan -> break the task into the smallest next write step and assign a risk tier2. Dry-run / sandbox -> execute the verifiable action in a safe copy of the environment or a limited scope3. Human approval -> show the diff, impact summary, cost/risk, and request approval for the real side effect4. Side effect -> apply the change, call an external API, or update system state5. Validation -> check expectations after the change: tests, health checks, consistency constraints6. Rollback / escalation -> rollback, hand off to a human, or switch back to read-only mode on failure
Approval should happen after the dry run and before the real side effect, otherwise the human approves a plan that is still too abstract.
The write path should leave an audit trail: who approved, what changed, which checks passed, and what was rolled back.
The rollback path must be designed explicitly; otherwise agent autonomy only works in one direction.

Tooling roles and concrete examples

This is not a mandatory stack. It is a map of common layers: first why the layer exists, then which concrete tools or approaches are often used, and what trade-off they introduce.

LayerWhy it existsExamplesBest fitMain trade-off
OrchestrationCoordinates step lifecycle, budgets, retries, pause/resume, and stop conditions.custom state machine, LangGraph, TemporalUseful when the workflow is multi-step, recoverable, and needs to live longer than one model call.The stronger the orchestration framework, the higher the cost of schemas, migrations, and operational ownership of the runtime itself.
Tool contract / registryDescribes capabilities as typed actions with schema, scope, and risk metadata.Model Context Protocol, OpenAPI/JSON Schema, internal capability registryUseful when tools grow quickly and need to be exposed to the model as a limited, explainable surface area.More formal contracts reduce chaos, but make fast ad-hoc tool evolution more expensive.
Sandboxed executionIsolates real command execution, code execution, or external API side effects from the model loop.isolated worker, Docker, Firecracker microVMNeeded whenever the agent can run code, shell commands, filesystem actions, or expensive integrations.Stronger isolation lowers blast radius, but adds latency, infrastructure cost, and debugging complexity.
State / memoryStores session context, checkpoint state, durable memory, and recovery cursors.Redis, Postgres, workflow history store, vector databaseNeeded when workflows must resume after pauses, scale beyond one step, or reuse facts across runs.More memory improves continuity, but raises the risk of stale context, inconsistent recall, and data-governance debt.
Observability / evaluationConnects model calls, tool spans, approvals, and outcomes into one trace and one regression loop.OpenTelemetry, Langfuse, Phoenix, custom replay harnessUseful in production runtimes where it is important not only to see success, but to investigate concrete failures.The more detailed the tracing, the higher the storage cost, redaction burden, and analytics complexity.
Policy / approvalTurns risk decisions into a separate architectural layer instead of text inside a prompt.Open Policy Agent, Cedar, approval queue, custom review serviceNeeded whenever write actions, money, PII, tenant boundaries, or production state cannot be fully trusted to the model.Stronger guardrails reduce risk, but can slow UX and reduce perceived autonomy if the rules are designed too coarsely.

State, control plane, and safe fallback

This is where the separation between short-lived context, durable memory, and safe fallback really matters. Without it, the system accumulates stale facts and fails badly under interruption.

Transient Contextcurrent prompt, retrievedcontext, session varsAgent Runtimeplanner + executor + decisionloopRun Statecheckpoint, step history,recovery cursorLong-Term Memorydurable facts, embeddings, prioroutcomesPolicy GatesACL, approval rules, cost andrisk limitsAudit Logtool traces, approvals, outputs,incident evidenceSafe Fallbackhuman takeover, deterministic path,rollbackrequest datacheckpoint / resumememory readallow / denytrace writeescalate / rollback

Transient context is not durable state

You can discard request context, but run state is required for resume after failure and repeatable investigation.

Policy gates verify permission to act

Guardrails matter as a separate decision layer with reason codes, not just as text inside the prompt.

Fallback should be predictable

When risk is high, the system should end in a deterministic path: human handoff, rollback, or a read-only answer.

Failures, approval paths, and graceful degradation

A good agent architecture is not measured by the number of autonomous steps it can take, but by the quality of its controlled degradation: whether it can stop, explain the reason, hand context to a human, and finish the workflow without a hidden side effect. That is best tested on replay sets, not only on one-off happy-path demos.

Schema mismatch and invalid tool output

Symptom: The model chooses the right capability class, but fails on arguments or receives a partially broken response.

Architectural response: Make validation and normalization a separate step, and require replanning after an invalid response instead of using blind retry.

Loop drift and budget burn

Symptom: The agent keeps calling similar tools without actually getting closer to task completion.

Architectural response: Limit step count, introduce budget-burn alerts, and define stop rules that switch the workflow into human handoff or deterministic fallback.

Stale memory and incorrect state restoration

Symptom: The workflow continues from an outdated checkpoint or uses facts that no longer match the current world.

Architectural response: Version checkpoints and memory records, track source freshness, and require rehydration before any risky action.

Unsafe side effect before approval

Symptom: A write action starts before a human has seen the impact summary, diff, or error cost.

Architectural response: Separate dry-run and real execution architecturally: approval should unlock a distinct write capability rather than confirm an already-running call.

Downstream tool failure or provider outage

Symptom: A required tool becomes unavailable, hangs, or returns unpredictable output on a critical step.

Architectural response: Prepare the fallback path ahead of time: an alternative tool, a read-only answer, delayed execution, or operator escalation.

Anti-patterns

Giving the model a wide list of tools and hoping the right prompt will constrain behavior on its own.
Storing prompt, run state, tool logs, and long-term memory in one context blob without lifecycle or ownership boundaries.
Treating an approval UI as sufficient protection if the side effect is already prepared or partially executed before approval.
Measuring quality only by completed runs without analyzing rollback, denied actions, replay regressions, and hidden cost.

Practical recommendations

Design the agent as a runtime with capability boundaries rather than a chain of prompt heuristics.
Keep read-only mode as the default, and unlock write capabilities only through distinct risk tiers and approval paths.
Log not just the final outcome, but each decision point: why a tool was chosen, why a step was rejected, and why the loop stopped.
Collect replay sets and failure buckets by type so you can improve not only the prompt, but also contracts, policies, recovery semantics, and fallback.

Related materials

  • Model Context Protocol - A practical standard for describing tools and resources as typed contracts for model runtimes.
  • Temporal Workflows - A reference for durable long-running workflows, replay, and recovery semantics.
  • Open Policy Agent - A policy-as-code approach for capability decisions, approval logic, and reason-coded deny paths.
  • OpenTelemetry - A foundation for traces and metrics across model calls, tool spans, approvals, and final outcomes.
  • OWASP GenAI Security Project - Reference material on prompt injection, tool abuse, and threat modeling for LLM and agent systems.

Related chapters

Enable tracking in Settings