Once an agent is allowed to call tools, the architecture stops being a prompt flow and becomes a system with permissions, state, and budgets.
The chapter shows how to design planning loops, tool registries, approval gates, retries, and stopping rules so the agent adds value instead of uncontrolled operational complexity.
In interviews, it helps you discuss not an abstract agent, but a concrete runtime: who can call what, how arguments are validated, and how the system degrades safely under failure.
Practical value of this chapter
Runtime shape
Translate agent loops and tool-calling concepts into architecture decisions about permissions, state, budgets, and safe-failure paths.
Autonomy control
Evaluate the system through tool reliability, autonomy control, latency, cost, and operational risk.
Interview material
Frame the answer around the agent loop, tool registry, approvals, and stopping conditions, showing where constraints appear and how you manage them.
System trade-offs
Make trade-offs explicit in agent loops, tool calling, and controlled action execution: experiment speed, quality, explainability, resource budget, and maintenance complexity.
Once an agent is allowed to call tools, read external context, and affect system state, the problem is no longer “a clever prompt.” It becomes a small runtime with permissions, budget tiers, checkpoint state, and explicit safe-failure paths.
The most useful way to discuss that system is through capability boundaries, recovery semantics, approvals, and observability. Those layers determine whether the agent remains controllable under load, during failures, and on the write path.
Capability boundary
The main decision is not just which model to use, but which actions the agent is allowed to see and invoke at all.
Recovery semantics
You need to define ahead of time how the workflow resumes after a timeout, tool failure, or partially completed step.
Human override
Write paths, costly actions, and external side effects should have an explicit approval path and a clear manual takeover option.
Runtime building blocks
The easiest way to reason about an agent system is layer by layer: who chooses the next step, who checks access, who executes the action, and who owns recovery after failure.
Planner / Orchestrator
The orchestrator receives intent, current run state, and the active budget tier, then decides only the next step instead of trying to predict the full workflow up front.
Key contract: Input: task, history, capabilities, and risk context. Output: one next action with a reason code and a budget boundary.
What to watch: Loop length, budget burn rate, average replanning depth, and the share of runs that reach a stop condition without human intervention.
Tool registry and contracts
The tool registry defines a typed surface for the model: capability names, input/output schema, access scope, and risk class.
Key contract: Every tool should have an argument schema, response contract, action allowlist, and an explicit description of side effects.
What to watch: Schema violations, denied tool calls, stale contract definitions, and the share of calls that produced no useful outcome.
Executor / sandbox
The execution layer turns the orchestrator's decision into a concrete command, API call, or internal service action without exposing the model to an unlimited runtime.
Key contract: Each call runs in a bounded environment with timeouts, retry policy, idempotency expectations, and a log of actual side effects.
What to watch: Tool latency, timeout rate, repeated calls for the same step, side-effect success rate, and the share of manual rollbacks after execution.
Policy and approvals
This layer separates read-only flows from actions that can cost money, change data, or trigger an external operation.
Key contract: Each action receives a risk tier, approval requirement, deny path, and a safe rejection reason that is understandable to both the user and the operator.
What to watch: Approval rate, denied actions, false-positive blocking, human review latency, and the number of write paths that ran without required approval.
State and memory
Transient context, checkpoint state, and long-term memory should not be blended together: they have different lifecycles and serve different decisions.
Key contract: Short-lived context stores current-request data, run state stores step progress, and durable memory stores facts and prior outcomes.
What to watch: Resume success rate, stale memory hits, checkpoint size, and drift between the current world and stored memory.
Observability and evaluation
An agent runtime should be observed like a distributed system: model spans, tool spans, approvals, failures, and outcomes belong in one trace.
Key contract: One trace id links intent, tool calls, approvals, fallback, and final outcome, while replay/eval pipelines test regressions before rollout.
What to watch: Task completion, rollback rate, grounded answer rate, replay regressions, and the cost of a useful workflow rather than the cost of a single model call.
Reference agent runtime architecture
The orchestrator should not do everything
It decides the next step and budget, while execution and side effects stay behind a separate boundary.
Registry and policy reduce tool surface area
The agent only sees typed capabilities, and dangerous actions are routed through an approval path.
State and traces are separate systems
Without checkpoints and an audit trail, the loop is hard to recover and almost impossible to investigate.
Execution path: from intent to a safe action
It helps to read the runtime not as one monolithic workflow, but as a sequence of steps with checks, budgets, and checkpoints that let the system resume safely after failure or pause.
1. Intent classification and budget tier assignment
~20-80 msAt the start, the system decides which capabilities are visible at all, which budget is allowed, and which stop rules apply to this workflow.
Control point: Decide whether the run is read-only, write-capable, or approval-required.
2. Context and run-state restoration
~20-120 msIf the run continues after a failure or pause, the runtime should restore only relevant state instead of mixing it with stale memory.
Control point: Hydrate session context, checkpoint, and the facts required from memory.
3. Planning the next step only
~100-500 msA good orchestrator formulates the next verifiable action unit: which tool is needed, what input is expected, and how to tell whether the step succeeded.
Control point: Choose a capability, not a full workflow script.
4. Argument, ACL, and risk-code validation
~10-70 msBefore the real tool call, the runtime checks argument schema, access rules, action cost, and whether human approval is required.
Control point: Stop the dangerous call before execution, not after it.
5. Sandboxed execution and result inspection
~50 ms - secondsAfter the call, the runtime evaluates the tool response, the result quality, the presence of side effects, and decides whether to continue, replan, or fallback.
Control point: Capture output, side effects, and reason-coded failures.
6. Trace persistence, stop, or loop continuation
~10-50 msEven a successful step should write a trace, update the checkpoint, and go through a stop condition so the loop does not keep running by inertia.
Control point: Persist a checkpoint and finish the workflow deterministically.
Execution loop and decision points
Validation should be explicit
Schema, ACL, and risk checks before execution reduce expensive and unsafe tool calls.
Every result needs a decision node
The agent should not treat any output as an automatic justification for the next side effect.
Retry is not a blind repeat
A useful retry changes scope, arguments, or the tool rather than running the exact same call again.
Agent workflow examples
The two scenarios below show the difference between a retrieval-first assistant and an agent that can prepare changes, but reaches the write path only after a separate approval step.
Retrieval-first copilot
A workflow for enterprise assistants and knowledge copilots: the main job is to build a grounded answer from retrieval and business read tools without enabling write actions by default.
Write-capable agent
A workflow for coding agents, internal ops automation, or managed remediation: the agent may prepare a change, but the path to a side effect is separated from planning and protected by approval gates.
Tooling roles and concrete examples
This is not a mandatory stack. It is a map of common layers: first why the layer exists, then which concrete tools or approaches are often used, and what trade-off they introduce.
| Layer | Why it exists | Examples | Best fit | Main trade-off |
|---|---|---|---|---|
| Orchestration | Coordinates step lifecycle, budgets, retries, pause/resume, and stop conditions. | custom state machine, LangGraph, Temporal | Useful when the workflow is multi-step, recoverable, and needs to live longer than one model call. | The stronger the orchestration framework, the higher the cost of schemas, migrations, and operational ownership of the runtime itself. |
| Tool contract / registry | Describes capabilities as typed actions with schema, scope, and risk metadata. | Model Context Protocol, OpenAPI/JSON Schema, internal capability registry | Useful when tools grow quickly and need to be exposed to the model as a limited, explainable surface area. | More formal contracts reduce chaos, but make fast ad-hoc tool evolution more expensive. |
| Sandboxed execution | Isolates real command execution, code execution, or external API side effects from the model loop. | isolated worker, Docker, Firecracker microVM | Needed whenever the agent can run code, shell commands, filesystem actions, or expensive integrations. | Stronger isolation lowers blast radius, but adds latency, infrastructure cost, and debugging complexity. |
| State / memory | Stores session context, checkpoint state, durable memory, and recovery cursors. | Redis, Postgres, workflow history store, vector database | Needed when workflows must resume after pauses, scale beyond one step, or reuse facts across runs. | More memory improves continuity, but raises the risk of stale context, inconsistent recall, and data-governance debt. |
| Observability / evaluation | Connects model calls, tool spans, approvals, and outcomes into one trace and one regression loop. | OpenTelemetry, Langfuse, Phoenix, custom replay harness | Useful in production runtimes where it is important not only to see success, but to investigate concrete failures. | The more detailed the tracing, the higher the storage cost, redaction burden, and analytics complexity. |
| Policy / approval | Turns risk decisions into a separate architectural layer instead of text inside a prompt. | Open Policy Agent, Cedar, approval queue, custom review service | Needed whenever write actions, money, PII, tenant boundaries, or production state cannot be fully trusted to the model. | Stronger guardrails reduce risk, but can slow UX and reduce perceived autonomy if the rules are designed too coarsely. |
State, control plane, and safe fallback
This is where the separation between short-lived context, durable memory, and safe fallback really matters. Without it, the system accumulates stale facts and fails badly under interruption.
Transient context is not durable state
You can discard request context, but run state is required for resume after failure and repeatable investigation.
Policy gates verify permission to act
Guardrails matter as a separate decision layer with reason codes, not just as text inside the prompt.
Fallback should be predictable
When risk is high, the system should end in a deterministic path: human handoff, rollback, or a read-only answer.
Failures, approval paths, and graceful degradation
A good agent architecture is not measured by the number of autonomous steps it can take, but by the quality of its controlled degradation: whether it can stop, explain the reason, hand context to a human, and finish the workflow without a hidden side effect. That is best tested on replay sets, not only on one-off happy-path demos.
Schema mismatch and invalid tool output
Symptom: The model chooses the right capability class, but fails on arguments or receives a partially broken response.
Architectural response: Make validation and normalization a separate step, and require replanning after an invalid response instead of using blind retry.
Loop drift and budget burn
Symptom: The agent keeps calling similar tools without actually getting closer to task completion.
Architectural response: Limit step count, introduce budget-burn alerts, and define stop rules that switch the workflow into human handoff or deterministic fallback.
Stale memory and incorrect state restoration
Symptom: The workflow continues from an outdated checkpoint or uses facts that no longer match the current world.
Architectural response: Version checkpoints and memory records, track source freshness, and require rehydration before any risky action.
Unsafe side effect before approval
Symptom: A write action starts before a human has seen the impact summary, diff, or error cost.
Architectural response: Separate dry-run and real execution architecturally: approval should unlock a distinct write capability rather than confirm an already-running call.
Downstream tool failure or provider outage
Symptom: A required tool becomes unavailable, hangs, or returns unpredictable output on a critical step.
Architectural response: Prepare the fallback path ahead of time: an alternative tool, a read-only answer, delayed execution, or operator escalation.
Anti-patterns
Practical recommendations
Related materials
- Model Context Protocol - A practical standard for describing tools and resources as typed contracts for model runtimes.
- Temporal Workflows - A reference for durable long-running workflows, replay, and recovery semantics.
- Open Policy Agent - A policy-as-code approach for capability decisions, approval logic, and reason-coded deny paths.
- OpenTelemetry - A foundation for traces and metrics across model calls, tool spans, approvals, and final outcomes.
- OWASP GenAI Security Project - Reference material on prompt injection, tool abuse, and threat modeling for LLM and agent systems.
Related chapters
- Prompt Engineering for LLMs (short summary) - Why prompt engineering quickly becomes context design and then full runtime design.
- An Illustrated Guide to AI Agents (short summary) - A book companion on memory, planning loops, reflection, and the organization of tool use.
- GenAI/RAG System Architecture - A neighboring runtime-first case where retrieval, guardrails, and observability already operate as a production system.
- LLM Guardrails, Prompt Injection, and Safety Patterns - How to turn safety and trust boundaries into a separate architectural layer rather than one rule in the prompt.
- AI Coding Agent Platform - A practical case about sandboxing, tool permissions, approvals, and a controlled write path for coding agents.
