Serving architecture matters not simply because a model must run somewhere, but because this is where answer quality collides with latency, cost, and resilience.
The chapter treats online, batch, and stream inference as distinct operating contracts with different queues, dependencies, and degradation rules.
That perspective helps discuss routing, fallback, and autoscaling as part of product quality rather than as infrastructure in isolation.
Practical value of this chapter
Runtime architecture
Separate the critical path, execution layer, and degradation path as parts of one operating system.
Inference economics
Discuss batching, CPU/GPU routing, and autoscaling through the balance of latency and cost.
Fallback strategy
Plan a lighter model, cached answers, and safe defaults before the main path starts failing.
Inference modes
Understand when to choose online, batch, or stream inference for different load patterns.
Related chapter
Feature Store & Model Serving
The feature and data plane often shapes tail latency more than the model itself.
Serving is where inference collides with queues, caches, timeouts, and infrastructure cost. This is also where latency limits turn into an explicit budget for each layer. Production ML and AI systems usually fail not because the model is weak, but because the live path was underdesigned and degraded behavior was never made operational.
Inference modes
Online inference
The synchronous user-facing path, where every millisecond affects the product experience and degraded modes must be prepared in advance.
Batch inference
Large score recomputations, backfills, and nightly refreshes. The main risks are queue buildup, stale results, and interference with shared resources.
Stream inference
Scoring events as they arrive in near real time. This keeps outputs fresh, but sharply increases the complexity of state, ordering, and overload handling.
Serving runtime architecture by layers
This diagram breaks the serving runtime into layers, from traffic ingress and routing to model execution, response shaping, and degraded behavior.
What to keep under control
It helps to see serving not only as a chain of services, but as a balance of latency, cost, and resilience across every layer.
Latency budget
Inference economics
Resilience controls
How a request flows through the serving runtime
Below is a comparison of two broader execution paths: the latency-sensitive online path and the batch/stream path used for bulk or event-driven processing.
How a request flows through the serving runtime
Comparing the online path with the batch/stream path
Active step
1. Intake and routing
The gateway or router accepts the request, checks tenant rules, and decides whether it can enter the path.
Latency-sensitive request path
- The online path is tightly constrained by latency.
- Tail latency and fallback policy are critical.
- Any slow dependency immediately hits UX.
Latency budget decomposition
Request routing
5-15 ms
Admission control, tenant rules, and route selection should consume far less time than the inference path itself.
Feature and context fetch
20-60 ms
This is where tail latency often hides: cache misses, slow feature stores, and unnecessary dependency hops.
Model execution
30-90 ms
CPU versus GPU routing, batching windows, and model size define throughput, tail latency, and the cost envelope.
Post-processing
10-30 ms
Validation, threshold application, response shaping, citations, and policy filters still live on the critical path.
Execution policy
CPU/GPU routing
Heavy models and high-throughput paths often belong on GPU, but short bursts of traffic may be better served on CPU or by a lighter model.
Batching windows
Batching lowers the cost per request, but it almost always worsens tail latency. You need a hard maximum wait time and separate rules for different traffic classes.
Admission control
The queue must limit or shed lower-priority traffic before the whole serving path starts suffocating under total load.
Warm pools and autoscaling
If the heavy path warms up too slowly, autoscaling without warm capacity gives you a pretty graph and a poor user experience.
Degraded modes
- A cached answer or recent score for read paths that are sensitive to latency.
- A reduced feature set when the feature store or an external dependency degrades.
- A lighter model instead of the primary GPU-heavy path.
- A predefined fallback with a safe baseline when no inference path can be confirmed in time.
Unit economics metrics
- cost per 1K requests or per successfully resolved task
- GPU utilization and batching efficiency
- share of traffic on the lighter model and fallback frequency
- queue wait time, timeout rate, and rejected-request percentage
Key trade-offs
- Leaning on GPU raises throughput, but it also makes capacity planning and cost forecasting harder.
- A larger batching window reduces inference cost, but it almost always hurts p99 and interactive user flows.
- Aggressive caching helps absorb spikes, but it increases stale-response risk and can hide degradation in the primary path.
- A unified serving stack reduces duplication, but it expands the blast radius across models and product surfaces.
Common mistakes
Recommendations
Related chapters
- Feature Store & Model Serving - A chapter about the feature plane and offline-online consistency that often defines the main tail-latency risks.
- Model Release, Calibration, and Experiment Loops - How staged rollout changes not just model weights, but also serving configuration, routing, and the live latency budget.
- The history of Google TPUs and their evolution - Why accelerator economics and hardware choice directly shape serving architecture.
- ML Ops Pipeline - How the inference path fits into the broader model lifecycle, monitoring, and retraining flow.
