Model Serving and Inference Architecture

Serving architecture matters not simply because a model must run somewhere, but because this is where answer quality collides with latency, cost, and resilience.

The chapter treats online, batch, and stream inference as distinct operating contracts with different queues, dependencies, and degradation rules.

That perspective helps discuss routing, fallback, and autoscaling as part of product quality rather than as infrastructure in isolation.

Practical value of this chapter

Runtime architecture

Separate the critical path, execution layer, and degradation path as parts of one operating system.

Inference economics

Discuss batching, CPU/GPU routing, and autoscaling through the balance of latency and cost.

Fallback strategy

Plan a lighter model, cached answers, and safe defaults before the main path starts failing.

Inference modes

Understand when to choose online, batch, or stream inference for different load patterns.

Related chapter

Feature Store & Model Serving

The feature and data plane often shapes tail latency more than the model itself.

Читать обзор

Serving is where inference collides with queues, caches, timeouts, and infrastructure cost. This is also where latency limits turn into an explicit budget for each layer. Production ML and AI systems usually fail not because the model is weak, but because the live path was underdesigned and degraded behavior was never made operational.

Inference modes

Online inference

The user waits on the answer synchronously, and every millisecond shows up in the product. Degraded modes have to be ready ahead of time, not improvised during the first spike.

Batch inference

Large score recomputations, backfills, and nightly refreshes. The main risks are queue buildup, stale results, and interference with shared resources.

Stream inference

Scoring events as they arrive in near real time. This keeps outputs fresh, but sharply increases the complexity of state, ordering, and overload handling.

Serving runtime architecture by layers

It helps to read the serving runtime by layers: traffic ingress and routing, model execution, response shaping, and degraded behavior. Each layer adds its own slice of latency and its own failure point.

Clients and traffic ingress

API / SDKGatewayAuthRate limits

Layer transition

Routing and policy

Request routerTenant rulesAdmission controlRoute selection

Layer transition

Context and features

CacheFeature storeRetrievalContext assembly

Layer transition

Model execution

CPU/GPUBatchingConcurrency limitsWorkers

Layer transition

Post-processing and response

ThresholdsValidationFormattingResponse shaping

Layer transition

Degradation and recovery

FallbackLight modelSafe defaultsRecovery

What to keep under control

It helps to see serving not only as a chain of services, but as a balance of latency, cost, and resilience across every layer.

Latency budget

p95/p99queue waitfeature fetchpost-processing

Inference economics

GPU utilizationbatch efficiencycost per 1K requests

Resilience controls

fallback ratedegraded modeswarm capacityrecovery time

How a request flows through the serving runtime

The same request takes different routes depending on the mode: the latency-sensitive online path and the batch/stream path used for bulk or event-driven processing live by different rules.

How a request flows through the serving runtime

Comparing the online path with the batch/stream path

Interactive replayStep 1/5

Active step

1. Intake and routing

The gateway or router accepts the request, checks tenant rules, and decides whether it can enter the path.

Latency-sensitive request path

The online path is tightly constrained by latency.
Tail latency and fallback policy are critical.
Any slow dependency immediately hits UX.

Latency budgetFallbackTail latency

Latency budget decomposition

Request routing

5-15 ms

Admission control, tenant rules, and route selection should consume far less time than the inference path itself.

Feature and context fetch

20-60 ms

This is where tail latency often hides: cache misses, slow feature stores, and unnecessary dependency hops.

Model execution

30-90 ms

CPU versus GPU routing, batching windows, and model size define throughput, tail latency, and the cost envelope.

Post-processing

10-30 ms

Validation, threshold application, response shaping, citations, and policy filters still live on the critical path.

Execution policy

CPU/GPU routing

Heavy models and high-throughput paths often belong on GPU, but short bursts of traffic may be better served on CPU or by a lighter model.

Batching windows

Batching lowers the cost per request, but it almost always worsens tail latency. You need a hard maximum wait time and separate rules for different traffic classes.

Admission control

The queue must limit or shed lower-priority traffic before the whole serving path starts suffocating under total load.

Warm pools and autoscaling

If the heavy path warms up too slowly, autoscaling without warm capacity gives you a pretty graph and a poor user experience.

Degraded modes

A cached answer or recent score for read paths that are sensitive to latency.
A reduced feature set when the feature store or an external dependency degrades.
A lighter model instead of the primary GPU-heavy path.
A predefined fallback with a safe baseline when no inference path can be confirmed in time.

Unit economics metrics

cost per 1K requests or per successfully resolved task
GPU utilization and batching efficiency
share of traffic on the lighter model and fallback frequency
queue wait time, timeout rate, and rejected-request percentage

Key trade-offs

Leaning on GPU raises throughput, but it also makes capacity planning and cost forecasting harder.
A larger batching window reduces inference cost, but it almost always hurts p99 and interactive user flows.
Aggressive caching helps absorb spikes, but it increases stale-response risk and can hide degradation in the primary path.
A unified serving stack reduces duplication, but it expands the blast radius across models and product surfaces.

Common mistakes

Treating serving as a simple HTTP call to a model instead of designing it as a system in its own right.

Failing to separate online traffic from heavy batch or async work through execution rules and SLOs.

Relying on autoscaling without degraded modes, admission control, and pre-warmed capacity.

Looking only at latency and answer quality while ignoring processing cost and fallback frequency.

Recommendations

Break the latency budget down by layer and keep separate p95 and p99 metrics for each one.

Treat the critical path, execution layer, and degraded path as three controllable parts of the same runtime, especially for stream-heavy traffic.

Maintain at least two safe paths for trouble: a cached answer plus either a lighter model or a baseline decision.

Evaluate batching, routing, and autoscaling decisions together against quality, latency, and cost.

References

NVIDIA — Triton Inference Server: Batchers (dynamic batching, queues)Kwon et al. — Efficient Memory Management for LLM Serving with PagedAttention (vLLM, SOSP 2023)KServe — Autoscaling inference with Kubernetes HPA (documentation)Google Cloud — MLOps: Continuous delivery and automation pipelines in machine learning

Related chapters

Feature Store & Model Serving - A chapter about the feature plane and offline-online consistency that often defines the main tail-latency risks.
Model Release, Calibration, and Experiment Loops - Staged rollout changes more than model weights — serving configuration, routing, and the live latency budget move with them.
The history of Google TPUs and their evolution - Why accelerator economics and hardware choice directly shape serving architecture.
ML Ops Pipeline - How the inference path fits into the broader model lifecycle, monitoring, and retraining flow.
Generative AI System Design Interview (short summary) - Interview context for discussing latency, token usage, GPU utilization, fallback, and GenAI inference cost.