System Design Space
Knowledge graphSettings

Updated: April 5, 2026 at 1:05 PM

Model Serving and Inference Architecture

medium

How to design the live inference path for ML and LLM systems: online, batch, and stream modes, autoscaling, CPU/GPU routing, degraded behavior, and latency-cost trade-offs.

Serving architecture matters not simply because a model must run somewhere, but because this is where answer quality collides with latency, cost, and resilience.

The chapter treats online, batch, and stream inference as distinct operating contracts with different queues, dependencies, and degradation rules.

That perspective helps discuss routing, fallback, and autoscaling as part of product quality rather than as infrastructure in isolation.

Practical value of this chapter

Runtime architecture

Separate the critical path, execution layer, and degradation path as parts of one operating system.

Inference economics

Discuss batching, CPU/GPU routing, and autoscaling through the balance of latency and cost.

Fallback strategy

Plan a lighter model, cached answers, and safe defaults before the main path starts failing.

Inference modes

Understand when to choose online, batch, or stream inference for different load patterns.

Related chapter

Feature Store & Model Serving

The feature and data plane often shapes tail latency more than the model itself.

Читать обзор

Serving is where inference collides with queues, caches, timeouts, and infrastructure cost. This is also where latency limits turn into an explicit budget for each layer. Production ML and AI systems usually fail not because the model is weak, but because the live path was underdesigned and degraded behavior was never made operational.

Inference modes

Online inference

The synchronous user-facing path, where every millisecond affects the product experience and degraded modes must be prepared in advance.

Batch inference

Large score recomputations, backfills, and nightly refreshes. The main risks are queue buildup, stale results, and interference with shared resources.

Stream inference

Scoring events as they arrive in near real time. This keeps outputs fresh, but sharply increases the complexity of state, ordering, and overload handling.

Serving runtime architecture by layers

This diagram breaks the serving runtime into layers, from traffic ingress and routing to model execution, response shaping, and degraded behavior.

Clients and traffic ingress
API / SDKGatewayAuthRate limits
Layer transition
Routing and policy
Request routerTenant rulesAdmission controlRoute selection
Layer transition
Context and features
CacheFeature storeRetrievalContext assembly
Layer transition
Model execution
CPU/GPUBatchingConcurrency limitsWorkers
Layer transition
Post-processing and response
ThresholdsValidationFormattingResponse shaping
Layer transition
Degradation and recovery
FallbackLight modelSafe defaultsRecovery

What to keep under control

It helps to see serving not only as a chain of services, but as a balance of latency, cost, and resilience across every layer.

Latency budget

p95/p99queue waitfeature fetchpost-processing

Inference economics

GPU utilizationbatch efficiencycost per 1K requests

Resilience controls

fallback ratedegraded modeswarm capacityrecovery time

How a request flows through the serving runtime

Below is a comparison of two broader execution paths: the latency-sensitive online path and the batch/stream path used for bulk or event-driven processing.

How a request flows through the serving runtime

Comparing the online path with the batch/stream path

Interactive replayStep 1/5

Active step

1. Intake and routing

The gateway or router accepts the request, checks tenant rules, and decides whether it can enter the path.

Latency-sensitive request path

  • The online path is tightly constrained by latency.
  • Tail latency and fallback policy are critical.
  • Any slow dependency immediately hits UX.
Latency budgetFallbackTail latency

Latency budget decomposition

Request routing

5-15 ms

Admission control, tenant rules, and route selection should consume far less time than the inference path itself.

Feature and context fetch

20-60 ms

This is where tail latency often hides: cache misses, slow feature stores, and unnecessary dependency hops.

Model execution

30-90 ms

CPU versus GPU routing, batching windows, and model size define throughput, tail latency, and the cost envelope.

Post-processing

10-30 ms

Validation, threshold application, response shaping, citations, and policy filters still live on the critical path.

Execution policy

CPU/GPU routing

Heavy models and high-throughput paths often belong on GPU, but short bursts of traffic may be better served on CPU or by a lighter model.

Batching windows

Batching lowers the cost per request, but it almost always worsens tail latency. You need a hard maximum wait time and separate rules for different traffic classes.

Admission control

The queue must limit or shed lower-priority traffic before the whole serving path starts suffocating under total load.

Warm pools and autoscaling

If the heavy path warms up too slowly, autoscaling without warm capacity gives you a pretty graph and a poor user experience.

Degraded modes

  • A cached answer or recent score for read paths that are sensitive to latency.
  • A reduced feature set when the feature store or an external dependency degrades.
  • A lighter model instead of the primary GPU-heavy path.
  • A predefined fallback with a safe baseline when no inference path can be confirmed in time.

Unit economics metrics

  • cost per 1K requests or per successfully resolved task
  • GPU utilization and batching efficiency
  • share of traffic on the lighter model and fallback frequency
  • queue wait time, timeout rate, and rejected-request percentage

Key trade-offs

  • Leaning on GPU raises throughput, but it also makes capacity planning and cost forecasting harder.
  • A larger batching window reduces inference cost, but it almost always hurts p99 and interactive user flows.
  • Aggressive caching helps absorb spikes, but it increases stale-response risk and can hide degradation in the primary path.
  • A unified serving stack reduces duplication, but it expands the blast radius across models and product surfaces.

Common mistakes

Treating serving as a simple HTTP call to a model instead of designing it as a system in its own right.
Failing to separate online traffic from heavy batch or async work through execution rules and SLOs.
Relying on autoscaling without degraded modes, admission control, and pre-warmed capacity.
Looking only at latency and answer quality while ignoring processing cost and fallback frequency.

Recommendations

Break the latency budget down by layer and keep separate p95 and p99 metrics for each one.
Treat the critical path, execution layer, and degraded path as three controllable parts of the same runtime, especially for stream-heavy traffic.
Maintain at least two safe paths for trouble: a cached answer plus either a lighter model or a baseline decision.
Evaluate batching, routing, and autoscaling decisions together against quality, latency, and cost.

Related chapters

Enable tracking in Settings