System Design Space
Knowledge graphSettings

Updated: March 2, 2026 at 6:37 PM

Performance Engineering

mid

Systematic approach to performance: latency optimization, profiling, capacity planning and performance budget in production.

Context

Observability & Monitoring Design

Without metrics, logs and traces, performance engineering becomes guesswork.

Open chapter

Performance Engineering is a systems discipline where performance is designed in advance and managed throughout the system's life cycle. Main focus: latency optimization, bottleneck profiling and capacity planning for load growth without loss of stability.

Latency optimization

Request path and budget

Break the end-to-end request into stages (edge, API, DB, cache, external dependencies) and set a latency budget for each segment.

Reduced network round-trips

Batching, connection pooling, keep-alive, co-location of services and reduction of chatty interaction between microservices.

Data access optimization

Correct indexes, query plan profiling, read-through cache, data locality and control of N+1 patterns.

Queue and async boundaries

Move non-critical operations to the async pipeline to protect p95/p99 user requests.

Tail latency control

Hedged requests, timeout budget, adaptive retry and protection against retry storms during degradation.

Concurrency and backpressure

Limit concurrency on critical resources (DB, thread pool, external API) to prevent collapse during peaks.

Degradation by priorities

Define graceful degradation: non-essential features are disabled first, while the critical user path maintains the SLA.

Related

Event-Driven Architecture

Asynchronous boundaries and task queuing often remove latency from the synchronous path.

Open chapter

Profiling: where exactly is time lost?

CPU profiling

Look for hot paths and unnecessary allocations in business code and serialization.

Memory profiling

Detect leaks, fragmentation and pressure on the GC/allocator.

I/O profiling

Separately measure network wait, disk wait, lock contention and external API latency.

Distributed tracing

Link the profile to traces to see exactly where the request is wasting time in the distributed call graph.

Lock contention and thread states

Understand blocking, wait queues, and thread state: many latency spikes are caused by contention, not CPU.

GC and runtime pauses

Analyze stop-the-world pauses, memory pressure and churn of short-lived objects, especially in high-QPS services.

SLO and performance budget by layer

Edge / API gateway

20-40 ms

Routing, auth-check, rate limiting without noticeable delay for the client.

  • Policy/keys cache and minimization of synchronous external calls.
  • Connection reuse + TLS session resumption.
  • Strict timeout for downstream and fallback responses.

Application service

60-120 ms

Basic business logic and orchestration of neighboring services within the SLO.

  • Parallelization of independent calls and prohibition of chatty patterns.
  • Idempotent retries are only within the retry budget.
  • Fan-out limitation and protection against n+1 requests.

Data access (DB/cache/search)

40-120 ms

Stable p95/p99 on data with growing volume and competition.

  • Query plan regression checks before release.
  • Read/write separation, local cache and hot key mitigation.
  • Control of pool saturation, lock wait and slow queries.

External dependencies

depends on the provider's SLA

Limit the blast radius of external degradation to the user path.

  • Circuit breaker + fallback + bulkhead isolation.
  • Async queue for non-critical integrations.
  • Separate SLO for external hops and timeout contract.

Capacity planning

  • Determine the key workload units: requests/sec, events/sec, active users, data growth per day.
  • Build baseline based on p50/p95/p99 latency, saturation and error rate under the current load.
  • Consider headroom separately for average and peak: minimum 20-30% for critical contours.
  • Check the scaling of bottlenecks one at a time: compute, storage IOPS, network egress, queue throughput.
  • Plan capacity along with the release roadmap and seasonality, and not just according to past schedules.
  • Separate the online circuit and the batch/analytics circuit so that the background load does not eat up the user-facing budget.
  • Consider the cost: the capacity plan must balance SLO, sustainability and FinOps restrictions.

Load Testing Matrix

Steady-state baseline

Level load 30-60 minutes

Check the stability of p95/p99 and saturation on a typical production profile.

Success Criteria

  • p95/p99 remain in budget without increasing error rate.
  • CPU/Memory/IO are located below the saturation triggers.
  • There is no increase in queue lag and thread wait.

Step-load growth

Step by step increase in load every 5-10 minutes

Find the degradation threshold and see the first bottleneck.

Success Criteria

  • The point where tail latency begins to increase is known.
  • It is clear which resource saturates first (CPU, DB, network, lock).
  • There is a plan to move the bottleneck for the next release.

Spike / burst test

Sharp splash at 2-5x from baseline

Check the behavior of auto-scaling, queue buffering and rate limiting under the burst.

Success Criteria

  • The service does not go into cascading failure.
  • Errors remain controllable and recoverable.
  • The time to return to normal is predictable.

Soak / endurance

Long test 6-24 hours

Find memory leaks, cache degradation, fragmentation accumulation and latency drift.

Success Criteria

  • No monotonic growth of memory/FD/queue depth.
  • GC/allocator pauses do not get worse over time.
  • p99 does not drift when the input load is stable.

Optimization playbook: from hypothesis to effect

Optimization playbook steps

6 steps from baseline to durable impact
Measurement
Diagnosis
Experiment
Validation
Operations
Click "Run" to walk through the optimization process step by step.

Typical antipatterns

Optimize without baseline metrics and without a reproducible load scenario.

Look only at the average latency and ignore p95/p99/p999.

Trying to treat latency with hardware only, without eliminating architectural bottlenecks.

Make endless retries without budget and backoff with partial degradation.

Plan capacity without taking into account data growth and features that will change the load profile.

Mix performance optimizations and functional changes in a single release without isolating the effect.

Ignore the impact of GC/allocator/lock contention and focus only on SQL/cache.

Practical checklist

  • For critical user-journeys, p95/p99 SLO and latency budget are defined by layer.
  • There is a regular baseline test (steady-state) and a stress scenario (spike/step-load).
  • Profiling (CPU/memory/IO/lock) is performed before and after optimizations.
  • Saturation metrics are monitored: pool utilization, queue depth, thread wait, GC pause.
  • Retry, timeout and circuit breaker are consistent between services and do not create retry storms.
  • Performance regression checks are included in the release process for key APIs/queries.
  • The capacity plan is reviewed along with the roadmap and seasonal load peaks.
  • The team records the results of optimizations in the runbook/ADR: what helped, where are the risks, how to monitor.

References

Productivity is not a one-time benchmark, but a constant engineering feedback loop.

Related chapters

Enable tracking in Settings

System Design Space

© 2026 Alexander Polomodov