System Design Space
Knowledge graphSettings

Updated: February 21, 2026 at 11:59 PM

Testing Distributed Systems

hard

A practical approach to testing distributed systems: chaos engineering, contract testing and integration testing at scale.

Support

Jepsen and consistency models

Key context on how distributed systems break down in reality.

Open chapter

Testing Distributed Systems - this is not one type of tests, but a strategy that combines contract testing, integration tests on realistic infrastructure and chaos engineering. The goal is not just to find bugs in the code, but to prove the system’s resilience to network losses, partial failures, out of sync and dependency degradation.

Distributed systems testing stack

Deterministic component tests

Check business logic and state transitions in isolation, without network instability.

Contract testing

Fix APIs and event contracts between services so that changes do not break neighboring teams.

Integration testing

Run end-to-end critical scenarios in a realistic environment with brokers, databases and retries.

Chaos experiments

Inject controlled failures: network loss, pod restarts, latency spikes, zone outage.

Production verification

Check SLO, error budget and rollback readiness on live traffic with guardrails.

Ops

SRE and operational reliability

Tests should be linked to the SLO/error budget and the release decision.

Open chapter

Chaos engineering

  • Start with steady-state metrics (for example, p95 latency, success rate, lag).
  • Limit blast radius: first to one service/region, then scale the experiment.
  • Determine stop conditions before starting the experiment.
  • Automate rollback and recording of results in postmortem format.
  • Conduct chaos regularly, not as a one-time activity before release.

Contract tests

Synchronous contracts

HTTP/gRPC schemes, required fields, error codes, timeouts and retry semantics.

Asynchronous contracts

Versioning of event schema, backward compatibility and idempotency consumer logic.

Consumer-driven contracts

Consumers set expectations, providers validate changes before merge/release.

Contract as CI gate

The pipeline should not skip a release if there are incompatible contract changes.

Integration testing at scale

  • Ephemeral test environments per PR/branch with a subset of production topology.
  • Seed datasets + replay real scenarios to check ordering and data consistency.
  • Fault injection in integration tests: packet loss, time skew, broker rebalance, DB failover.
  • Observability in tests: trace/span correlation, queue lag, retry depth, saturation signals.
  • A separate suite of long-term large-scale tests (nightly/weekly) so as not to slow down each delivery.

Practical checklist

There are service SLOs and tests validate them, not just happy-path.

There are contracts for synchronous APIs and asynchronous events.

The Integration suite covers critical user journeys and degrade-mode scenarios.

Chaos experiments run according to a schedule and have an owner.

The release threshold takes into account tests, observability and rollback-time at the same time.

Main anti-pattern: have only happy-path e2e tests and not test controlled failure scenarios.

References

Related chapters

Enable tracking in Settings

System Design Space

© 2026 Alexander Polomodov