Support
Jepsen and consistency models
Key context on how distributed systems break down in reality.
Testing Distributed Systems - this is not one type of tests, but a strategy that combines contract testing, integration tests on realistic infrastructure and chaos engineering. The goal is not just to find bugs in the code, but to prove the system’s resilience to network losses, partial failures, out of sync and dependency degradation.
Distributed systems testing stack
Deterministic component tests
Check business logic and state transitions in isolation, without network instability.
Contract testing
Fix APIs and event contracts between services so that changes do not break neighboring teams.
Integration testing
Run end-to-end critical scenarios in a realistic environment with brokers, databases and retries.
Chaos experiments
Inject controlled failures: network loss, pod restarts, latency spikes, zone outage.
Production verification
Check SLO, error budget and rollback readiness on live traffic with guardrails.
Ops
SRE and operational reliability
Tests should be linked to the SLO/error budget and the release decision.
Chaos engineering
- Start with steady-state metrics (for example, p95 latency, success rate, lag).
- Limit blast radius: first to one service/region, then scale the experiment.
- Determine stop conditions before starting the experiment.
- Automate rollback and recording of results in postmortem format.
- Conduct chaos regularly, not as a one-time activity before release.
Contract tests
Synchronous contracts
HTTP/gRPC schemes, required fields, error codes, timeouts and retry semantics.
Asynchronous contracts
Versioning of event schema, backward compatibility and idempotency consumer logic.
Consumer-driven contracts
Consumers set expectations, providers validate changes before merge/release.
Contract as CI gate
The pipeline should not skip a release if there are incompatible contract changes.
Integration testing at scale
- Ephemeral test environments per PR/branch with a subset of production topology.
- Seed datasets + replay real scenarios to check ordering and data consistency.
- Fault injection in integration tests: packet loss, time skew, broker rebalance, DB failover.
- Observability in tests: trace/span correlation, queue lag, retry depth, saturation signals.
- A separate suite of long-term large-scale tests (nightly/weekly) so as not to slow down each delivery.
Practical checklist
There are service SLOs and tests validate them, not just happy-path.
There are contracts for synchronous APIs and asynchronous events.
The Integration suite covers critical user journeys and degrade-mode scenarios.
Chaos experiments run according to a schedule and have an owner.
The release threshold takes into account tests, observability and rollback-time at the same time.
Main anti-pattern: have only happy-path e2e tests and not test controlled failure scenarios.
