System Design Space
Knowledge graphSettings

Updated: March 3, 2026 at 10:17 PM

Distributed tracing in microservices (Jaeger, Tempo)

mid

Practical distributed tracing in microservices: tracing architecture, Jaeger and Tempo, write/read path, sampling strategy, and operational trade-offs.

Context

Observability & Monitoring Design

This chapter is a distributed tracing deep dive within the broader observability architecture.

Open chapter

Distributed tracing in microservices provides precise latency and error diagnostics on end-to-end request flows. This chapter extends Observability & Monitoring Design and focuses on architecture and operational decisions for Jaeger and Tempo.

Tooling: Jaeger and Tempo

Jaeger

A classic open-source tracing backend with a familiar UI and fast onboarding for teams starting distributed tracing.

Strengths

  • Fast time-to-value: waterfall and critical path become visible in the first rollout.
  • Mature ecosystem and straightforward integration with OpenTelemetry Collector.
  • Useful as an operational interface for on-call and incident triage.

Trade-offs

  • As trace volume grows, index and storage cost can become significant.
  • Retention strategy and tag cardinality control are mandatory.

Tempo

Grafana Labs tracing backend focused on low-cost storage in object storage and large-scale span ingestion.

Strengths

  • Cost-efficient at scale due to object-storage-first design and block format.
  • Fits naturally into Grafana Explore with shared observability workflows.
  • Strong option for high-throughput tracing platforms with longer retention.

Trade-offs

  • Troubleshooting quality depends on strong attributes and good sampling policies.
  • Read-path design must be intentional to keep trace lookup fast.

Tracing system visualization

Tracing topology in microservices

The same telemetry platform serves two different paths: write path (ingest) and read path (incident diagnostics).

Vertical ingest-path view

Stages are arranged top-to-bottom: source, processing, backend, and outcome.

Instrumented microservices

SDKs create spans and propagate trace context across HTTP/gRPC/Kafka hops.

SDKHTTP/gRPCKafkatraceparent
Transition to next stagespans + traceparent

OTel Collector

Collector applies enrichment, filtering, sampling, and telemetry routing policies.

enrichmentsampling policyrouting
Transition to next stageOTLP export

Ingest backends

The stream is exported into Jaeger Collector and/or Tempo Distributor.

Jaeger CollectorTempo Distributorfan-out
Transition to next stageindex + spans / blocks

Tracing storage

Jaeger writes indexes/spans to backend, Tempo stores blocks in object storage.

Cassandra/ESS3/GCS/MinIOretention tiers

Write path: how spans reach storage

  1. Service SDKs create spans and propagate traceparent across HTTP/gRPC/Kafka hops.
  2. OTel Collector enriches telemetry (service, env, tenant) and applies sampling policy.
  3. Collector exports the stream to Jaeger Collector and/or Tempo Distributor.
  4. Jaeger writes indexes and spans to backend; Tempo writes blocks into object storage.

Operational focus

  • Jaeger is convenient for fast onboarding and familiar troubleshooting UX.
  • Tempo reduces cost with object storage and without heavy classic trace indexing.
  • High-cardinality workloads depend on tail sampling and retention tiers.

Design decisions for write/read path

Write path (ingest)

  • Context propagation must be guaranteed across every boundary: HTTP, gRPC, message broker, and background jobs.
  • OTel Collector acts as a policy point: enrichment, filtering, tail sampling, and backend routing.
  • For high-load systems, set ingest quotas so tracing does not consume logs/metrics budget.
  • Storage and retention are product-level requirements: cold/warm tiers, TTL, and cost targets.

Read path (query)

  • Start trace search from impact signals: service, status=ERROR, and p95/p99 latency windows.
  • Jaeger UI and Grafana Explore should follow unified naming conventions for services and span attributes.
  • Track query latency as a dedicated SLI: engineers need traces in seconds, not minutes.
  • Trace + logs + metrics correlation is required; without it, investigations become manual and slow.

Practical rollout checklist

Define mandatory span attributes: service.name, deployment.environment, tenant, and error.type.

Create separate sampling profiles for critical APIs, background jobs, and noisy endpoints.

Load-test collectors and storage on synthetic and production-like traffic.

Embed trace links into runbooks and incident response templates for on-call teams.

Common anti-patterns

Enable 100% sampling without estimating storage/query cost and without traffic criticality profiles.

Capture only edge spans and miss internal microservice hops.

Allow inconsistent span/tag naming across teams without an observability contract.

Treat tracing as a separate tool instead of part of the incident-management workflow.

References

Related chapters

Enable tracking in Settings

System Design Space

© 2026 Alexander Polomodov