System Design Space
Knowledge graphSettings

Updated: March 24, 2026 at 3:23 PM

Distributed tracing in microservices (Jaeger, Tempo)

medium

Practical distributed tracing in microservices: tracing architecture, Jaeger and Tempo, write/read path, sampling strategy, and operational trade-offs.

Distributed tracing becomes mandatory once the critical request path can no longer be reconstructed from memory or scattered logs.

The chapter breaks down the tracing pipeline through Jaeger, Tempo, write and read paths, and sampling, showing how microservice observability is constrained by the balance between data depth, latency, and storage cost.

In interviews, it helps you explain when traces beat metrics or logs, how to find latency hotspots, and why full-fidelity tracing is not always worth the price.

Practical value of this chapter

Design in practice

Turn guidance on distributed tracing in microservices and latency-path diagnostics into concrete operational decisions: alert interfaces, runbook boundaries, and rollback strategy.

Decision quality

Evaluate architecture via SLO, error budget, MTTR, and critical-path resilience rather than feature completeness alone.

Interview articulation

Frame answers around the reliability lifecycle: degradation signal, response, root-cause isolation, recovery, and prevention loop.

Trade-off framing

Make trade-offs explicit for distributed tracing in microservices and latency-path diagnostics: release speed, automation level, observability cost, and operational complexity.

Context

Observability & Monitoring Design

This chapter is a distributed tracing deep dive within the broader observability architecture.

Open chapter

Distributed tracing in microservices provides precise latency and error diagnostics on end-to-end request flows. This chapter extends Observability & Monitoring Design and focuses on architecture and operational decisions for Jaeger and Tempo.

Tooling: Jaeger and Tempo

Jaeger

A classic open-source tracing backend with a familiar UI and fast onboarding for teams starting distributed tracing.

Strengths

  • Fast time-to-value: waterfall and critical path become visible in the first rollout.
  • Mature ecosystem and straightforward integration with OpenTelemetry Collector.
  • Useful as an operational interface for on-call and incident triage.

Trade-offs

  • As trace volume grows, index and storage cost can become significant.
  • Retention strategy and tag cardinality control are mandatory.

Tempo

Grafana Labs tracing backend focused on low-cost storage in object storage and large-scale span ingestion.

Strengths

  • Cost-efficient at scale due to object-storage-first design and block format.
  • Fits naturally into Grafana Explore with shared observability workflows.
  • Strong option for high-throughput tracing platforms with longer retention.

Trade-offs

  • Troubleshooting quality depends on strong attributes and good sampling policies.
  • Read-path design must be intentional to keep trace lookup fast.

Tracing system visualization

Tracing topology in microservices

The same telemetry platform serves two different paths: write path (ingest) and read path (incident diagnostics).

Vertical ingest-path view

Stages are arranged top-to-bottom: source, processing, backend, and outcome.

Instrumented microservices

SDKs create spans and propagate trace context across HTTP/gRPC/Kafka hops.

SDKHTTP/gRPCKafkatraceparent
Transition to next stagespans + traceparent

OTel Collector

Collector applies enrichment, filtering, sampling, and telemetry routing policies.

enrichmentsampling policyrouting
Transition to next stageOTLP export

Ingest backends

The stream is exported into Jaeger Collector and/or Tempo Distributor.

Jaeger CollectorTempo Distributorfan-out
Transition to next stageindex + spans / blocks

Tracing storage

Jaeger writes indexes/spans to backend, Tempo stores blocks in object storage.

Cassandra/ESS3/GCS/MinIOretention tiers

Write path: how spans reach storage

  1. Service SDKs create spans and propagate traceparent across HTTP/gRPC/Kafka hops.
  2. OTel Collector enriches telemetry (service, env, tenant) and applies sampling policy.
  3. Collector exports the stream to Jaeger Collector and/or Tempo Distributor.
  4. Jaeger writes indexes and spans to backend; Tempo writes blocks into object storage.

Operational focus

  • Jaeger is convenient for fast onboarding and familiar troubleshooting UX.
  • Tempo reduces cost with object storage and without heavy classic trace indexing.
  • High-cardinality workloads depend on tail sampling and retention tiers.

Design decisions for write/read path

Write path (ingest)

  • Context propagation must be guaranteed across every boundary: HTTP, gRPC, message broker, and background jobs.
  • OTel Collector acts as a policy point: enrichment, filtering, tail sampling, and backend routing.
  • For high-load systems, set ingest quotas so tracing does not consume logs/metrics budget.
  • Storage and retention are product-level requirements: cold/warm tiers, TTL, and cost targets.

Read path (query)

  • Start trace search from impact signals: service, status=ERROR, and p95/p99 latency windows.
  • Jaeger UI and Grafana Explore should follow unified naming conventions for services and span attributes.
  • Track query latency as a dedicated SLI: engineers need traces in seconds, not minutes.
  • Trace + logs + metrics correlation is required; without it, investigations become manual and slow.

Practical rollout checklist

Define mandatory span attributes: service.name, deployment.environment, tenant, and error.type.

Create separate sampling profiles for critical APIs, background jobs, and noisy endpoints.

Load-test collectors and storage on synthetic and production-like traffic.

Embed trace links into runbooks and incident response templates for on-call teams.

Common anti-patterns

Enable 100% sampling without estimating storage/query cost and without traffic criticality profiles.

Capture only edge spans and miss internal microservice hops.

Allow inconsistent span/tag naming across teams without an observability contract.

Treat tracing as a separate tool instead of part of the incident-management workflow.

References

Related chapters

Enable tracking in Settings