System Design Space
Knowledge graphSettings

Updated: May 13, 2026 at 12:00 PM

Distributed tracing in microservices (Jaeger, Tempo)

medium

Practical distributed tracing in microservices: Jaeger, Tempo, OpenTelemetry, write and read paths, sampling, trace storage, and latency investigation.

Distributed tracing becomes mandatory once the critical request path can no longer be reconstructed from memory or scattered logs.

The chapter breaks down Jaeger, Tempo, write and read paths, and sampling, showing how microservice observability is constrained by the balance between data depth, latency, and storage cost.

In interviews, it helps you explain when traces are better than metrics or logs, how to find latency hotspots, and why full-fidelity tracing is not always worth the price.

Practical value of this chapter

Design in practice

Design the span path from service to storage and back to investigation: instrumentation, collector, sampling, storage, and lookup.

Decision quality

Evaluate the stack through trace lookup speed, storage cost, tag cardinality, and critical-path completeness.

Interview articulation

Show when tracing is more useful than metrics or logs, how it isolates latency, and why sampling must be intentional.

Trade-off framing

Make the cost of full-fidelity tracing, attribute depth, retention, and investigation UI speed explicit.

Context

Observability & Monitoring Design

This chapter dives deeper into distributed tracing inside the broader observability platform.

Open chapter

Distributed tracing in microservices shows the path of a request across services, queues, and dependencies so teams can find latency or errors from connected spans instead of guesses. This chapter covers trace context, context propagation, write path, read path, sampling, tail sampling, and trace storage in Jaeger and Tempo. It extends Observability & Monitoring Design and focuses on the operational decisions that shorten incident investigations.

Tooling: Jaeger and Tempo

Jaeger

A classic open-source tracing backend with a familiar interface and fast onboarding for teams starting distributed tracing.

Strengths

  • Fast time-to-value: waterfall view and critical path become useful during the first staged rollout.
  • Mature integrations with OpenTelemetry Collector and the wider tracing ecosystem.
  • Works well as an operational interface for on-call engineers and incident triage.

Trade-offs

  • As trace volume grows, storage and index cost become noticeable.
  • Retention strategy and tag-cardinality control need to be explicit.

Tempo

Grafana Labs' tracing backend, optimized for low-cost object storage and high-volume span ingestion.

Strengths

  • Cost-efficient at scale thanks to object storage and a block-based data format.
  • Fits naturally with Grafana Explore and metrics in one diagnostic workflow.
  • A strong fit for high-volume tracing platforms with longer retention tiers.

Trade-offs

  • Troubleshooting quality depends on good span attributes and sampling policy.
  • The read path must be designed intentionally or trace lookup becomes slow.

Distributed tracing diagram

Tracing topology in microservices

The same telemetry platform serves two different paths: write path (ingest) and read path (incident diagnostics).

Vertical ingest-path view

Stages are arranged top-to-bottom: source, processing, backend, and outcome.

Instrumented microservices

SDKs create spans and propagate trace context across HTTP/gRPC/Kafka hops.

SDKHTTP/gRPCKafkatraceparent
Transition to next stagespans + traceparent

OpenTelemetry Collector

Collector applies enrichment, filtering, sampling, and telemetry routing policies.

enrichmentsampling policyrouting
Transition to next stageOTLP export

Ingest backends

The stream is exported into Jaeger Collector and/or Tempo Distributor.

Jaeger CollectorTempo Distributorfan-out
Transition to next stageindex + spans / blocks

Tracing storage

Jaeger writes indexes/spans to backend, Tempo stores blocks in object storage.

Cassandra/ESS3/GCS/MinIOretention tiers

Write path: how spans reach storage

  1. Service SDKs create spans and propagate traceparent across HTTP/gRPC/Kafka hops.
  2. OTel Collector enriches telemetry (service, env, tenant) and applies sampling policy.
  3. Collector exports the stream to Jaeger Collector and/or Tempo Distributor.
  4. Jaeger writes indexes and spans to backend; Tempo writes blocks into object storage.

Operational focus

  • Jaeger is convenient for fast onboarding and familiar troubleshooting UX.
  • Tempo reduces cost with object storage and without heavy classic trace indexing.
  • High-cardinality workloads depend on tail sampling and retention tiers.

Decisions for the write and read paths

Write path

  • Trace context must propagate across every boundary: HTTP, gRPC, message brokers, and background jobs.
  • OpenTelemetry Collector acts as the policy point for enrichment, filtering, tail sampling, and telemetry routing.
  • For high-load systems, set separate ingest quotas so tracing does not consume the logs and metrics budget.
  • Storage is a product requirement: retention tiers, TTL, and target storage cost need to be designed up front.

Read path

  • Start trace search from impact signals: service, status=ERROR, and p95/p99 high-latency windows.
  • Jaeger UI and Grafana Explore should share naming conventions for services and span attributes.
  • Track query latency as its own SLI: engineers need traces in seconds, not minutes.
  • Trace correlation with logs and metrics is mandatory; without it, investigation becomes manual and slow.

Practical rollout checklist

Define mandatory span attributes: service.name, deployment.environment, tenant, and error.type.

Create separate sampling profiles for critical APIs, background jobs, and noisy endpoints.

Load-test the telemetry collector and trace storage with synthetic and production-like traffic.

Embed trace links into runbooks and incident-response templates for on-call teams.

Common anti-patterns

Enabling 100% sampling without estimating storage and query cost or defining profiles by traffic criticality.

Capturing only edge spans and losing internal microservice hops.

Letting teams use inconsistent span and tag naming without a shared observability contract.

Treating tracing as a standalone tool instead of part of the incident-management workflow.

References

Related chapters

Enable tracking in Settings