Distributed tracing in microservices (Jaeger, Tempo)

Distributed tracing becomes mandatory once the critical request path can no longer be reconstructed from memory or scattered logs.

The chapter breaks down Jaeger, Tempo, write and read paths, and sampling, showing how microservice observability is constrained by the balance between data depth, latency, and storage cost.

In interviews, it helps you explain when traces are better than metrics or logs, how to find latency hotspots, and why full-fidelity tracing is not always worth the price.

Practical value of this chapter

Design in practice

Design the span path from service to storage and back to investigation: instrumentation, collector, sampling, storage, and lookup.

Decision quality

Evaluate the stack through trace lookup speed, storage cost, tag cardinality, and critical-path completeness.

Interview articulation

Show when tracing is more useful than metrics or logs, how it isolates latency, and why sampling must be intentional.

Trade-off framing

Make the cost of full-fidelity tracing, attribute depth, retention, and investigation UI speed explicit.

Context

Observability & Monitoring Design

Here we drop one level below baseline observability: how tracing follows a request across services and where it saves time during an investigation.

Open chapter

Once a request passes through a dozen services, queues, and dependencies, "where exactly is it slow" stops being answerable from one service's logs. Distributed tracing in microservices reconstructs the full path of a request from connected spans, so teams find latency or errors from facts instead of guesses. This chapter covers trace context, context propagation, write path, read path, sampling, tail sampling, and trace storage in Jaeger and Tempo. It extends Observability & Monitoring Design and focuses on the operational decisions that shorten incident investigations.

Tooling: Jaeger and Tempo

Jaeger

When you are just standing up tracing, seeing the first result fast matters most. Jaeger is a classic open-source tracing backend with a familiar interface that gets a team to distributed tracing without a long setup.

Strengths

Fast time-to-value: waterfall view and critical path become useful during the first staged rollout.
Mature integrations with OpenTelemetry Collector and the wider tracing ecosystem.
Works well as an operational interface for on-call engineers and incident triage.

Trade-offs

As trace volume grows, storage and index cost become noticeable.
Retention strategy and tag-cardinality control need to be explicit.

Tempo

Grafana Labs' tracing backend. It bets on low-cost object storage, so it carries high-volume span ingestion where other backends start running up a storage bill.

Strengths

Cost-efficient at scale thanks to object storage and a block-based data format.
Metrics and traces sit side by side in Grafana Explore, so on-call moves from a latency spike to a specific span without switching tools.
A strong fit for high-volume tracing platforms with longer retention tiers.

Trade-offs

Troubleshooting quality depends on good span attributes and sampling policy.
The read path must be designed intentionally or trace lookup becomes slow.

Distributed tracing diagram

Tracing topology in microservices

The same telemetry platform serves two different paths: write path (ingest) and read path (incident diagnostics).

Vertical ingest-path view

Stages are arranged top-to-bottom: source, processing, backend, and outcome.

Instrumented microservices

SDKs create spans and propagate trace context across HTTP/gRPC/Kafka hops.

SDKHTTP/gRPCKafkatraceparent

Transition to next stagespans + traceparent

OpenTelemetry Collector

Collector applies enrichment, filtering, sampling, and telemetry routing policies.

enrichmentsampling policyrouting

Transition to next stageOTLP export

Ingest backends

The stream is exported into Jaeger Collector and/or Tempo Distributor.

Jaeger CollectorTempo Distributorfan-out

Transition to next stageindex + spans / blocks

Tracing storage

Jaeger writes indexes/spans to backend, Tempo stores blocks in object storage.

Cassandra/ESS3/GCS/MinIOretention tiers

Write path: how spans reach storage

Service SDKs create spans and propagate traceparent across HTTP/gRPC/Kafka hops.
OTel Collector enriches telemetry (service, env, tenant) and applies sampling policy.
Collector exports the stream to Jaeger Collector and/or Tempo Distributor.
Jaeger writes indexes and spans to backend; Tempo writes blocks into object storage.

Operational focus

Jaeger is convenient for fast onboarding and familiar troubleshooting UX.
Tempo reduces cost with object storage and without heavy classic trace indexing.
High-cardinality workloads depend on tail sampling and retention tiers.

Decisions for the write and read paths

Write path

Trace context is lost at any boundary where you forget to pass it on, so it has to travel through everything: HTTP, gRPC, message brokers, and background jobs. One missed hop and the trace breaks exactly where the investigation is happening.
OpenTelemetry Collector acts as the policy point for enrichment, filtering, tail sampling, and telemetry routing.
For high-load systems, set separate ingest quotas so tracing does not consume the logs and metrics budget.
Storage is a product requirement, not a default setting: retention tiers, TTL, and target storage cost are decided up front, or the budget leaks quietly and the old traces are not there when you need them.

Read path

Start trace search not from any request but from user-impact signals: service name, status=ERROR, and p95/p99 high-latency windows. That way the investigation begins with what actually hurts, not with a random sample.
Jaeger UI and Grafana Explore should share naming conventions for services and span attributes.
Keep query latency to the tracing UI under its own SLI: during an incident an engineer must get a trace in seconds, or the slow lookup itself becomes part of the outage.
Trace correlation with logs and metrics is mandatory; without it, investigation becomes manual and slow.

Practical rollout checklist

Define mandatory span attributes: service.name, deployment.environment, tenant, and error.type.

Create separate sampling profiles for critical APIs, background jobs, and noisy endpoints.

Load-test the telemetry collector and trace storage with synthetic and production-like traffic.

Embed trace links into runbooks and incident-response templates for on-call teams.

Common anti-patterns

Enabling 100% sampling without estimating storage and query cost or defining profiles by traffic criticality — the storage bill grows faster than the value of the extra traces.

Capturing only edge spans and losing internal microservice hops.

Letting teams use inconsistent span and tag naming without a shared observability contract.

Keeping tracing as a just-in-case standalone tool instead of part of the incident-management workflow: it then gets used only occasionally and turns up empty at the moment it matters.

References

Related chapters

Observability & Monitoring Design - The observability baseline: metrics say what got worse, logs say what happened in one service, and tracing ties it into a request path for faster incident triage.
Inter-Service Communication Patterns - Covers HTTP, gRPC, and async boundaries — the very places where trace context is easiest to lose if you do not plan for it up front.
Service Mesh Architecture - Shows service-mesh telemetry and tracing controls at the data-plane level in microservice environments.
Prometheus Architecture - Connects tracing with metrics to correlate latency, spans, and regressions.
Performance Engineering - Adds latency-analysis and bottleneck-finding techniques where distributed tracing gives first-pass diagnostics.
Troubleshooting Interviews - Practical triage and root-cause workflow where trace data reduces investigation time.