Distributed tracing becomes mandatory once the critical request path can no longer be reconstructed from memory or scattered logs.
The chapter breaks down Jaeger, Tempo, write and read paths, and sampling, showing how microservice observability is constrained by the balance between data depth, latency, and storage cost.
In interviews, it helps you explain when traces are better than metrics or logs, how to find latency hotspots, and why full-fidelity tracing is not always worth the price.
Practical value of this chapter
Design in practice
Design the span path from service to storage and back to investigation: instrumentation, collector, sampling, storage, and lookup.
Decision quality
Evaluate the stack through trace lookup speed, storage cost, tag cardinality, and critical-path completeness.
Interview articulation
Show when tracing is more useful than metrics or logs, how it isolates latency, and why sampling must be intentional.
Trade-off framing
Make the cost of full-fidelity tracing, attribute depth, retention, and investigation UI speed explicit.
Context
Observability & Monitoring Design
This chapter dives deeper into distributed tracing inside the broader observability platform.
Distributed tracing in microservices shows the path of a request across services, queues, and dependencies so teams can find latency or errors from connected spans instead of guesses. This chapter covers trace context, context propagation, write path, read path, sampling, tail sampling, and trace storage in Jaeger and Tempo. It extends Observability & Monitoring Design and focuses on the operational decisions that shorten incident investigations.
Tooling: Jaeger and Tempo
Jaeger
A classic open-source tracing backend with a familiar interface and fast onboarding for teams starting distributed tracing.
Strengths
- Fast time-to-value: waterfall view and critical path become useful during the first staged rollout.
- Mature integrations with OpenTelemetry Collector and the wider tracing ecosystem.
- Works well as an operational interface for on-call engineers and incident triage.
Trade-offs
- As trace volume grows, storage and index cost become noticeable.
- Retention strategy and tag-cardinality control need to be explicit.
Tempo
Grafana Labs' tracing backend, optimized for low-cost object storage and high-volume span ingestion.
Strengths
- Cost-efficient at scale thanks to object storage and a block-based data format.
- Fits naturally with Grafana Explore and metrics in one diagnostic workflow.
- A strong fit for high-volume tracing platforms with longer retention tiers.
Trade-offs
- Troubleshooting quality depends on good span attributes and sampling policy.
- The read path must be designed intentionally or trace lookup becomes slow.
Distributed tracing diagram
Tracing topology in microservices
The same telemetry platform serves two different paths: write path (ingest) and read path (incident diagnostics).
Vertical ingest-path view
Stages are arranged top-to-bottom: source, processing, backend, and outcome.
Instrumented microservices
SDKs create spans and propagate trace context across HTTP/gRPC/Kafka hops.
OpenTelemetry Collector
Collector applies enrichment, filtering, sampling, and telemetry routing policies.
Ingest backends
The stream is exported into Jaeger Collector and/or Tempo Distributor.
Tracing storage
Jaeger writes indexes/spans to backend, Tempo stores blocks in object storage.
Write path: how spans reach storage
- Service SDKs create spans and propagate traceparent across HTTP/gRPC/Kafka hops.
- OTel Collector enriches telemetry (service, env, tenant) and applies sampling policy.
- Collector exports the stream to Jaeger Collector and/or Tempo Distributor.
- Jaeger writes indexes and spans to backend; Tempo writes blocks into object storage.
Operational focus
- Jaeger is convenient for fast onboarding and familiar troubleshooting UX.
- Tempo reduces cost with object storage and without heavy classic trace indexing.
- High-cardinality workloads depend on tail sampling and retention tiers.
Decisions for the write and read paths
Write path
- Trace context must propagate across every boundary: HTTP, gRPC, message brokers, and background jobs.
- OpenTelemetry Collector acts as the policy point for enrichment, filtering, tail sampling, and telemetry routing.
- For high-load systems, set separate ingest quotas so tracing does not consume the logs and metrics budget.
- Storage is a product requirement: retention tiers, TTL, and target storage cost need to be designed up front.
Read path
- Start trace search from impact signals: service, status=ERROR, and p95/p99 high-latency windows.
- Jaeger UI and Grafana Explore should share naming conventions for services and span attributes.
- Track query latency as its own SLI: engineers need traces in seconds, not minutes.
- Trace correlation with logs and metrics is mandatory; without it, investigation becomes manual and slow.
Practical rollout checklist
Define mandatory span attributes: service.name, deployment.environment, tenant, and error.type.
Create separate sampling profiles for critical APIs, background jobs, and noisy endpoints.
Load-test the telemetry collector and trace storage with synthetic and production-like traffic.
Embed trace links into runbooks and incident-response templates for on-call teams.
Common anti-patterns
Enabling 100% sampling without estimating storage and query cost or defining profiles by traffic criticality.
Capturing only edge spans and losing internal microservice hops.
Letting teams use inconsistent span and tag naming without a shared observability contract.
Treating tracing as a standalone tool instead of part of the incident-management workflow.
References
Related chapters
- Observability & Monitoring Design - Provides the observability baseline where tracing complements logs and metrics for faster incident triage.
- Inter-Service Communication Patterns - Covers HTTP, gRPC, and async boundaries where reliable trace-context propagation is essential.
- Service Mesh Architecture - Shows service-mesh telemetry and tracing controls at the data-plane level in microservice environments.
- Prometheus Architecture - Connects tracing with metrics to correlate latency, spans, and regressions.
- Performance Engineering - Adds latency-analysis and bottleneck-finding techniques where distributed tracing gives first-pass diagnostics.
- Troubleshooting Interviews - Practical triage and root-cause workflow where trace data reduces investigation time.
