Distributed tracing becomes mandatory once the critical request path can no longer be reconstructed from memory or scattered logs.
The chapter breaks down the tracing pipeline through Jaeger, Tempo, write and read paths, and sampling, showing how microservice observability is constrained by the balance between data depth, latency, and storage cost.
In interviews, it helps you explain when traces beat metrics or logs, how to find latency hotspots, and why full-fidelity tracing is not always worth the price.
Practical value of this chapter
Design in practice
Turn guidance on distributed tracing in microservices and latency-path diagnostics into concrete operational decisions: alert interfaces, runbook boundaries, and rollback strategy.
Decision quality
Evaluate architecture via SLO, error budget, MTTR, and critical-path resilience rather than feature completeness alone.
Interview articulation
Frame answers around the reliability lifecycle: degradation signal, response, root-cause isolation, recovery, and prevention loop.
Trade-off framing
Make trade-offs explicit for distributed tracing in microservices and latency-path diagnostics: release speed, automation level, observability cost, and operational complexity.
Context
Observability & Monitoring Design
This chapter is a distributed tracing deep dive within the broader observability architecture.
Distributed tracing in microservices provides precise latency and error diagnostics on end-to-end request flows. This chapter extends Observability & Monitoring Design and focuses on architecture and operational decisions for Jaeger and Tempo.
Tooling: Jaeger and Tempo
Jaeger
A classic open-source tracing backend with a familiar UI and fast onboarding for teams starting distributed tracing.
Strengths
- Fast time-to-value: waterfall and critical path become visible in the first rollout.
- Mature ecosystem and straightforward integration with OpenTelemetry Collector.
- Useful as an operational interface for on-call and incident triage.
Trade-offs
- As trace volume grows, index and storage cost can become significant.
- Retention strategy and tag cardinality control are mandatory.
Tempo
Grafana Labs tracing backend focused on low-cost storage in object storage and large-scale span ingestion.
Strengths
- Cost-efficient at scale due to object-storage-first design and block format.
- Fits naturally into Grafana Explore with shared observability workflows.
- Strong option for high-throughput tracing platforms with longer retention.
Trade-offs
- Troubleshooting quality depends on strong attributes and good sampling policies.
- Read-path design must be intentional to keep trace lookup fast.
Tracing system visualization
Tracing topology in microservices
The same telemetry platform serves two different paths: write path (ingest) and read path (incident diagnostics).
Vertical ingest-path view
Stages are arranged top-to-bottom: source, processing, backend, and outcome.
Instrumented microservices
SDKs create spans and propagate trace context across HTTP/gRPC/Kafka hops.
OTel Collector
Collector applies enrichment, filtering, sampling, and telemetry routing policies.
Ingest backends
The stream is exported into Jaeger Collector and/or Tempo Distributor.
Tracing storage
Jaeger writes indexes/spans to backend, Tempo stores blocks in object storage.
Write path: how spans reach storage
- Service SDKs create spans and propagate traceparent across HTTP/gRPC/Kafka hops.
- OTel Collector enriches telemetry (service, env, tenant) and applies sampling policy.
- Collector exports the stream to Jaeger Collector and/or Tempo Distributor.
- Jaeger writes indexes and spans to backend; Tempo writes blocks into object storage.
Operational focus
- Jaeger is convenient for fast onboarding and familiar troubleshooting UX.
- Tempo reduces cost with object storage and without heavy classic trace indexing.
- High-cardinality workloads depend on tail sampling and retention tiers.
Design decisions for write/read path
Write path (ingest)
- Context propagation must be guaranteed across every boundary: HTTP, gRPC, message broker, and background jobs.
- OTel Collector acts as a policy point: enrichment, filtering, tail sampling, and backend routing.
- For high-load systems, set ingest quotas so tracing does not consume logs/metrics budget.
- Storage and retention are product-level requirements: cold/warm tiers, TTL, and cost targets.
Read path (query)
- Start trace search from impact signals: service, status=ERROR, and p95/p99 latency windows.
- Jaeger UI and Grafana Explore should follow unified naming conventions for services and span attributes.
- Track query latency as a dedicated SLI: engineers need traces in seconds, not minutes.
- Trace + logs + metrics correlation is required; without it, investigations become manual and slow.
Practical rollout checklist
Define mandatory span attributes: service.name, deployment.environment, tenant, and error.type.
Create separate sampling profiles for critical APIs, background jobs, and noisy endpoints.
Load-test collectors and storage on synthetic and production-like traffic.
Embed trace links into runbooks and incident response templates for on-call teams.
Common anti-patterns
Enable 100% sampling without estimating storage/query cost and without traffic criticality profiles.
Capture only edge spans and miss internal microservice hops.
Allow inconsistent span/tag naming across teams without an observability contract.
Treat tracing as a separate tool instead of part of the incident-management workflow.
References
Related chapters
- Observability & Monitoring Design - Provides the observability baseline where tracing complements logs and metrics for faster incident triage.
- Inter-Service Communication Patterns - Covers HTTP/gRPC/async boundaries where reliable trace-context propagation is essential.
- Service Mesh Architecture - Shows mesh-level telemetry and data-plane tracing controls in microservice environments.
- Prometheus Architecture - Connects tracing with metrics to correlate latency spans and investigate production regressions.
- Performance Engineering - Adds latency-analysis and bottleneck-profiling techniques where tracing gives first-pass diagnostics.
- Troubleshooting Interview - Practical triage and root-cause workflow where trace data reduces investigation time.
