Context
Observability & Monitoring Design
This chapter is a distributed tracing deep dive within the broader observability architecture.
Distributed tracing in microservices provides precise latency and error diagnostics on end-to-end request flows. This chapter extends Observability & Monitoring Design and focuses on architecture and operational decisions for Jaeger and Tempo.
Tooling: Jaeger and Tempo
Jaeger
A classic open-source tracing backend with a familiar UI and fast onboarding for teams starting distributed tracing.
Strengths
- Fast time-to-value: waterfall and critical path become visible in the first rollout.
- Mature ecosystem and straightforward integration with OpenTelemetry Collector.
- Useful as an operational interface for on-call and incident triage.
Trade-offs
- As trace volume grows, index and storage cost can become significant.
- Retention strategy and tag cardinality control are mandatory.
Tempo
Grafana Labs tracing backend focused on low-cost storage in object storage and large-scale span ingestion.
Strengths
- Cost-efficient at scale due to object-storage-first design and block format.
- Fits naturally into Grafana Explore with shared observability workflows.
- Strong option for high-throughput tracing platforms with longer retention.
Trade-offs
- Troubleshooting quality depends on strong attributes and good sampling policies.
- Read-path design must be intentional to keep trace lookup fast.
Tracing system visualization
Tracing topology in microservices
The same telemetry platform serves two different paths: write path (ingest) and read path (incident diagnostics).
Vertical ingest-path view
Stages are arranged top-to-bottom: source, processing, backend, and outcome.
Instrumented microservices
SDKs create spans and propagate trace context across HTTP/gRPC/Kafka hops.
OTel Collector
Collector applies enrichment, filtering, sampling, and telemetry routing policies.
Ingest backends
The stream is exported into Jaeger Collector and/or Tempo Distributor.
Tracing storage
Jaeger writes indexes/spans to backend, Tempo stores blocks in object storage.
Write path: how spans reach storage
- Service SDKs create spans and propagate traceparent across HTTP/gRPC/Kafka hops.
- OTel Collector enriches telemetry (service, env, tenant) and applies sampling policy.
- Collector exports the stream to Jaeger Collector and/or Tempo Distributor.
- Jaeger writes indexes and spans to backend; Tempo writes blocks into object storage.
Operational focus
- Jaeger is convenient for fast onboarding and familiar troubleshooting UX.
- Tempo reduces cost with object storage and without heavy classic trace indexing.
- High-cardinality workloads depend on tail sampling and retention tiers.
Design decisions for write/read path
Write path (ingest)
- Context propagation must be guaranteed across every boundary: HTTP, gRPC, message broker, and background jobs.
- OTel Collector acts as a policy point: enrichment, filtering, tail sampling, and backend routing.
- For high-load systems, set ingest quotas so tracing does not consume logs/metrics budget.
- Storage and retention are product-level requirements: cold/warm tiers, TTL, and cost targets.
Read path (query)
- Start trace search from impact signals: service, status=ERROR, and p95/p99 latency windows.
- Jaeger UI and Grafana Explore should follow unified naming conventions for services and span attributes.
- Track query latency as a dedicated SLI: engineers need traces in seconds, not minutes.
- Trace + logs + metrics correlation is required; without it, investigations become manual and slow.
Practical rollout checklist
Define mandatory span attributes: service.name, deployment.environment, tenant, and error.type.
Create separate sampling profiles for critical APIs, background jobs, and noisy endpoints.
Load-test collectors and storage on synthetic and production-like traffic.
Embed trace links into runbooks and incident response templates for on-call teams.
Common anti-patterns
Enable 100% sampling without estimating storage/query cost and without traffic criticality profiles.
Capture only edge spans and miss internal microservice hops.
Allow inconsistent span/tag naming across teams without an observability contract.
Treat tracing as a separate tool instead of part of the incident-management workflow.
