System Design Space
Knowledge graphSettings

Updated: May 16, 2026 at 11:00 AM

Prometheus: The Documentary

medium

The history of Prometheus: SoundCloud, the pull model, PromQL, Alertmanager, CNCF, and the path to a monitoring standard.

The Prometheus story matters not because it is nostalgic, but because a simple metric-collection model fit distributed platforms unusually well.

Its path from SoundCloud to a monitoring standard explains why the pull model, PromQL, and multidimensional time series proved practical for platform teams and SRE workflows.

For engineering discussions, the film gives context for why teams converge on tools, how standards emerge, and how an observability stack shapes the operating language of an organization.

Practical value of this chapter

Design in practice

Turn guidance on Prometheus history and metrics as an operating language into concrete operational decisions: alert interfaces, runbook boundaries, and rollback strategy.

Decision quality

Evaluate architecture via SLO, error budget, MTTR, and critical-path resilience rather than feature completeness alone.

Interview articulation

Frame answers around the reliability lifecycle: degradation signal, response, root-cause isolation, recovery, and prevention loop.

Trade-off framing

Make trade-offs explicit for Prometheus history and metrics as an operating language: release speed, automation level, observability cost, and operational complexity.

Prometheus: The Documentary

The story of Prometheus: from an internal SoundCloud tool to a monitoring standard

Year:2022
Production:Honeypot

Source

Book cube

Original post recommending the documentary

Перейти на сайт

What is the film about?

The documentary shows how Prometheus was born inside SoundCloud in 2012 and became the de facto monitoring standard for cloud-native systems. The story starts with reliability pain: the team already had its own workload orchestrator, but still lacked a practical way to see service health and explain degradation quickly.

This chapter reads Prometheus through its pull model, scrape targets, time-series database, PromQL, alerting rules, Alertmanager, metric cardinality, federation, and the role of metrics in SRE practice.

How the story developed

2012

SoundCloud and reliability pain

Julius Volz and Björn Rabenstein, both coming from Google, were responsible for SoundCloud reliability. The company already had its own workload orchestrator, but teams still lacked a clear view of service health.

2012

The limits of statsd and Graphite

Monitoring the cluster with existing tools proved too hard, so the team started building a system inspired by Google's Borg monitoring model.

2012-2013

Prometheus is born

The new approach combined a pull model, scrape targets, a time-series database, and PromQL for querying metrics.

2015

Open development and public announcement

The code is published on GitHub, then SoundCloud officially announces Prometheus. Early users outside SoundCloud help validate the model beyond one company.

2016

Joining CNCF

Prometheus is accepted by CNCF as the second incubating project after Kubernetes. This strengthens neutral project governance and accelerates ecosystem growth.

2018

CNCF graduation

Prometheus becomes the second CNCF graduated project after Kubernetes. For the market, this signals maturity: an active community, clear governance, and readiness for production use.

2022

Prometheus v2.40 and native histograms

Release 2.40 introduces experimental native histograms, an important step toward more accurate distribution metrics under high load.

2024

Prometheus 3.0

Prometheus ships its first major release in seven years. The project modernizes its technical foundation while keeping its role as a monitoring standard for cloud-native systems.

2025+

3.x stabilization

Development continues in the 3.x line; starting with v3.8, native histograms are marked stable, making them easier to adopt in operational use.

Key technical ideas

Pull model

Prometheus actively scrapes targets, giving operators stronger control over service discovery, collection frequency, and endpoint health.

Time series

The time-series database is optimized for metrics, timestamps, and label cardinality rather than general-purpose analytics.

PromQL

PromQL lets teams aggregate metrics, compute derived signals, and test hypotheses during incidents.

Rules and alerts

Recording rules, alerting rules, and Alertmanager turn metrics into an operational signal that teams can route and act on.

Exporters

Prometheus exporters make databases, queues, nodes, and external systems observable without rewriting those systems.

Ecosystem

Grafana, Kubernetes integrations, Prometheus federation, and remote storage help scale monitoring beyond one server.

References

Related chapters

  • Site Reliability Engineering - Connects Prometheus metrics with SLOs, SRE practice, and incident response.
  • Kubernetes: The Documentary - Shows the Kubernetes ecosystem that made Prometheus a natural monitoring layer.
  • Cloud Native - Provides the platform context where observability and metrics become part of daily operations.
  • Kubernetes Patterns - Adds Kubernetes operational patterns around health checks, resources, operators, and the metrics loop.
  • Building Microservices - Covers metrics and observability practices in microservices, where Prometheus is often the baseline choice.

Enable tracking in Settings