Prometheus: The Documentary

The Prometheus story matters not because it is nostalgic, but because a simple metric-collection model fit distributed platforms unusually well.

Its path from SoundCloud to a monitoring standard explains why the pull model, PromQL, and multidimensional time series proved practical for platform teams and SRE workflows.

For engineering discussions, the film gives context for why teams converge on tools, how standards emerge, and how an observability stack shapes the operating language of an organization.

Practical value of this chapter

Design in practice

Turn guidance on Prometheus history and metrics as an operating language into concrete operational decisions: alert interfaces, runbook boundaries, and rollback strategy.

Decision quality

Evaluate architecture via SLO, error budget, MTTR, and critical-path resilience rather than feature completeness alone.

Interview articulation

Frame answers around the reliability lifecycle: degradation signal, response, root-cause isolation, recovery, and prevention loop.

Trade-off framing

Make trade-offs explicit for Prometheus history and metrics as an operating language: release speed, automation level, observability cost, and operational complexity.

Watch on YouTube

The story of Prometheus: from an internal SoundCloud tool to a monitoring standard

Year:2022

Production:Honeypot

Source

Book cube

Original post recommending the documentary

Перейти на сайт

What is the film about?

The documentary shows how Prometheus was born inside SoundCloud in 2012 and became the de facto monitoring standard for cloud-native systems. It starts with a concrete pain: the team already had its own workload orchestrator, but had nothing to quickly see service health and explain a degradation during an incident. Once services come and go on their own, the usual tools stop giving a clear picture.

This chapter reads Prometheus through its pull model, scrape targets, time-series database, PromQL, alerting rules, Alertmanager, metric cardinality, federation, and the role of metrics in SRE practice.

How the story developed

2012

SoundCloud and reliability pain

Julius Volz and Björn Rabenstein, both coming from Google, were responsible for SoundCloud reliability. The company already had its own workload orchestrator, but teams still lacked a clear view of service health.

2012

The limits of statsd and Graphite

Existing tools could not keep up with a dynamic cluster: metrics did not map to specific services, and debugging turned into guesswork. The team started building a system modeled on Google's Borg monitoring.

2012-2013

Prometheus is born

The new approach combined a pull model, scrape targets, a time-series database, and PromQL for querying metrics.

2015

Open development and public announcement

The code is published on GitHub, then SoundCloud officially announces Prometheus. Early users outside SoundCloud help validate the model beyond one company.

2016

Joining CNCF

Prometheus is accepted by CNCF as the second incubating project after Kubernetes. This strengthens neutral project governance and accelerates ecosystem growth.

2018

CNCF graduation

Prometheus becomes the second CNCF graduated project after Kubernetes. For the market, this signals maturity: an active community, clear governance, and readiness for production use.

2022

Prometheus v2.40 and native histograms

Release 2.40 introduces experimental native histograms: an attempt to measure latency distributions more accurately under high load without inflating the number of metrics and label cardinality.

2024

Prometheus 3.0

Prometheus ships its first major release in seven years. The project modernizes its technical foundation while keeping its role as a monitoring standard for cloud-native systems.

2025+

3.x stabilization

Development continues in the 3.x line; starting with v3.8, native histograms are marked stable. In practice this means you can enable them in production without betting on breaking changes.

Key technical ideas

Pull model

Prometheus actively scrapes targets, giving operators stronger control over service discovery, collection frequency, and endpoint health.

Time series

The time-series database is built for metrics and timestamps, not general-purpose analytics. The cost of a mistake is label cardinality: one extra dimension in a label quickly bloats storage.

PromQL

PromQL lets teams aggregate metrics, compute derived signals, and test hypotheses during incidents.

Rules and alerts

Recording rules, alerting rules, and Alertmanager turn metrics into an operational signal that teams can route and act on.

Exporters

No need to embed a client in every system: Prometheus exporters scrape metrics from databases, queues, nodes, and external services without rewriting them.

Ecosystem

Grafana, Kubernetes integrations, Prometheus federation, and remote storage help scale monitoring beyond one server.

References

Prometheus: The Documentary Prometheus on GitHub Prometheus announcement from SoundCloud CNCF Prometheus project Prometheus graduation announcement Prometheus 3.0 release Prometheus native histograms specification

Related chapters

Site Reliability Engineering - Connects Prometheus metrics with SLOs, SRE practice, and incident response.
Kubernetes: The Documentary - The story of the Kubernetes platform that made Prometheus the default monitoring layer: once workloads became dynamic, scraping them with static configs stopped working.
Cloud Native - Provides the platform context where observability and metrics become part of daily operations.
Kubernetes Patterns - Kubernetes operational patterns that shape what Prometheus actually scrapes: health checks, resource limits, operators, and the metrics loop.
Building Microservices - Why metrics and observability matter in microservices: without them a single dependency failure is hard to tell apart from a system-wide degradation. Prometheus is usually the default choice here.