System Design Space
Knowledge graphSettings

Updated: March 24, 2026 at 3:23 PM

Prometheus: The Documentary

medium

History of Prometheus: SoundCloud, PromQL and the path to a standard for cloud-native monitoring.

The Prometheus story is interesting not because it is nostalgic, but because one monitoring model ended up matching cloud-native systems unusually well.

Its path from SoundCloud to an industry standard explains why pull-based scraping, PromQL, and multidimensional time series proved practical enough for platform teams and SRE workflows.

For engineering discussions, the film is useful as context for questions about tooling adoption, standardization pressure, and how the chosen observability stack shapes the operating language of an entire organization.

Practical value of this chapter

Design in practice

Turn guidance on Prometheus history and metrics role in cloud-native reliability into concrete operational decisions: alert interfaces, runbook boundaries, and rollback strategy.

Decision quality

Evaluate architecture via SLO, error budget, MTTR, and critical-path resilience rather than feature completeness alone.

Interview articulation

Frame answers around the reliability lifecycle: degradation signal, response, root-cause isolation, recovery, and prevention loop.

Trade-off framing

Make trade-offs explicit for Prometheus history and metrics role in cloud-native reliability: release speed, automation level, observability cost, and operational complexity.

Prometheus: The Documentary

The history of monitoring, which has become a standard for the cloud-native ecosystem

Year:2022
Production:not specified

Source

Book cube

Original post recommending the documentary

Перейти на сайт

What is the film about?

The documentary shows how Prometheus was born inside SoundCloud in 2012 and became the de facto standard for monitoring cloud-native applications. The story begins with reliability issues and the difficulty of observability of a native workload orchestrator.

How the story developed

2012

SoundCloud and SRE pain

Two ex-googlers (Julius Volz, Bjorn Rabenstein) were responsible for the reliability of SoundCloud. There was already a workload orchestrator inside (before the advent of Kubernetes).

2012

Failed attempts with statsd and graphite

It turned out to be too difficult to monitor the cluster with these tools, so engineers began to create a system similar to Borg monitoring at Google.

2012-2013

Birth of Prometheus

New approach: pull collection model, time-series database and PromQL for queries.

2015

Open source and announcement

The code is immediately published on GitHub, then SoundCloud officially announces the system and another company picks it up as an early-adopter.

2016

Login to CNCF

Prometheus is accepted into CNCF as the second hosted/incubating project after Kubernetes. This reinforces the neutral governance model and accelerates ecosystem growth.

2018

Graduated status in CNCF

Prometheus becomes the second CNCF graduated project after Kubernetes. For the market, this is a signal of maturity: stable governance, an active community and a production-ready profile.

2022

Prometheus v2.40 and native histograms (experiment)

In release 2.40, experimental support for native histograms appears. This is an important step towards more accurate distribution metrics under high load.

2024

Prometheus 3.0

Major release 3.0 is released (the first major in 7 years): the project updates the technical foundation and continues its evolution without losing the role of the cloud-native monitoring standard.

2025+

Stabilization 3.x

Development continues in the 3.x branch; support for native histograms is fixed as stable (starting from v3.8), which simplifies use in production practice.

Key technical ideas

Pull model

The system itself polls targets, which simplifies scaling and reduces the burden on clients.

Time-series base

Optimization for metrics, time series and high cardinality.

PromQL

A flexible query language for aggregations and calculations on top of metrics.

Ecosystem

Exporters, Alertmanager, Grafana and integrations for Kubernetes.

Useful links

Related chapters

  • Site Reliability Engineering - Provides SLI/SLO and incident response practices where Prometheus metrics become an operational baseline.
  • Kubernetes: The Documentary - Shows the rise of the cloud-native ecosystem where Prometheus became a default monitoring layer.
  • Cloud Native - Connects platform architecture with observability workflows and the role of metrics in distributed operations.
  • Kubernetes Patterns - Extends the topic with Kubernetes operational patterns where monitoring and alerting are built into delivery.
  • Building Microservices - Covers observability and service-metrics practices where Prometheus is often the standard production choice.

Enable tracking in Settings