Prometheus: The Documentary
The history of monitoring, which has become a standard for the cloud-native ecosystem
Source
Book cube
Original post recommending the documentary
What is the film about?
The documentary shows how Prometheus was born inside SoundCloud in 2012 and became the de facto standard for monitoring cloud-native applications. The story begins with reliability issues and the difficulty of observability of a native workload orchestrator.
How the story developed
SoundCloud and SRE pain
Two ex-googlers (Julius Volz, Bjorn Rabenstein) were responsible for the reliability of SoundCloud. There was already a workload orchestrator inside (before the advent of Kubernetes).
Failed attempts with statsd and graphite
It turned out to be too difficult to monitor the cluster with these tools, so engineers began to create a system similar to Borg monitoring at Google.
Birth of Prometheus
New approach: pull collection model, time-series database and PromQL for queries.
Open source and announcement
The code is immediately published on GitHub, then SoundCloud officially announces the system and another company picks it up as an early-adopter.
Login to CNCF
Prometheus is accepted into CNCF as the second hosted/incubating project after Kubernetes. This reinforces the neutral governance model and accelerates ecosystem growth.
Graduated status in CNCF
Prometheus becomes the second CNCF graduated project after Kubernetes. For the market, this is a signal of maturity: stable governance, an active community and a production-ready profile.
Prometheus v2.40 and native histograms (experiment)
In release 2.40, experimental support for native histograms appears. This is an important step towards more accurate distribution metrics under high load.
Prometheus 3.0
Major release 3.0 is released (the first major in 7 years): the project updates the technical foundation and continues its evolution without losing the role of the cloud-native monitoring standard.
Stabilization 3.x
Development continues in the 3.x branch; support for native histograms is fixed as stable (starting from v3.8), which simplifies use in production practice.
Key technical ideas
Pull model
The system itself polls targets, which simplifies scaling and reduces the burden on clients.
Time-series base
Optimization for metrics, time series and high cardinality.
PromQL
A flexible query language for aggregations and calculations on top of metrics.
Ecosystem
Exporters, Alertmanager, Grafana and integrations for Kubernetes.

