Source
Apache Cassandra
History, architecture and features of Apache Cassandra.
Apache Cassandra is a distributed wide-column database that combines the ideas of Dynamo and Bigtable. It is designed for scalability and high availability, and consistency is customizable to system requirements.
Cassandra specifics
Wide-column store
The data is organized in keyspace and tables optimized for known query patterns.
Masterless architecture
All nodes are equal, eliminating a single point of failure and increasing availability.
AP + tunable consistency
The system is focused on availability and separation resistance, and the level of consistency is configurable.
Limitations and compromises
- Limited support for complex join and ad-hoc queries.
- The query model requires a pre-thought-out scheme for reading.
- Works optimally with large volumes and high write loads.
Architecture visualization
Ring Topology
Consistent Hashing
Choose a key to see how it is distributed across the ring (RF=3):
Replication Factor = 3
Each key is stored on 3 nodes: primary node and the next 2 clockwise nodes.
Write Path
- Client -> any node (coordinator)
- Coordinator computes hash(key) -> token
- Token -> primary node + RF-1 replicas
- Parallel write to all replicas
History: key milestones
Facebook -> open source
Cassandra was created by Facebook and opened to the community in 2008.
Apache Incubator
The project moved to the Apache Incubator and began to develop as an open-source initiative.
Top-level project
Apache Cassandra became the top-level project.
1.0: first stable major release
The production-ready status of Cassandra as an independent distributed DBMS has been secured.
2.0: LWT and development of CQL
Lightweight transactions (CAS/Paxos) and noticeable improvements to the query model appear.
3.0: major storage update
Major internal changes to the storage layer and performance improvements.
4.0: Focus on stability
A release with a focus on reliability, predictability and operational maturity.
5.0: SAI and vector scripts
New major release with Storage-Attached Indexes and capabilities for modern search/AI workloads.
IBM and DataStax
The purchase of DataStax by IBM has been announced, which strengthens the enterprise contour around the Cassandra ecosystem.
Cassandra architecture by layers
The architecture features coordinator, replication and LSM-like storage with commit log, memtable and SSTable.
Cluster architecture
Data model
DDL vs DML: how the request goes
DDL works with the keyspace and table schema, DML works with data. Below are the basic steps for both types of requests.
How a request flows through Cassandra
Comparing the execution chain for DDL (schema) and DML (data)
Active step
1. Node accepts request
Any cluster node can accept a DML request.
Data operations
- DML works with data and indexes without changing schema.
- Write path is optimized for high write throughput.
- Consistency level defines write acknowledgement behavior.
Why choose Cassandra
- Linear scaling when adding nodes.
- High availability without a single point of failure.
- Good write performance thanks to LSM-like storage.
- Flexible consistency settings for different scenarios.
