Cassandra is worth understanding not as generic NoSQL, but as a very specific architectural bet on availability, linear write growth, and designing from access patterns.
In a real system, this chapter helps you shape tables around partition keys, clustering columns, and partition boundaries, and choose consistency levels based on business-critical behavior rather than defaults.
In interviews and architecture discussions, it gives you a stronger language for explaining why Cassandra fits write-heavy, geo-distributed systems while demanding discipline on the read and query side.
Practical value of this chapter
Query-driven model
Design tables from access patterns: partition key, clustering columns, and partition-size boundaries.
Tunable consistency
Match consistency levels to operation criticality, latency budget, and product requirements.
Operational cycle
Treat compaction, repair, tombstone control, and capacity management as continuous architecture work.
Interview narrative
Position Cassandra as a fit for write-heavy geo-distributed systems with explicit read-side trade-offs.
Source
Apache Cassandra
History, architecture and features of Apache Cassandra.
Apache Cassandra is a distributed wide-column database that combines the ideas of Dynamo and Bigtable. It is designed for scalability and high availability, and consistency is customizable to system requirements.
Cassandra specifics
Wide-column store
The data is organized in keyspace and tables optimized for known query patterns.
Masterless architecture
All nodes are equal, eliminating a single point of failure and increasing availability.
AP + tunable consistency
The system is focused on availability and separation resistance, and the level of consistency is configurable.
Limitations and compromises
- Limited support for complex join and ad-hoc queries.
- The query model requires a pre-thought-out scheme for reading.
- Works optimally with large volumes and high write loads.
Architecture visualization
Ring Topology
Consistent Hashing
Choose a key to see how it is distributed across the ring (RF=3):
Replication Factor = 3
Each key is stored on 3 nodes: primary node and the next 2 clockwise nodes.
Write Path
- Client -> any node (coordinator)
- Coordinator computes hash(key) -> token
- Token -> primary node + RF-1 replicas
- Parallel write to all replicas
History: key milestones
Facebook -> open source
Cassandra was created by Facebook and opened to the community in 2008.
Apache Incubator
The project moved to the Apache Incubator and began to develop as an open-source initiative.
Top-level project
Apache Cassandra became the top-level project.
1.0: first stable major release
The production-ready status of Cassandra as an independent distributed DBMS has been secured.
2.0: LWT and development of CQL
Lightweight transactions (CAS/Paxos) and noticeable improvements to the query model appear.
3.0: major storage update
Major internal changes to the storage layer and performance improvements.
4.0: Focus on stability
A release with a focus on reliability, predictability and operational maturity.
5.0: SAI and vector scripts
New major release with Storage-Attached Indexes and capabilities for modern search/AI workloads.
IBM and DataStax
The purchase of DataStax by IBM has been announced, which strengthens the enterprise contour around the Cassandra ecosystem.
Cassandra architecture by layers
The architecture features coordinator, replication and LSM-like storage with commit log, memtable and SSTable.
Cluster architecture
Data model
DDL vs DML: how the request goes
DDL works with the keyspace and table schema, DML works with data. Below are the basic steps for both types of requests.
How a request flows through Cassandra
Comparing the execution chain for DDL (schema) and DML (data)
Active step
1. Node accepts request
Any cluster node can accept a DML request.
Data operations
- DML works with data and indexes without changing schema.
- Write path is optimized for high write throughput.
- Consistency level defines write acknowledgement behavior.
Why choose Cassandra
- Linear scaling when adding nodes.
- High availability without a single point of failure.
- Good write performance thanks to LSM-like storage.
- Flexible consistency settings for different scenarios.
Related chapters
- Database Selection Framework - How to decide when Cassandra is the right fit for write-heavy distributed workloads versus when another store is preferable.
- Replication and sharding - Operational patterns for replica placement, balancing, and failure management in distributed data layers.
- CAP theorem - Foundational context for availability/consistency trade-offs behind Cassandra's architectural choices.
- PACELC theorem - CAP extension for evaluating latency/consistency in normal operation and choosing tunable consistency levels.
- Jepsen and consistency models - How to validate real distributed-database guarantees under partitions and failure scenarios.
- Key-Value Database - Case-study view of distributed KV design with partitioning/quorum, close to Cassandra-class system requirements.
