Deep Dive into Cassandra

Mohit Sharma
28 min readDec 26, 2023

Relational Database Management Systems (RDBMS) are versatile but may not be suitable for all types of workloads due to several reasons:

  1. Schema Rigidity: RDBMS require a predefined schema that dictates the structure of the data. This schema can be rigid and may not accommodate situations where the data is Unstructured/Semi-structured data/Complex hierarchical data (JSON or XML). In contrast, NoSQL databases like document stores or key-value stores are more flexible in handling such scenarios.
  2. Scalability: While traditional RDBMS can scale vertically by adding more resources (e.g., CPU, RAM) to a single server, they can have limitations when it comes to horizontal scalability. Distributing data across multiple servers for massive scalability can be complex and may not be cost-effective with traditional RDBMS. NoSQL databases, especially those designed for distributed systems, often handle horizontal scaling more gracefully.
  3. High Velocity and Volume: In scenarios where data is generated at an extremely high velocity (e.g., IoT sensor data) or where the volume of data is enormous (e.g., big data analytics), RDBMS may struggle to provide the necessary performance and scalability. Distributed and specialized databases like Apache Cassandra or Hadoop are better choices for such cases.
  4. Read-Heavy vs. Write-Heavy Workloads: RDBMS are traditionally designed for transactional workloads with a balanced mix of reads and writes. In cases where the workload is heavily skewed towards either reads or writes, there may be more specialized databases or caching solutions that can outperform RDBMS.
  5. Cost: RDBMS can be expensive to license and maintain, particularly for large-scale deployments. NoSQL and open-source alternatives can often provide cost savings for certain workloads.
  6. Data Consistency vs. Availability: RDBMS prioritize strong data consistency (ACID properties), which can lead to performance bottlenecks in distributed environments. NoSQL databases often relax consistency in favor of availability and partition tolerance, making them better suited for distributed systems.

The needs of today’s workloads

Speed: Modern workloads demand high performance and low latency to ensure quick response times for users and applications.

Avoid Single Point of Failure (SPoF): Reliability is crucial, and systems should be designed to eliminate single points of failure to ensure continuous operation even in the event of hardware or software failures.

Low TCO (Total Cost of Operation/Ownership): Cost-efficiency is a priority. Organizations aim to minimize the total cost of owning and operating their IT infrastructure while maximizing value.

Fewer System Administrators: Automation and efficient management tools are essential to reduce the need for extensive human intervention and lower administrative overhead.

Incremental Scalability: The ability to scale resources (e.g., compute, storage, or network) up or down incrementally based on changing demands is critical. This ensures efficient resource utilization and avoids over-provisioning.

Scale Out, Not Up: Scalability should be achieved by adding more commodity hardware or nodes to the infrastructure (scale-out), rather than relying on a single, more powerful machine (scale-up). This approach provides better fault tolerance and cost-effectiveness

Scale out, not Scale up

The concept of “scale out, not scale up” refers to the approach of expanding a computing cluster or infrastructure by adding more commodity hardware or components off the shelf (COTS) machines, rather than upgrading to more powerful, expensive machines. Here are the key points about this approach:

Scale Up (Vertical Scaling):

  • Traditional Approach: Historically, organizations would increase their cluster capacity by replacing existing hardware with more powerful machines.
  • Not Cost-Effective: Scaling up can be costly because you end up buying hardware that is above the optimal price-to-performance “sweet spot” on the cost curve.
  • Frequent Replacements: As technology advances, you may need to replace machines more often to keep up with performance demands.

Scale Out (Horizontal Scaling):

  • Incremental Growth: Scaling out involves adding more commodity hardware or COTS machines to the existing infrastructure incrementally.
  • Cost-Effective: This approach is typically more cost-effective because you can use readily available, affordable hardware components.
  • Long-Term Strategy: Over a longer duration, you can phase in a few newer (faster) machines as you phase out older ones, maintaining a balance between performance and cost.
  • Widely Used: Many companies, including those running data centers and cloud environments, prefer the scale-out approach due to its cost-effectiveness and flexibility.

NoSQL

NoSQL, which stands for “Not Only SQL,” is a category of database management systems designed to handle various types of data, including unstructured, semi-structured, or structured data, and to provide flexible, scalable, and high-performance solutions.

NoSQL databases are particularly well-suited for use cases where traditional relational databases (SQL databases) may not be the best fit due to factors such as scalability, schema flexibility, and high data velocity.

There are several types of NoSQL databases, each optimized for different types of data and use cases. Here are four common types of NoSQL systems:

  1. Document-Based NoSQL Databases (MongoDB, CouchDB)
  2. Key-Value Stores (Redis, Amazon DynamoDB)
  3. Column-Family Stores (Wide-Column Stores) (Apache Cassandra, HBase)
  4. Graph Databases (Neo4j, Amazon Neptune)

Column Oriented Storage

Columnar Data Storage

  • In column-oriented storage, data is organized and stored in a columnar fashion rather than the traditional row-based storage. Each column represents a specific attribute or field of the dataset.
  • For example, in a database containing customer information, you might have separate columns for names, addresses, and phone numbers.
  • All values within a column are of the same data type (e.g., integers, strings), which allows for efficient storage and retrieval of data.

Data Compression:

  • Column-oriented databases often apply compression techniques to reduce the storage footprint of the data.
  • Data compression algorithms are highly effective when applied to columns because they can exploit the repetitiveness or similarity of data values within a single column.
  • Compression reduces storage costs and can improve query performance by reducing I/O.

Reduced I/O:

  • One of the significant advantages of column-oriented storage is that it minimizes I/O operations during queries.
  • Since queries typically access only a subset of columns (specific attributes), the database engine can read and process only the relevant columns, reducing the amount of data transferred from storage and speeding up query performance.

Aggregation Efficiency:

  • Column-oriented storage excels at performing aggregation operations, such as calculating the SUM, AVG, COUNT, or MAX of values within a single column.
  • The contiguous storage of values in a column makes it more efficient to scan and aggregate data quickly.

Scalability:

  • Many column-oriented databases are designed to be horizontally scalable, allowing organizations to add more nodes or servers to accommodate large datasets or increasing workloads.
  • This scalability is crucial for handling vast amounts of data efficiently in data warehousing and analytical environments.

Column-oriented databases (CODBs) link values belonging to one row by using a row identifier (RID). The RID is a unique identifier for each row in the database. It is typically stored in a separate column, but it can also be stored in the header of each column.

Cassandra

A primary key in Cassandra is a column or set of columns that uniquely identify a row in a column family. Primary keys are used to partition the data across the nodes in a Cassandra cluster. Once defined, it cannot be changed. A subset of primary key column(s) are designated as partition key.

Data Placement Strategies

How do you decide which server(s) a key-value resides on?

The decision of which server (node) a key-value will be stored on is determined through partitioning. Cassandra uses a function called partitioner that assigns each key to specific node within a cluster.

I. SimpleStrategy

SimpleStrategy is a replication strategy that is often used in single-data-center deployments. Under SimpleStrategy, you define the number of replicas, and Cassandra uses a basic token-based approach to distribute data across nodes in the cluster.

Partitioners

  1. RandomPartitioner (Hash-based): This is a partitioner that uses a hash-based approach to distribute data across nodes. It is similar to the Chord DHT (Distributed Hash Table) and assigns data to nodes based on a hashed token value. This evenly distributes data but may not be ideal for range queries.
  2. ByteOrderedPartitioner (Range-based): This partitioner assigns ranges of keys to servers in a clockwise manner around the ring. It is more suitable for range queries because data for adjacent keys is often stored on adjacent nodes. This makes it efficient for queries like “Get me all Twitter users starting with [a-b].”

II. NetworkTopologyStrategy

NetworkTopologyStrategy is designed for multi-data-center deployments, where data needs to be replicated across different data centers (DCs) to ensure high availability and disaster recovery. This strategy allows you to specify the number of replicas per data center, taking into account the replication factor for each DC.

Placement Rules for NetworkTopologyStrategy:

  1. First replica placement is determined by the Partitioner, just like with SimpleStrategy.
  2. For subsequent replicas, Cassandra moves clockwise around the ring until it hits a different rack within the same data center.
  3. If the data center has multiple racks, Cassandra distributes replicas across those racks to ensure fault tolerance within the data center.
  4. You can configure the replication factor for each DC separately, allowing fine-grained control over how many replicas are stored in each data center.

Hash-Based Data Placement Strategies vs Range-Based Data Placement Strategies

I. Hash-Based Data Placement Strategies

Cassandra’s default and most commonly used data placement strategy is hash-based. It relies on partitioners like Murmur3Partitioner and RandomPartitioner to distribute data across nodes based on the hash of the partition key.

With hash-based placement, data is evenly distributed across the cluster’s nodes, which helps balance the load and achieve uniform data distribution.

RandomPartitioner (based on consistent hashing internally) is similar to the Chord like Distributed Hash Table and assigns data to nodes based on a hashed token value.

Consistent hashing works by first converting the key into a token. This token is then used to calculate a hash value, which is used to determine which server the key-value should be stored in.

Once the hash value has been calculated, Cassandra uses a ring structure to determine which server the key-value should be stored in. The ring structure is a logical representation of the cluster, with each server represented by a node on the ring. The nodes are arranged in a clockwise order, and the hash value of the key-value is used to determine which node the key-value should be stored in.

II. Range-Based Data Placement Strategies

While hash-based placement is the default and recommended strategy in Cassandra, you can also implement range-based placement strategies using partitioners like ByteOrderedPartitioner.

Range-based placement allows you to group related data together based on the order of keys. This can be useful for optimizing queries that involve range scans, as data for adjacent keys is often stored on adjacent nodes.

Range-based placement is particularly beneficial for scenarios where range queries, such as retrieving data within a specific key range, are a common use case.

Snitches

Snitches in Cassandra are responsible for providing information about the network topology of the cluster to the other nodes in the cluster. This information is used by Cassandra to route requests efficiently and to place replicas of data in different locations to improve availability.

Here are some common snitch options in Cassandra:

  1. SimpleSnitch: This is the simplest snitch and is rack-unaware. It doesn’t consider the physical network topology. It treats all nodes as if they are in the same data center and the same rack. It’s suitable for single-data-center deployments or scenarios where you don’t need to consider rack awareness.
  2. RackInferringSnitch: The RackInferringSnitch is a snitch that infers the rack and datacenter of each node based on its IP address. It does this by using a variety of techniques, such as looking at the subnet mask of the IP address and using DNS to resolve the IP address to a hostname. The RackInferringSnitch is more accurate than the SimpleSnitch, but it is also more complex to configure. It is important to note that the RackInferringSnitch may not be able to accurately infer the rack and datacenter of all nodes in the cluster.
  3. PropertyFileSnitch: It uses a configuration file (cassandra-rackdc.properties) to define the mapping of nodes to data centers and racks. This allows you to explicitly specify the topology information for each node in your cluster. It provides more control and precision over the topology configuration.

Writes in Cassandra

1. Writes Overview (Client → Coordinator → Replica)

1.1 Lock-Free and Fast Writes: Cassandra is designed for high write throughput and aims to provide fast and lock-free write operations. This means that writes are executed without the need for locks and are optimized for speed.

1.2 Coordinator Node: When a write request is received from a client, it is sent to one coordinator node within the Cassandra cluster. The coordinator node is responsible for managing the write operation and ensuring its success.

1.3 Per-Key, Per-Client, or Per-Query Coordinator: The coordinator node can be selected on a per-key, per-client, or per-query basis, depending on the specific configuration and use case. Using a per-key coordinator ensures that writes for a specific key are serialized, preventing concurrent updates to the same data.

1.4 Partitioner and Replica Nodes: The write request includes a partition key. The coordinator node uses the partitioner (which takes the partition key as input) to determine which nodes in the cluster are responsible for the data associated with the given key. The write request is then sent to all replica nodes that are responsible for the key.

1.5 Acknowledgement Upon Receiving X Replicas:

  • Cassandra follows a configurable consistency level, and the coordinator waits for a specified number of replica nodes (X replicas) to acknowledge the write operation. Once the required number of replicas have responded, the coordinator sends an acknowledgement to the client, indicating the successful write.
  • Cassandra provides a number of different write consistency levels, which allow you to control how many replicas of the data must be updated before the write is considered successful. The default write consistency level is ALL, which means that all replicas of the data must be updated before the write is considered successful.

1.6 Hinted Handoff Mechanism

  • Cassandra employs a mechanism called Hinted Handoff to handle temporary failures. If a replica node is temporarily unavailable when a write occurs, the coordinator writes the data to all other reachable replicas.
  • The coordinator keeps the write locally until the unavailable replica comes back online, at which point the hinted handoff is replayed to ensure data consistency. Hinted handoffs can also be replayed manually if needed.

1.7 Buffering Writes When All Replicas Are Down:

  • In cases where some or all the replicas are unavailable (e.g., due to a network partition or node failures), the coordinator buffers writes for a certain duration (typically up to a few hours) to prevent data loss. This buffering helps ensure that data is not lost and can be written to the replicas once they become available again.
  • Cassandra will eventually fail the write operation if the replicas are unavailable for too long. This can be configured using the “write_timeout” setting.

1.8 One Ring Per Datacenter and Cross-Datacenter Coordination:

  • Cassandra uses a ring topology for each datacenter within the cluster. For multi-datacenter deployments, Cassandra elects a coordinator node to coordinate writes across data centers. This coordination ensures that writes are properly replicated and coordinated between data centers.
  • The election process for cross-datacenter coordination is typically done using distributed coordination services like Zookeeper and may involve a variant of the Paxos consensus algorithm.
  • Cassandra also supports a “network topology” feature that allows you to define custom network topologies for your cluster. This can be useful for optimizing performance and reliability in complex deployments.

2. Write Operations at Replica Nodes

Steps Taken at a Replica Node in Cassandra When It Receives a Write Operation:

2.1 Log Write in Disk Commit Log (WAL — Write-Ahead Log):

  • When a replica node receives a write request, the first step is to log the write operation in the disk commit log. This log is often referred to as the Write-Ahead Log (WAL).
  • The commit log is an append-only log file that records all write operations. It serves as a durable record of write operations and is used for recovery in case of node failures or crashes.

2.2 Update Memtables:

  • The replica node then updates the memtables. A memtable is an in-memory data structure that represents multiple key-value pairs.
  • Memtables are typically implemented as append-only data structures and are optimized for fast write operations. They function as a cache that can be searched by key.
  • Cassandra uses a write-back cache approach for memtables, which means that writes are first stored in memory for fast writes and are later flushed to disk for durability.

2.3 Flush Memtable to Disk:

  • When a memtable becomes full or reaches a certain age threshold, it is flushed to disk. This process creates a new data file known as an SSTable (Sorted String Table).
  • An SSTable is a sorted and immutable list of key-value pairs, sorted by key. Once created, an SSTable does not change; it remains append-only and is never updated in place.
  • In addition to the data file (SSTable), an index file is created. The index file contains (key, position in data SSTable) pairs, which allow for efficient data retrieval.
  • To further optimize data retrieval, a Bloom filter is also associated with the SSTable. A Bloom filter is a probabilistic data structure that provides a fast and memory-efficient way to check whether a given key exists in the SSTable.

2.4 Sending Acknowledgements (ACKs)

  • A replica node in Cassandra sends an acknowledgment (ACK) to the coordinator node when it has successfully written the data to its memtable. The coordinator node will wait for ACKs from all of the required replicas before responding to the client with a success or failure message.
  • The number of required replicas depends on the write consistency level that was specified by the client. For example, if the client specified the ALL write consistency level, then the coordinator node will wait for ACKs from all of the replicas before responding to the client. However, if the client specified the LOCAL_QUORUM write consistency level, then the coordinator node will only wait for ACKs from a quorum of replicas before responding to the client.

Note:

  • Both the coordinator and replica nodes in Cassandra have memtables. This allows Cassandra to serve reads quickly, without having to access disk. When a client makes a write request to Cassandra, the coordinator node logs the write to the commit log and then applies the write to its memtable. The coordinator node then forwards the write to the replica nodes. The replica nodes also apply the write to their memtables.
  • When a memtable reaches a certain size, it is flushed to disk as an SSTable. SSTables are immutable files that store data in sorted order. This allows Cassandra to efficiently perform range queries and scans.
  • The coordinator and replica nodes both flush their memtables to disk independently. This means that it is possible for the coordinator node to flush its memtable to disk before the replica nodes have had a chance to flush their memtables to disk. This is not a problem, however, because the commit log ensures that the data will not be lost, even if the coordinator node fails before all of the replica nodes have flushed their memtables to disk.

Lock-free writes with version vectors

Here’s the revised text with additional clarity on version vectors for first-time readers:

Lock-Free Writes

In Cassandra, write operations are designed to be lock-free, which means that multiple clients can simultaneously write to the same data without causing delays or blocking each other. Cassandra achieves this through a combination of techniques:

  1. Memtables: Cassandra initially writes data to in-memory structures called memtables before eventually flushing them to disk. Importantly, memtables are not locked, meaning that multiple clients can concurrently write to the same memtable without any contention or blocking.
  2. Version Vectors: Cassandra employs version vectors as a key mechanism to manage concurrent writes to the same data while ensuring data integrity. Version vectors are used to keep track of changes made to a piece of data, allowing Cassandra to merge concurrent writes without any data loss.
  3. Commit Logs: Before applying writes to memtables, Cassandra logs all write operations in a durable commit log that resides on disk. This commit log acts as a safety net, ensuring that data remains intact even if the coordinator node fails before all replica nodes have had a chance to flush their memtables to disk.

Version Vectors

Version vectors play a crucial role in maintaining data consistency in Cassandra, especially when multiple clients attempt to write to the same data simultaneously. Here’s how version vectors work:

A version vector is essentially an array of timestamps, with each timestamp representing a specific change made to a piece of data. When a client initiates a write request in Cassandra, the coordinator node assigns a version vector to that particular write. This version vector is then included in the write request that gets forwarded to the replica nodes.

Upon receiving a write request, a replica node merges the version vector provided in the request with its own local version vector. This merging process ensures that the replica node has a comprehensive record of all changes made to the data, including concurrent writes from different clients.

In cases where two or more clients attempt to write to the same data simultaneously, the coordinator node assigns different version vectors to each write request. This differentiation allows Cassandra to track and manage these concurrent writes effectively.

When a replica node flushes its memtable to disk, it includes the version vectors associated with all the writes in that memtable into the resulting SSTable (Sorted String Table). This practice ensures that the SSTable contains a comprehensive history of all the changes made to the data.

When a client sends a read request to Cassandra, the coordinator node performs a crucial task by merging the version vectors of all the replicas involved. This merger helps determine the most recent version of the data. Subsequently, the coordinator node returns this latest version of the data to the client, ensuring data consistency even in the presence of concurrent writes.

Hinted handoff mechanism

Hinted handoff is a mechanism in Cassandra that allows writes to be successful even if one or more replicas are unavailable. It works by storing the write on the coordinator node and then replaying it to the unavailable replicas later, when they become available.

Hinted handoff is enabled by default in Cassandra, but it can be disabled in the cassandra.yaml file. It is important to note that hinted handoff is not a guarantee of write consistency. If a replica is unavailable for a long period of time, then the data may be lost.

Bloom Filters

  • SSTables in Cassandra use bloom filters. Bloom filters are a probabilistic data structure that allows Cassandra to determine whether an SSTable has data for a particular row.
  • Bloom filters work by storing a bit array of a certain size. When a row is inserted into an SSTable, the bloom filter is updated by setting the corresponding bits in the bit array. To check if a row exists in an SSTable, Cassandra simply checks the corresponding bits in the bloom filter. If any of the bits are not set, then Cassandra knows that the row does not exist in the SSTable.
  • Bloom filters are not perfect, and there is a small chance of a false positive. This means that Cassandra may tell the client that a row exists in an SSTable, when in reality it does not. However, bloom filters are very efficient, and they can significantly reduce the number of SSTables that Cassandra needs to read for a given query.
  • Bloom filters can be configured for different levels of accuracy. The more accurate the bloom filter is, the less likely it is to give a false positive. However, the more accurate the bloom filter is, the more memory it will consume.
  • Cassandra uses bloom filters to improve the performance of read queries. When a client makes a read request, Cassandra first checks the bloom filters of all of the SSTables to see if any of the SSTables contain the requested row. If any of the SSTables contain the requested row, then Cassandra will read the row from the SSTable. If none of the SSTables contain the requested row, then Cassandra knows that the row does not exist in the database.

Write path

The following is a high-level overview of the write path in Cassandra:

  1. The client makes a write request to the coordinator node.
  2. The coordinator node logs the write to the commit log.
  3. The coordinator node applies the write to the memtable.
  4. The coordinator node forwards the write to the replica nodes.
  5. The replica nodes apply the write to their memtables.
  6. The replica nodes flush their memtables to disk as SSTables.
  7. The coordinator node receives acknowledgments from the replica nodes.
  8. The coordinator node responds to the client with a success or failure message.
  9. If any of the replica nodes are unavailable, then the coordinator node stores the write as a hint.
  10. When the coordinator node detects that an unavailable replica has become available, it replays the hinted writes to that replica.

Content of SSTables

SSTables within Cassandra store data in the form of sorted key-value pairs. The keys are arranged in ascending order, and the values can encompass various data types supported by Cassandra, such as strings, numbers, lists, and maps.

In addition to key-value pairs, SSTables also incorporate several vital components:

  • Bloom filter: A Bloom filter serves as a probabilistic data structure, enabling Cassandra to rapidly determine whether an SSTable contains data for a specific key.
  • Index: An index is instrumental in swiftly locating the data associated with a particular key within an SSTable.
  • Tombstones: Tombstones serve as markers for deleted rows. Cassandra utilizes tombstones to ensure that deleted data remains invisible to readers until it has undergone compaction.

Compaction in SSTables

Compaction in Cassandra is the process of consolidating multiple SSTables to generate new, more compact SSTables. This procedure yields several advantages, including:

  • Reduced disk space usage: SSTables can potentially include duplicate data and tombstones (markers for deleted rows). Compaction combines SSTables, eliminating duplicate data and tombstones, which can substantially diminish Cassandra’s disk space requirements.
  • Enhanced read performance: Compaction arranges SSTables in a sorted order, consequently enhancing the efficiency of range queries and scans.
  • Lower memory consumption: When Cassandra retrieves data from SSTables, it loads the entire SSTable into memory. Compaction merges SSTables, thereby decreasing the number of SSTables that Cassandra needs to load into memory.

When Cassandra performs compaction on SSTables, it merges them into a sorted order, eliminating duplicate data and tombstones. This process results in the creation of new, more streamlined SSTables that are more efficient for both reading and writing.

Compaction operates as a background process in Cassandra and is usually initiated when an SSTable’s size surpasses a specified threshold. Users can also trigger compaction manually.

Here’s a high-level overview of how compaction functions in Cassandra:

  1. Cassandra selects a set of SSTables for compaction.
  2. The selected SSTables are merged together, eliminating duplicate data and tombstones.
  3. The newly merged SSTable is written to disk.
  4. The old SSTables are deleted.

It’s important to note that the same key can exist in two SSTables in Cassandra, and this can occur due to various reasons, including:

  • Compaction: When Cassandra carries out SSTable compaction, it merges SSTables in sorted order, removing duplicate data. However, if two SSTables contain the same key but with different values, Cassandra will not eliminate the duplicate data.
  • Updates: When a row is updated in Cassandra, the previous version of the row is not immediately deleted. Instead, a tombstone is inserted into the SSTable to mark the row as deleted. As a result, both the old and new versions of the row may coexist in two different SSTables.
  • Multiple writes: In cases where two clients attempt to write to the same key simultaneously, Cassandra may generate two distinct SSTables, one for each client’s write operation.

Read

At Coordinator

  1. Coordinator Selection: In Cassandra, the read operation is initiated by a client, and a coordinator node is responsible for managing the read request. The coordinator node is selected based on predefined criteria, often aiming to minimize network latency. Depending on the configuration, the coordinator can contact a specific number of replicas, denoted as “X,” which are typically chosen based on factors like rack locality or other criteria that optimize data retrieval.
  2. Choosing Fast Responders: One of Cassandra’s performance optimizations involves selecting replicas that have responded quickly to previous requests. This means that the coordinator node may prioritize contacting replicas that have demonstrated lower response times in the past. This approach minimizes read latency by directing requests to replicas known for their responsiveness.
  3. Selecting the Latest Value: When the read request is sent to the chosen replicas, Cassandra collects responses from these replicas and identifies the version of the data with the latest timestamp among the responses. The data in Cassandra is versioned using timestamps, which allows the system to determine which version is the most recent. The coordinator returns this latest-timestamped value to the client. The specific number of replicas to be contacted (X) can vary depending on your Cassandra cluster’s configuration.
  4. Background Consistency Checks: Cassandra maintains data consistency through background processes. While the primary goal is to serve the client’s read request efficiently, Cassandra also ensures data consistency over time. In the background, Cassandra compares values from replicas that didn’t initially respond to the read request. If any two values differ, it initiates a read repair.

A read repair is a process where Cassandra updates replicas with outdated data to match the most recent version. This mechanism ensures that eventually, all replicas are brought up to date and consistent with the latest changes. It’s important to note that read repairs are generally performed asynchronously and do not impact the immediate response to the client’s read request.

At the Replica Node

When the read request arrives at a replica node, the replica node first looks for the requested data in its Memtables. Memtables are in-memory data structures that store recently written data. They serve as a fast, in-memory cache for recent writes and are checked before accessing data on disk.

If the data is not found in Memtables, the replica node proceeds to search for the data in the SSTables on disk. SSTables are sorted and immutable tables that store data persistently. The read process may involve accessing multiple SSTables, especially if the requested data is distributed across them.

Handling Data Split Across SSTables: In a distributed database like Cassandra, data for a single row may be distributed across multiple SSTables due to the nature of distributed data storage and compaction. When this occurs, read operations may require accessing and combining data from multiple SSTables. While this multi-SSTable access can make reads slightly slower than writes, Cassandra is still optimized for fast read performance, especially with the use of Memtables and indexing structures.

CAP Theorem

The CAP Theorem is a fundamental concept in distributed computing that highlights the trade-offs that exist when designing distributed systems in terms of three key properties: Consistency, Availability, and Partition Tolerance.

Here’s a breakdown of the “CAP” components:

  1. Consistency: This refers to the guarantee that all nodes in a distributed system will see the same data at the same time. In other words, every read operation following a write operation will return the most recent write’s value. Achieving strong consistency in a distributed system can lead to slower response times and increased complexity in the presence of network partitions.
  2. Availability: Availability implies that every request (read or write) to the system receives a response, without any guarantee about the data’s consistency. In other words, even if some nodes are experiencing network issues or are temporarily unavailable, the system as a whole continues to operate and respond to requests. Ensuring high availability can sometimes lead to sacrificing strong consistency.
  3. Partition Tolerance: Partition tolerance addresses the system’s ability to continue functioning even when network partitions or communication failures occur between nodes. In a distributed system, network partitions can lead to nodes becoming temporarily unreachable or isolated from each other.

The CAP Theorem states that in a distributed system, you can only achieve two out of these three properties simultaneously, but not all three.

CAP Theorem Fallout

The CAP Theorem indeed has significant implications for distributed systems, especially in today’s cloud computing environments where network partitions and distributed architectures are common. Here’s how the CAP theorem fallout affects system design choices, using Cassandra and traditional RDBMSs as examples:

Cassandra:

  • Cassandra is designed with a focus on Availability and Partition-Tolerance, which are essential in cloud computing and highly distributed systems.
  • It offers what is known as “eventual consistency,” which means that it may not guarantee strong immediate consistency like traditional RDBMSs. Instead, it prioritises providing quick responses and high availability, even in the presence of network partitions.
  • Cassandra’s data model and architecture are well-suited for scenarios where maintaining high availability and being able to continue functioning during network disruptions are critical. It allows for trade-offs in consistency to achieve these goals.

Traditional RDBMSs:

  • Traditional Relational Database Management Systems (RDBMSs) tend to prioritise strong consistency over availability in the presence of network partitions.
  • They are designed with the assumption of a reliable network and often use techniques like distributed transactions to ensure data consistency across multiple nodes.
  • In situations where maintaining strict data consistency is paramount, such as financial systems or applications dealing with sensitive data, RDBMSs are often chosen.

Eventual Consistency

Eventual consistency is a consistency model for distributed systems in which, under certain conditions, all replicas or nodes in a distributed database will eventually converge to the same consistent state. In other words, given enough time and assuming no new writes to a particular data item (key), all replicas of that data will reflect the same value, ensuring consistency.

Key Characteristics of Eventual Consistency:

  1. Convergence Over Time: Eventual consistency doesn’t guarantee immediate consistency. Instead, it acknowledges that in a distributed system, data may temporarily appear inconsistent due to factors like network delays, replication latency, or concurrent updates.
  2. Continual Convergence: As long as new writes to a key don’t occur, the system always strives to converge to a consistent state. It achieves this by continuously replicating and updating data across all nodes in the system.
  3. “Wave” of Updates: When writes occur, a “wave” of updated values begins propagating through the system. This wave of updates lags behind the latest values sent by clients but constantly tries to catch up. The goal is to ensure that all replicas eventually receive the same data.
  4. Potential for Stale Values: In practice, eventual consistency may still result in the occasional retrieval of stale or outdated values, especially in scenarios where there are many back-to-back writes. However, this temporary inconsistency is usually acceptable for many distributed applications and use cases.
  5. Quick Convergence During Low Write Periods: Eventual consistency works particularly well in situations with occasional or low write activity. In such cases, the system can quickly converge to a consistent state because there are fewer concurrent writes to manage.

RDBMS vs. Key-value stores

Comparing Relational Database Management Systems (RDBMS) with Key-Value Stores like Cassandra, you’ll find differences in their data models and the consistency models they offer:

RDBMS (Relational Database Management Systems):

  1. ACID Properties: RDBMS systems are known for providing ACID properties, which stand for:
  2. Atomicity: Ensures that a transaction is treated as a single, indivisible unit of work. It either fully succeeds, or nothing happens.
  3. Consistency: Guarantees that a transaction brings the database from one consistent state to another, preserving data integrity and adhering to defined constraints.
  4. Isolation: Ensures that concurrent transactions do not interfere with each other. Each transaction appears to be executing in isolation.
  5. Durability: Guarantees that once a transaction is committed, its changes are permanent and survive system failures.
  6. Structured Data Model: RDBMS systems use a structured, tabular data model with tables, rows, and columns. Data is organized into well-defined schemas with relationships between tables.
  7. SQL Language: RDBMS systems typically use SQL (Structured Query Language) for querying and manipulating data.

Key-Value Stores (e.g., Cassandra):

  1. BASE Properties: Key-Value stores, including Cassandra, often provide BASE properties, which stand for:
  2. Basically Available: Emphasizes system availability, ensuring that every request (read or write) receives a response. Even under network partitions or failures, the system remains operational.
  3. Soft-state: Acknowledges that system state can be soft or temporarily inconsistent due to factors like replication lag or concurrent updates. Soft-state implies that system state may not always reflect the latest write.
  4. Eventual Consistency: Guarantees that, given enough time and no new writes to a data item, all replicas will eventually converge to the same consistent state. Eventual consistency prioritizes availability over immediate data consistency.
  5. Semi-Structured Data Model: Key-Value stores use a more flexible, semi-structured data model where data is stored as key-value pairs. Each value can be a complex data structure, making it suitable for handling diverse and evolving data.
  6. NoSQL Query Language: Key-Value stores often provide their own query languages or APIs for data access, which may differ from SQL.

Consistency levels in Cassandra

Cassandra’s flexibility in choosing consistency levels for each operation (read or write) is one of its key features, allowing developers to fine-tune the trade-offs between consistency and performance based on their application requirements. Here are some of the common consistency levels in Cassandra:

  1. ANY: In the ANY consistency level, the request is considered successful as long as it reaches any server, even if it’s not a replica of the data being accessed. This level provides the fastest response time since it doesn’t wait for replication or strong consistency. It’s often used in scenarios where the lowest latency is more critical than data consistency, such as collecting statistics or logging.
  2. ONE: With the ONE consistency level, the request is considered successful if it reaches at least one replica server. ONE is faster than ALL but still provides a level of data redundancy. It’s commonly used when a balance between speed and consistency is required.
  3. QUORUM: QUORUM requires a quorum of replicas across all datacenters (DCs) to acknowledge the request. The quorum is usually calculated as (replication_factor / 2) + 1, where the replication factor is the number of replicas. This level offers a strong level of consistency, as it ensures that a majority of replicas agree on the data state. QUORUM is commonly used in multi-datacenter setups where maintaining strong consistency across datacenters is essential.
  4. LOCAL_QUORUM: Operates similarly to QUORUM but within the coordinator’s local datacenter (DC). It offers a compromise between global consistency and reduced latency since it only waits for a quorum in the first DC the client contacts. This consistency level is suitable for applications where strong consistency is needed within a datacenter but doesn’t require global consistency.
  5. EACH_QUORUM: EACH_QUORUM allows each datacenter to establish its own quorum for the operation, supporting hierarchical replies. This level is useful in scenarios where you want to maintain strong consistency within each datacenter while allowing for eventual consistency between datacenters.
  6. ALL: The ALL consistency level requires that all replica nodes acknowledge the request before it’s considered successful. It provides the highest level of data consistency but at the cost of slower response times. ALL is typically used in scenarios where data integrity and strong consistency are critical, such as financial applications.

Each of these consistency levels allows developers to tailor the behavior of their Cassandra cluster to the specific requirements of their application, whether it prioritises low latency, high availability, or strong consistency. The choice of consistency level should align with the application’s use case and performance expectations.

Quorums in distributed databases, such as Cassandra, are a fundamental concept for achieving a balance between data consistency and system availability. Here’s a detailed explanation of how quorums work in Cassandra:

Reads with Quorums:

  • In Cassandra, when a client performs a read operation, it specifies a value for R, which represents the read consistency level.
  • The coordinator node, responsible for handling the read request, waits for responses from R replica nodes before sending the result back to the client. These R replicas are often referred to as the "read replicas."
  • In the background, the coordinator checks the consistency of the remaining (N - R) replicas, where N is the total number of replicas for that key. If any inconsistencies are detected, it initiates a read repair process to ensure data consistency across all replicas.

Writes with Quorums:

  • Write operations in Cassandra involve specifying a value for W, which represents the write consistency level.
  • When a client writes new data, it writes the data to W replicas and returns. These W replicas are often called the "write replicas."
  • There are two flavors of write operations:
  1. Blocking Writes: The coordinator blocks until a quorum (defined by W) is reached. This ensures that the write is replicated to at least a majority of replicas before returning a success response to the client.
  2. Asynchronous Writes: In this mode, the coordinator performs the write operation and immediately returns a success response to the client without waiting for a quorum. The coordination of consistency happens asynchronously in the background.

Conditions for Quorums:

  • Two necessary conditions must be satisfied when using quorums in Cassandra:
  1. W + R > N: The sum of the write consistency level (W) and the read consistency level (R) must be greater than the total number of replicas (N) for that key.
  2. W > N/2: The write consistency level (W) must be greater than half of the total number of replicas (N) for that key.

Selecting Quorum Values:

  • The choice of W and R values depends on the application's specific requirements and workload characteristics.
  • Different combinations of W and R are suitable for different scenarios:
  • (W=1, R=1): Suitable for scenarios with very few writes and reads.
  • (W=N, R=1): Ideal for read-heavy workloads where consistency isn't a primary concern.
  • (W=N/2+1, R=N/2+1): Effective for write-heavy workloads where maintaining strong consistency is crucial.
  • (W=1, R=N): Great for write-heavy workloads where a single client mostly writes to a key.

In summary, quorums in Cassandra provide a flexible and tunable mechanism for achieving the desired balance between consistency and availability based on the specific needs of the application. The choice of quorum values is a crucial decision in designing a Cassandra cluster to meet the performance and data integrity requirements of the application.

Hope you enjoyed reading. I’m always open to suggestions and new ideas. Please write to me :)

--

--