Introduction to NoSQL databases

TLDR;

This video explains NoSQL databases, their advantages, disadvantages, and use cases, comparing them with SQL databases. It then dives into the Cassandra architecture, discussing key concepts like data distribution, replication, quorum, and SSTable compaction. The video highlights how NoSQL databases are optimized for specific scenarios like high write volume and scalability, while also pointing out their limitations in consistency and complex relationships.

NoSQL databases are useful for handling large volumes of data with flexible schemas.
Cassandra's architecture ensures high availability and fault tolerance through data replication and distribution.
Understanding quorum and SSTable compaction is crucial for managing consistency and storage efficiency in NoSQL systems.

Intro [0:00]

The video kicks off with a discussion on NoSQL databases, clarifying that while they're powerful, they aren't a one-size-fits-all solution. It's important to know when to use them and when not to. The speakers point out that while NoSQL is great for scaling, many popular applications like YouTube, StackOverflow, Instagram, and WhatsApp don't actually use them.

NoSQL explanation and comparison [1:08]

The speaker explains the core difference between SQL and NoSQL databases using an example of storing user data. In SQL, complex objects like addresses are stored in separate tables with foreign key mappings. In contrast, NoSQL stores data as a single "blob" in JSON format, with nested objects and no foreign keys. This makes NoSQL efficient for insertions and retrievals since all relevant data is contained in one block. The schema is flexible, allowing for easy changes without expensive column additions. NoSQL databases have horizontal partitioning inbuilt, making them focused on availability and built for scale. They are also built for aggregations, finding metrics and getting intelligent data.

However, NoSQL databases have disadvantages. They don't inherently support many updates, which can lead to data inconsistency. Asset properties are not guaranteed, so you can't have transactions using NoSQL databases. These databases are not read optimized, and read times are comparatively slower. They also don't have implicit information about relations, and joins are hard.

NoSQL databases are best used when the data is a blob, there are few updates, and you want to keep all of them together. They are also useful when you want inherent redundancy or aggregations in the data.

Cassandra Architecture [10:27]

The video transitions into Cassandra's architecture, highlighting a cluster of five nodes where requests are distributed based on a hash of the request key. A good hash function ensures uniform distribution of requests across nodes, maximizing resource utilization. If the initial hash function is poor, a two-layer cluster can be implemented with a different hash function on the second layer to achieve better distribution. Data is replicated across multiple nodes to prevent data loss if a node crashes. The hashing concept allows for easy replication, where copies of data are stored on subsequent nodes. This setup provides load balancing and redundancy, enhancing both read speed and data guarantee.

Quorum [18:00]

The discussion moves to distributed consensus, particularly the concept of quorum. With a replication factor of three, data is copied across multiple nodes. Quorum is a mechanism where multiple nodes agree on a value to return to the user. For example, if a write operation is in progress and a node crashes, the remaining nodes use timestamps to determine the latest data version. A quorum of two out of three nodes is often used, meaning if two nodes agree on a value, that value is considered the truth. While this approach carries a small risk of returning incorrect data, it prioritizes availability over strict consistency.

Compaction of SST tables [21:30]

Finally, the video explains how Cassandra stores and writes data using SSTables (Sorted String Tables). Incoming requests are initially stored in memory as a log file for fast sequential writing. Periodically, this memory is dumped into SSTables, where the keys are sorted. SSTables are immutable, meaning updates create new records. To manage storage usage with duplicate keys, Cassandra uses compaction, merging different SSTables. Compaction is an Order N operation, similar to a merge sort, optimizing for space. Deleted records are marked with tombstones, which are used during read operations to identify and remove dead records.