Introduction
Wide column database systems store data in column families. Below is a small example of a NoSQL database:
Characterstics
Although, these relations look like tables, but they’re not actually tables. This is because, we cannot make arbitrary selections based on non-primary columns. For example, the following query will fail:
Therefore, similar to key-value stores, we can only query by using primary keys.
A primary key in a wide column database consists of one or more partition keys and zero or more clustering keys (or sorting keys).
Wide column databases allow horizontal partitioning of data among multiple nodes enabling horizontal scalability.
Data modeling
One important point to consider is that wide column databases store data in denormalized form. This means, that all data related to an object is stored along with that object at the same place.
For example, in a traditional relational database schema. Consider an entity user:
|
|
And an entity called comments:
|
|
In this scheme, we can retrieve the comments made by a particular user using the following query:
|
|
However, this does not work in wide column databases like cassandra. This is because, they do not support joins.
We need to redefine the schema as follows:
|
|
Now we should be able to retrieve the comments using the following query:
|
|
Since we no longer need to perform joins, retrieving data in wide column databases is much faster and efficient. However, storing data in a denormalized form has its own problems. For example we usually need to store duplicate data when modeling data in denormalized form. This makes it hard to keep copies in sync leading to data inconsistencies.
Also, you need to model you data such that the column you want to search over for a record (say user_id) needs to be the partition key. For example, suppose you need to run the following query:
|
|
Now, you need to model your data such that search_column is the partition key.
But what if you need to search over multiple columns?
In that case, you would need to create another table which is campatible with your query pattern. Below is a simple example illustrating this:
Data partioning
Since partitioning is embeded into the data model through partition keys. Horizontal scalability is natural to wide column databases. We can scale up or down depending on the demand simply by adding removing nodes.
But what happens, when a node is added or removed. Is the data redistributed across all nodes? No. This is not the case. For example, cassandra uses consistent hashing technique to minimize the data transfer in case of addition/removal of nodes. Each node also has a doppelganger (german for identical double), on the hash ring. The duplicate representations of a node on the hash ring are called virtual nodes.
Write performance
Widecolumn databases provide much better write performance compared to traditional relational databases. This is primarily because of 2 reasons:
- Write optimized storage mechanisms: Wide column databases are
optimized for writes. They use special data structures to achieve
this. For example, cassandra uses Log Structured Merge Trees
(commonly called LSM trees) and SStables. These data structures delay
the
flush
system calls, combining multiple writes and improving performance.
- Since, data is stored partioned in multiple nodes. Writes can be executed in parallel improving performance.
Note: The content of this article is based on this video which is also a great resource to learn about other types of database systems