i

Hadoop Tutorial

Apache Cassandra

Cassandra is a distributed database management system contrived to handle a large volume of structured data on commodity servers. With its distributed architecture, Cassandra handles a huge amount of data. Data is positioned with more than one replication factor on different machines; thus, it provides high availability, and it can store data on multiple nodes no single point of failure, which is its main feature.

As, at any time, hardware failure can occur and any node can be down, as a precaution, Cassandra's architecture was built in such a way to tackle this sort of situation. During such failure, data stored in another node can be used as a backup. This attributes to the design of the distributed architecture of Cassandra. Cassandra stores data on various nodes with a peer to peer distributed architecture. All nodes use Gossip protocol to exchange information with each other.

Features of Cassandra:

  • Open Source: due to its open-source feature, it has a huge Cassandra Community, which enables people with a platform to discuss their questions and points of view.  Integration of Cassandra with other Apache Open-source projects (Hadoop, Apache Pig, and Apache Hive, etc.) is possible, which is the first and foremost Cassandra feature.

  • Peer-to-peer Architecture: Unlike master-slave architecture, where there is the central unit, and the rest communicate with that unit, in a peer-to-peer architecture, several units communicate with each other, which is followed by  Apache Cassandra, which leaves no single point of failure. Besides, in any of the data centres, a specific number of nodes cannot be added to any cluster, which allows Cassandra to have a robust architecture with exceptional characteristics.

  • Elastic Scalability:  One of the most significant advantages, as one can easily scale-up or scale-down the cluster in Cassandra. It is flexible enough to add or delete any number of nodes from the cluster in the absence of disturbances. While scaling up or scaling down, there is no need to restart the cluster. For this, for the highest number of nodes, Cassandra has a very high throughput. In fact, there is no interruption or delay during the scaling process. As a result, reading and writing throughput increases without delay simultaneously.

  • Fault Tolerance and High Availability:  Data replication is possible in Cassandra; thus, it extremely faults tolerant and available. In the case of one node failure, the data is readily available in different nodes, which makes it highly available, and data can be retrieved from other nodes.  The number of replication is set by the user, who can replicate each row in a cluster based on the row key. Data replication can be across multiple data centres, which leads to high-level back-up and recovery competencies.

  • High Performance: Cassandra database has unrivalled performance as compared to other NoSQL database. As the developers all around wanted to utilize the capabilities of many multi-core machines, Cassandra was developed. Cassandra has proven to be excellently reliable when it comes to a large set of data. Therefore,  a lot of organizations which deal with a huge amount of data on a daily basis use it. Moreover, they are ensured about the data, as they cannot afford to lose the data.

  • Schema-Free: Cassandra allows the flexibility to create columns within the rows. Hence it is known as the schema-free data model. Since there may not be the same set of columns in each row, all the columns required by the application on the surface do not need to be displayed. Therefore, Schema-free database in a column family is one of the essential Cassandra features.

  • Column-Oriented: The data model of Cassandra is column-oriented. Column name contains metadata in other databases, while columns in Cassandra contain actual data. Columns are allocated in Cassandra based on column names. So, there are a lot of columns in the rows.

  • Tuneable Consistency:  There are two types of consistency in Cassandra, Eventual consistency and Strong Consistency, any of which can be chosen by the developer to suit his requirement. Eventual consistency ensures that the client agrees as soon as the cluster approves the write. Whereas, Strong consistency ensures any update is transmitted to all the nodes or machines, where the specific data is available. In addition, combining the two consistencies is also a possibility.

The architecture of Cassandra:

Cassandra has the following major components:

Node:  It is the primary component of Cassandra. Data is stored in the Node.

Data Centre: A collection of nodes are called data centre. Many nodes are categorized as a data centre.

Cluster: The cluster is the group of many data centres.

Commit Log: Each operation of writing is written to the Commit Log. Commit log is used for crash recovery.

Mem-table: After data are written in the Commit log, data is written in Mem-table. Data is written in Mem-table temporarily.

SSTable: When Mem-table reaches a certain threshold, data is flushed to an SSTable disk file.

                                                Fig: Architecture of Cassandra