i

Hadoop Tutorial

HDFS Features and Goals

Hadoop doesn't require expensive, highly reliable hardware. It is specially designed to run on clusters of commonly available hardware that can be purchased from multiple vendors for which, at least for large clusters, the risk of node failure across the cluster is low. HDFS is designed to continue working without a noticeable interruption to the user in the face of such failure. We are going to explain the main features and goals of HDFS in the following section:

  • Fault Tolerance: It refers to the working strength of a system in unfavourable conditions and how that system can treat such situations. HDFS is highly fault-tolerant. Data is divided into blocks in HDFS, and multiple copies of blocks are generated in the cluster (Configurable) on different machines. So whenever any machine in the cluster goes down, then from the other machine, which contains the same copy of data frames, a client can easily access their data. HDFS also preserves the replication function by replicating data blocks on another rack. Therefore, if a computer unexpectedly fails, a user will be able to access data from other slaves in another rack.

  • High Availability: The Hadoop File System is a file system that is widely available. Data gets replicated among the data nodes in the HDFS cluster by replicating the blocks on the other slaves available in the HDFS cluster. Therefore, whenever a user wants to access this data, they can access their data from the slaves that hold their blocks and are accessible on the clusters nearest node.  And during unfavourable situations like a failure of a node, a user can easily access their data from the other nodes. Because duplicate copies of blocks that contain user data are created on the other nodes present in the HDFS cluster.

  • Distributed Storage: All features are accomplished in HDFS by distributed storage and replication. HDFS data is stored in the HDFS cluster in a distributed way across the nodes. Data is divided into blocks in HDFS and is stored on the nodes of the HDFS cluster.  And then, replicas of each block will be created and stored on other nodes of the cluster. So if any computer fails in the cluster, we can easily access our data from the other nodes that contain their replica.

  • Data Reliability: HDFS is a well-distributed file system of Hadoop that provides reliable data storage. HDFS can store data reliably on a cluster of nodes. HDFS partitions the data into blocks and stores these blocks on nodes of the HDFS cluster. It reliably stores data by creating a replica of each block on the nodes present in the cluster and thus provides the facility for fault tolerance. If a node containing data goes down, then a user can easily access that data from the other nodes, which include a copy of the same data in the HDFS cluster. HDFS, by default, creates three copies of blocks containing data present in the nodes in the HDFS cluster. As a result, data is easily available to users, so the user is not facing the data loss problem.

  • Replication:  Data Replication is one of Hadoop HDFS ' most significant and unique features. Data replication is performed in HDFS to address the data loss issue in unfavourable conditions such as node crashing, hardware failure, etc. The replication process is managed by HDFS at regular intervals.  HDFS continues to create replicas of user data on various machines available in the cluster. It allows the user to access their data from other machines that contain the blocks of that data whenever any computer in the cluster is crashed. As a consequence, there is no possibility of data loss

  • Scalability: We can scale the cluster as per requirements as HDFS stores data on multiple nodes in the cluster. There is two scalability mechanism available in HDFS: Vertical scalability, which adds more resources (Disk, Memory, CPU) on the existing nodes of the cluster. And the second one is horizontal scalability that adds more machines in the cluster. The horizontal approach is preferred because, without any downtime, we can scale the cluster from 10s nodes to 100s nodes on the fly