i

Hadoop Tutorial

HDFS Architecture

Hadoop Distributed File System (HDFS) is a structured file system in which each file is split into blocks of the predetermined size. Such blocks are stored in a cluster of one or more machines. Hadoop HDFS Architecture is basically a Master/Slave Architecture, where a cluster comprises a single NameNode as Master, and all the other Data Nodes are Slaves. We can deploy HDFS on a wide range of Java-supported computers. Although several DataNodes can be run on a single machine, in the working world, these DataNodes are spread across various machines.

                                    Fig: Hadoop Architecture

Name Node:

NameNode is the master node in the HDFS Architecture that maintains and manages the blocks present on the DataNodes. NameNode, which is a highly available server, manages the namespace of the file system and controls clients ' access to files. The HDFS architecture is constructed in such a way that the user data on the NameNode never resides. Data resides on DataNodes only.

Functions of NameNode:

  • It is the master node that maintains and manages the Slaves (DataNodes)

  • It records the metadata in the cluster.  Metadata usually contains the location of blocks stored, permissions, hierarchy, the size of the files, etc. We have two metadata associated files:

    • The Fsimage file contains the inodes and the list of blocks that define the metadata. The complete snapshot at any given point of time of the file system’s metadata is available in this file. 

    • The edits file contains modification history that has been performed on the fsimage file. Incremental changes like appending data or renaming the file are tracked in the edit log to confirm the durability instead of creating a new fsimage snapshot every time the namespace is being altered.

  • It records any update to the metadata of the file system. For example, If a file is removed in HDFS, it will be registered in the EditLog immediately by NameNode.

  • It receives a Heartbeat and a block report from all the DataNodes in the cluster to confirm that the DataNodes are alive.

  • It keeps track of all blocks in HDFS and also records which nodes are associated with which blocks.

  • The NameNode is also responsible for taking care of the replication factor of all the blocks.

  • In the event of a DataNode failure, NameNode selects new DataNodes for new replicas, then balance disk usage, and maintain the communication traffic to the DataNodes.

Data Node:

DataNodes are the slave nodes in HDFS. Like NameNode, DataNode is not a high-quality or high-availability, non-expensive system. The DataNode, which is a block server, stores the data in the local file ext4 or ext3.

Functions of DataNode:

  • These are slave process which runs on each slave machine.

  • DataNodes store the actual data.

  • DataNodes perform low-level read and write requests from the file systems' clients.

  • They periodically send heartbeats to the NameNode to report HDFS' overall health. By default, this heartbeat frequency is set to 3 seconds.

Secondary NameNode:

Apart from the main two daemons, there is a third process called Secondary NameNode.The Secondary NameNode acts as a helper daemon of NameNode.

Functions of Secondary NameNode:

  • The Secondary NameNode reads all the file systems and metadata from the memory of the NameNode and writes it into the file system or hard disk.

  • It combines the EditLogs with FsImage from the NameNode.

  • It regularly downloads EditLogs from the NameNode and applies to FsImage. The new FsImage will be copied back to the NameNode and will be used whenever the next time the NameNode is started.

Secondary NameNode also performs regular checkpoints in HDFS. That's why it is also called CheckpointNode.

Blocks:

General, in any of the File System, we save the data as a collection of blocks. Similarly, HDFS stores each file as blocks that are scattered throughout the Hadoop cluster. The default block size is 128 MB in Apache Hadoop 3.x, which we can configure as per our requirement.

It is not necessary that HDFS will always use the exact block size. Each file is stored in multiple of the configured block size (128 MB, 256 MB, etc.). Suppose I have a file of size 514 MB.  If we use the default configuration of block size, which is 128 MB, Then, 5 (4*128 +1*2) blocks will be created for my file. The first four blocks will be 128 MB. But, the last one will be of size 2 MB only. 

Replication Management:

HDFS provides a reliable way to store massive data as data blocks in a distributed environment. In order to provide fault tolerance, the blocks are also replicated. The default replication factor is three, which is again configurable. You can see in the figure where each block is replicated three times (considering the default replication factor) and stored on different DataNodes.

                                                      Fig: HDFS Replication

Therefore, if we want to store a 128 MB file in HDFS using the default configuration, we end up consuming 384 MB (3 * 128 MB) storage as the blocks are replicated three times and each replica resides on a separate DataNode.

Rack Awareness:

In a large cluster of Hadoop, to improve the network traffic while reading/writing the HDFS file, NameNode chooses the DataNode, which is closer to the same rack or nearby rack to Read /write request. NameNode obtains rack information by holding each DataNode's rack ids. Rack Awareness in Hadoop is the concept that chooses DataNodes based on the rack information.

In HDFS Architecture, NameNode confirms that all the replicas are not located in the same or single rack. It uses Rack Awareness Algorithm to reduce latency as well as fault tolerance. We know that the default replication factor is 3. According to Rack Awareness, the Algorithm's first replica of a block will be stored on a local rack. The next one will be stored to another DataNode within the same rack. And the final replica will be stored on a different rack in Hadoop. Rack Awareness is essential to improve:

  • Data high availability and reliability.

  • The performance of the cluster.

  • To improve network bandwidth.