i

Hadoop Tutorial

Storage layer - HDFS (Hadoop Distributed File System)

The Hadoop Distributed File System offers a distributed file system intended to operate on commodity hardware.With existing distributed file systems, it has many similarities. However, there are significant differences from other distributed file systems. It is incredibly tolerant of faults and is intended for low-cost hardware deployment. It offers reliable access to application data throughput and is appropriate for large dataset applications.

HDFS divides and stores the information unit into smaller units called blocks. It has operating two daemons — one for the master node-Name Node and one for the slave node-Data Node.

Java is the native language of HDFS. Hence one can deploy Data Node and Name Node on computers having Java installed. In a typical structure, there is one dedicated machine running Name Node, and all the other nodes in the cluster run Data Node. The Name Node contains metadata like the location of blocks on the Data Nodes.

Block in HDFS:

Block is nothing but the lowest computer system storage device. It is the lowest file-allocated adjacent storage. The default block size is 64MB, but we can extend it to 256 MB in Hadoop.

Name Node:

All the files and directories in the namespace are represented on the Name Node by Inodes (Index) that contain various attributes like permissions, modification timestamp, disk space quota, namespace quota and access times. Name Node maps the complete file system structure into memory. fsimage and edits files are used for persistence during restarts.

The Fsimage file contains the inodes and the list of blocks that define the metadata. The complete snapshot at any given point of time of the file systems metadata is available in this file. 

The edits file contains modification history that has been performed on the fsimage file. Incremental changes like appending data or renaming the file are tracked in the edit log to confirm the durability instead of creating a new fsimage snapshot every time the namespace is being altered.

When the Name Node starts, the fsimage file is loaded, and the contents of the edits file are applied to retrieve the latest state of the file system. The only issue with this is that over time, the edits file increases and consumes all the disk space and the consequence is slowing down the restart process. This is when the Secondary Name Node comes to the rescue. Secondary Name Node gets the fsimage and edits log from the primary Name Node at regular intervals and loads both of them to the main memory by applying each operation from edits log file to fsimage. Secondary Name Node copies the new fsimage file to the primary Name Node and also update the fsimage file.

Data Node:

Data Node manages the state of an HDFS node and interacts with the blocks. A Data Node can perform CPU intensive jobs like semantic and language analysis, statistics and machine learning tasks, and I/O intensive jobs like clustering, data import, data export, search, decompression, and indexing. A Data Node needs a lot of I/O for data processing and transfer.

On startup, every Data Node connects to the Name Node and performs a handshake to verify the namespace ID and software version of the Data Node. If either is any mismatch, then the Data Node shuts down automatically. A Data Node verifies the block replicas in its ownership by sending a block report to the Name Node. As soon as the Data Node registers, the first block report is sent. Data Node sends a heartbeat to the Name Node every 3 seconds to confirm that the Data Node is operating and the block replicas it hosts are available.