i
Characteristics of Big Data
Application of Big Data Processing
Introduction to BIG DATA
Where to get Big Data?
Types of Big Data
Storage layer - HDFS (Hadoop Distributed File System)
MapReduce
YARN
How Hadoop works?
Hadoop Eco System
Hadoop Architecture
Hadoop Installation & Environment Setup
Setting Up A Single Node Hadoop Cluster
Ubuntu User Configuration
SSH Setup With Key Generation
Disable IPv6
Download and Install Hadoop 3.1.2
Working with Configuration Files
Start The Hadoop instances
Hadoop Distributed File System (HDFS)
HDFS Features and Goals
HDFS Architecture
Read Operations in HDFS
Write Operations In HDFS
HDFS Operations
YARN
YARN Features
YARN Architecture
Resource Manager
Node Manager
Application Master
Container
Application Workflow in Hadoop YARN
Hadoop MapReduce
How MapReduce Works?
MapReduce Examples with Python
Running The MapReduce Program & Storing The Data File To HDFS
Create A Python Script
Hadoop Environment Setup
Execute The Script
Apache Hive Definition
Why Apache Hive?
Features Of Apache Hive
Hive Architecture
Hive Metastore
Hive Query Language
SQL vs Hive
Hive Installation
Apache Pig Definition
MapReduce vs. Apache Pig vs. Hive
Apache Pig Architecture
Installation Process Of Apache Pig
Execute Apache Pig Script
Hadoop Eco Components
NoSQL Data Management
Apache Hbase
Apache Cassandra
Mongodb
Introduction To Kafka
The Architecture of Apache Flume
Apache Spark Ecosystem
The Hadoop Distributed File System offers a distributed file system intended to operate on commodity hardware.With existing distributed file systems, it has many similarities. However, there are significant differences from other distributed file systems. It is incredibly tolerant of faults and is intended for low-cost hardware deployment. It offers reliable access to application data throughput and is appropriate for large dataset applications.
HDFS divides and stores the information unit into smaller units called blocks. It has operating two daemons — one for the master node-Name Node and one for the slave node-Data Node.
Java is the native language of HDFS. Hence one can deploy Data Node and Name Node on computers having Java installed. In a typical structure, there is one dedicated machine running Name Node, and all the other nodes in the cluster run Data Node. The Name Node contains metadata like the location of blocks on the Data Nodes.
Block in HDFS:
Block is nothing but the lowest computer system storage device. It is the lowest file-allocated adjacent storage. The default block size is 64MB, but we can extend it to 256 MB in Hadoop.
Name Node:
All the files and directories in the namespace are represented on the Name Node by Inodes (Index) that contain various attributes like permissions, modification timestamp, disk space quota, namespace quota and access times. Name Node maps the complete file system structure into memory. fsimage and edits files are used for persistence during restarts.
The Fsimage file contains the inodes and the list of blocks that define the metadata. The complete snapshot at any given point of time of the file systems metadata is available in this file.
The edits file contains modification history that has been performed on the fsimage file. Incremental changes like appending data or renaming the file are tracked in the edit log to confirm the durability instead of creating a new fsimage snapshot every time the namespace is being altered.
When the Name Node starts, the fsimage file is loaded, and the contents of the edits file are applied to retrieve the latest state of the file system. The only issue with this is that over time, the edits file increases and consumes all the disk space and the consequence is slowing down the restart process. This is when the Secondary Name Node comes to the rescue. Secondary Name Node gets the fsimage and edits log from the primary Name Node at regular intervals and loads both of them to the main memory by applying each operation from edits log file to fsimage. Secondary Name Node copies the new fsimage file to the primary Name Node and also update the fsimage file.
Data Node:
Data Node manages the state of an HDFS node and interacts with the blocks. A Data Node can perform CPU intensive jobs like semantic and language analysis, statistics and machine learning tasks, and I/O intensive jobs like clustering, data import, data export, search, decompression, and indexing. A Data Node needs a lot of I/O for data processing and transfer.
On startup, every Data Node connects to the Name Node and performs a handshake to verify the namespace ID and software version of the Data Node. If either is any mismatch, then the Data Node shuts down automatically. A Data Node verifies the block replicas in its ownership by sending a block report to the Name Node. As soon as the Data Node registers, the first block report is sent. Data Node sends a heartbeat to the Name Node every 3 seconds to confirm that the Data Node is operating and the block replicas it hosts are available.
Don't miss out!