i
Characteristics of Big Data
Application of Big Data Processing
Introduction to BIG DATA
Where to get Big Data?
Types of Big Data
Storage layer - HDFS (Hadoop Distributed File System)
MapReduce
YARN
How Hadoop works?
Hadoop Eco System
Hadoop Architecture
Hadoop Installation & Environment Setup
Setting Up A Single Node Hadoop Cluster
Ubuntu User Configuration
SSH Setup With Key Generation
Disable IPv6
Download and Install Hadoop 3.1.2
Working with Configuration Files
Start The Hadoop instances
Hadoop Distributed File System (HDFS)
HDFS Features and Goals
HDFS Architecture
Read Operations in HDFS
Write Operations In HDFS
HDFS Operations
YARN
YARN Features
YARN Architecture
Resource Manager
Node Manager
Application Master
Container
Application Workflow in Hadoop YARN
Hadoop MapReduce
How MapReduce Works?
MapReduce Examples with Python
Running The MapReduce Program & Storing The Data File To HDFS
Create A Python Script
Hadoop Environment Setup
Execute The Script
Apache Hive Definition
Why Apache Hive?
Features Of Apache Hive
Hive Architecture
Hive Metastore
Hive Query Language
SQL vs Hive
Hive Installation
Apache Pig Definition
MapReduce vs. Apache Pig vs. Hive
Apache Pig Architecture
Installation Process Of Apache Pig
Execute Apache Pig Script
Hadoop Eco Components
NoSQL Data Management
Apache Hbase
Apache Cassandra
Mongodb
Introduction To Kafka
The Architecture of Apache Flume
Apache Spark Ecosystem
In this section, I will explain the HDFS read operations in detail.
Fig: HDFS Write operation
An HDFS client initiates write operation by calling 'create()' method of DistributedFileSystem object, which creates a new file to write.
The object DistributedFileSystem connects to the NameNode by RPC call and initiates the creation of a new file. This file, however, produces an operation that does not connect any blocks with the file. It is NameNode' primary responsibility to verify that the file does not exist already, and the client has proper permissions to create a new file. If a file already available or the client does not have sufficient authority to create a file, then the client receives an IOException. Otherwise, the operation is successful, and the NameNode creates a new record for the file.
Once a new record is generated in NameNode, it returns the client an object of type FSDataOutputStream. A client uses this to write data into the HDFS. The data write method is invoked.
FSDataOutputStream contains the DFSOutputStream object that handles the communication with NameNode and DataNodes. While the client continues writing data, the DFSOutputStream continues creating packets with this data. These packets wait into a queue, which is called DataQueue.
There is another DataStreamer component that consumes this DataQueue. DataStreamer also asks NameNode for the allocation of new blocks, thereby picking desirable DataNodes to be used for replication.
Now, the replication process begins with the creation of a pipeline using DataNodes. As we have chosen a replication level of 3, there will be 3 DataNodes in the pipeline.
The DataStreamer pours the packets into the pipeline's first DataNode.
All the DataNodes in the pipeline stores packet received by it and forwards the same to pipeline's second DataNode.
DFSOutputStream maintains another queue, ' Ack Queue ', to store packets waiting that are waiting for acknowledgement from DataNodes.
Once all DataNodes in the pipeline receive acknowledgement for a packet in the queue, it is removed from the ' Ack Queue'. In case of a failure of DataNode, packets from this queue will be used to reinitiate the process.
After a client is done with writing data, it calls a close() method Call to close(), results in flushing the remaining data packets into the pipeline and waiting for an acknowledgement.
Once a final acknowledgement is received, NameNode is contacted to tell it that the file write operation is complete.
Don't miss out!