i
Characteristics of Big Data
Application of Big Data Processing
Introduction to BIG DATA
Where to get Big Data?
Types of Big Data
Storage layer - HDFS (Hadoop Distributed File System)
MapReduce
YARN
How Hadoop works?
Hadoop Eco System
Hadoop Architecture
Hadoop Installation & Environment Setup
Setting Up A Single Node Hadoop Cluster
Ubuntu User Configuration
SSH Setup With Key Generation
Disable IPv6
Download and Install Hadoop 3.1.2
Working with Configuration Files
Start The Hadoop instances
Hadoop Distributed File System (HDFS)
HDFS Features and Goals
HDFS Architecture
Read Operations in HDFS
Write Operations In HDFS
HDFS Operations
YARN
YARN Features
YARN Architecture
Resource Manager
Node Manager
Application Master
Container
Application Workflow in Hadoop YARN
Hadoop MapReduce
How MapReduce Works?
MapReduce Examples with Python
Running The MapReduce Program & Storing The Data File To HDFS
Create A Python Script
Hadoop Environment Setup
Execute The Script
Apache Hive Definition
Why Apache Hive?
Features Of Apache Hive
Hive Architecture
Hive Metastore
Hive Query Language
SQL vs Hive
Hive Installation
Apache Pig Definition
MapReduce vs. Apache Pig vs. Hive
Apache Pig Architecture
Installation Process Of Apache Pig
Execute Apache Pig Script
Hadoop Eco Components
NoSQL Data Management
Apache Hbase
Apache Cassandra
Mongodb
Introduction To Kafka
The Architecture of Apache Flume
Apache Spark Ecosystem
Hadoop doesn't require expensive, highly reliable hardware. It is specially designed to run on clusters of commonly available hardware that can be purchased from multiple vendors for which, at least for large clusters, the risk of node failure across the cluster is low. HDFS is designed to continue working without a noticeable interruption to the user in the face of such failure. We are going to explain the main features and goals of HDFS in the following section:
Fault Tolerance: It refers to the working strength of a system in unfavourable conditions and how that system can treat such situations. HDFS is highly fault-tolerant. Data is divided into blocks in HDFS, and multiple copies of blocks are generated in the cluster (Configurable) on different machines. So whenever any machine in the cluster goes down, then from the other machine, which contains the same copy of data frames, a client can easily access their data. HDFS also preserves the replication function by replicating data blocks on another rack. Therefore, if a computer unexpectedly fails, a user will be able to access data from other slaves in another rack.
High Availability: The Hadoop File System is a file system that is widely available. Data gets replicated among the data nodes in the HDFS cluster by replicating the blocks on the other slaves available in the HDFS cluster. Therefore, whenever a user wants to access this data, they can access their data from the slaves that hold their blocks and are accessible on the clusters nearest node. And during unfavourable situations like a failure of a node, a user can easily access their data from the other nodes. Because duplicate copies of blocks that contain user data are created on the other nodes present in the HDFS cluster.
Distributed Storage: All features are accomplished in HDFS by distributed storage and replication. HDFS data is stored in the HDFS cluster in a distributed way across the nodes. Data is divided into blocks in HDFS and is stored on the nodes of the HDFS cluster. And then, replicas of each block will be created and stored on other nodes of the cluster. So if any computer fails in the cluster, we can easily access our data from the other nodes that contain their replica.
Data Reliability: HDFS is a well-distributed file system of Hadoop that provides reliable data storage. HDFS can store data reliably on a cluster of nodes. HDFS partitions the data into blocks and stores these blocks on nodes of the HDFS cluster. It reliably stores data by creating a replica of each block on the nodes present in the cluster and thus provides the facility for fault tolerance. If a node containing data goes down, then a user can easily access that data from the other nodes, which include a copy of the same data in the HDFS cluster. HDFS, by default, creates three copies of blocks containing data present in the nodes in the HDFS cluster. As a result, data is easily available to users, so the user is not facing the data loss problem.
Replication: Data Replication is one of Hadoop HDFS ' most significant and unique features. Data replication is performed in HDFS to address the data loss issue in unfavourable conditions such as node crashing, hardware failure, etc. The replication process is managed by HDFS at regular intervals. HDFS continues to create replicas of user data on various machines available in the cluster. It allows the user to access their data from other machines that contain the blocks of that data whenever any computer in the cluster is crashed. As a consequence, there is no possibility of data loss
Scalability: We can scale the cluster as per requirements as HDFS stores data on multiple nodes in the cluster. There is two scalability mechanism available in HDFS: Vertical scalability, which adds more resources (Disk, Memory, CPU) on the existing nodes of the cluster. And the second one is horizontal scalability that adds more machines in the cluster. The horizontal approach is preferred because, without any downtime, we can scale the cluster from 10s nodes to 100s nodes on the fly
Don't miss out!