i
Characteristics of Big Data
Application of Big Data Processing
Introduction to BIG DATA
Where to get Big Data?
Types of Big Data
Storage layer - HDFS (Hadoop Distributed File System)
MapReduce
YARN
How Hadoop works?
Hadoop Eco System
Hadoop Architecture
Hadoop Installation & Environment Setup
Setting Up A Single Node Hadoop Cluster
Ubuntu User Configuration
SSH Setup With Key Generation
Disable IPv6
Download and Install Hadoop 3.1.2
Working with Configuration Files
Start The Hadoop instances
Hadoop Distributed File System (HDFS)
HDFS Features and Goals
HDFS Architecture
Read Operations in HDFS
Write Operations In HDFS
HDFS Operations
YARN
YARN Features
YARN Architecture
Resource Manager
Node Manager
Application Master
Container
Application Workflow in Hadoop YARN
Hadoop MapReduce
How MapReduce Works?
MapReduce Examples with Python
Running The MapReduce Program & Storing The Data File To HDFS
Create A Python Script
Hadoop Environment Setup
Execute The Script
Apache Hive Definition
Why Apache Hive?
Features Of Apache Hive
Hive Architecture
Hive Metastore
Hive Query Language
SQL vs Hive
Hive Installation
Apache Pig Definition
MapReduce vs. Apache Pig vs. Hive
Apache Pig Architecture
Installation Process Of Apache Pig
Execute Apache Pig Script
Hadoop Eco Components
NoSQL Data Management
Apache Hbase
Apache Cassandra
Mongodb
Introduction To Kafka
The Architecture of Apache Flume
Apache Spark Ecosystem
Map-Reduce transforms lists of input data into lists of output data elements. A Map-Reduce program will do this, using two different list processing idioms- Map and Reduce. In between Map and Reduce, there is a small phase called Shuffle and Sort in MapReduce.
First, we will discuss the first phase of the MapReduce paradigm, which is a Map. Input data given to a mapper is processed through a user-defined function written at mapper. We set all the required complex business logic at the mapper phase so that heavy processing is done by the mapper in parallel. Usually, the number of mappers is much more than the number of reducers. Mapper generates a result, which is intermediate data of key and value pair, and this output goes as input to the reducer.
The key will be the object on which the data has to be grouped and aggregated on the reducer side
Value is the data set on which we operate.
Fig: Key-Value pair of MapReduce
This intermediate (key/ value pair) result is then processed by the user-defined function written at a reducer, and the final result is generated. Usually, light processing is done to reduce part. This final result is stored in HDFS, and replication is done as usual.
Fig: MapReduce Process
MapReduce Example with Movie data:
We have a toy data set with four columns, USER_ID, MOVIE_ID, RATING, and TIMESTAMP. We want to know the number of ratings provided by each user. So our key is USER_ID, and value is MOVIE_ID.
Tab: Movie Data set
Mapping
This is the first phase in the execution of the map-reduce program. In this phase, data is passed to a mapping function to produce output values. In our example, the Key is USER_ID and value is MOVIE_ID, and a job of mapping phase keep the occurrences together to prepare a list in the form of
Shuffling and Sort
This is an intermediate phase that consumes the output of the Mapping phase. It consolidates the relevant records from the Mapping phase output. In our example, the same USER_ID's are clubbed together along with their respective MOVIE_ID.
Fig: MapReduce process for Movie data
Reducing
In this phase, output values from the Shuffling phase are aggregated. This phase combines values from the Shuffling phase and returns a single output value. In short, this phrase summarizes the complete dataset. In our example, it's returning that USER 166 rated one movie, 186 rated three movies, and so on.
Don't miss out!