i

Hadoop Tutorial

MapReduce

Apache MapReduce is the processing layer of Hadoop. This programming model is designed for processing large volumes of data in parallel by dividing the work into a set of independent jobs. We need to put business logic in the way MapReduce works and rest things will be taken care of by the framework.

Developers write MapReduce programs in a particular style that influence by functional programming constructs, specifical idioms for processing lists of data. In MapReduce, we get inputs from a list, and it converts it into an output, which is again a list. It is the heart of Hadoop. Hadoop is so influential and efficient due to MapReduce and its parallel processing capabilities.  

In Map-Reduce, a problem is divided into a large number of smaller problems, each of which is processed to give individual outputs. These different outputs are further processed to provide the final result. Hadoop Map-Reduce has a power of scalability so that it can be used across many machines. Many small computers can be used to process jobs that could not be processed by a large machine.

The Map job run in the following phases:-

a. Record Reader

The record reader transforms the input sets into records. It parses the data set into records but does not parse records itself. It sends the data to the mapper in key-value pairs. Usually, the key is the object on which the data has to be grouped and aggregated, and value is the data set on which we operate.

b. Mapper

The mapper is a user-defined function that processes the key-value pair from the record reader. It generates zero or multiple intermediate key-value pairs. The logic of what will be the key-value pair lies on the mapper function.

c. Combiner

The combiner is a localized reducer which takes intermediate data from the mapper and groups them. In many circumstances, this decreases the amount of data needed to move over the network. Combiner provides extreme performance gain without drawbacks.

d. Partitioner

Partitioner takes the intermediate key-value pairs from the mapper function. It splits them into shards and set one shard per reducer. By default, the partitioner fetches the hash code of the key. The partitioner performs modulus operation by several reducers: key.hashcode()%. It uniformly distributes the keyspace across the reducers. It also ensures that the key with the same value but from different mappers ends up into the same reducer. From each map task, the partitioned data is written on the local file system. It waits there so that the reducer can pull it.

The Reduce task run in the following phases:-

i. Shuffle and Sort

The Shuffle and sort is the first step of Reducer. This step takes the data written by partitioner to the machine where the reducer is running. Individual data pieces are sorted into a massive data list. The purpose of this sort is to gather the equivalent keys. The framework does this so that in the reduced task, we can rapidly iterate over it.  The framework handles everything automatically. 

ii. Reduce

The reducer performs the specific reduce function once per key grouping. This framework passes the function key and an iterator object that contains all the key values. We can write reducer to aggregate, filter, and combine data in several different ways. After finishing the reduce function, it gives zero or more key-value pairs to the output format.

iii. Output Format

This is the final step of MapReduce Phase. This takes the reducer's key-value pair and writes it through the record writer to the file. By definition, a tab separates the key and value, and a newline character separates each record.  To provide a more efficient output format, we can customize it. The final data gets written to HDFS.

                                                      Fig: Hadoop Architecture