i

Hadoop Tutorial

How MapReduce Works?

Map-Reduce transforms lists of input data into lists of output data elements. A Map-Reduce program will do this, using two different list processing idioms- Map and Reduce. In between Map and Reduce, there is a small phase called Shuffle and Sort in MapReduce.

First, we will discuss the first phase of the MapReduce paradigm, which is a Map. Input data given to a mapper is processed through a user-defined function written at mapper. We set all the required complex business logic at the mapper phase so that heavy processing is done by the mapper in parallel.  Usually, the number of mappers is much more than the number of reducers.  Mapper generates a result, which is intermediate data of key and value pair, and this output goes as input to the reducer.

  • The key will be the object on which the data has to be grouped and aggregated on the reducer side

  • Value is the data set on which we operate.

                        Fig: Key-Value pair of MapReduce

This intermediate (key/ value pair) result is then processed by the user-defined function written at a reducer, and the final result is generated. Usually, light processing is done to reduce part. This final result is stored in HDFS, and replication is done as usual.

                                        Fig: MapReduce Process

MapReduce Example with Movie data:

We have a toy data set with four columns, USER_ID, MOVIE_ID, RATING, and TIMESTAMP. We want to know the number of ratings provided by each user. So our key is USER_ID, and value is MOVIE_ID.

                     Tab: Movie Data set

Mapping

This is the first phase in the execution of the map-reduce program. In this phase, data is passed to a mapping function to produce output values. In our example, the Key is USER_ID and value is MOVIE_ID, and a job of mapping phase keep the occurrences together to prepare a list in the form of

Shuffling and Sort

This is an intermediate phase that consumes the output of the Mapping phase. It consolidates the relevant records from the Mapping phase output. In our example, the same USER_ID's are clubbed together along with their respective MOVIE_ID.

                                          Fig: MapReduce process for Movie data

Reducing

In this phase, output values from the Shuffling phase are aggregated. This phase combines values from the Shuffling phase and returns a single output value. In short, this phrase summarizes the complete dataset. In our example, it's returning that USER 166 rated one movie, 186 rated three movies, and so on.