i

Hadoop Tutorial

MapReduce Examples with Python

In this part of the document, we will work with the movie rating dataset. We will use Python libraries to execute a map-reduce job. Let's check the data set (u.data) first. It has 1725 observations of 4 columns (variables). The first one is user_id, 2nd one is movie_id, 3rd one is rating, and the final one is time. 

Now using the Python code, we are going to count the number of movies in each rating. It will complete a map-reduce job inside the Hadoop environment. There will be two parts, Map and reduce. In the below section, I will explain the python code for this map-reduce job.

Code for mapping phase:

The Key is the rating, and we are taking the value as 1. So, for each rating, it will generate a pair (rating, 1).

Code for reduce phase:

In the reduce phase, the output will be the aggregated of 1’s for each rating. So, it will add all the 1’s for rating 1 and then for rating 2 and so on.

This is a small chunk of data to explain the example.