JobTracker | Abode for Hadoop Beginners

In this post, we will discuss the following with regards to MapReduce framework :-

Motivation for a parallel processing framework
What is MapReduce?
How MapReduce solves a problem?
MapReduce in detail
1. Map Phase
2. Shuffle & Sort Phase
3. Reduce Phase

Motivation for a parallel processing framework

As you know we are living in the age of Data. Today the size of internet is almost 40+ billion web pages. If we assume that each page has 20Kb of data, we looking at somewhere close to 800+ terabyte of data. With the reading speed of 100 Mbps, it will take a computer more than 4 months to read the web. And just to store such huge data, we will need 1000’s of computers.

Clearly we need a mechanism to divide such huge work among various computers. If we give the same work of reading 400+ terabytes of data to say 1000 computers, it can be done in less than 3 hours.

But our task is not limited to reading data. We need also need to process such massive data. Clearly consolidating the data from the nodes and transferring it to the compute nodes will take a lot of time. We therefore need a programming framework that allows for distributing the computation task and running them in parallel.

What is MapReduce?

MapReduce is one such programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks.¹ Each of these tasks are then run on individual node in a cluster. The advantage of MapReduce framework is that, it not only divides the work but also also abstracts the programmer from all the issues that come with distributed computing such as communication and coordination, recovering from machine failure, status reporting, debugging, optimization and data locality.

How MapReduce solves a problem?

A typical problem solved by MapReduce can be broken down into the following steps :-

Read a lot of data
Extract something you are interested in from every record
Shuffle and sort the intermediate results
Aggregate (filter, sort or transform) the intermediate result
Write out the result

MapReduce in detail

The figure below will give you an overview of MapReduce. In MapReduce everything is in terms of key-value pairs. The phases of the MapReduce task are Map, Sort & Shuffle and Reduce.

The two Daemons associated with MapReduce are JobTracker and TaskTracker. The Client submits the job to the master node which runs the JobTracker. The JobTracker then assigns Map and Reduce tasks to other nodes in the cluster. These nodes run the TaskTracker daemons that starts the Map and Reduce tasks on the nodes and send the progress report to the JobTracker. So basically JobTracker is the MasterNode and TaskTrackers are the slaves.

Overview of MapReduce

Map Phase

One of the major advantages of using Hadoop is Data Locality. Hadoop always tries to take the computation to the data rather than to transfer data to the computation node. In other words a node is made the Mapper that is closest to the data (if possible the same node that has the data is made the mapper). This greatly reduces the network traffic and the time to transfer data from one node to another.

If you can see the figure above, multiple mappers run the Map tasks in parallel, each processing a portion if the input data. The Mappers receives the input interms of key-value pairs. It then emits a list of intermediate key and value pair (zero of more).

map(key, value) -> (intermediate_key, intermediate_value)

Map is essentially a function that takes a key-value pair as input and emits intermediate key-value pair. Let us concrete our understanding by taking an example of “word frequency” Mapper. Here our goal is to find the frequency of the words in a document. The key here is the offset of the word (where the word is found in the document) and the value is the word itself. When this is passed to the Mapper, it emits the word as key and 1 as the value.

For instance, let the document be “to be or not to be”.

So the Map function would look something like,

map(k, v) = (v, 1)

map(offest1, 'to') = ('to', 1)

map(offest2, 'be') = ('be', 1)

map(offest3, 'or') = ('or', 1)

map(offest4, 'not') = ('not', 1)

map(offest5, 'to') = ('to', 1)

map(offest6, 'be') = ('be', 1)

Sort & Shuffle phase

After all the Map phases are complete, every pair has an intermediate key and value. Now all the values that have the same key are combined together and then they are sorted. This is the sort and shuffle phase of MapReduce. It is important to note that the output of the Map phase is stored on the local hard disk and not on the HDFS.

Continuing with our example, the output after Sort & Shuffle phase will be,

('to', [1, 1]), ('be', [1, 1]), ('or', [1]), ('not', [1])

Reduce Phase

After the Sort & Shuffle phase, the intermediate lists of key-value pair is fed to the Reducer, where it is subjected to the Reduce function. The Reducer emits zero/more final key value pair. Usually, a reducer will emit one key value pair for every intermediate key. The output of the Reducer is written on to the HDFS.

The Reduce function here will be to add all the intermediate values associated with a particular key. So the final output will be,

('to', 2), ('be', 2), ('or', 1)

As you see, the Reducer will give us the required output (which is the word frequency in the document).

To conclude, I reiterate that MapReduce is a great abstraction that allows programmers to focus only on the problem rather that all the messy details that need to be looked into while programming for large scale distributed development.

—
References

1. http://developer.yahoo.com/hadoop/tutorial/module4.html
2. Hadoop – The Definitive Guide (Tom White)