Monday, November 9, 2009

dirty basic MapReduce implementation(well not even an implementation)

So just to understand the concept of MapReduce better, I tried to create some Java code which would use the MapReduce 'pattern' to solve the word count problem (i.e counting number of times each unique word occurs in the given set of documents).Of course this has been tested only on a sample of three basic text files ,not error checking whatsoever and was implemented using the the first way that came to mind.

FileKeyValue.java - File name and File value(list of words in the file)
WordKeyValue.java - Word name and word count
MapReducer.java - This does the bulk of the work.
  1. It creates a list of FileKeyValue objects for all the input files.
  2. Then threads out each FileKeyValue object to be processed in a Mapper.
  3. Waits for all the Mappers to finish.
  4. Then sorts the output of all the mappers by the outputKey(i.e word) and consolidates the output value from all the Mappers for that output key (i.e. creates the intermediate list).
  5. Threads out each unique combination of (outputKey, intermediate list) combination to a Reducer for reduction.
  6. Waits for all reducers to complete.
  7. Prints out the results
Mapper.java - Breaks the FileKeyValue object into a list of WordKeyValue objects.Each Mapper runs as a separate Thread
Reducer.java - Sums up the intermediate list values for a given word and passes it back to MapReducer class.Each Reducer runs as a separate Thread

No comments: