Thursday, November 26, 2009

The Happiness at work Matrix - 4Ps

For me there are four important factors that decide if the  time  I spent at work was worthwhile or not .They are what I call the 4Ps -

  • People - This is the most important factor for me.I really like being with people who are smarter than me(oh which is not that hard to find) , people who I respect and  and people who are fun to work with.
  • Problem - This is the problem you are trying to solve at work, which is the second most important thing for me.This is how excited are you to work on the problem, it doesn't have to be rocket science to be fun, for example on my first project creating a web page was like an amazing experience.
  • Power - This is the third most important thing.This is the amount of influence you have at the place you work, how much people listen to your opinions and how important they think you are to the project.I think this directly is related to how you are "growing" at work, the higher your power the faster you are growing
  • PayCheck -The obvious one, how much you make
So you  can draw a little matrix with years as the Y axis and the 4Ps on the X and mark it Blue if that P was good for you for that year, Orange for medium and Red for bad.It should give you a good perspective of how your years went by.Here is mine.Just because I have so many blues for the Pay Check  does not mean I am like the best paid guy around - everything is your perspective of things so all it means when a blue is there is that I "thought" I was doing good on that P




Tuesday, November 24, 2009

Closures

I keep hearing people talking about Closures but it always seemed hard to grasp it by just a cursory look at the material.For the uninitiated Wikipedia's definition is hard to grasp

What is a closure? In computer science, a closure is a first-class function with free variables that are bound in the lexical environment.

Oh what is a first-class function?A programming language is said to support first-class functions (or function types or function literals) if it treats functions as first-class objects. Specifically, this means that the language supports constructing new functions during the execution of a program, storing them in data structures, passing them as arguments to other functions, and returning them as the values of other functions. This concept doesn't cover any means external to the language and program (meta programming), such as invoking a compiler or an eval function to create a new function.

Now what the heck is a free variable?In computer programming, a free variable is a variable referred to in a function that is not a local variable or an argument of that function[1].


And what do you mean bound? In programming languages, name binding is the association of values with identifiers[1]. An identifier bound to a value is said to reference that value.

Oh finally what is a lexical environment?In computer programming, lexical environment or scope is an enclosing context where values and expressions are associated

Kinda still hard to get if we don't know the formal terminology used

But from what I got and crudely put, Closure is the ability to be able to pass  a piece of "executable code" around and then the executable code when being executed will have access to the same set of variables as available to the place from where you are passing it .Here is a simple example in Ruby

def foo
        local_variable ="foo variable";   
       f = Proc.new { puts "local variable ="+local_variable}
       bar(f)     
 end

def bar(f)
      local_variable ="bar variable";   
      f.call
end

foo

This is going to print "local variable =foo variable" , so I was able to pass the { puts "local variable ="+local_variable} code around and wherever I executed it had access to the same variables as from where it was called i.e. foo

MIT OCW -Lecture 7(Hashing and Hash Functions)

Today morning I took the 7th Lecture which covered Hashing and Hash Functions.This was not as math intensive as the previous few lectures and was more intuition based( if there is such a thing - but basically what I mean is you could kinda see why things were good or bad without even having to do the Math).Here is what was covered

  • Symbol table problem that is encountered in Compilers, Operations on the Symbol table i.e. insertion,deletion and search
  • Direct Access table - Just an array where you use the key to index into the array to find the value
  • What is "Hashing"?
  • How to resolve collisions by chaining
  • Runtime analysis for hashing with resolving collisions by chaining - worst case is Theta(n) and average case is Theta(1+ alpha), where alpha is the load factor(i.e. n/m ,n=number of elements in the Set and m is the number of slots in the Hash Table)
  • How to choose a Hash function? - Should distribute keys uniformly and regularity of keys (for example all keys are even numbers) should not affect the uniformity
  • Hash Function using Division Method => h(k) = k Mod m , don't pick m to have a small divisor for example 2 or don't pick m as a power of 2, typically pick a prime which  is not too close to power or 2 or 10
  • Hash Function using Multiplication Method => h(k)= (A.k mod 2**w).right.shift(w -r) where A is an odd integer between 2**(w-1) and 2**w  and not too close to the bounds, 2**r = m and w is the word length in bits of the computer.
  • The above method is fast since multiplication and right shifting are typically faster on a computer than division
  • Resolving collisions through open addressing i.e. when you don't have storage for pointers for doing resolving using linking. - you basically keep probing with a different hash function each time you don't find an empty slot or the search key you are looking for
  • Linear Probing Strategy => h(k,i) = (h(k,0) + i) mod m. This method suffers from primarly clustering i.e. basically we are search the Hash Table linearly one after the another, so if we have a block of 10 occupied slots in one stretch, everytime you have to check those 10 slots before moving on to the next empty slot
  • Double Hashing Strategy => h(k,i) = (h1(k) + i*h2(k))mod m, typically you pick m as a power of 2 and h2(k) is odd for uniform distribution
  • Analysis of runtime for Open Addressing => Expected number of probes <=  1/(1-alpha) where alpha is the load factor.The lecture proves the above theorem

Network Programming in C - Beej's guide summary(Part 1)

So to brush up my C skills, I thought of doing some C programming and Network programming was something I was always curious about so what better way then do network programming in C.I haven't gotten to the programming part yet.So far I have just completed about half of Beej's guide to Network Programming
And since I have fallen into this good (or bad? who knows) habit of taking notes, here is what I have covered so far from the tutorial
  • What is a socket?
  • Difference between Stream Sockets and Datagram Sockets (SOCK_STREAM and SOCK_DGRAM)
  • Stream sockets use Transmission Control Protocol and Datagram Sockets use User Datagram protocol
  • Streams sockets have a connection open during transmissioin and have all the good stuff of ensuring the packets reach other end in same order etc whereas datagram sockets don't ensure any of that, the reason to use UDP over TCP is speed and speed
  • Layers of Network stack and why it is layered (very brief)
  • why ipv6? Because we were running out of address in ipv4 - ipv4 is 32 bits for each address, ipv6 is 128 bits
  • Representation of ipv4 address e.g. 192.0.2.111 and ipv6 address e.g. 2001:0db8:c9d2:0012:0000:0000:0000:0051 , you can actually ignore zeros and represent it as 2001:db8:c9d2:12::51
  • Loopback ipv6 address is ::1 and ipv4 is 127.0.0.1
  • What is a subnet and how it is represented?
  • What are port numbers?
  • byte orders - big endian and little endian and the need to convert from network byte order to host byte order and vice versa during receiving and sending packets
  • Explains Structs - addrinfo, sockaddr,sockaddr_in, in_addr, sockaddr_in6, in6_addr,sockaddr_storage
  • Explains - inet_pton() function which converts String representation of ip address (e.g. 172.78.89.1) to in_addr struct and inet_ntop() for vice versa conversion
  • Private networks and Network Address Translation(NAT) - for example when you are connected via a router to the internet from home , your ip addres via ipconfig is different than what your ip address is on this site http://www.whatismyip.com/.Also in theory there can be so many addresses with ipv6 that NAT won't be needed

Monday, November 23, 2009

MIT OCW -Lecture 6(Order Statistics, Median)

I took this Lecture 6 on Sunday.This is what was covered

  • What is order statistic i.e. finding the k-th smallest element in an array
  • Naive Algorithm for doing it i.e. sorting the array and returning the k-th element from it Theta( nlg(n))
  • Algorithm using randomized divide and conquer (basically using quicksort's random partitioning method)
  • Intuitive analysis on running time of randomized partition for finding the order statistic i.e. All cases when we split the array into any ratio other than (0:n-1) = Theta(n) and only worst case when we always pick a pivot such that the array is split (0:n-1) = Theta (n**2)
  • Formal analysis of running time using Indicative Random variables and substitution method of the above algorithm
  • To overcome the worst case of the above algorithm , another algorithm by Rivest, Floyd,Pratt et al was covered 
  • The above algorithm basically involves splitting the input array into n/5 groups, finding the median of each n/5 group and then recursively finding the median of these medians.
  • The above algorithm was covered at a high level and proof provided on why it would always be Theta(n) run time, but the important part is that the constant C is so high that this may not be a practical algorithm .
Other than the above lecture, I also cleaned up this blog a little but changing the URL and the Blog Template,editing a few blogs entries with no title etc..While in the cleaning up mood, I also cleaned up my home directory to group all logically related projects/files into one parent folder.Then I was playing around with C a little and I think I got back 80% of the C I knew 8 years back, pointer arithmetic funness is something I probably need to spend a little more time to get it back.

Thursday, November 19, 2009

MIT OCW - Lecture 5(Linear-time Sorting: Lower Bounds, Counting Sort, Radix Sort)

Today morning I took the 5th Lecture which talked about decision trees and linear sorting algorithms.Specifically it covered

  • Started by reviewing how fast were the already covered sorting algorithms i.e. Merge,Insertion,Quick and Heap Sort(which is not covered in the lectures but wikipedia has good information on it )
  • Discussed what is a computation model
  • Comparison sorting model - All sorts which use comparison operators for sorting
  • What is a decision tree model
  • How we can represent sorting algorithms based on comparison in decision tree model
  • Proof using decision tree model that all comparison sorts have a lower bound of at least n*lg(n).
  • Merge sort, Heap Sort and Randomized QuickSort are asymptotically optimal comparison model sorting algorithms
  • Counting Sort for sorting in linear time - O(n+k) where n is input size and k is the number of distinct elements in the input
  • limitations of Counting sort i.e k needs to be fairly small else it will need very large storage space for keeping the counter array
  • What is a Stable sorting algorithm - keeps the relative position of equal elements in output same as in input array
  • Radix sort
  • Run time analysis of Radix sort and also how to optimally break the input integer into digits i.e O(n) and r =lg(n)
Then I got a little bored of all the theory so went ahead and installed Go. Go is using Mercurial for version control.I have used CVS, use ClearCase now in work life, use Git for GitHub (still know just the basics) and hopefully I will get to a point with Go where I can compare Mercurial with other version control systems.Always fun to compare different designs for the same problem.

I have been too long in the Object Oriented Java world that I am yet to digest the Cish type of syntax of Go and unfortunately all it's speed is lost on my computer because I run Ubuntu on VMWare and that is sloooooooooooow.

Wednesday, November 18, 2009

MIT OCW - Lecture 4 (Quicksort and Randomized Algorithms)

I completed the 4th Lecture today morning which was on QuickSort and Randomizing QuickSort.Here is what was covered

  • How Quick sort fits into Divide and Conquer paradigm
  • QuickSort pseudo code and algorithm explanation
  • Time analysis of partition method i.e. Theta(n)
  • Worst case time analysis of QuickSort i.e. when input is sorted or reverse sorted (Theta (n**2)h
  • Best case time analysis of QuickSort when we the pivot always splits the array into two equal halves( Theta(nlg(n))
  • Analysis for Average Case time of QuickSort when Pivot splits the array 1/10:9/10 (Theta(nlg(n))
  • Time analysis of QuickSort when we alternate between best case and worst case for each recursive partition (Theta(nlg(n))
  • Randomized Quicksort - How to overcome worst case by randomly picking a pivot so that the running time is independent of the input array order
  • Time analysis of Randomized QuickSort (Theta(nlg(n))
This lecture expects a background in basics of probability and random variables(the proof for time analysis of Randomized quick sort uses Indicator random variables)

All in all QuickSort is one of the best practical algorithms available for sorting, of course it may needed to be tuned a little on a case by case basis e.g. if you expect sorted inputs at times, use randomized quick sorts, if the partition array size is say less than 5 elements use some other algorithm to sort it instead of recursively going all the way to one element to partition etc.

Here is a basic implementation of QuickSort in Java

Sunday, November 15, 2009

MIT OCW - Lecture 2(Asymptotic Notation - Recurrences - Substitution, Master Method) and Lecture 3( Divide-and-Conquer: Strassen, Fibonacci)

So it was a rather productive weekend

On Saturday, I finished MIT's Lecture 2 which covered the below subjects.It was fairly mathematical
  1. Asymptotic notation , Big O, Big Omega, Big Theta, Small Omega, Small O notations
  2. Asymptotic analysis of recursive functions
  • Substitution Method
  • Recursive tree Method
  • Master Theorem Method

I also read basics of Arithmetic Series, Geometric Series and Logarithms

I implemented a basic Merge Sort

On sunday I took the Lecture 3 from MIT's open course ware which covered Divide-and-Conquer: Strassen, Fibonacci, Polynomial Multiplication.This lecture covered a lot of neat ways of doing things like Fibonacci numbers and powering a number etc.Specifically it covered
  1. Recap of Master Theorem
  2. What is Divide and Conquer(Recursion)
  3. Running time of Merge Sort using Master Theorem
  4. Running time of Binary Search
  5. Powering a number (naive way and divide and conquer way)
  6. Fibonacci Sequence (Naive recursion, bottom up, naive recursive squaring(using golden ratio),recursive squaring(using Matrix multiplication) )
  7. Matrix Multiplication - Divide and Conquer using Strassens method
  8. Embedding a binary tree on a chip - (Naive way which needs area of n*lg(n) and another method(derived using Master Theorem) which needed a area of n)
I also implemented a naive solution to this game called Drench which my friend showed me on Friday.The game basically involves coloring a n*n matrix with a simple set of rules.Here is the implementation. For 14*14 sized matrix (which is what the online game drench has) , this solution solved it on an average in about 20 to 22 colorings.I use a random number generator to generate the base matrix.I have not tested it thoroughly yet and the program has a lot of scope for optimization.

I want to finish chapter 4 in the Introduction to Algorithms book (which is recommended for the course) which covers the substitution method, recursion tree method and master theorem before I proceed further with the lecture.That way I will have a better understanding of the Math involved and help me understand other lectures better.Hopefully I can do that tomorrow.

Friday, November 13, 2009

DB2 V9 SYSIBMADM views

If you are using DB2 V9 or later and want lots of nice performance monitoring information such as long running SQL and how much time they are taking, how are your buffer pools performing, lock snapshot etc.There are a bunch of views in the SYSIBMADM schema which makes it easy to write scripts to monitor your database for any offending SQLs

Check out Table 343 in the below Link

http://publib.boulder.ibm.com/infocenter/db2luw/v9/index.jsp?topic=/com.ibm.db2.udb.admin.doc/doc/r0023485.htm

MIT OCW - Lecture 1(Analysis of Algorithms,Insertion Sort,Merge Sort)

I took the first lecture on Introduction to Algorithms from MIT Open CourseWare today morning. I had already completed two lectures one on Introduction to Algorithms and the second on Stacks from National Programme on Technology Enhanced Learning of India's youtube channel

I just wanted to compare between the two before deciding which one I would take fully.I think I will go with the MIT's lectures because
  • I found the explanation of how the algorithms work more insightful on the MIT Lectures.
  • The IIT lectures uses power point slides whereas the MIT guy uses a black board and writes it out during the class, I liked the second approach better because it gives time for the students to think and black board is so free form that it is easy to explain much better using it
  • There are already lecture notes people have taken and posted on their blogs which I can refer to
  • MIT Lectures seemed to have more stress on Math (which is hard for me but good for me)
  • I already have the book the MIT course recommends to use for this course
But since the lecture will be using a black board, I think it is better to use a pen and paper than notepad for the notes.So I will not be posting the notes on the blog, I'll still put the code I do on my github repo.

Today I tried to capture some of the lecture at least using my notepad and here it is
Define Algorithm Analysis
Talked about things in a software program more important than performance
Insertion Sort Algorithm
The Running time of Algorithm depends on
a. type of input (whether fully sorted or partially sorted)
b. size of input
c. want upper bounds of time taken most of the time

Kinds of analysis
Worst case(usually)
T(n) = max time on any input of size n
Average case(sometimes)
T(n) = expected time over all inputs of size n
expected time is time for every input multiplied by the probablility that the input will occur,
hence for this we need the statistical distribution of inputs (for example all inputs are equally likely)
Best Case(bogus)
you can cheat by showing a slow algorithm as good over some input where on vast majority of inputs it is probably very bad

What is insertion sorts running time
Depends on computer
-relative speed ( compare two algorithms on same machine)
-Abolute speed ( compare two algorithms on different machine)

Asymptotic Analysis
a.Ignore Machine dependent constants
b.Look at the growth of the running time with respect to input (instead of looking at the actual running time)

Asymptotic notation
Theta notation - a. Drop lower order terms
b. Ignore leading constants

Analysis of Insertion Sort
Worst Case analysis
T(n) = Summation (j =2 to n) theta(j) = theta(n**2) (Arithemetic Series)

Merge Sort A[1..n]

1 if n = 1 , done , Time = Theta(1)
2 Recursively sort A[1, n/2] and A[(n/2)+1,n] Time =2*Theta(n/2)
3 Merge the two sorted lists = Time = Theta(n)

Merge sort time
if n = 1 (array is already sorted) , T(n) = Theta(1)
if n > 1 , T(N) = 2*Theta(n/2) + Theta (n)

Merge sort - Theta(nlogn) - The analysis was done graphically on blackboard

Thursday, November 12, 2009

my github repository

So I installed Eclipse and Git today on my Ubuntu on VMWare Player

Step 1 -> I installed Git following the instructions on this link http://chrisolsen.org/2008/03/10/installing-git-on-ubuntu/
Step 2-> I learn't the quick basics of git commands from here http://www.spheredev.org/wiki/Git_for_the_lazy
Step 3-> I installed eclipse which was straight forward sudo apt-get install eclipse
Step 4-> I opened a github micro account on github.com
Step 5-> I created my private public SSH keys and added the public key to the github command using these instructions http://help.github.com/linux-key-setup/

And here is my github repository http://github.com/bgurupra/gpb_play_projs

I have two folders in it, map_reduce folder has the basic java implementation of the map reduce idea using threads and algorithms folder has the implementation of whatever algorithms I will do based on the algorithm lectures

Wednesday, November 11, 2009

Git ,VMware and Ubuntu!

So I started reading up a little bit on Git today and hopefully soon get a GitHub account.But before that I wanted to set up Ubuntu on the Vmware player which I got done.Now I need to setup Eclipse and Git and all the good stuff on my Ubuntu

Here is a old talk on Git by Linus and also the intro to Git on github.com but the gist I got so far is
  • Git Keeps a copy of ALL the files and ALL the versions locally unlike your other version control things like CVS, SVN , ClearCase which keep only the latest version locally
  • Git is distributed version control System, so it has multiple servers instead of just one centralized server
  • In Git each person can have as many branches as they want and work on it.Since merging is easy , you can have like a "main" branch and merge all verified changes into that branch
  • This does not mean Git takes a whole lotta space, it is still probably less than what subversion takes to keep just the latest version of the files
  • Since Git keeps everything on your disk it is super duper fast
  • Since Git keeps everything on your disk you can do pretty much everything other than push and pull while you are offline
  • Branching and Merging is very fast

I would have liked to have atleast completed my Ubuntu setup and Git setup today.But dumb fuckers and dumb fuck processes of corporate IT spoiled my mood a little and then I had a lengthy lengthy argument with the missus which took up a lot of my time :-(

Summary of all the MIT Introduction to Algorithms lectures by Peteris Krumins

Wow somebody has done such an awesome awesome job of summarizing it

Summary of all the MIT Introduction to Algorithms lectures

Today probably will take the MIT's first lecture and see how that compares to the one I am taking

Stack implementation - NPTEL -Lecture 2(Stacks)

Below is the implementation of a simple stack based on the Stacks Lecture. A straight forward implementation in Java which supports both the growth strategy and the tight strategy for increasing the stack size.


Stack - Interface with Stack methods
MyStack - Concrete implementation
StackOverFlowException - Thrown when we run out of memory
StackEmptyException Thrown when the user calls top or pop when Stack is empty
StockSpanAlgorithm This class implements the calculating stock span (javadoc describes what the problem is) using both a trivial implementation (O(n**2)) and using Stacks (O(n))
Utils - Utils class which prints the array contents to the output

I am thinking instead of using pastie to post my code, I am better off using github.Lets see how that goes

Tuesday, November 10, 2009

NPTEL -Lecture 2(Stacks)

So Day 2 sat through the Stacks Lecture and here are the notes

Stacks

Abstract Data Type - Here is the fancy defintion , ADT is a mathematically specified entity that defines a set of its instances with:

A specific interface - a collection of signatures of operations that can be invoked on an instance
A set of axioms ( preconditions and postconditions) that define the semantics of the operations ( i.e. what the operations do to the instance of ADT, not how)

so in terms of Java ADT is pretty much like a Interface with decent JavaDoc

There are three types of operations
Construction (Constructor in Java)
Access functions ( something like a getter(), list())
Manipulation functions (something like a setter(), put(), add())


Stack
LIFO Datatype

operations on a stack , new - creates it, push - pushes , pop - pops, top - just reads the top most element without removing it, size - gives the size, isEmpty - checks if Empty

Formally axioms will be defined like this for example
Pop(Push(S,v)) = S
Top(Push(S,v)) = v

Stack interface in Java and some basics of Exceptions was covered

Creating an Array based stack in Java and it's implementation was shown


All methods of Stack run at O(1) - so Stack is very efficient

Example of Calculating the Stock Span i.e. If you have an array of daily stock prices , create another array such that for each element in the stock price array you have a corresponding element for the Span of that day. For example let us say the stock prices varied like this {10, 3,1,8,3,2 }, Span is defined as the difference between current day index and the index of the last day when the stock price is greater than the current day , for example Span for the day Stock price was 2 is 1 , for 8 it is 3

1)First approach using loop of loops so order was O(n**2)
2)More efficient Stack approach where order was O(n)


A growable ArrayBased Stack, Everytime the Stack is full you increase the size
1)Tight Strategy - Add the constant to the current max size and create a new Stack -O(n**2/c) where c is the constant
2)Growth Strategy - Double the size of Stack everytime is full - O(n)

So growth Strategy is a better option

Monday, November 9, 2009

Other 3 Lectures on Cluster computing and MapReduce from Google+ NPTEL -Lecture 1(Introduction to Algorithms)

So today I took the last three lectures as well here Google Cluster Computing and Map Reduce.

The third one on distributed file systems talked about Google File System was really good.It was not at all hard to understand and was kind of amazing in the sense that with ideas that are not even that hard to understand , you can actually manage peta bytes of data.Though GFS isn't you general purpose file system but a file system that is optimized to store a few millions of very large files ( the block size is 64 MB!!), optimized for reading large chunks of data in one shot and for appending data at the end of the file.

The other two lectures on Clustering Algorithms and Graph Algorithms was a little hard for me to appreciate because I had no background on it whatsoever.

So I started off by hunting videos on youtube that start with algorithm basics ( I do have a book on the same subject, but watching online lectures seems to be a much better option for me).Guess what I found a gold mine of Lectures from a Professor from Indian Institute of Technology in Delhi , here is the link

I'll try to listen to one lecture everyday and post the class notes here.So today's lecture was on "Introduction to Algorithms" and here are the class notes I took

Data Structures and Algorithms
  • What is an Algorithm?
  • What is a good Algorithm? - Small Running time and takes less memory

First Sorting Algorithm - Insertion Sort

Analysis of running time - At the very basic level, there are some basic fundamental operations

for example comparison operation(>,<,==), arithmetic operation(+,-,*,/), logical operations( &&, ||).So run time is just the sum of number of times the fundamental operations need to be executed for a given algorithm, multiplied by the time taken for each of these fundamental operations. There is best case times, worst case times and Average case times. Typically you would want to consider the worst case because that is the upper bound.Secondly the average case is typically as bad as the Worst case .Average case is difficult to compute as well Asymptotic Analysis: This method simplifies analysis of running time by getting rid of the implementation specific detail like what hardware, what software etc. Secondly the obvious way to measure the run-time is to implement it and measure the time taken to run the program , but that is hard and not feasible for obvious reasons (because input size can vary all over the place, the run time will vary based on OS Load, hardware etc etc)

Asymptotic Analysis just captures the essence of how the algorithm's running time increases with the increase in the input size Big Oh Notation f(n) = O(g(n)) if there exists constants c and no such that f(n) <= cg(n) for n >= no
for example
f(n) = 2n+6 and g(n) = n. and c = 4 and no is 3 .

Big Oh notation is used for worst case analysis


But really outside the fancy definition the simple rule is just Drop the lower order terms and the constants from the function

For example if the run time is a function 50nlog(n), then in Big-Oh notation you would simply represent it as O(nlog(n))

So if you are running a loop within a loop for input size n, your algorithmic efficiency is O(n(outer loop) * n (inner loop)) = O (n*2), if you have a loop within a loop within a loop it is O(n*3) etc

O(log(n)) is better than O(n) is better than O(n2) is better than O(n to the power k) is better than O ( a to the power n)

There is also big Omega(lower bound) and big Theta (tight bound = average case) - but these notations aren't as widely used as Big Oh

dirty basic MapReduce implementation(well not even an implementation)

So just to understand the concept of MapReduce better, I tried to create some Java code which would use the MapReduce 'pattern' to solve the word count problem (i.e counting number of times each unique word occurs in the given set of documents).Of course this has been tested only on a sample of three basic text files ,not error checking whatsoever and was implemented using the the first way that came to mind.

FileKeyValue.java - File name and File value(list of words in the file)
WordKeyValue.java - Word name and word count
MapReducer.java - This does the bulk of the work.
  1. It creates a list of FileKeyValue objects for all the input files.
  2. Then threads out each FileKeyValue object to be processed in a Mapper.
  3. Waits for all the Mappers to finish.
  4. Then sorts the output of all the mappers by the outputKey(i.e word) and consolidates the output value from all the Mappers for that output key (i.e. creates the intermediate list).
  5. Threads out each unique combination of (outputKey, intermediate list) combination to a Reducer for reduction.
  6. Waits for all reducers to complete.
  7. Prints out the results
Mapper.java - Breaks the FileKeyValue object into a list of WordKeyValue objects.Each Mapper runs as a separate Thread
Reducer.java - Sums up the intermediate list values for a given word and passes it back to MapReducer class.Each Reducer runs as a separate Thread

Sunday, November 8, 2009

Lecture on Cluster Computing and MapReduce from Google

For a while I was curious on what was MapReduce exactly and if even you are curious these lectures from Google will help

http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html

So far I took the first and the second one.Both are approx an hour long.

The first one gives an overview of distributed computing and it's history.
  • Difference between parallel computing and distributed computing
  • Synchronization primitives and Semaphores
  • Condition variables
  • Fundamentals of Networking (what is a port, TCP, IP etc)
The second lecture goes into details of Map Reduce
  • Overview of Functional Programming
  • What is Map and Fold in the context of Functional Programming
  • Overview of MapReduce with the example of a word count on a bunch of files Algorithm
I started writing a very basic MapReduce implementation in Java (using regular Threads of course to parallelize the Mappers and Reducers) for the Word Count Algorithm.I have about 50% completed and hopefully will complete it all tomorrow and post the code out here.