Tuesday, November 24, 2009

MIT OCW -Lecture 7(Hashing and Hash Functions)

Today morning I took the 7th Lecture which covered Hashing and Hash Functions.This was not as math intensive as the previous few lectures and was more intuition based( if there is such a thing - but basically what I mean is you could kinda see why things were good or bad without even having to do the Math).Here is what was covered

  • Symbol table problem that is encountered in Compilers, Operations on the Symbol table i.e. insertion,deletion and search
  • Direct Access table - Just an array where you use the key to index into the array to find the value
  • What is "Hashing"?
  • How to resolve collisions by chaining
  • Runtime analysis for hashing with resolving collisions by chaining - worst case is Theta(n) and average case is Theta(1+ alpha), where alpha is the load factor(i.e. n/m ,n=number of elements in the Set and m is the number of slots in the Hash Table)
  • How to choose a Hash function? - Should distribute keys uniformly and regularity of keys (for example all keys are even numbers) should not affect the uniformity
  • Hash Function using Division Method => h(k) = k Mod m , don't pick m to have a small divisor for example 2 or don't pick m as a power of 2, typically pick a prime which  is not too close to power or 2 or 10
  • Hash Function using Multiplication Method => h(k)= (A.k mod 2**w).right.shift(w -r) where A is an odd integer between 2**(w-1) and 2**w  and not too close to the bounds, 2**r = m and w is the word length in bits of the computer.
  • The above method is fast since multiplication and right shifting are typically faster on a computer than division
  • Resolving collisions through open addressing i.e. when you don't have storage for pointers for doing resolving using linking. - you basically keep probing with a different hash function each time you don't find an empty slot or the search key you are looking for
  • Linear Probing Strategy => h(k,i) = (h(k,0) + i) mod m. This method suffers from primarly clustering i.e. basically we are search the Hash Table linearly one after the another, so if we have a block of 10 occupied slots in one stretch, everytime you have to check those 10 slots before moving on to the next empty slot
  • Double Hashing Strategy => h(k,i) = (h1(k) + i*h2(k))mod m, typically you pick m as a power of 2 and h2(k) is odd for uniform distribution
  • Analysis of runtime for Open Addressing => Expected number of probes <=  1/(1-alpha) where alpha is the load factor.The lecture proves the above theorem

No comments: