ECS110 Lecture Notes for Monday, November 20, 1995

Professor Rogaway conducting

Lecture 21

Scribe of the day: Gabriel Moreno




Today:				Handout:
  finish Hashing			Assign. #4
  universal hashing			Assign. #4 supplement
  graph representations

Reading:
 6->6.2.4
 6.3
 6.4
_____________________________________________________________________

Recall:
  When we are given a set of keys "K", we could find function 'h'
  on universe "U" such that for all (k != k'), h(k) != h(k').
  Recall that such an 'h' is called a perfect hashing function.
 We know: You can find a linear-time computable 'h' in O(|k|) time.
  However, we can't always know ahead of time what keys are being used.

 Recall: When we talked about a compiler identifier-dictionary, we
         expected to have many underscores, repeated prefixes, etc.
        We can't always expect to anticipate what identifiers a user
        might choose to use.
         Ideally, we'd like find a hash function which distributes 
       every "K" nicely.
    Solution:  We need not one single hash function, but a 'family'
              of hash functions.
    Definition: "H"(denoted by script-H) is a set {h: U-> [0,...,m-1]}
              And when your program runs, choose h from "H" at random.
 

  This situation is similar to Quicksort. Recall, when we simply 
    pivoted on the first element, we usually had OK results, but there  
    are cases where quicksort ran very slowly.  Similarly, in this 
    case we have the following:

 Familily of hash functions:
             For every "K" the hashing is usually good.
			  vs.
 A single hash function:
		For almost every "K" the hashing is good.

 Example: suppose "H" is a family of hash functions mapping 8-bit
          bytes into the range [0...m-1].
       To choose a random h from "H" (recall h is a particular hash
        function, "H" represents the family), choose r random numbers
        between [0...p-1] where p is prime.
      -> a1, a2, ....,a_r from [0,...,p-1]
    We define h        (k1,...,kr) =(a1k1 + a2k2 +...+arkr) % p
               a1....ar                                   (modulo)

Proposition: for all k in "K" the probability that h(k)=h(k')
             is (1/p).
	     This is a measure of the "goodness" of "H".

Proof:  left as an exercise.



let's consider another application of universal hash functions. (we've already considered the compiler example) PROGRAM CHECKING by Manuel Blum, 1986. The idea behind program checking is simple. One example of program checking we learned a long time ago. Recall when you first learned how to do long division. The teacher probably had you go back and multiply the answer by the denominator to see that you got back the numerator. Similarly program checking says that after writing a huge chunk of code why not write a few more lines, i.e. a program checker, that verifies that the answers you get are correct. Example: Imagine we are sorting a set "X"= {x1, x2, ...,xn} into a sorted set "Y"= { y1, y2, ....yn} where y1<=y2,...<=yn and {x1, ...xn} = {y1,...,yn} as multi-sets. We can use universal hashing. Idea: 1. check y1<=y2...<=yn. if not ---->the sorting program is BUGGY! 2. check that the set {y1,...,yn} has correct number of elements if not---------->BUGGY! 3. Choose a random h from "H" such that h hashes all elements in the universe into 32-bit strings. compute the summation over all elements of "X", and "Y" h(x1)+ h(x2)+...+h(xn) and h(y1)+h(y2)+...+h(yn) if they are not equal, then the program is buggy. otherwise, it is correct. Proposition: Suppose "H" is a random family of functions, mapping from some domain D into 32-bit words. Then, if your program (the program being checked) is correct, your program checker always answers 'correct'. If your program mis-sorts some given input "X"=x1,x2,..xn the Probability that your checker answers "BUGGY" is greater than: (-32) 1 - 2 ^ . For example, suppose you feed your sorting program 2,3,5,2 and it outputs 2,3,5,5. Now, the first two parts of the program checking algorithm would have not returned BUGGY since the output is non-decreasing and has four elements. The third part of the program checking program will ask: Is h(2)+ h(3) + h(5) + h(5) = h(2) + h(3) + h(5) + h(5) ? Clearly this will only be true if h(2)=h(5). The odds of this happening are: (-32) 2^ This ends hashing. Note to students: You will have to understand perfect hashing and Universal hashing from the lectures since these topics are not covered in the text.
GRAPH REPRESENTATIONS 1 ---------------------2 |\ | \ | \ | \ | \ 0 | \ | \ | \ 4--------3 In computer, there are two main ways to represent the above graph. an ADJACENCY MATRIX or an ADJ LIST let V be the set of vertices, and E the set of edges. Then, G= (V,E), n=|V|, m =|E| an Adj. Matrix for the graph above would look like this: _0_1_2_3_4_ 0|0 0 0 0 0| 1|0 0 1 0 0| 2|0 1 0 1 1| 3|0 0 1 0 1| 4|0 0 1 1 0| Note: the matrix is symmetric, because in an undirected graph a adjacent to b IFF b adjacent to a. Also: this implementation can represent multi-edges, simply by inserting the number of edges in the matrix instead of simply a '1' It can also represnet self-loops by putting a '1' at position (i,i) where i an element of V. Advantages: very fast look-up time O(1) to see if particular edge e is contained in set of edges E. On the other hand, this implementation grows very large Space = O(n-squared). if matrix is dense, then O(n-squared) is OK. But if it isn't you're wasting a tremendous amout of space. This brings us to our alternative: Adjacency List ______ 0: 0|______|--->|| ___ 1: 2 1|______|----->|2| |-->|| 2: 1, 3, 4 2|______|------>|1|__|--->|3|__|--->|4|__|--->|| 3: 2, 4 3|______|--->|2|___|--->|4|__|--->// 4: 2, 3 4|______|------>|2|__|--->|3|__|--->// O(d) time to find d vertices adjacent to a given vertex. This is the quickest time we could hope for, matrix is O(n) time by comparison. space is O(n+m) vs. O(n-squared) for matrix. On the down-side: O(d) to decide if a given angle E is in a graph.