ECS110 Lecture Notes for Wednesday, November 15, 1995

Professor Rogaway conducting

Lecture 20

Scribe of the day: Paula Barter




REMINDER!  Remember to delete the large files from program 3!!

Today's Lecture
-Hashing


Review: ADT Dictionary //Data - a set of elements S, Each element having an associated // Key, Drawn from some universe U of Keys. // No two elements have the same key. //Operations - void insert(Element i); //adds i to S int in(Element i); //returns 1 if i is in S else returns 0 void delete(Element i); //removes i from S Now, incorporate this idea of Hashing with this ADT. S is a set in the Universe of Keys. h is the hash function from S to the Hash Table. Each section of the Table is called a slot. A collision is when two different keys map to the same slot This is resolved by chaining. Dynamic Hashing: When you don't know n (number of items). For Example when (the load factor) alpha > 1 then double m (the size of the hash table) This is called Rehashing. Claim: Under uniform hashing assumption, n operations takes expected theta(n) time. Example: initialize m = 100 insert 1st 400 items: Note: with each increase of m a new hash function is needed. Assume m = 4 and the following 4 inserts are called: insert("Fred") insert("Sam") insert("Alice") insert("Ali") Now insert("Bob") is called. This would make the load factor greater than one. The old Hash Table is therefore rehashed with a new hash function. Now Let's return to the expense of the hash function. The average cost from 1 - 101 = 2 The average cost from 102 - 201 = 3 The average cost from 202 - 401 = 3 The average cost from 402 - 801 = 3 Let's take for example an Accounting trick: To show theta(1) expected time per operation over a sequence of operations: Suppose the "real" cost for a "simple" insert is $1.00 Suppose the "rehash cost" is $n.00 For any insert charge $3.00. This implies that: For the "simple" insert use $1.00 to cover costs put the other $2.00 in the savings For the "rehash cost" use the money in savings. Claim: There will always be enough money in your savings account to pay for the "rehash cost". Proof: Preceding the n rehash there were n/2 "simple" inserts. Then there are $n in the savings, which is enough to cover costs.
Here are Three ways to make your h (hashing) function: 1. Division Method k (a key) //enormous integer h(k) = k % m //especially common when k is a word choose m to be prime but a prime not close to a power of 2. For example: m = 256 could only deal with the last byte of k. m = 701 is good for about 1000 items. 2. Multiplication Method h(k) = |_ m*(betta*k)*fractional part _| // betta- a weird //constant // ie (squareroot(5) //- 1) / 2 = .6180339877 3. Folding k = x1x2x3....xt // |x1| = 16 h(k) = the sum from 1 to t (xi) % m However this is insensitive to the arrangement of strings. For example: x1x2x3x4 x2x4x1x3 x3x2x4x1 All of these have the same value. Note: Simple hash functions increase speed. Perfect Hashing ( arrangement of things so there are no collisions) - choosing h such that no two keys from S collide - need to know S in advance - You know S for example in compilers Let's say we have 60 reserved words. We are given a set of keys and we need to decide which are reserved words and which are identifiers. Given: String S Return: (reserved word, the word itself) or (identifier, the word itself) 1. You could compare each word with list of reserve words. But this is very slow (60 comparisons). 2. Binary Search Tree. Depth is n, there will still be n comparisons. 3. Find a perfect Hash Program, this will be very fast.