--------------------------------------------------------------------------- COMP 731 - Data Struc & Algorithms - Lecture 18 - Thursday, August 10, 2000 --------------------------------------------------------------------------- Today: o Explanation for Assignment #4 o Hashing A *dictionary* is the following ADT: //DATA: A collection of ITEMS, each item having some KEY // from a totally-ordered set. Initially, there are no // items in the collections. We assume that there is at // most one item with any given key. //OPERATIONS: Insert(item i) - Insert item i into the dictionary. We assume that the no item with key i.key already exists in the dictionary. Find(key k) - Return a pointer to the item with key k, if such an item exists, and return NIL otherwise. Delete(key k) - Delete the item with key k. Undefined if there is no such item. Not all dictionaries support this operation. Basic idea: ---------------- |----| | | 0| | | UNIVERSE of | |----| | Possible | h 1| | | Keys, U | ----------> |----| | (BIG) | 2| | | | |----| ---------------- 3| | |----| | | |----| | | |----| | | |----| | | |----| | | |----| m-1| | |----| Every key hashes under h into some one SLOT of the hash table. The hash table has m slots, so h : U -> [0..m-1] Common to select m as a prime number. Collision: x different from x' but h(x) = h(x'). If |U|>m, then collisions are unavoidable. But we want to make them rare. Dealing with collisions: (Method 1) Collision resolution by CHAINING. Explain. Then analysis. n # of entries in the hash table alpha = --- = -------------------------------- m # of slots in the hash table Back-of-envelope analysis: Insert into the hash table: O(1) time (worst case -- assumes that you know that the item is not already present). Find - item IS in the dictionary: Have to do expected 1 + alpha/2 work. Why? Average chain length is n/m and have to go half-way down the chain. The 1 is for the constant amount of work (we assume) to compute h(x). So: O(1 + alpha) expected. - item is NOT in the dictionary: as above, but have to go to the end of the chain. O (1+ alpha) expected work. Delete - 1 + alpha/2 : O(1+alpha/2) expected. All of these assume the UNIFORM HASHING MODEL, which says: For purposes of analysis, we treat the hash function h as though it were a random function. (Method 2) Collision resolution by OPEN ADDRESSING. Explain. Assume no deletions! General approach: compute h_0(x), h_1(x), ... until you find an empty slot. Ideally, h_0(x) ... h_{m-1}(x) should be a permutation on [0..m], so that, as long as there is an empty slot, you will eventually find it. h_i(x) = (hash(x) + i) mod m - linear probing Problem: CLUSTERING. Keep hash table less than 0.5 occupied to avoid being a major problem. h_i(x) = (hash(x) + i^2) mod m - quadratic probing Less severe clustering problem. ... various other methods ... How to choose h? Want h to be simple and fast to compute. Depends a lot on the the universe of key, and what sort of distribution one is expecting on the universe of keys. Example: U is random 32-bit numbers. Then h(x) = x mod m is a good choice, where m is a power of 2, ie., h(x) = x >> d where 2^d = m But this is a terrible choice if U is not random, and in fact is likely to have lots of likely inputs that differ only in their high-order bits! U is a set of strings. Let x = x[1] ... x[t]. H(x) = x[1] + ... + x[t] mod m // will not work well if lots of inputs differ only in the order in which their characters appear. May also be slow if x is long. Often something very ad hoc, like H(x) = t xor 16 x[1] xor 128 x[2] xor 512 x[t] mod 1007 will work well. Perfect hashing on S: no collisions within the set S. So only possibe if |S| >= m. Easiest to find if |S| is well larger than m. Example: hashing C++ reserved word using some ad hoc hash function as a way to determine if a string is an identifier or a reserved word.