---------------------------------------------------------------------------
COMP 731 - Data Struc & Algorithms - Lecture 18 - Thursday, August 10, 2000
---------------------------------------------------------------------------
Today:
o Explanation for Assignment #4
o Hashing
A *dictionary* is the following ADT:
//DATA: A collection of ITEMS, each item having some KEY
// from a totally-ordered set. Initially, there are no
// items in the collections. We assume that there is at
// most one item with any given key.
//OPERATIONS:
Insert(item i) - Insert item i into the dictionary.
We assume that the no item with key i.key
already exists in the dictionary.
Find(key k) - Return a pointer to the item with key k,
if such an item exists, and return NIL
otherwise.
Delete(key k) - Delete the item with key k. Undefined
if there is no such item.
Not all dictionaries support this operation.
Basic idea:
---------------- |----|
| | 0| |
| UNIVERSE of | |----|
| Possible | h 1| |
| Keys, U | ----------> |----|
| (BIG) | 2| |
| | |----|
---------------- 3| |
|----|
| |
|----|
| |
|----|
| |
|----|
| |
|----|
| |
|----|
m-1| |
|----|
Every key hashes under h into some one SLOT of the hash
table. The hash table has m slots, so
h : U -> [0..m-1]
Common to select m as a prime number.
Collision:
x different from x' but h(x) = h(x').
If |U|>m, then collisions are unavoidable. But we want
to make them rare.
Dealing with collisions:
(Method 1) Collision resolution by CHAINING.
Explain. Then analysis.
n # of entries in the hash table
alpha = --- = --------------------------------
m # of slots in the hash table
Back-of-envelope analysis:
Insert into the hash table: O(1) time (worst case -- assumes
that you know that the item is not already present).
Find - item IS in the dictionary: Have to do expected
1 + alpha/2 work. Why?
Average chain length is n/m and have to go half-way down
the chain. The 1 is for the constant amount of work
(we assume) to compute h(x). So:
O(1 + alpha) expected.
- item is NOT in the dictionary: as above, but have to go
to the end of the chain.
O (1+ alpha) expected work.
Delete - 1 + alpha/2 : O(1+alpha/2) expected.
All of these assume the UNIFORM HASHING MODEL, which says:
For purposes of analysis,
we treat the hash function h as though it were a
random function.
(Method 2) Collision resolution by OPEN ADDRESSING.
Explain. Assume no deletions!
General approach: compute h_0(x), h_1(x), ... until
you find an empty slot. Ideally, h_0(x) ... h_{m-1}(x)
should be a permutation on [0..m], so that, as long as there
is an empty slot, you will eventually find it.
h_i(x) = (hash(x) + i) mod m - linear probing
Problem: CLUSTERING. Keep hash table less than 0.5 occupied to
avoid being a major problem.
h_i(x) = (hash(x) + i^2) mod m - quadratic probing
Less severe clustering problem.
... various other methods ...
How to choose h?
Want h to be simple and fast to compute.
Depends a lot on the the universe of key, and what sort
of distribution one is expecting on the universe of keys.
Example: U is random 32-bit numbers. Then h(x) = x mod m
is a good choice, where m is a power of 2, ie.,
h(x) = x >> d where 2^d = m
But this is a terrible choice if U is not random,
and in fact is likely to have lots of likely inputs
that differ only in their high-order bits!
U is a set of strings.
Let x = x[1] ... x[t].
H(x) = x[1] + ... + x[t] mod m
// will not work well if lots of inputs differ only in the order
in which their characters appear. May also be slow if x is long.
Often something very ad hoc, like
H(x) = t xor 16 x[1] xor 128 x[2] xor 512 x[t] mod 1007
will work well.
Perfect hashing on S: no collisions within the set S.
So only possibe if |S| >= m.
Easiest to find if |S| is well larger than m.
Example: hashing C++ reserved word using some ad hoc hash function
as a way to determine if a string is an identifier or a reserved word.