---------------------------------------------------------------------------
COMP 731 - Data Struc & Algorithms - Lecture 19 - Friday, August 11, 2000
---------------------------------------------------------------------------

Today: 

o Hashing applications
o Binary search trees


1. symbol table in compiler.  Mentioned yesterday in connection
   with perfect hashing

2. Given sets S and T, does S intersect T?
   Suppose |S|=|T|=n.
   Naive algorithm is Theta(n^2) time.
   Here is an expected O(n) time algorithm (under the uniform
   hashing assumption)

      for each s in S
         Insert(dict, s)
      for each t in T do
         if Find(dict, t) then return "S and T intersect"

3. Associative array
       A["hello"] = 7
       A["there"] = 21
       A["algorithm] = 103
   To compute A[str] to a Find(str).
   To set A[str]=x, replace the item Find(str) 
   by an item with the new information

......

Binary Search Trees 
 Also support 
   Insert
   Find
   Delete
In addition,
   FindMin
   FindMax
   Succ
   Pred
are well supported.

BST property:
 key of every node in LeftTree(x) < key (x) < key of every node in RightTree(x)

Show how to insert  -- easy

Show how to delete -- 
   case 1: the node being deleted has no children.  Remove it.
   case 2: the node being deleted has a single child.  So remove that node
           and move up the child.
   case 3: the node being deleted, x, has two children.  The succ(x),
           which is node you get by moving right and then left as far as possible.
           Necessarily that node has no left child.  So remove it using case 1 
           or case 2, and have it replace x.

Minor problem: case 3 will tend to "skew" the tree:
      since we are alway replacing a node x with a node from the RIGHT subtree of
      x, the tree will start to get left-heavy.   One possibility is to ignore this.
      Another possibility is to randomly select to replace x by the successor
      (right and the left as far as possible) or the predecessor (left and then
      right as far as possible).  A simpler possibility (no pseudorandom random-number 
      generation needed) is to ALTERNATE between these two possibilities.

Tree traversals:
     Described breadth-first traversal (uses a QUEUE), and
     Depth-first traversals:   
        preorder(x):  visit(x)            preorder(left(x))    preorder(right(x))
        inorder(x):   inorder(left(x))    visit(x)             inorder(right(x))
        postorder(x): postorder(left(x))  postorder(right(x))  visit(x)

     Draw a tree and illustrate....

The operations on a tree take time proportional to the height of the tree.
    (The height of the tree is the length of the longest 
    path from the root to a leaf.  The height of a node is the length 
    of a longest path from that node to a leaf, where edges are directed
    from the root towards the leaves.)  
    So good trees are "balanced" -- bushy. 
    Theta(n) in the worst case.
    Theta(lg n) in the best case.
 
Are most nodes in most trees closer to being lg n in depth, 
or closer to n in depth?
The former.
Let's think about the "average distance of a node to the root in a 'random'
BST.  We have to think what we mean by a 'random' BST.  Here I will
mean that, among all the possible shapes of binary trees with n nodes, 
we choose one at random.  (There are other interpretations.  Perhaps a more
meaningful one is that the tree is the shape of the tree produced by 
inserting random numbers in the interval [0,1].)

One node is selected for the root.
There are then n-1 remaining nodes to distribute.
We could put

      LEFT         RIGHT
--------------------------
       0             n-1
       1             n-2
       2             n-3
      ...            ...
      n-3             2
      n-2             1
      n-1             0

We are saying that all of these possibilities are equi-probable.

F(n) = the TOTAL path length of the tree, 

     = sum  (distance between x and the root r)
        x
     = sum   depth(x)
        x 

Then
              n-1
  F(n) = 1/n  Sum (F(i) + F(n-1-i) + 1)  
              i=0

              n-1
       = 2/n  Sum  F(i)   +    (n-1)
              i=0

Look familiar?  This is the same recurrence as for the expected running
time of Quicksort!   So it's solution, you will recall, is Theta(n lg n).

So the expected total path length is about c n lg n.  There are n nodes, so
the expected depth of a node in a random tree is c lg n.


But this is of little help if our keys are, for example, in order!
I want to consider two ways to deal with this, way to try to force
balance.

1. AVL trees    - a "classical" method, Adelson-Velskii and Landis (1962)
2. Splay trees  - an elegant, "modern" method of Selator-Tarjan (1985)
                  balance will be obtained in an amortized sense