ECS110 Lecture Notes for Friday, December 8, 1995

Professor Rogaway conducting

Lecture 28

Scribe of the day:
                                      1 - O                            
                                    /       \                          
                                 3A - O     2 - O                      
                                /    \     /     \
                           000010  001100 /       \                    
                                         /       3B - O                
                                        /        /    \                
                                      5 - O  110111   111000           
                                      /   \                            
                                 100001   100011                       
Example find on 100001
  • At 1 - O, we look at 1st bit. It is a 1, so go right to 2 - O.
  • At 2 - O, we look at 2nd bit. It is a 0, so go left to 5 - O.
  • At 5 - O, we look at 5th bit. It is a 0, so go left to 100001 and compare. They are equal so 100001 is in our trie.

Patricia

What if we took our binary trie and only used one type of node? This node would contain the 2 pointers to it's children and the data. This is the Patricia (No, I'm not kidding. This specialized trie is really called a Patricia).

Because the number of leaf nodes is at most one more than the number of internal nodes, we need to create an extra node at the zero level to guarantee that all our data can be put in the Patricia.

The convention we use here is if a node does not have a left child, then it's left pointer points back to itself. If it doesn't have a right child then the right pointer points to it's parent or the zero level node.

We search through a Patricia in the same manner as before, where the bit determines the direction taken. The catch is, if we see that the bit level stays the same or decreases then we do our comparison.


[Patricia]

Example find for 100001:
  • We skip the zero level node and start at the 1st level node that contains 001100.
  • We are at a 1st level node. We check the 1st bit of 100001 which is a 1, and follow the right pointer to 2 - 100011. The bit level has increased so we do not compare.
  • We are at a 2nd level node. We check the 2nd bit of 100001 which is a 0, and follow the left pointer to 5 - 100001. The bit level has increased so we do not compare.
  • We are at a 5th level node. We check the 5th bit of 100001 which is a 0, and follow the left pointer to 5 - 100001. The bit level is the same as before, so we compare 100001 and 100001. They are the same so 100001 is in our Patricia.
Example find for 111000:
  • We skip the zero level node and start at the 1st level node that contains 001100.
  • We are at a 1st level node. We check the 1st bit of 111000 which is a 1, and follow the right pointer to 2 - 100011. The bit level has increased so we do not compare.
  • We are at a 2nd level node. We check the 2nd bit of 111000 which is a 1, and follow the right pointer to 3 - 110111. The bit level has increased so we do not compare.
  • We are at a 3rd level node. We check the 3rd bit of 111000 which is a 1, and follow the right pointer to 0 - 111000. The bit level decreased, so we compare 111000 and 111000. They are the same so 111000 is in our Patricia.

Variant on DST's

We could have more branches per node so we end up with a fatter tree of very shallow depth. Professor Rogaway mentions possible DSTs with 1000 branches. No, I am not giving an example of this. Sorry ;->


Application of DST: File compression using Huffman encoding

Suppose we have a file with lots of redundancy and want to compress it.

We have a 1MB file whose entries are the ASCII letters a, b, c, d, e, and f with the following frequencies:

a | 450000
b | 130000
c | 120000
d | 160000
e |  90000
f |  50000
If each of these letters is represented with 8 bits, then we end up with a 1MB file.
Now because we only have 6 entries, we could more efficiently represent each entry with a 3 bit code.
entry       3-bit code
    a        000
    b        001
    c        010
    d        011
    e        100
    f        101
Because we are using only 3-bits , we end up with a file that is 3/8 the size of the original 8-bit file. About 375KB.

Now what if we try something else and represent the entries in the following manner.

entry      code
    a      0     Notice that each code is not a prefix of another.
    b      101   For example, no other code begins with a's code of 0.
    c      100   Nor does any other code begin with c's code of 100.
    d      111
    e      1101
    f      1100
This is Huffman encoding.

Now to decode 0101111110100, we just parse the string into the different codes and output the entry. Because the codes are not prefixes of other codes, when we make a match, we are guaranteed that the matched code is the correct one.

Here is how the above bit string is decoded.

code      0 101 111 1101 0 0
output    a  b   d   e   a a
This results in a file size of 280KB.
Here is the corresponding Huffman DST.
            /\
         0 /  \ 1
          /    \           The 0's and 1's next to the branches
         a     /\          indicate the path to follow depending upon
              /  \         the bit. 0, go left. 1, go right.
           0 /    \ 1
            /      \
           /\      /\
         0/  \1  0/  \ 1
         /    b  /    \
        c       /      d
               /\
             0/  \ 1
             f    e 

How to build a Huffman tree

We build a Huffman tree by being greedy. Here are the frequencies again in ten thousands.

a: 45     b: 13     c: 12     d: 16     e: 9     f: 5
We start building the tree by grouping the two least likely items to occur. We then add the frequencies to come up with a combined frequency for the combined entries.

We first see that e (9) and f (5) are the two lowest frequencies. We combine them and come up with a combined frequency of 9+5=14.

a: 45     b: 13     c: 12     d: 16     14
                                       /  \
                                      f    e
Now b (13) and c (12) have the two lowest frequencies. We combine them and come up with a combined frequency of 13+12=25.

a: 45     25     d: 16     14
         /  \             /  \
        c    b           f    e
Now d (16) and the f-e group (14) have the two lowest frequencies. We combine them and come up with a combined frequency of 16+14=30.

a: 45     25            30
         /  \          /  \
        c    b        /    d
                     /\
                    /  \
                   f    e      
Now the c-b group (25) and the d-f-e group (30) have the two lowest frequencies. We combine them and come up with a combined frequency of 25+30=55.

a: 45         55
             /  \
            /    \
           /      \
          /\      /\
         /  \    /  \
        c    b  /    d
               /\
              /  \
             f    e      
Now we combine the remaining groups and end up with our final Huffman tree.

            /\
           /  \
          a   /\
             /  \
            /    \
           /      \
          /\      /\
         /  \    /  \
        c    b  /    d
               /\
              /  \
             f    e      

Conclusion

Lessons we learned in this course.

1. Programming is a thoughtful endeavor.

    "If you're used to programming by reading the assignment sheet and going to the computer and typing it in, then you probably haven't gotten most of the programs. Maybe it worked on the first one, but my guess is that it didn't work on 2, 3, and 4. And if you only solved program 1 and managed to get working solutions for none of 2, 3, and 4. A root cause might be that you never came around to the opinion that in order to get real programs to work, you can't just sit down and write them. You really have to think about how you're going to attack this problem. Give it some thought."
2. Data Structures and Algorithms are intertwined.
    "This is a course in data structures, right? But I feel when I'm teaching it that I'm spending a third of my time discussing algorithms. I don't know how to avoid that. I think that data structures and algorithms go hand in hand. The interesting thing about data structures is the algorithms you run on them. The interesting thing about algorithms is that you can in fact make them run by intertwining them with data structures."
3. Think Abstractly.
    "Another possible reason you've not succeeded in programming assignments is that somehow you never really got the point of building these abstraction boundaries and really believing in them. Some people make the abstraction boundaries. They implement the priority queue as a binary heap, but they never really put it out of their head that the way to then think of that data structure is via the operations which are acting on it. Insert. Delete. Instead, every time they see that insert, in their head, somehow they're translating it into what operations are taking place in the underlying data structure. You probably know if you do this. Every time you do this, you have to sort of hit yourself and say "You're not abstracting properly." The failure to make an abstraction boundary at this data structure level means ultimately that you can't see the problem clearly enough to solve it when the problems become abstract. Ultimately you have to cut away the details, like how your priority queue is implemented and see it as this abstract thing on which you can do use this set of operations. So here we've seen abstract data types used throughout, so really you need to think of your abstract data types as abstract data types and treat this encapsulation seriously"

If you find any errors, or have a question about the posted notes included on this page please feel free to contact me via email. I can be reached at
lbsiakkhasone@ucdavis.edu