This assignment should be done in groups of two.
Due: Wed. May 23, 11:59pm.
What to submit: a Makefile, executable called “speak” and all the code files needed to build your solution on the CSIF linux machines. Do not submit object files. Use handin to turn in your files to cs60 p4.
Speaker of Words
This project will use statistical methods to “write” text that bears some similarity to famous books.
n-grams capture a simple idea: that the present may depend on the past. More specifically, they describe the likelihood of a present state, given the states that came before.
In many cases, sequences are not random. Consider, for example, English words. If you know the last letter I typed was a ‘q’, you know that it is very unlikely that the next letter I type will be an ‘m’ and quite likely that it will be a ‘u’. Given a large corpus of data, we can build a model that will predict the likelihood of the various sequences in the data. If we built this model for letters using a large sample of English text, it would capture the relationship between ‘q’ and letters that might follow it, as described above.
n-grams formalize this concept. An n-gram is a sequence of n items (letters, words, DNA base pairs, etc.) Bigrams represent sequences of 2, trigrams sequences of 3, etc. Often, we are interested in predicting the n-th item, given the n-1 items. This n-gram model is represented as an n-1 order Markov model: P(x_n| x_1,…,x_{n-1}). For the above example, we might have P(x_n=m | x_{n-1} = q) = 0.0001 and P(x_n = u | x_{n-1} = q) = 0.98. (This reads: “the probability of x_n being m, given that x_{n-1} was q”.)
We can apply this idea to generate new text. Given the input of say a book, we can learn an n-gram model that predicts the likelihood of word patterns. We can then use this model to generate novel text: random writing that captures some of the patterns of the original text. For example, if you have a bi-gram model, start by picking a first word. Then you can randomly select the next word, weighted by the likelihood of that word appearing. e.g. in the example of letters, most of the time we would pick ‘u’, but every so often, we would pick ‘m’ to try to match the actual distribution seen in the book. Given the second word, we can then use it as input to generate the third, and so on until we have as much text as we want. If we were using trigrams instead of bigrams, we’d use the previous two words to predict the next.
As n increases, the coherency of the text will increase (but the range of possible output will decrease). If n=1, there is no relationship between the items in the sequence. The following examples were generated on text from Sherlock Holmes:
n=1: tension past abhorrent dilate shipping 'we'll spies governess spreading go hat architects tangible doctor? explanations alarm wisdom geese! ruefully circumstantial 4th gaunt don't deserve insensibility eighteen compress cage interested expenses? satisfactory embellish
n=2: tossing aside altogether look at once or dreaming he shook himself it bore unmistakable signs of the bride gave a child by his signature is alive who comes back and that his
n=3: 'well it is equally certain too that whatever danger threatened an occupant of the way folk who were opposed to its highest pitch of expectancy there was not there when i came
n=4: brawls took place two of which ended in the police-court until at last he became the terror of the village and even in the distant parsonage that cry raised the sleepers from
n-grams have applications in computational linguistics, bioinformatics and several other areas. They can be used for prediction and also to measure statistics on a sequence.
The design objective is to make the writing and analysis queries that you implement as fast as possible. It is OK to spend time pre-processing the data and you don’t need to minimize the amount of space you use. You need to determine appropriate data structures in order to do this.
One way to tackle this problem would be to store all the text in memory and then at run time, to search it for all the occurrences of the words 1 to n-1 in order to predict the odds of word n. This approach requires no pre-processing, but a lot of work for each prediction. In your implementation, you should store all the n-gram data in such a way that you can complete the commands quickly. This requires pre-processing.
Call your executable “speak”. It should take the following command line arguments:
speak <n-gram size> <output length> <text file>
where <n-gram size> is the value of n, e.g. 3
<output length> is the number of words that you should generate on a speak command
<text file> is the name of the file that should be analyzed to build the n-gram model
The program should execute two commands. A “speak” command will generate output length words of text by randomly sampling the n-gram model (Part 1). An analysis command will print the 10 most common n-grams that start with a specified word (Part 2).
The program should provide the following text interface and allow users to repeatedly enter these commands:
Options:
1. To make the oracle speak, enter: s
2. To see likely n-grams starting with a word, enter: t <word>
3. To quit, enter: q
You must press enter after each command.
This part should write random sentences to stdout based on the n-gram model. Note that an n-gram model is simply based on counting. Consider the input text:
“This is good, this thing. This is definitely better. Did you see how this was before?”
To calculate the bigram model for this, we simply count all the occurrences of “this” and the word after it. The odds of a particular bigram occurring are just the count of that bigram divided by the total number of bigrams that start with this. e.g.
|
Bigram |
Occurrence Count |
P(x_n|x_{n-1} = this) |
|
this is |
2 |
.5 |
|
this thing |
1 |
.25 |
|
this was |
1 |
.25 |
If this is the last word, we can generate the next word by generating a random number between 1 and 4. For 1 or 2, the next word would be “is”, for 3 it would be “thing” and for 4 it would be “was”.
Note that one way to represent an n-gram is as tree. How would you represent the full set of n-grams? Your main challenge is determining how to build the n-gram model in order to make the speak and analysis query efficient.
For larger n, the model will generate more coherent text, but there will also be less variation (longer sequences are less likely). One way to give an indication of this is to count the number of branches not taken when generating the output. e.g. if “thing” was picked as the next word above, there would be two branches not taken. You should print the total number of branches not taken after you print your story. See how this changes as you vary n.
I set my output to only write 8 words per line.
The user will enter a word and you need to print the 10 most frequent n-grams that start with this word. Think about how you can make this query fast through appropriate pre-computation and data structure design.
Try modifying your model to work on individual characters rather than words.
Be sure to follow good coding practices throughout. Think carefully about your implementation before you begin to code. In particular, think about how to solve both parts of the assignment and decide on a data structure(s) that will support the two operations. This will save you time.
My main file is located in ~neff/cs60/p4. You are free to modify this as you like, but it will take care of a number of low level issues for you (e.g. reading the file and tokenizing it).
You may use the Standard Template Library (STL) for this assignment. I used the following data structures: std::list, std::vector, std::map and a tree I wrote myself.
Project Gutenberg is a great source for the large text files that you will need for this assignment. They have text versions of a large number of books.
When developing your code, work with a small input file that you can easily predict the correct behavior for. This will make it easier to debug and it will also run faster, shortening your cycle time. Then move to a large file and test it on that.
You will be evaluated on the following criteria:
- Correct output
- Efficient design
- Clean code
I’ll create a forum where you can post amusing, interesting or profound output that you generate. Please indicate the text used to generate your output and the parameters (e.g. the value of n).
>speak 3 32 LeavesOfGrass.txt
Options:
1. To make the oracle speak, enter: s
2. To see likely n-grams starting with a word, enter: t <word>
3. To quit, enter: q
You must press enter after each command.
s
mastless hulk and 'mid its teeming madden'd half-drown'd crowds
nor helm nor helmsman dim smitten star orb
not of the old graveyards of the marsh
at night alone i sing for him
Number of branches not taken: 1624
s
air; receive the summer air the water the slender
tapering trees the liliput countless armies of those
who sought a new song a poem of
earth of the houses! appearances now or
Number of branches not taken: 874
t song
n-gram: " song of the " frequency: 0.161616
n-gram: " song of myself " frequency: 0.030303
n-gram: " song for you " frequency: 0.020202
n-gram: " song for the " frequency: 0.020202
n-gram: " song of my " frequency: 0.020202
n-gram: " song as i " frequency: 0.020202
n-gram: " song song of " frequency: 0.020202
n-gram: " song for occupations! " frequency: 0.010101
n-gram: " song for all " frequency: 0.010101
n-gram: " song for thee " frequency: 0.010101
t grass
n-gram: " grass of the " frequency: 0.0425532
n-gram: " grass in the " frequency: 0.0425532
n-gram: " grass my tongue " frequency: 0.0212766
n-gram: " grass loose the " frequency: 0.0212766
n-gram: " grass is itself " frequency: 0.0212766
n-gram: " grass is very " frequency: 0.0212766
n-gram: " grass is no " frequency: 0.0212766
n-gram: " grass it may " frequency: 0.0212766
n-gram: " grass of graves--o " frequency: 0.0212766
n-gram: " grass of spring " frequency: 0.0212766
This assignment was inspired by an art piece by David Rokeby, an assignment by Joe Zachary (who was in turn inspired by ideas put forward by Claude Shannon) and the use of n-grams in computational linguistics.