COMP 731 - Assignment 4 - Due 18 August 2000

A letter is one of the 52 characters: a,b,c,...,z,A,B,...,Z. A word is a maximal-length sequence of contiguous letters. The lower-case of a word, LC(x), replaces all upper-case characters in word x by their lower-case equivalents. Words x and x' are different if LC(x) is different from LC(x').

Write a program that, given a file f and a number N, finds N most frequently occurring different words in f. I will provide you a large input file (several MBytes) to run your program on. Your program should run fast. Report your answers either in alphabetical order or sorted in order of most-common-word to least-common-word (among the top N words). Arrange the output nicely in columns.

Please find for me the 500 most common words for the input file I provide you, sorted each of the two ways.