ECS 271, Machine Learning: Project and Potential Topics

Instructor: Prof. Rao Vemuri, rvemuri@ucdavis.edu

What is expected of you

Projects are due in my office by 3 pm on last day of classes.
Turn in a project writeup of approximately 6-8 pages in a format that is normally required for conference papers. Turn in the code electronically with instructions to running the code.
Your written submission should include the following sections (some may be very short).
          A title
          Your name, affiliation and e-mail address
          A 250-word abstract, followed by 3 or 4 key words
          Problem definition
          Related work
          Your approach(es)
          Experimental results
          Conclusions and ideas for future work
          References

Appendix: An appendix of one or two pages containing a brief introduction to the program you have written. The body of the paper should NOT contain any references to the Appendix. If you need to discuss the program, excerpt the relevant code segments and include them in your  Submit your program electronically.

A note on "Related work." Your project must include a bibliography of at least 2-4 papers, and a brief discussion (one page is plenty) of their content and
relevance to your project. EVERY project is expected to have a section of this kind. If you do not know of related papers yet, then you might try browsing recent
conference proceedings such as the International Conference on Machine Learning, and recent journal articles in the journal Machine Learning.

Suggested Project Topics

A list of suggested projects is given below. Students are encouraged to make their own suggestions. Make your suggestions by following the format I have given for my suggestions.

One way of selecting a topic is to consider the methods in Machine Learning as tools in a toolbox for use in your proposed dissertation and try to solve a simplified version of that problem using one of the techniques in the text book. Students are expected to submit a proposed title and a brief description of their selected topic one week into the Quarter. You can change your mind only once and that too within the second week of the Quarter. No further changes will be allowed.

Following are some potential, tentative sample course projects for ECS 270, Machine Learning.  These are ONLY suggestions. I may add more later. If you would like to pursue one of these ideas, contact me. You may also propose your own project, subject to approval by the instructor.

One way to start off is to locate a published paper that deals with a topic of interest to you and see if you can either duplicate the results, repeat my bodifying one of the assumptions, etc. It is imporatnt that you start worrying about where to get training data that your agent will use to learn. There are databases avialable for handwritten characters, faces, etc.

In either case, please turn in a one-page project proposal to rvemuri@ucdavis.edu by the beginning of class on ?? January 2002 or no later than one week from the first day we meet. Your proposal should explain (1) the problem you will look at, (2) the approach and algorithm(s) you will use, and (3) how you will evaluate the results.

===============================================================

TITLE 1: Text Classification with Bayesian Methods
 

DESCRIPTION: Given the growing volume of online text, automatic document classification is of great practical value, and an increasingly important area for research.
Naive Bayes has been applied to this problem with considerable success, however, naive Bayes makes many assumptions about data distribution that are clearly not true
of real-world text. This project will aim to improve upon naive Bayes by selectively removing some of these assumptions. I imagine beginning the project by removing the
assumption that document length is indepedent of class---thus, designing a new version of naive Bayes that uses document length in order to help classify more accurately.
If we finish this, we'll move on to other assumptions, such as the word independence assumption, and experiment with methods that capture some dependencies between
words. The paper at
http://www.cs.cmu.edu/~mccallum/papers/multinomial-aaai98w.ps
is a good place to start some reading. You should be highly proficient in C
programming, since you can modify rainbow
http://www.cs.cmu.edu/~mccallum/bow/rainbow.

===============================================================

TITLE2: Support vector machines for face recognition

DESCRIPTION: Face recognition is a learning problem that has recently received a lot of attention. One standard approach involves reducing the dimensionality of the
problem using Principal Component Analysis (PCA) and then selecting the nearest class (eigenfaces). Support Vector Machines (SVM) are becoming very popular in the machine learning community as a technique for tackling high-dimensional problems. No one has yet (to my knowledge) applied SVMs to face recognition. Can SVMs outperform standard face recognition algorithms?

Issues that the student should address:
- How best to apply SVM to the n-class problem of face recognition;
- Figure out training and/or image preprocessing strategies (wavelets?);
- Compare how SVMs compare to other techniques (see notes)
Notes:
- A good implementation of SVMs is available (Thorsten's SVMlight);
- You can try to get access to two datasets used widely in the community, ORL and FERET, for training & testing;
- Results for eigenfaces, fisherfaces and JPRC's face recognition system on these datasets, as well as implementations, so comparing SVM to other algorithms
will be straightforward.
- We can recommend tutorials and papers on SVMs to supplement what was covered in class, if needed.

===============================================================
TITLE3: Predictive Exponential Models

DESCRIPTION: A new predictive model recently introduced by Chen & Rosenfeld can incorporate arbitrary features using exponential distributions and sampling (see
www.cs.cmu.edu/~roni/wsme.ps). Although the model was originally developed for language modeling, it can be used for prediction or classification in any domain. In this
project you will be expected to read and understnad this paper, and to apply the model to a Machine Learning problem of your choice. For example, you could choose
one of the ML problem cases used in the course, and try to improve on the existing, "baseline", solution.

COMMENT: This project is open to more than one student. Each student could work on their own ML problem, or we can choose a larger problem for joint work.

===============================================================

TITLE4: Natural Language Feature Selection for Exponential Models

DESCRIPTION: A new predictive model recently introduced by Chen & Rosenfeld can incorporate arbitrary features using exponential distributions and sampling (see
www.cs.cmu.edu/~roni/wsme.ps). The model was originally developed for modeling of natural language, and has highlighted feature selection as the main challenge in that
domain. In this project you will be expected to read and understnad this paper. Then, you will be given two corpora. The first one consists of transcribed over-the-phone
conversations. The second corpus is artificial, and was generated from the best existing language model (which was trained on the first corpus). Your job is to use
machine learning and statistical methods of your choice (and other methods if you wish) to find systematic differences between the two corpora. These differences
translate directly into new features, which will be added to the model in an attempt to improve on it (an improvement in language modeling can increase the quality of
language technologies such as speech recognition, machine translation, text classification, spellchecking etc.)

COMMENT: This project is open to several students, who would be working separately.

===============================================================

TITLE5: Learning of strategies for energy trading
 

In my design course, students write software agents for energy trading. each agent is given an energy quota and money to spend to fill this quota, for each of
about 50 consecutive periods. There are penalties for not filling the quota, including death (elimination) if 5 the quotas are not filled in 5 consecutive periods. Agents obtain
energy through a double auction. they submit bids and the highest bids win. After each round, all the bids are made public, so each agent knows what its competitors did
in the past. The purpose is to spend as little money as possible and still meet one's quotas, that is, to anticipate what the other agents will bid in the next period, and then
bid just enough to get the energy one needs.

The students  know little or nothing about automatic learning. They rely strictly on their intuition to devise their bidding algorithms. It would be interesting to see
if some of your students could use automatic learning technique to build winning agents.
 

===============================================================

TITLE6: Learning from labeled and unlabeled data
 

DESCRIPTION: The recent paper by Blum & Mitchell on co-training proposes an algorithm for learning from unlabeled as well as labeled data in certain problem
settings (see www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/colt98_final.ps). In this project you will be expected to read and understand this paper, and to extend the experimental results in this paper. In particular, I have some ideas for creating synthetic data sets that test the robustness of the algorithm to changes in the problem setting discussed in the paper.

===============================================================

TITLE7: Similarity matching in high-dimensional space on discrete data
 

DESCRIPTION: Given a database with hundreds of attributes (or fields) and thousands of tuples (or records), finding similar tuples (records) is very difficult and we do
not have any efficient algorithms to accomplish this task. I have some ideas for new algorithms that may prove to be effective. In this project, you will implement these
algorithms and explore variants to determine their effectiveness.

===============================================================

TITLE8: Using a repository of old text to answer new questions
 

DESCRIPTION: Consider a repository of email messages in which discussion center around living with a disease, such as celiac, heart disease or diabetes. Frequently
new people become diagnosed and join the list, resulting in a good number of questions being asked repeatedly. Unfortunately, messages do not adhere to a restricted
vocabulary and so traditional web-based keyword searching is often ineffective. In this project, you will use and evaluate algorithms to generate responses to new email
messages based on the repository of old email messages. You can begin with a Bayesian text classifier [as discussed in class: Lewis, 1991; Lang, 1995; Joachims, 1996]
and a semantic generalization algorithm I have constructed and based on your analysis, explore interesting variants to determine the effectiveness of this new approach.

===============================================================

TITLE9:Datamining of consumer purchase data

The easiest thing would be to ask any one who is interested in looking at consumer purchase (frequently bought or frequently transacted).

TITLE10: Newsletter that learns user interests

Create a personalized news letter that looks at selected news sources, and brings news stories that are of interest to a user. In the easier case, the user gives a list of keywords and your program brings the most relevant news stories and presents them to the user. In the more interesting case, you observe the behavior odf the user, learn his/her interests and bring the relevant news stories.

TITLE11: An agent that learns how to play the game of Checkers

TITLE12: RBF network that classifies intruders from normal users
 
 

Useful Sites where you can find data, code and papers

UCI

JMLR