CS 224 Fall 2009 String Algorithms and  Algorithms in Computational
Biology

Dan Gusfield

4:10 - 5:30 Tues, Thurs; 167 Olson

There is also a schedule discussion section that will only be used for
student presentations at the end of the quarter, Thurs. 10:30 - 11:30


The formal prerequisite of this class is CS 222A. But, whether you have
had that course or not, what is critical is that you understand the
basic game-plan of worst-case algorithm analysis, that you can think
about algorithms without the need to see low-level programming details,
and that you can follow and construct proofs of algorithmic properties.
The focus of the class is on algorithms that illustrate techniques and
problems that are relevant in computational biology, but it is not a
course on practical computational biology or bioinformatics. That
course is CS 124, which is not a prerequisite for CS 224.

The class will be a mixture of two major topic areas: 1) string
algorithms, particularly based on suffix arrays and related techniques
such as constant-time least common ancestor queries; 2), algorithms and
combinatorial structure for phylogenetic networks, particularly
algorithms to deduce phylogenetic networks involving recombination as
well as mutation. In both areas of concentration, algorithmic
efficiency and combinatorial structure will be emphasized. The
phylogenetic topics also discuss combinatorial optimization methods and
integer programming.

Open problems will be identified. In the four times that this class has
been offered, at least two published papers resulted from open problems
examined in the class.

There will be regular homeworks and a final exam (most likely
take-home). Each student will also be required to read a current paper
(I will suggest some later), and present it in the scheduled discussion
section. The following lecture topic list is subject to revision as we
go. There is no assigned textbook. Course notes or copied materials
will be handed out.

The main topics are:

1. Exact Matching, Z-algorithm, Boyer-Moore Algorithm using the
Z-algorithm for fast preprocessing, linear-time construction of suffix
trees, suffix arrays, and LCP information; new (2004 and 2006)
linear-time preprocessing methods for constant-time least common
ancestor and least common extension queries.  Many possible
applications such as rapid identification of pathogens,  finding all
tandem repeats in linear time, use of LCA for accumulating common
substring statistics, approximate tandem repeat finding, fast unique
decipherability using suffix trees etc.

2.  Non-trivial applications of dynamic programming: hybrid dynamic
programming with suffix trees, linear space applications, four Russians
algorithms, improvements in RNA folding algorithms, circular string
edit distance.

2. Perfect Phylogeny - 2 and 3 state algorithms. Reduction of the
3-sate problem to 2-SAT.  Relation to chordal graph theory and the
exploitation of that connection.

3. Neighbor joining algorithm for distance based phylogenetics. New
proofs of consistency. Generalization of the metric-uniqueness
theorem.

4. Phylogenetic Networks involving recombination: galled-trees,
decomposition, construction of good networks, lower bounds on the
number of recombinations needed, many uses of integer programming,
applications to topics such as genome-wide association studies.
Phylogenetic networks involving hybridization and other reticulation
events; relation to SPR and maximum agreement forests.  Introduction to
the open field of extending the binary case to the multi-state case.