CS 224 Fall 2009 String Algorithms and Algorithms in Computational Biology Dan Gusfield 4:10 - 5:30 Tues, Thurs; 167 Olson There is also a schedule discussion section that will only be used for student presentations at the end of the quarter, Thurs. 10:30 - 11:30 The formal prerequisite of this class is CS 222A. But, whether you have had that course or not, what is critical is that you understand the basic game-plan of worst-case algorithm analysis, that you can think about algorithms without the need to see low-level programming details, and that you can follow and construct proofs of algorithmic properties. The focus of the class is on algorithms that illustrate techniques and problems that are relevant in computational biology, but it is not a course on practical computational biology or bioinformatics. That course is CS 124, which is not a prerequisite for CS 224. The class will be a mixture of two major topic areas: 1) string algorithms, particularly based on suffix arrays and related techniques such as constant-time least common ancestor queries; 2), algorithms and combinatorial structure for phylogenetic networks, particularly algorithms to deduce phylogenetic networks involving recombination as well as mutation. In both areas of concentration, algorithmic efficiency and combinatorial structure will be emphasized. The phylogenetic topics also discuss combinatorial optimization methods and integer programming. Open problems will be identified. In the four times that this class has been offered, at least two published papers resulted from open problems examined in the class. There will be regular homeworks and a final exam (most likely take-home). Each student will also be required to read a current paper (I will suggest some later), and present it in the scheduled discussion section. The following lecture topic list is subject to revision as we go. There is no assigned textbook. Course notes or copied materials will be handed out. The main topics are: 1. Exact Matching, Z-algorithm, Boyer-Moore Algorithm using the Z-algorithm for fast preprocessing, linear-time construction of suffix trees, suffix arrays, and LCP information; new (2004 and 2006) linear-time preprocessing methods for constant-time least common ancestor and least common extension queries. Many possible applications such as rapid identification of pathogens, finding all tandem repeats in linear time, use of LCA for accumulating common substring statistics, approximate tandem repeat finding, fast unique decipherability using suffix trees etc. 2. Non-trivial applications of dynamic programming: hybrid dynamic programming with suffix trees, linear space applications, four Russians algorithms, improvements in RNA folding algorithms, circular string edit distance. 2. Perfect Phylogeny - 2 and 3 state algorithms. Reduction of the 3-sate problem to 2-SAT. Relation to chordal graph theory and the exploitation of that connection. 3. Neighbor joining algorithm for distance based phylogenetics. New proofs of consistency. Generalization of the metric-uniqueness theorem. 4. Phylogenetic Networks involving recombination: galled-trees, decomposition, construction of good networks, lower bounds on the number of recombinations needed, many uses of integer programming, applications to topics such as genome-wide association studies. Phylogenetic networks involving hybridization and other reticulation events; relation to SPR and maximum agreement forests. Introduction to the open field of extending the binary case to the multi-state case.