Lab 7, Week of May 17

Exercises are due in class May 26

In Lab 7 we ask you to do two things:

ToyBLAST. You will refine your toy BLAST program. If you didn't get everything working in the Lab 6 part of the that program, continue working on that as well as the new parts. If you absolutely cannot get Lab 6 finished and working, and hence can't continue of with Lab 7, let us know, and we will give you some parts of the code for Lab 6. But it is best if you do it all yourself.

Multiple Sequence Alignment. You will learn how to use one of the most popular global multiple alignment algorithms out there, ClustalX, and (optional) you will use a homegrown multiple alignment program, star.pl that we will discuss in class.

In addition, if you have not already, please also read the third notes on Perl distributed about two weeks back. Please let me know if you find any errors or ambiguities in those notes.

What you need to turn in are your answers to the questions and exercises below. For Perl programs use script to print out the program and show how it runs on data.

1. ToyBLAST

There are several deficiencies of the toy BLAST program you did for Lab 6. Here we will tackle a few of them.

Use kmer4.pl instead of kmerfirst.pl, so that every location of each different 4-mer in Q is collected. Then when a 4-mer common to Q and S is found, the left and right scan is done at each location in Q.
When a database string matches a w-character substring in the query string, where k < w, then the same substring is found multiple times, and if w > t, the threshold for reporting, it will get reported many times; that is not desirable. Real BLAST has ways to avoid finding the same substring multiple times, but we will only make sure that the substgring is not reported multiple times. We can do it using a hash called stringhash, like this: Whenever BLAST finds a reportable substring in a database, starting at position $i, say, in the database string, it looks to see if $stringhash{$i} is defined. If so, it does not report the string. Otherwise it assigns the string to $stringhash{$i} and does report the string.
Implement this change.
We would like to process strings that are more than a single line long. So in the file each string will be held in consecutive lines, with strings seperated by blank lines. That is like saying that each string is a paragraph instead of just a single line. To read in a whole paragraph, put the line

$/ = "";

somewhere in the program before the reading begins. Read about this on page 102 of Johnson.

2. Multiple Sequence Alignment

i) ClustalX

Your first task is to walk through the following practical tutorial on ClustalX found at ClustalX Tutorial

Download the file aligned globins That shows a polished, hand optimized, multiple alignment of many globin sequences.

Download the file packed globins which has the same sequences as the globins file, but with the spaces and names removed.

Now adapt the packed globins data so that it can be used as input to clustalx, and use clustalx to get a multiple alignment of those sequences. How does it compare to the original one in the aligned globins file?

ii) (Optional) A Homegrown multiple alignment program: star.pl

Now download the program star.pl star.pl that we will use to multiply align the sequences.

You will also need a weight matrix weight.txt when you run the program star.pl. Download that matrix from weight.txt

Run this multiple alignment program with the packed globins.

The multiple alignment will have to be cut out from among other output. Get it and cut it out and save it to a file. How does it compare to the multiple alignment we started from in file aligned globins (just by eye-balling the alignments), and the alignment produced by Clustalx? Take note of the ratio of the optimal pairwise to the induced scores produced by star.pl.

In the star.pl program, you are initially asked for a center number and told to use the mini center first. After the first multiple alignment is found, the program gives you the option to specify another center and find the resulting multiple alignment. Just do this a couple of times, picking centers other than the mini center. Each time, see how well the resulting multiple alignment matches the original alignment in globins, and what the ratio of optimal to induced scores is. Record these ratios, and state your conclusions.

How does this program work?