Lab 6, Week of May 10

Exercises are due in lab May 17

In this lab, we will use Blast on the web, and write a bit of a toy Blast program in Perl.

Blast:

Coffee Break (Aug 18)

http://www.ncbi.nlm.nih.gov/Coffeebreak/Archive/archive.html

From the Coffee Break archive go to the August 18, 1999 entry on the CLOCK protein. This is a good example of how homology searching can be useful. After you have read through the text answer the following exercises related to the Drosophila CLOCK protein.

BLAST Exercise:

First we would like to find the protein record for the Drosophila clock protein. The entry has the unique gi number 3219726. Find the entry. What is the accession number?
There will be a format display inidcator near the top of the screen. It is set to GenPept by default, change the display to FASTA format. Now copy the sequence beginning with the > and ending with the last amino acid.
You will now perform a few BLAST searches with your protein. From the BLAST home page listed above choose the Advanced BLAST Search option. Paste in your sequence, and then make the following selections:

Program: blastp
Database: SwissProt
Descriptions: 100
Alignments: 10
Scoring Matrix: BLOSUM 45

What are the SwissProt names of the top 3 hits and their corresponding E-values?

There is a protein called NPA1 that also contains the same circadian domain described in the report on the CLOCK protein. Find the first two "hits" for this protein in your BLAST results. What are their names and corresponding E-values?
Now repeat the experiment with the scoring matrix set to BLOSUM 80 and answer the same questions.
In your own words, what is the E-value? You may wish to consult the BLAST Notes or the Online Help.
Based on the results of your two BLAST searches which scoring method seems better? There are a couple of forces at work so the answer is not so straightforward - but you may be familiar with terms like sensitivity and selectivity which are often used to describe the "goodness" of search results.
Based on your search results towards what end of the protein do you think the circadian domain resides?
We will now support your conclusion above as well as examine the effects of length on the significance of search results. We will accomplish this by dividing the query in half and searching both halves against the SwissProt database using the BLOSUM45 scoring matrix, all other settings as indicated above. First search the first 7 lines of the FASTA entry against the database. Use the 5 sequences identified above as a point of comparison. Of the ones that are retrieved as hits, what are their scores and E-values. Now repeat with the next 7 lines of the FASTA entry. Does this support your answer above? How?
Many times when looking for the location of a gene an expressed amino acid sequence is searched againsed a nucleotide genome. Which blast program would you use to do this?

The start of a toy BLAST program in Perl:

Several kmer searching programs in Perl are available from here. I will discuss these in detail in lecture on Monday, along with another kmer program (kmerfirst.pl) available here. The new elements of Perl include the substr function, the length function, two-dimensional hashes, the defined function, building up a list as a hash value, and maybe some others. You should read Johnson on these elements of Perl, although I will talk about them explicitly in class (but I have no notes on them). Eventually, you will need to understand all of these for the larger BLAST program we will write (in coming labs). For this lab, kmerfirst.pl will be the most useful. Program kmerfirst.pl finds the first position of each of the different kmers of length k. You will extract what you need from this, and then expand on it, to write the start of a toy BLAST program. Your program should do the following things:

1. Read in from file a query string Q.

2. For k = 4, adapt program kmerfirst.pl to find the first location of each different k-mer in Q. This information is kept in a hash as in program kmerfirst.pl.

3. Successively read in one string at a time from a file called toygenbank. When a string S is read in, scan through it 4-mers, using the same hash as before. For this, extract and adapt what you need from kmerfirst.pl

4. Whenever (if ever) a 4-mer in S is determined to be in Q, extract the location of the first occurrence of that 4-mer in Q. Then put the characters of Q and S in arrays (as we did in needleman.pl) so that you can examine individual characters. Then scan left from the k-mer in Q and in S, as long as you find matching characters. Repeat to the right. Let L denote the length of the whole match obtained in this way. If L is greater than 10, then print a message that a good HSP has been found between Q and S, and print S.

Here is some test data to work with. When you hand in the lab for this week, be sure to show the program working on that data.

Notice that the same HSP gets reported multiple times. Explain why that happens.

If you want to work ahead, next week the lab will ask you to modify and use kmer4.pl instead of kmerfirst.pl, so that every location of each different 4-mer in Q is collected. Then when a 4-mer common to Q and S is found, the left and right scan is done at each location in Q. Also, the program should use a hash to avoid reporting the same HSP multiple times.