Lab 6, Week of May 10
Exercises are
due in lab May 17
In this lab, we will use Blast on the web, and write a bit of
a toy Blast program in Perl.
Blast:
Coffee Break (Aug 18)
From the Coffee Break archive go to the August 18, 1999 entry
on the CLOCK protein. This is a good example of how homology searching
can be useful. After you have read through the text
answer the following exercises
related to the Drosophila CLOCK protein.
BLAST Exercise:
- First we would like to find the protein record for the Drosophila
clock protein. The entry has the unique gi number 3219726.
Find the entry. What is the accession number?
- There will be a format display inidcator near the top of
the screen. It is set to GenPept by default, change the display
to FASTA format. Now copy the sequence beginning with the >
and ending with the last amino acid.
- You will now perform a few BLAST searches with your protein.
From the BLAST home page listed above choose the Advanced BLAST
Search option. Paste in your sequence, and then make the following
selections:
Program: blastp
Database: SwissProt
Descriptions: 100
Alignments: 10
Scoring Matrix: BLOSUM 45
What are the SwissProt names of the top 3 hits and their corresponding
E-values?
There is a protein called NPA1 that also contains the same circadian
domain described in the report on the CLOCK protein. Find the
first two "hits" for this protein in your BLAST results.
What are their names and corresponding E-values?
- Now repeat the experiment with the scoring matrix set to
BLOSUM 80 and answer the same questions.
- In your own words, what is the E-value? You may wish to consult
the BLAST Notes or the Online Help.
- Based on the results of your two BLAST searches which scoring
method seems better? There are a couple of forces at work so
the answer is not so straightforward - but you may be familiar
with terms like sensitivity and selectivity which are often used
to describe the "goodness" of search results.
- Based on your search results towards what end of the protein
do you think the circadian domain resides?
- We will now support your conclusion above as well as examine
the effects of length on the significance of search results.
We will accomplish this by dividing the query in half and searching
both halves against the SwissProt database using the BLOSUM45 scoring
matrix, all other settings as indicated above. First search the first 7 lines
of the FASTA entry against the database. Use the 5 sequences identified above
as a point of comparison. Of the ones that are retrieved as hits, what are
their scores and E-values. Now repeat with the next 7 lines of the
FASTA entry. Does this support your answer above? How?
- Many
times when looking for the location of a gene an expressed amino acid sequence
is searched againsed a nucleotide genome. Which blast program would you use
to do this?
The start of a toy BLAST program in Perl:
Several kmer searching programs in Perl are available from here.
I will discuss these in detail
in lecture on Monday, along with another kmer program (kmerfirst.pl)
available here.
The new elements of Perl include the substr function,
the length function, two-dimensional hashes, the defined function,
building up a list as a hash value, and maybe some others. You should
read Johnson on these elements of Perl, although I will talk about them
explicitly in class (but I have no notes on them). Eventually, you
will need to understand all of these for the larger BLAST program
we will write (in coming labs). For this lab, kmerfirst.pl will
be the most useful.
Program kmerfirst.pl finds the first position of each of the different kmers
of length k. You will extract what you need from this, and then expand
on it, to write the start of a toy BLAST program. Your program should
do the following things:
1. Read in from file a query string Q.
2. For k = 4, adapt program kmerfirst.pl to find the first location of
each different k-mer in Q.
This information is kept in a hash as in program
kmerfirst.pl.
3. Successively read in one string at a time from a file called
toygenbank. When a string S
is read in, scan through it 4-mers, using the same hash as before.
For this, extract and adapt what you need from kmerfirst.pl
4. Whenever (if ever) a
4-mer in S is determined to be in Q, extract the location of
the first occurrence of that 4-mer in Q. Then put the characters of
Q and S in arrays (as we did in needleman.pl) so that you can examine
individual characters. Then scan left from the k-mer in Q and in
S, as long as you find matching characters. Repeat to the right.
Let L denote the length of the whole match obtained in this way.
If L is greater than
10, then print a message that a good HSP has been found between Q and
S, and print S.
Here is some test data to work with.
When you hand in the lab for this week, be sure to show the program
working on that data.
Notice that the same HSP gets reported multiple times. Explain why
that happens.
If you want to work ahead, next week the lab will ask you to modify and use
kmer4.pl instead of kmerfirst.pl, so that every
location of each different
4-mer in Q is collected. Then when a 4-mer common to Q and S is found,
the left and right scan is done at each location in Q. Also, the program
should use a hash to avoid reporting the same HSP multiple times.