next up previous
Next: Emerging uses of sequence Up: What is Bioinformatics? Previous: What is Bioinformatics?

Sequence-oriented bioinformatics as a first paradigm

Sequence comparison, particularly when combined with the systematic collection, curation and search of databases containing biomolecular sequences, has become essential in modern molecular biology. One fact explains the importance of molecular sequence data and sequence comparison in biology.

In biomolecular sequences (DNA, RNA or amino acid sequences), high sequence similarity usually implies significant functional or structural similarity.

Evolution reuses, builds on, duplicates and modifies those structures (proteins, exons, DNA regulatory sequences, morphological features, enzymatic pathways, etc.) that have been ``successful" (left as a vague concept). Life is based on a repertoire of structured and interrelated molecular building blocks that are shared and passed around. The same and related molecular structures and mechanisms show up repeatedly in the genome of a single species, and across a very wide spectrum of divergent species. ``Duplication with modification" [#!D!#,#!DO90!#,#!DO93!#,#!DO183!#] is the central paradigm of protein evolution, wherein new proteins and/or new biological functions are fashioned from earlier ones. R. Doolittle [#!DO183!#,#!DO90!#] emphasizes this point as follows:

The vast majority of extant proteins are the result of a continuous series of genetic duplications and subsequent modifications. As a result, redundancy is a built-in characteristic of protein sequences, and we should not be surprised that so many new sequences resemble already known sequences.

and

``... all of biology is based on an enormous redundancy ...".

The following quotes reinforce this view and suggest the utility of the ``enormous redundancy" in the practice of molecular biology. The first quote is from Eric Wieschaus, co-winner of the 1995 Nobel prize in medicine for work on the genetics of Drosophila development. Describing the work done years earlier, Wieschaus says

We didn't know it at the time, but we found out everything in life is so similar, that the same genes that work in flies are the ones that work in humans.

And fruit flies aren't special. The following is from a book review on DNA repair [#!STR!#]:

Throughout the present work we see the insights gained through our ability to look for sequence homologies by comparison of the DNA of different species. Studies on yeast are remarkable predictors of the human system!


So ``redundancy", and ``similarity" are central phenomena in biology. But similarity has its limits - humans and flies do differ in some respects. These differences make conserved similarities even more significant, which in turn makes comparison and analogy very powerful tools in biology. Lesk [#!LESK!#] writes:

It is characteristic of biological systems that objects that we observe to have a certain form arose by evolution from related objects with similar but not identical from. They must, therefore, be robust, in that they retain the freedom to tolerate some variation. We can take advantage of this robustness in our analysis: By identifying and comparing related objects, we can distinguish variable and conserved features, and thereby determine what is crucial to structure and function.


The important ``related objects" to compare include much more than sequence data, because biological universality occurs at many levels of detail. But it is usually easier to acquire and examine sequences than it is to examine fine details of genetics or cellular biochemistry or morphology. For example, there are vastly more protein sequences known (deduced from underlying DNA sequences) than there are known three-dimensional protein structures. And it isn't just a matter of convenience that makes sequences important. Rather, the biological sequences encode and reflect the more complex common molecular structures and mechanisms that appear as features at the cellular or biochemical levels. Moreover, ``nowhere in the biological world is the Darwinian notion of 'descent with modification' more apparent than in the sequences of genes and gene products" [#!DO90!#]. Hence a tractable, though partly heuristic, way to search for functional or structural universality in biological systems is to search for similarity and conservation at the sequence level. The power of this approach is made clear in the following quotes from [#!PEARSON-95!#] and [#!CRLS!#] respectively:

Today, the most powerful method for inferring the biological function of a gene (or the protein that it encodes) is by sequence similarity searching on protein and DNA sequence databases. With the development of rapid methods for sequence comparison, both with heuristic algorithms and powerful parallel computers, discoveries based solely on sequence homology have become routine.


Determining function for a sequence is a matter of tremendous complexity, requiring biological experiments of the highest order of creativity. Nevertheless, with only DNA sequence it is possible to execute a computer- based algorithm comparing the sequence to a database of previously characterized genes. In about 50% of the cases, such a mechanical comparison will indicate a sufficient degree of similarity to suggest a putative enzymatic or structural function that might be possessed by the unknown gene.

So large-scale sequence comparison, usually organized as database search, is a very powerful tool for biological inference in modern molecular biology. And that tool is already almost universally used by molecular biologists. The final quote reflects the potential total impact on biology of the sequence data and its exploitation in the form of sequence database searching. It is from an article [#!GIL!#] by Nobelist Walter Gilbert:

The new paradigm now emerging, is that all the `genes' will be known (in the sense of being resident in databases available electronically), and that the starting point of biological investigation will be theoretical. An individual scientist will begin with a theoretical conjecture, only then turning to experiment to follow or test that hypothesis.



next up previous
Next: Emerging uses of sequence Up: What is Bioinformatics? Previous: What is Bioinformatics?
Dan Gusfield
1999-11-03