Primer on Finding Homologous Proteins or Genes
Hugh Nicholas and Alexander RopelewskiLooking for proteins homologous to a known protein sequence is one of the most intensively studied areas in computational biology, and there are many algorithms and programs available. Identifying all sequences homologous to a single known sequence involves several cycles of specifying a query, followed by searching the database for additional members of the family of homologous proteins.
Step 1: Query the protein databases for one or more homologous sequences
The first step is always to use the single known protein sequence as a query to search one or more of the protein databases for homologous sequences. The search technique of choice is to use an algorithm that finds a "local" alignment. "Local" alignment searches identify the regions or subsections in a pair of sequences that show the highest degree of similarity and provides the strongest evidence in favor of the hypothesis of homology -- that is, that the two sequences are related to each other through evolution from a common ancestral sequence.You can search the database for local alignments with one of several widely used heuristic algorithms, or with a rigorous algorithm based on dynamic programming that is guaranteed to report the best matching region between the query sequence and every sequence in the database.
The heuristic algorithms approximate the mathematically rigorous approach but may run a couple of orders of magnitude faster. The price for this speed up is that the heuristic algorithms can miss up to 3 to 5 percent of the matches that would be found using the rigorous algorithm.
All the heuristic algorithms begin, and get most of their speed up, by using a hashing algorithm to examine the sequences for a minimum number of contiguous exact or high scoring matches. Only the regions of sequences showing these contiguous high density matches are examined further. This makes the heuristic methods both less sensitive and more selective than the rigorous approach. In rare instances the additional selectivity is an advantage.
The rigorous searches usually use the Smith-Waterman algorithm or the extensions later made to it by Waterman called MaxSegs -- which finds the "k best" scoring independent sub regions. A highly portable and flexible implementation of the MaxSegs algorithm is available at the Pittsburgh Supercomputing Center (PSC). This version is also well optimized for vector supercomputers and can be run at the PSC. The two most widely used heuristic algorithms are FASTA and BLAST. FASTA is available in several implementations from a wide variety of sources and BLAST is available at the National Library of Medicine by anonymous ftp or you can use it there through an e-mail server.
If you use one of the heuristic algorithms you will want to use MaxSegs to get a rigorous and more complete alignment of homologous sequences initially identified by the heuristic algorithm.
One of the problems that can limit the effectiveness of the entire investigation is selecting scoring parameters to use in this initial database search. A Primer on Scoring Methods is available.
Step 2: Get a complete alignment
After you have found homologues of the initial sequence you will want to get a complete alignment. If the sequences appear related over their entire length rather than just sharing just a common domain, such as a nucleotide binding fold, you will want a "global" alignment. A "global" alignment is an alignment over the entire length of the sequences.If you found only a single homologous sequence then you will need a global pairwise algorithm. The most commonly used global alignment programs use the rigorous dynamic programming algorithm of Needleman and Wunsch with the extensions suggested by Gotoh. The PSC has a highly portable and flexible implementation of this algorithm that we refer to as NWGap. This program runs very efficiently on the PSC's Cray C-90 vector supercomputer and is capable of aligning very long sequences.
If you have identified several homologous sequences then you will want a global multiple sequence alignment program. There are two approaches to the global alignment of several sequences. First, and in many ways the most powerful, is multidimensional dynamic programming. This is an extension of the two sequence dynamic programming approach to more than two sequences. The best program for this purpose was written at the National Library of Medicine and is called MSA. It is available by anonymous ftp. We have a version of MSA that runs on the Cray C-90 at the PSC that we have enhanced to make it easier for the user to specify different scoring parameters. This approach can use large amounts of computer time and memory and is thus well suited to the PSC's C-90.
A second approach to creating global multiple sequence alignments is the progressive alignment approach. The first step is a cluster analysis of the sequences. Then the most similar pair of sequences is aligned. Next, either another pair of sequences is aligned or another sequence is aligned to the cluster of already aligned sequences. This process is repeated until all the sequences have been merged into a single aligned cluster of sequences.
The progressive approach uses much less computer time and memory than multidimensional dyn amic programming. However it suffers from being subjected to what mathematicians call "being trapped in a local maximum". The local maximum results from the progressive nature of the process which looks at only part of the data in the early steps.
Since the alignment is the result of a progressive series of pairwise alignments its overall correctness is very dependent on the correctness of alignment of the initial pairs of sequences aligned. Unfortunately, pairwise alignments are fairly unreliable, particularly if the sequences have diverged to an appreciable extent. The progressive alignment process cannot correct errors introduced into the alignment at an earlier stage as it adds more sequences to the alignment. Since early errors are not corrected they propagate and lead to more errors in the later part of the progressive alignment process.
Thus the alignment show the maximum similarity of the sequences given the order in which they were joined into the alignment. This local maximum of similarity is unlikely to be overall or global maximum of similarity over all of the sequences in the alignment. Such a global maximum is obtainable with the multidimensional dynamic programming approach to multiple sequence alignment.
Studies at the PSC (unpublished data) indicate that alignments produced by multidimensional dynamic programming (MSA) are much closer to those predicted from the superposition of three dimensional protein structures than are alignments derived from the progressive approach. This can be true even in cases selected to give the progressive alignment method a high probability of producing the correct alignment.
Step3: Use the alignment to create a sequence Profile
After the homologous sequences have been aligned the next step should be to use the alignment to create a sequence Profile. The Profile is used to again search the sequence databases for additional sequences homologous to those in the set of aligned sequences used to create the Profile. This allows you to identify sequences that are more distantly related to your initial, single sequence. Searching the sequence database with a Profile is always done using the rigorous dynamic programming approach.The PSC has a program for this called Profile-SS (Profile search and scan). As with other PSC programs Profile-SS runs efficiently on our vector supercomputer. Other implementations of the Profile Search program are available from Mike Gribskov at the San Diego Supercomputing Center and in the commercial GCG suite of programs, although the PSC version has features not available in other implementations.
After you identify more distantly related homologues with the Profile search, you will want to add the new sequences to the multiple sequence alignment. Profile-SS has a global alignment option that allows you to do this in a straight forward manner. Alternatively, you can begin with all of the sequences unaligned and start at the beginning of the global alignment process.
Step 4: Repeat the process until no new sequences are discovered.
This whole process of realignment, profile creation, and search the database with the profile can be repeated until no new sequence are discovered in the Profile search stage.
See also:
- Bibliography of readings on the topics covered above.