Science Fair Project Encyclopedia
Sequence alignment is an arrangement of two or more sequences, highlighting their similarity. The sequences are padded with gaps (usually denoted by dashes) so that wherever possible, columns contain identical or similar characters from the sequences involved:
tcctctgcctctgccatcat---caaccccaaagt |||| ||| ||||| ||||| |||||||||||| tcctgtgcatctgcaatcatgggcaaccccaaagt
It is usually used to study the evolution of the sequences from a common ancestor, especially biological sequences such as protein sequences or DNA sequences. Mismatches in the alignment correspond to mutations, and gaps correspond to insertions or deletions. Sequence alignment can also be used to study things like the evolution of languages and the similarity between texts.
The term sequence alignment may also refer to the process of constructing such alignment or finding significant alignments in a database of potentially unrelated sequences.
Typically, the purpose of this is to find homologues (relatives) of a gene or gene-product in a database of known examples. This information is useful for answering a variety of biological questions. The most important application of pairwise alignment is identification of sequences of unknown structure or function. Another important use is the study of molecular evolution.
A global alignment between two sequences is an alignment in which all the characters in both sequences participate in the alignment. Global alignments are useful mostly for finding closely-related sequences. As these sequences are also easily identified by local alignment methods global alignment is now somewhat deprecated as a technique. Further, there are several complications to molecular evolution (such as domain shuffling - see below) which prevent these methods from being useful.
Local alignment methods find related regions within sequences - in other words they can consist of a subset of the characters within each sequence. For example, positions 20-40 of sequence A might be aligned with positions 50-70 of sequence B.
This is a more flexible technique than global alignment and has the advantage that related regions which appear in a different order in the two proteins (which is known as domain shuffling) can be identified as being related. This is not possible with global alignment methods.
Significance of alignment
The typical assumption in the use of alignment is the mechanism of molecular evolution. DNA carries over genetic material from generation to generation, by virtue of its semi-conservative duplication mechanism. Changes in the material are introduced by occasional errors and mutations in the duplication, and by viruses and other mechanisms which sometimes move sub-sequences within the chromosome and between individuals. Consequently, an alignment between sequences indicates that the sequences evolved from a common ancestor which contained the matching subsequences. In the case of genetic sequences, it implies that their carriers evolved from a common ancestor.
Using assumptions about the probabilities of these change events, we can estimate the time when sequences diverged from a common ancestor or the time required for changing one sequence into another. However, there is disagreement about the value and nature of these probabilities for biological evolution. One school of thought assumes a simple, constant rate of change (an application of Occam's Razor) while the other school assumes short evolutionary periods when changes were extremely high.
Many genetic changes are evolutionarily non-significant (that is, neutral, neither positive or negative) in that they do not affect the selective advantage (the fitness) of the carrier. This is the basis of Kimura's Neutral theory. This is partly supported by examples such as the Globin protein sequence which shows normal functionality in-spite of several changes at various bases but loses functionality only when changes occur at critical sequence locations.
The actual biological meaning of any alignment can never be absolutely guaranteed. However, statistical methods can be used to assess the likelihood of finding an alignment between two regions (or sequences) by chance, given the size of the database and its composition.
Two important related issues for sequence alignment are:
- How should we choose the best alignment between two sequences (or regions of sequences)?
- How do we rank the alignments between a query and the many sequences in the database, according to their significance in the domain (such as biological significance)?
In studying evolution, the first question can be addressed by developing a model of how likely are certain changes between characters in the sequence. There are many ways to do this, none of which is generally superior. The models are derived empirically using related sequences, and are expressed as substitution matrices. These matrices are used by the algorithms named below to evaluate alternative alignments between two sequences. The actual biological quality of the alignments then depends upon the evolutionary model used to generate the score.
The second question is purely statistical. Research has determined a few hard theoretical rules and many approximations. It is now generally accepted that the scores of alignments between random sequences follow the extreme value distribution. Pairwise alignment programs such as BLAST use simulation to estimate the parameters of this distribution given a particular query, database, substitution matrix and certain other parameters. Alignments can then be given a statistical significance value, allowing inference of possible relationships between sequences.
One potential way to validate sequence alignments is to use structural alignments where possible. Because protein structure is more conserved through evolution than is sequence (as shown by Cyrus Chothia and Arthur Lesk in 1986), structural alignments can be more reliable over long evolutionary distances, when the sequences have diverged so much that simple sequence comparison can detect their similarity. However, although structural alignments are useful, they are completely artifactual.
Multiple alignment is an extension of pairwise alignment to incorporate several sequences. Multiple alignment methods find common regions between the sequences at once, without making pairwise alignments first. There are several approaches, one of the most popular being the progressive alignment strategy used by the CLUSTAL family of programs. This is used in cladistics to build phylogenetic trees, and to build sequence profiles which are used to search sequence the database for more distant relatives. The two most popular methods for detecting remote-homologues, PSI-BLAST and Hidden Markov model- (HMM-) based methods, both work on this principle.
Pairwise global alignment only.
Pairwise, local or global alignment. It works by generating all six possible translations of each nucleotide sequence, then uses those translations for protein search.
This is an extension of Smith-Waterman, for pairwise alignment between a protein sequence and a nucleotide sequence.
Framesearch dynamically considers every possible single-nucleotide insertion or deletion (indel) to generate the translation that best matches the protein sequence. This is unlike six-frame-translated BLAST or Smith-Waterman, which generate all six possible translations of each nucleotide sequence and use them for protein search. This allows dramatically better alignments when the nucleotide sequence contains many indels, as is often the case with single-read expressed sequence tag sequences and very early versions of genomic sequences.
Unfortunately Framesearch is extremely CPU-intensive. Its practical use typically requires a large Computer cluster or specialized computer hardware optimized for running dynamic programming algorithms on a large scale. Several companies sell such systems, which use either custom integrated circuits or Field-programmable gate array technology.
Implements the standard Smith-Waterman algorithm. It is considerably slower than the more modern BLAST and FASTA methods. However, Smith-Waterman remains the gold standard for protein-protein or nucleotide-nucleotide pairwise alignment; its speed can be improved by using specialized hardware or a computer cluster.
BLAST (Basic Local Alignment Search Tool)
Main article: BLAST
This method uses a pre-computed hash table to serve as an index for short sequences. Given a query sequence, the sub-sequences are looked up in the index to reduce the amount of time and searching involved. Several parameters need to be provided to make this method faster or more accurate. Once patterns that match the search sequence are found, more accurate and intensive algorithms may be applied.
BLAST uses a pairwise local search and uses a number of methods to increase the speed of the original Smith-Waterman algorithm.
Main article: FASTA
Pairwise local search. Superseded by BLAST.
Progressive multiple alignment method. Comes in several varieties (ClustalW, ClustalX etc.)
- An excellent article at the NCBI web site on the methodology of the BLAST algorithm and the statistical significance of sequence alignments in general.
- JAligner is an open source Java implementation of the dynamic programming algorithm Smith-Waterman for biological pairwise local sequence alignment.
The contents of this article is licensed from www.wikipedia.org under the GNU Free Documentation License. Click here to see the transparent copy and copyright details