Science Fair Project Encyclopedia
Gene finding is the area of computational biology that is involved in algorithmically identifying stretches of sequences, usually genomic DNA, that are biologically functional. This especially includes protein-coding genes, but may also include other functional elements such as RNA genes and regulatory elements.
In prokaryotes, this task is relatively straightforward thanks to the presence of specific promoter sequences as well as the absence of splicing mechanisms.
In the eukaryotes, a variety of approaches have been developed, none of which are entirely successful. A major problem in identifying genes in eukaryotes is the mechanism of splicing and often of multiple and overlapping splice sites. Splice site identification is sometimes treated as a separate problem in the field of computational biology.
Computational approaches to gene finding can be broadly classified into three major categories.
- In extrinsic approaches, available DNA sequence is searched for similarity to known protein sequences (see Sequence alignment). If a given DNA sequence is very similar to the sequence of a protein, then it may be assumed that they are related. A notable extrinsic gene annotation pipeline is the Ensembl system.
- In ab initio approaches, an individual DNA sequence is analyzed for telltale signals that may suggest its function. These methods of analysis may be categorized:
- Linguistic approaches - which assume that there are various semantic elements in the DNA sequence that can be pieced together just like sentences. These use lexical analysis and grammar rules.
- Pattern approaches - these assume that there are specific patterns which could be expressed in terms of regular expressions that can be found for coding sequences and possibly even for specific protein families. These algorithms are often implemented by deterministic finite state automata DFA.
- Statistical approaches - these assume that there are specific differences in the statistical properties of coding and non-coding sequences. Some approaches look at the entropy of the sequences, while others have looked at specific nucleotide ratios. Some approaches improve upon the pattern approaches above by interpreting the pattern rules in a more fuzzy way. These are the hidden Markov model-based gene finders. The most notable of these is the GENSCAN system.
- As the entire genomes of many different species are sequenced, a promising new frontier in gene finding is a comparative genomics approach. This is based on the principle that genes and other functional elements undergo mutation at a slower rate than the rest of the genome, since mutations in functional elements are more likely to negatively impact the organism than mutations elsewhere. Genes can thus be detected by comparing the genomes of related species to detect this evolutionary pressure for conservation.
The contents of this article is licensed from www.wikipedia.org under the GNU Free Documentation License. Click here to see the transparent copy and copyright details