NCBI-101: How to search for the Tree in the Forest ... or the Gene in the Genome
NCBI-101: How to search for the Tree in the Forest ... or the Gene in the Genome
Due to numerous sequencing projects around the world more than 70 genomes of bacteria, archaea, eukaryota and over 700 viral and organellar genomes are completly sequenced. The human genome is almost deciphered. However, the number of genes encoding proteins is still under debate and finding the genes in the genome is a difficult task.
Finding the genes in micobiological genomes can be as simple as finding a translational start codon (ATG -coding for the initial Methionin) preceding a long stretch of an open reading frame (ORF). In Prokaryotes, the ORFs are long, not interrupted by non-coding regions (introns) and the gene density is high. Moreover, conserved sequence patterns are found preceding the coding regions, e.g., the Pribnow box (a TATAAT consensus region at position -10) or the -35-region. Thus, the genome can be scanned for such sequences in order to check the regions for genes. Even though the genes can be located on the plus or minus strand of the DNA, finding the protein-encoding genes is relatively straightforward.
In eukaryotic genomes the situation is more complex and predicting protein-encoding genes is more difficult. The gene density in eukaryotes is low, which means that the main proportion of DNA in the genomes are non-coding sequences. Cis-acting elements in eucaryotes are not as thoroughly caracterized as in prokaryotes. Furthermore, the genes are not only being transcribed into mRNAs as in prokaryotes, but the mRNAs are processed prior to translation into proteins in different ways. MRNA maturation processes involve the removal of non-coding regions (splicing) and/or editing, the exchange of nucleotides, resulting in differences between mRNA vs. genomic DNA sequences. Splice junctions have to be identified and splice junctions can even vary, resulting in "alternatively spliced genes" and alternating proteins. The molecular mechanisms of RNA modifications are still under investigation, but once RNA-binding sites and RNA-binding proteins are more thoroughly understood, the knowledge will help to predict gene structures and its variations.
Today, many computer programs take care of finding the genes within the bulk of DNA sequences, and different gene-finding strategies are being used. In general, three approaches can be distinguished. One method can be described as content-based, the second as site-based and the third as comparative. Content-based methods determine the overall properties of a sequence, including an evaluation of the codon usage. Since synonymous codons, codons that stand for the same amino acid, are not only distributed randomly among species, but codon usage is also different between weakly and strongly expressed genes in the same organism, this method can be used to identify coding vs. non-coding regions. Site-based methods determine transcription factor binding sites, polyA signals, start and stop codons, splice junctions, and other specific sequences or sequence patterns. Comparative methods make use of already determined sequences by a comparison of sequence data. Thus, these methods are "trained" and the results are better, the closer the test sequences are to the sequences of the training set. Eventually, the gene-prediction is most reliable when the application of different methods and programs results in the same predicted gene-structure.
Two divisions of the National Institutes of Health offer freely available software programs to help with the gene annotation process.
The ORF-finder (Open-reading-frame-finder) is available at the National Institute of Biotechnology information (NCBI). The sequence is submitted as GI or accession number, or in the FASTA format. It will be translated by the program into six-frames and will be returned as a graphic that indicates the location of each ORF found. The sequences of predicted protein products can directly be submitted for BLAST similarity searching. The program identifies the open reading frames using the standard or alternative genetic codes. If you are not sure which genetic code to apply for the organisms under investigation, check out the genetic code at NCBI's Taxonomy Browser or search the database for codon usage at the Kazusa DNA Research Institute (KDRI) in Japan.
GeneMachine from the National Human Genome Research Institute (NHGRI) is an integrated tool intended to perform both comparative and predictive gene identification techniques in a single run. The result file (returned in ASN.1 format by E-mail) can then be viewed using NCBI's Sequin.
The integrated analysis programs are:
- GRAIL for internal coding exon prediction,
- MZEF for coding exon prediction,
- GENSCAN for gene structure prediction,
- FGENES for gene structure prediction,
- RepeatMasker for complexity and interspersed repeat prediction,
- Sputnik for repeat prediction, and
- BLASTX and BLASTN for sequence homology searches.
Nevertheless, there are limits of the computational analysis and gene prediction methods. It is still hard to find RNA genes, genes that function on the RNA level and very small genes. It is also a challange to explore alternative splicing and the multiple protein products that a single gene can encode. However, the latest news about a program for the "Computational identification of promoters and first exons in the human genome" comes from the authors R.V. Davuluri, I.Grosse and M.Q. Zhang at Cold Spring Harbor Laboratory (NY). The publication in Nature Genetics reports about a new program called FirstEF (First Exon Finder). Using the comparative approach the program is supposedly able to recognize features ~500 bp of either side of the first exon, thus recognizing a potential promoter and coding and/or non-coding first exons. Even though the program id developed for the annotation of promoter regions and first exons in the human genome, the authors claim it useful for the annotation of other mammalian genomes, too.
Further reading:
- A Bibliography on Computational Gene Recognition
- Mount, David W. 2001. Bioinformatics: Sequence and Genome Analysis, Chapter 8, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York
- Baxevanis, Andreas D. and Ouellette, Francis B.F. 2001. Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins. Chapter 10, Wiley-Interscience
- Davuluri, Ramana V., Grosse, Ivo, Zhang, Michael Q. 2001. Computational identification of promoters and first exons in the human genome. Nature Genetics 29:412-417







