Go Back

Characterization and Prediction of Eukaryotic Start Codons.

Anders Gorm Pedersen, Henrik Nielsen and Soren Brunak

Center for Biological Sequence Analysis, Technical University of Denmark, Building 207, DK-2800 Lyngby, Denmark

The choice of start codon in eukaryotes depends on position as well as on context. Usually, translational initiation takes place at the first occurrence of the triplet AUG in an mRNA, but in some cases an AUG further downstream is selected. According to the so-called scanning hypothesis, the small subunit of the ribosome scans the mRNA from the 5' end until a suitable translation initiation site is found; if the first AUG is not selected, it implies that the surrounding sequence must be unfavorable for initiation.

Here we present a comprehensive statistical analysis and comparison of translation initiation sites in 21 eukaryotic species. We have investigated several vertebrate species belonging to both tetrapoda (land vertebrates) and fish, several monocot and dicot plants, as well as fruit fly, roundworm, and two different yeasts. Interestingly, we find that the start codon context signals are specific for systematic groups:

1) all signals from the vertebrate group are similar to each other,

2) all signals from monocots are similar,

3) all signals from dicots are similar,

4) everything else is significantly different.

We show sequence logos and nucleotide frequencies of start codon contexts for the 7 different systematic groups.

Based on the redundancy reduced data sets used in this analysis, we have constructed a neural network based method - NetStart - that is able to predict start codons in vertebrates and dicots. Prediction of translation initiation sites is particularly useful for the analysis of EST and genome data where the entire mature mRNA sequence is not known. NetStart is available as a WWW-server at the following address: http://www.cbs.dtu.dk/services/NetStart/. We are currently in the process of expanding the number of phylogenetic groups covered by NetStart, and furthermore plan to improve the method by specifically analyzing the coding potential of the input data.

Go Back