GeneMark: Background information
GeneMarkS, self-training for prokaryotic genomes
This program combines GeneMark.hmm (prokaryotic) and GeneMark (prokaryotic) with a self-training procedure that determines parameters of the models of both GeneMark.hmm and GeneMark. (Self-training wroks for sequences longer than 100KB)
* Note 1: If you are interested only in GeneMark (prokaryotic), you will be able to run it as a component of GeneMarkS.
* Note 2: GeneMarkS can be used
- for eukaryotic genomes with prokaryotics type gene organization (low fraction of intron-containing genes);
- for bacteriophages.
GeneMark-ES, self-training for eukaryotic genomes
The program combines GeneMark.hmm (eukaryotic) with a self-training procedure that determines parameters for the models used in GeneMark.hmm. (Self-training works for sequences longer than 10MB.)
For details, see publications.
GeneMark developed in 1993 was the first gene finding method recognized as an efficient and accurate tool for genome projects. GeneMark was used for annotation of the first completely sequenced bacteria, Haemophilus influenzae, and the first completely sequenced archaea, Methanococcus jannaschii. The GeneMark algorithm uses species specific inhomogeneous Markov chain models of protein-coding DNA sequence as well as homogeneous Markov chain models of non- coding DNA. Parameters of the models are estimated from training sets of sequences of known type. The major step of the algorithm computes a posteriory probability of a sequence fragment to carry on a genetic code in one of six possible frames (including three frames in complementary DNA strand) or to be "non-coding".
GeneMark.hmm algorithm was designed to improve the gene prediction quality, particularly to improve GeneMark in finding exact gene starts. The idea was to integrate the GeneMark models into naturally designed hidden Markov model framework with gene boundaries modeled as transitions between hidden states. Additionally, the ribosome binding site model is used to make the gene start predictions more accurate. In evaluations by different groups it was shown that GeneMark.hmm is significantly more accurate than GeneMark in exact gene prediction. From 1998 until now GeneMark.hmm and its self-training version, GeneMarkS, are the standard tools for gene identification in new prokaryotic genomic sequences, including metagenomes.
Next step after developing prokaryotic GeneMark.hmm was to extend the approach to the eukaryotic genomes where accurate prediction of protein coding exon boundaries presents the major challenge.The HMM architecture of eukaryotic GeneMark.hmm consists of hidden states for initial, internal and terminal exons, introns, intergenic regions and single exon genes located on both DNA strands. It also includes hidden states for initiation site, termination site, as well as donor and acceptor splice sites. GeneMark.hmm has been frequently used for annotation of plant and animal genomes.
Computer methods of accurate gene finding in DNA sequences require models of protein coding and non-coding regions derived either from experimentally validated training sets or from large amounts of anonymous DNA sequence. A heuristic method for derivation of parameters of inhomogeneous Markov models of protein coding regions. was proposed in 1999. The heuristic method utilizes the observation that parameters of the Markov models used in GeneMark can be approximated by the functions of the sequence G+C content. Therefore, a short DNA sequence sufficient for estimation of the genome G+C content (a fragment longer than 400 nt) is also sufficient for derivation of parameters of the Markov models used in GeneMark and GeneMark.hmm. Models built by the heuristic approach could be used to find genes in small fragments of anonymous prokaryotic genomes, such as metagenomic sequences, as well as in genomes of organelles, viruses, phages and plasmids. This method can also be used for highly inhomogeneous genomes where adjustment of the Markov models to local DNA composition is needed. The heuristic method provides an evidence that the mutational pressure that shapes G+C content is the driving force of the evolution of codon usage pattern.
GeneMark for Metagenomes
Refined heuristic models for metagenomic sequences
|Contact Us | Home