Sequence Analysis: Where We Were, Where We Are, and Where We Might Go

Russell F. Doolittle
Center for Molecular Genetics,University of California, San Diego

The last 35 years have been witness to enormous strides in sequence analysis. Earlier efforts at numerical taxonomy not withstanding, it was the availabilty of digitized information in the way of amino acid sequences coincident with the development of the digital computer that was so propitious. The unraveling of the genetic code along with the revelation that gene duplications were at the heart of organismal diversity set the stage for molecular explanations of life on Earth. The early use of digital computers for aligning characters and determining relationships predated recombinant DNA studies. In turn, the DNA revolution has provided the torrent of data that may make a complete undestanding possible. At this point, the field is brimming with technology and data both. What should we be trying to learn? The pincipal goal of the evolutionist remains the reconstruction of past events that have given rise to the present. Initially, it was thought that it would be a straightforward matter to compare genes and genomes and work backwards to a common ancestor. A major complication arose with the unexpected frequency of apparent lateral gene transfers, to the point where at this point there isn't universal agreement on the general nature of the Tree of Life. A more ambitious challenge is determining the likely gene content of the earliest cells. Computers alone will not reveal the truth here. More data are needed in certain critical areas, especially from early diverging protists harboring particular bacterial and archaeal symbionts.

Predicting the Future Course of Human Influenza Virus Evolution

Walter M. Fitch(1), Robin M. Bush(1), Catherine A. Bender(2), Kanta Subbarao(2), and Nancy J. Cox(2)
1. University of California-Irvine, Irvine, CA
2. Centers for Disease Control, Atlanta, GA

We have recently identified 18 codons in the hemagglutinin gene (H3) that are under positive selection to change the encoded amino acid. We believe that this selection is to reduce the effectiveness of the human's immune surveillance by altering the viral surface and using the time bought to get off further rounds of replication before the body cures the disease.

We have also noted previously that the shape of the phylogenetic tree for human influenza is unusual in its having one long trunk with many very short side branches. The trunk obviously represents what may be considered the evolving winners in a competition among the viruses to be the progenitor of future strains of influenza virus. If all that is so, then perhaps a study of replacements in these 18 positions will be indicative of who will win that race before the race is over.

Accordingly, we examined the evolution of the amino acid sequences of HA1 on the evolutionary tree from the genes of 357 strains of human influenza virus isolated from 1983 through 1997. We counted how many replacements in those sites occurred between the root and the tip where each isolate was located. We did this for each flu season using only the sequences through that flu season. The strain with the most replacements in these 18 positions was predicted to indicate where the trunk would emerge in future influenza seasons. This was done for eleven flu seasons. The method correctly predicted the location of the trunk in nine of the eleven trials and in all of the last eight seasons examined.

Introns and Modules in Ancient Conserved Genes

Walter Gilbert
Harvard University, Cambridge, MA

We have studied the correlation between introns and modules, compact regions of protein structure, in genes whose products have a known 3-dimensional structre and which are homologous between bacteria and eukaryotes. Using two definitions of modules, we can show that phase zero introns, those that lie between the codons, are significantly correlated with the boundaries of modules while introns that lie in phases one and two, that interrupt the codons are not so correlated. We will discuss the significance of this finding, which suggests that some of the phase zero introns are residues of the original construction of the genes while the phase one and two introns may have been added later in evolution.

(Alphabetical listing by presenter)

Progress in ab initio Protein Structure Prediction

David Baker
University of Washington, Seattle, WA

I will first discuss the development of computational models for local sequence-structure relationships in proteins. We used clustering methods to identify recurrent sequence patterns that transcend protein family boundaries, and subsequently an iterative refinement procedure to sharpen local sequence-structure relationships. The resulting ISITES library of sequence-structure motifs has proven capable of identifying segments of proteins which adopt native like structure in isolation. The ISITES motifs have been developed into a mixture model and a hidden Markov model for general protein sequence which have shown promise for gene recognition and local structure prediction. I will then describe ROSETTA, a new method for ab initio tertiary structure prediction, which is based on the idea that successful tertiary structure prediction requires modeling of both local and non local interactions in proteins, but at different levels of detail. Promising results in the recent CASP3 protein structure prediction experiment suggest that the low resolution structure prediction problem for proteins of less than 100 amino acids may be solved in the relatively near future, and that ab initio structure prediction methods may be able to contribute to the interpretation/annotation of genomic sequence information.

Exploiting the Past and the Future in Protein Secondary Structure Prediction

Pierre Baldi
University of California, Irvine

Predicting the secondary structure of a protein (alpha-helix, beta-sheet, coil) is an important step towards elucidating its three dimensional structure, as well as its function. Presently, the best predictors are based on machine learning approaches, in particular neural network architectures with a fixed, and relatively short, input window of amino acids, centered at the prediction site. Although a fixed small window avoids overfitting problems, it does not permit to capture variable long-ranged information.

We introduce a family of novel architectures which can learn to make predictions based on variable ranges of dependencies. These architectures extend recurrent neural networks, introducing non-causal bidirectional dynamics to capture both upstream and downstream information. The prediction algorithm is completed by the use of mixtures of estimators that leverage evolutionary information, expressed in terms of multiple alignments, mostly at the input level. While our system currently achieves an overall performance exceeding 75% correct prediction - at least comparable to the best existing systems - the main emphasis here is on the development of new algorithmic ideas.

Computational Approaches for Structural Genomics

Steven E. Brenner
Department of Structural Biology,Stanford University

Structural genomics aims to provide a good experimental structure or computational model of every tractable protein in a complete genome. Underlying this goal is the immense value of protein structure, especially in permitting recognition of distant evolutionary relationships for proteins whose sequence analysis has failed to find any significant homolog. A considerable fraction of the genes in all sequenced genomes have no known function, and structure determination provides a direct means of revealing homology which may be used to infer their putative molecular function. The solved structures will be similarly useful for elucidating the biochemical or biophysical role of proteins that have been previously ascribed only phenotypic functions. More generally, knowledge of an increasingly complete repertoire of protein structures will aid structure prediction methods, improve understanding of protein structure, and ultimately lend insight into molecular interactions and pathways.

We use computational methods to select families whose structures cannot be predicted and which are likely to be amenable to experimental characterization. Methods to be employed included modern sequence analysis and clustering algorithms. A critical component is consultation of the Presage database for structural genomics, which records the community's experimental work underway and computational predictions. The protein families are ranked according to several criteria including taxonomic diversity and known functional information. Individual proteins, often homologs from hyperthermophiles, are selected from these families as targets for structure determination. The solved structures are examined for structural similarity to other proteins of known structure. Homologous proteins in sequence databases are computationally modeled, to provide a resource of protein structure models complementing the experimentally solved protein structures.

The Delphic Boat : What Bacterial Genomes Tell Us?

Antoine Danchin
Institut Pasteur, Paris, France

Contrary to an intuitive idea there is often no predictable link between structure and function in biological objects. Therefore one cannot use genomes alone to discover the meaning of the genome texts. We shall illustrate how the concept of "neighborhood" can be used to create links between a variety of information sources, making the genome text mean something relevant. In particular having analyzed the overall way in which DNA is manipulated in various bacteria we shall see that, despite strong variation, there are strong constraints in the organisation of genes. Observing biases in features which would be thought to be unbiased is the hallmark of some selection pressure. Because the genetic code is redundant, coding sequences can be studied by analyzing their codon preference. The selection pressure maintaining this bias is linked to the organization of the cell's cytoplasm, which is clearly not a tiny test tube. It consists in a ribosome network, moving slowly with respect to local diffusion of the small molecules and macromolecules present in the cell. Ribosomes act as attractors of certain tRNA species, as a function of the local codon usage of the mRNA molecules they translate. This adapts codon usage of the gene corresponding to a given function to the position of its product in the cell. In particular if two genes have a very different codon usage this means that the messager RNAs are not translated at the same place in the cell. The messenger threads are pulled out from DNA by the network of translating ribosomes, going from one ribosome to the next one, as in a wiredrawing machine. Organization of the genes into polycistronic operons results in the fact that proteins having related functions are co-expressed locally, allowing compartmentalization of the corresponding substrates and products. As a consequence, if one goes from a very biased ribosome to a less biased one, the local concentration of the most biased tRNAs decreases. In turn this creates a selection pressure that produces a gradient in codon usage, as one goes further away from the most biased messengers and ribosomes. Investigation of the sequence of known genomes suggests that genes are not randomly distributed along the chromosome. In fact there is an extreme bias in the genes that permit one to differentiate explicitely genes coded by the leading strand from genes coded by the lagging strand. We shall discuss how these features should be implemented in the construction of realistic models of genomes in order to identify new unexpected signals.

Computational Screens for Noncoding RNA Genes in Archaeal Genomes

Sean Eddy
Deptartment of Genetics, Washington University School of Medicine

Some genes produce functional RNAs instead of encoding proteins. Current genefinding approaches focus almost exclusively on protein-coding genes, so the diversity of the "modern RNA world" is an open question. We are developing probabilistic modeling approaches -- specifically, hidden Markov models and stochastic context-free grammars -- to identify noncoding RNA genes in genome sequence data. We have been using these methods to study the diversity of small nucleolar RNAs (snoRNAs), which are responsible for guiding specific nucleotide modifications of eukaryotic ribosomal RNAs. In collaboration with Pat Dennis's group at the University of British Columbia, we have recently found that Archaeal genomes also have numerous snoRNA genes. Many of our predictions have been confirmed both experimentally and by comparative analysis among three available Pyrococcus genome sequences. Because snoRNAs are still unknown in Bacteria, this result suggests another shared character between the Archaeal and Eukaryotic lineages.

Comparative Genomics of Three Pyrococcus Species: Evidences for a Single Replication Origin in these Hyperthermophilic Archaea and Gene Transfer From Bacteria To Archaea

Patrick Forterre(1), Olivier Pocs(2), Hannu Myllykallio(1), Philippe Lopez(3), Hervé Philippe(3), Odile Leconte(2), Raymond Ripp(2), Jean-Claude Thierry(2), Roland Helig(4), Valérie Barbe(4), William Saurin(4), Jean Weissenbach(4), Yvan Zivanovic(1)
1. Institut de Génétique et Microbiologie, Université de Paris-Sud, Orsay, France
2. Institut de Génétique et de Biologie Moléculaire et Cellulaire, Illkirch, CU de Strasbourg , France
3. Laboratoire de Biologie cellulaire, Université de Paris-Sud, Orsay, France
4. Génoscope, Centre National de Séquençage, Evry, France

We have completely sequenced the genome of the hyperthermophilic archaeon Pyrococcus abyssi and performed in silico comparison with the genomes of Pyrococcus horikoshii and Pyrococcus furiosus. This study emphasizes the plasticity of genomes at the genus level and reveals extensive transfer of genes from Bacteria to Pyrococcus. We have detected in silico (1) a single and similar origin of replication in the three Pyrococcus species and obtained experimental evidence that validate our prediction in P. abyssi. The implication of our findings for the evolution of Archaea in general and the evolution of the replication apparatus in particular will be discussed in the light of alternative theories about the nature of the Last Universal Common Ancestor (LUCA) (1,3)

1. LOPEZ, P., PHILIPPE, H, MYLLYKALLIO, H. and FORTERRE, P. Identification of putative chromosomal origins of replication in archaea. Mol. Microbiol. 32, 883-886 (1999)
2. FORTERRE, P. Displacement of cellular proteins by functional analogues from plasmids or viruses could explain puzzling phylogenies of many DNA informational proteins, Mol. Microbiol.33, 457-465 (1999)
3. FORTERRE, P. and PHILIPPE, H Where is the root of the universal tree of life? Bioessays, in press

Analysis of Microarray Gene Expression Data Using Support Vector Machines

David Haussler
Co-authors: M. P. Brown, W. Grundy, D. Lin, N. Cristianini, C. Sugnet, T. Furey, and M. Ares, Jr.
Computer Science Department, University of California, Santa Cruz

We have developed a new method of functionally classifying genes using gene expression data from DNA microarray hybridization experiments. We train a Support Vector Machine (SVM) to distinguish expression patterns from genes in one class of co-regulated genes from expression patterns of genes in other classes. We have applied the method with some success to expression data collected for yeast genes by the Pat Brown laboratory. After training on expression patterns for two thirds of the ribosomal yeast genes, it identified the remaining known ribosomal genes from among the approximately 6000 yeast genes with few errors, and identified one incorrect annotation in the MYGD database. Experiments with other classes of co-regulated genes also gave useful results, but accuracy depends on how strongly the class of co-regulated genes exhibits a common and distinctive expression pattern in the set of DNA microarray hybridization experiments that is performed.

Block-Based Methods for Detecting Protein Homology

Steven Henikoff, Jorja G. Henikoff and Shmuel Pietrokovski
Fred Hutchinson Cancer Research Center

Whereas sequence databanks continue to expand rapidly, most newly discovered protein sequences belong to families characterized by conserved alignment blocks. Methods that utilize blocks rather than single sequences can efficiently detect weak similarities that reflect functional constraints. These include: 1) Embedding methods, which extract multiple alignment information from motif regions while retaining single sequence information where alignment is uncertain; 2) A graphing method, which is applied to the results of an all-versus-all search of the Blocks Database using the LAMA (Local Alignment of Multiple Alignment) searching tool, for detection of structurally-constrained motifs shared by different protein families; 3) Design of PCR primers for isolating distantly-related protein-coding sequences using the COnsensus-DEgenerate Hybrid Oligonucleotide Primer (CODEHOP) method.

Characterizing Highly Expressed Genes of Diverse Genomes

Samuel Karlin
Department of Mathematics, Stanford University

Intense current biotechnological research centers on macro and micro (DNA chip) arrays aiming to dissect gene expression under clinical and environmentally relevant conditions. Our approach in ascertaining gene expression levels relates to codon usage differences of genes relative to one or several sets of standard gene classes. Genes that deviate strongly in codon usage from the average gene but are relatively similar in codon usage to ribosomal protein genes (RPs) tend to be "highly expressed." By these criteria, "highly expressed" genes in most prokaryotic genomes include RPs, transcription and translation processing factors, chaperonins, and genes of principal energy metabolism. In particular, for the fast-growing E. coli, H. influenzae, and B. subtilis species (and S. cerevisiae), major glycolysis genes are highly expressed. In Synechocystis, genes of photosynthesis are highly expressed and in methanogens highly expressed genes are those generally essential for methanogenesis. A gene is referred to as alien if its codon usage difference from the average gene of the genome exceeds a high threshold and codon usage difference from RP genes is also high. Such genes with high codon bias might be acquired through recent horizontal transfer or might be deviant because of other disrupting influences. Thus, alien genes in Haemophilus influenzae include genes of the restriction systems HindIII, HincII, and HgeD generally acquired through plasmid integration. The identification of alien genes is of great interest with regard to pathogenicity and, in the context of evolutionary biology with regard to lateral gene transfer. In this talk we will describe our methods with applications of characterizing highly expressed genes in diverse prokaryotic and eukaryotic organisms.

Genetic Headroom

Jeffrey Lawrence
Department of Biological Sciences, University of Pittsburgh

Why do some groups of organisms contain more species than others? What factors allow for diversification of some lineages, but others? These questions can be addressed in bacteria using a genomic approach. Horizontal genetic transfer provides a powerful mechanism by which an organism can acquire novel metabolic capabilities, exploit novel environments, and diversify from parental and sibling lineages. I have described methods for assessing the amount of horizontally transferred DNA in a genome, and for estimating the date of introgression of each segment; these methods allow quantitation of the rate of horizontal genetic transfer in a particular lineage. A question remains as to what limits the maintenance of horizontally transferred DNA in a genome. Population genetics (and common sense) tells us that a finite amount of information can be maintained in a species' genome. I describe here methods of assessing the apportionment of natural selection in the counterselection of mutations at a variety of nucleotide sites. This method provides an estimate of the relative selective "effort" an organism expends in maintaining different kinds of genomic information. A portion of this information includes a variety of sites where non-lethal mutations are being counterselected (like those affecting synonymous codon choice or codon context). This parameter is termed "genetic headroom," and represents the genomic potential for maintaining the protein-coding capacity of additional genes when they introduced by horizontal transfer. Genomic surveys show that the amount of genetic headroom in a genome is indeed correlated to the rate of horizontal transfer. A model is presented where genetic headroom allows an organism to exploit the novel metabolic functions introduced by horizontal transfer, and may be an accurate predictor of speciation rates.

Evolution Teaches Structure Prediction

Burkhard Rost
Columbia University, New York

In the wake of the genome data flow, we need - more urgently than ever - accurate tools to predict protein structure. The problem of predicting protein structure from sequence remains fundamentally unsolved despite more than three decades of intensive research effort. However, the wealth of evolutionary information deposited in current databases enabled a significant improvement for methods predicting protein structure in 1D: secondary structure, transmembrane helices, and solvent accessibility. In particular, the combination of evolutionary information with neural networks proved extremely successful. The new generation of prediction methods proved to be accurate and reliable enough to be useful in genome analysis, and in experimental structure determination. Moreover, the new generation of theoretical methods is increasingly influencing experiments in molecular biology, and ready to digest large amounts of sequence data.

CATH Protein Families: Insights into Structural Evolution and Function

Orengo, C.A, Pearl, F., Lee, D., Bray, J., Todd, A. & Thornton, J.M.
Biomolecular Structure and Modelling Unit, University College, London

The rapid progress of the international genome projects has resulted in a wealth of sequence information for protein families. There are nearly 500,000 sequences known which can be grouped into ~20,000 families. Although the structural data has lagged behind the sequences, analysis suggests that with the advent of structural genomics initiatives we may soon have structural representatives for many evolutionary protein families. Since structure can provide crucial information on a protein's biological role, the next step is to accumulate and analyse functional data within these families.

At UCL, we have clustered all the well-resolved protein structures, in the PDB, into structural families. Proteins are first divided into separate domains and both sequence and structure alignment methods used to identify relationships. Data on families is stored within the CATH database (Class, Architecture, Topology or fold and Homologous superfamily). To date, there are ~18,000 domains within CATH, which cluster into ~1100 homologous superfamilies and ~650 fold families.

Techniques are now being developed for assigning structural families to genome sequences. For example, we can identify probable families for >30% of proteins in Mycoplasma Genitalium (Salamov et al. 1999). Tools are also being developed for analysing available functional data within each homologous superfamily. Protein ligand interactions are identified and displayed (DOMPLOT, Todd et al. 1998) and correlations between sequence and structure motifs are captured using a new analysis program (CORA, Orengo, 1999).

Comparitive Protein Structure Modeling in Genomics

Speaker: Andrej Sali
Roberto Sanchez, Francisco Melo, Ursula Pieper, Nebojsa Mirkovic, Marc Marti-Renom, Andras Fiser, Ashley Stuart and Andrej Sali
The Rockefeller University

Structural genomics aims to determine or accurately predict 3D structure of most proteins (1). This aim will be achieved by a focused, large-scale determination of protein structures by X-ray crystallography and NMR spectroscopy, combined efficiently with accurate protein structure prediction. Comparative protein structure modeling will be discussed in this context. To allow large-scale modeling, we automated fold assignment, sequence-structure alignment, comparative model building, and model evaluation (2-5). These steps were implemented mostly in our program Modeller, which is available on the Web at http://guitar.rockefeller.edu/. The approach has so far been applied to the proteins in 23 complete genomes, including the C. elegans genome. All-atom 3D models for substantial segments of 20-45% of the proteins in the 19 genomes have been obtained and stored in the ModBase database of comparative protein structure models (5,6). Several examples of how comparative modeling can be useful in the biological analysis of individual proteins as well as whole genomes will be described.

1. A. Sali. 100,000 protein structures for the biologist. Nature Structural Biology 5, 1029-1032. 1998.
2. A. Sali and T.L. Blundell. Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 234, 779-815, 1993.
3. R. Sanchez and A. Sali. Large-scale protein structure modeling of the Saccharomyces cerevisiae genome. Proc. Natl. Acad. Sci. USA 95, 13597-13602, 1998.
4. R. Sanchez and A. Sali. Comparative protein structure modeling in genomics. J. Comp. Phys. 150, 1-14, 1999.
5. URL http://guitar.rockefeller.edu:/modbase/.
6. R. Sanchez and A. Sali. ModBase: A database of comparative protein structure models. Bioinformatics, in press.

Defining Domains in Protein Structure

William Taylor
MRC National Institute of Medical Research, London, UK

A simple method for the definition of protein structural domains is described that requires only alpha-carbon coordinate data. The basic method, which encodes no specific aspects of protein structure, captures the essence of most domains but does not give high enough priority to the integrity of beta-sheet structure. This aspect was encouraged both by a bias toward attaining intact sheetss and also as as an acceptance condition on the final result. The method has only one variable parameter, reflecting the granularity level of the domains, and an attempt was made to set this level automatically for each protein based on the best agreement attained between the domains predicted on the native structure and a set of smoothed coordinates. While not perfect, this feature allowed some tightly packed domains to be separated that would have remained undivided had the best fixed granularity level been used. The quality of the results was high and when compared to a large collection of accepted domain definitions, only a few could be said to be clearly incorrect. The simplicity of the method allows its easy extension to the simultaneous definition of domains across related structures in a way that does not involve loss of detail trough averaging the structures. This was found to be a useful approach to reconciling differences among structural family members. The method is fast, taking about 1 second per 100 residues for smaller proteins.


Structural Basis for Triplet Repeat Disorders: A Computational Analysis

Speaker: Soren Brunak
Pierre Baldi, Soren Brunak, Yves Chauvin, and Anders Gorm Pedersen
Technical University of Denmark, Lyngby, Denmark

Over a dozen major degenerative disorders, including myototonic distrophy, Huntington's disease, and fragile X syndrome, result from unstable expansions of particular trinucleotides. Remarkably, only some of all the possible triplets, namely CAG/CTG, CGG/CCG and GAA/TTC, have been associated with the known pathological expansions. This raises some basic questions at the DNA level. Why do particular triplets seem to be singled out? What is the mechanism for their expansion and how does it depend on the triplet itself? Could other triplets or longer repeats be involved in other diseases? Using several different computational models of DNA structure, we show that the triplets involved in the pathological repeats generally fall into extreme classes. Thus, CAG/CTG repeats are particularly flexible, whereas GCC, CGG and GAA repeats appear to display both flexible and rigid (but curved) characteristics depending on the method of analysis. The fact that (1) trinucleotide repeats often become increasingly unstable when they exceed a length of approximately 50 repeats, and (2) repeated 12-mers display a similar increase in instability above 13 repeats, together suggest that approximately 150 bp is a general threshold length for repeat instability. Since this is about the length of DNA wrapped up in a single nucleosome core particle, we speculate that chromatin structure may play an important role in the expansion mechanism. We furthermore suggest that expansion of a dodecamer repeat, which we predict to have very high flexibility, may play a role in the pathogenesis of the neurodegenerative disorder multiple system atrophy (MSA).

Evaluation of gene finding programs using gene contigs and application to A. thaliana

Speaker Pierre Rouzé
Nathalie Pavy, Stephane Rombauts, Patrice Déhais, Catherine Mathé, Davuluri V. V. Ramana, Philippe Leroy and Pierre Rouzé
VIB Department of Plant Genetics & INRA-associated Laboratory, University of Ghent, Belgium

Performing routine annotation of genome sequences is a huge and difficult task. Due to the complexity of gene structure and lack of homologues for many genes, even the first part of this task, locating the genes along the sequence and more particularly finding their proper borders, remains a problem for eucaryotic genomes, plants being no exception. To improve the annotation process, one need to choose and combine the most appropriate tools to use inside a computer-assisted annotation platform. In silico gene prediction has to deal with DNA sequences with genes occurring on both strands and with intergenic regions, and the ability of the programs to build multi-gene models, and not only exons, has to be evaluated.

We have developed AraSet, a data set of authentic contigs of validated genes, enabling the evaluation of multi-gene models for the Arabidopsis genome. Besides conventional metrics to evaluate gene prediction at the site and the exon levels, new measures were introduced to consider the prediction at the protein sequence level as well as to evaluate gene models. This evaluation method can apply to any new gene prediction software and to any eukaryotic genome for which a similar dataset can be built. Using this evaluation for the Arabidopsis genome, we show that if the accuracies of splice site and exon prediction are quite high, gene modeling accuracy remains very low. From our analysis, the performance of the publicly available prediction programs for the Arabidopsis genome differs significantly, suggesting GeneMark.hmm as presently the most accurate software at all three levels. Interestingly, gene modeling could be further improved by specific combination of prediction software.

The AraSet sequence set and validation programs are available at http://sphinx.rug.ac.be:8080/biocomp/napav/.

Genome Rearrangement with Gene Families

David Sankoff
Universite de Montreal, Montreal, Canada

The theory and practice of genome rearrangement analysis breaks down in the biologically widespread contexts where each gene may be present in a number of copies, not necessarily contiguous. In some of these contexts it is, however, appropriate to ask which members of each gene family in two genomes are its 'true exemplars', i.e. which best reflect the original position of the ancestral gene in the common ancestor genome. This entails a search for the two 'exemplar strings' of same length having the smallest possible rearrangement distance: the 'exemplar distance'.

A branch and bound algorithm calculates these distances efficiently when based on easily calculated traditional rearrangement distances, such as signed reversals distance or breakpoint distance, which also satisfy a property of monotonicity in the number of genes.

A Simple Algorithm for Detecting Circular Permutations in Proteins

Ron Unger
Bar Ilan University, Tel-Aviv, Israel

A circular permutation of a protein is a genetic event in which part of the C-terminal of the protein is moved to its N-terminal. Recently, it has been shown that proteins that undergo engineered circular permutations generally maintain their three dimensional structure and biological function. This observation raises the possibility that circular permutation has been used by Nature during evolution. In this scenario a protein underwent circular permutation into another protein, thereafter both proteins further diverged by standard genetic operations. To study this possibility one needs an efficient algorithm that for a given pair of proteins can detect the underlying event of circular permutations. A naive algorithm might take time proportional to N**3 or even N**4, which is prohibitively slow for a large scale survey. A sophisticated algorithm that runs in asymptotic time of N**2 was recently suggested, but it is not practical for a large scale survey. Here we present a simple and efficient algorithm that runs in time N**2. The algorithm is based on duplicating one of the two sequences, and then performing a modified version of the standard dynamic programming algorithm. While the algorithm is not guaranteed to find the optimal results, we present data that indicate that in practice the algorithm performs very well. The algorithm was used as a part of a large scale survey of Swissprot for circular permutations. Few interesting examples will be discussed.

Conference Home Contact Information

Please send questions or comments about this site to John D. Besemer, john@amber.biology.gatech.edu