Go Back

Identifying Gene Function and Features through Comprehensive Automated Analysis

Michael M. Mueller

Project Manager, Molecular Applications Group, PO Box 51110, Palo Alto, CA 94303

Genomic data contains vast amounts of diverse information ready to be extracted and used. However, the process of obtaining novel biological understanding from the mass of data is not an easy one. Currently, human scalability - the ability to incorporate genomics data into all relevant research processes - is the main barrier. There are two human issues for scalability: the ability to take the power of bioinformatics to all researchers, beyond the small handful of experts familiar with these tools; and the elimination of demands for human time or intervention in any process that must be automated in order to be applied to whole genomes.

Both of these issues pose interesting scientific problems for integration and automation of diverse Bioinformatics analyses: elimination of unreliable, insignificant or redundant results, cross-checking, data-reduction, and identification of "interesting" results.

We will describe an automated approach that performs comprehensive analysis on sequence data, generating a rich dataset of functional and feature information ranging from open reading frame identification and indication of active site residues, to tissue expression patterns. Our approach strongly emphasizes the use and graphical presentation of multiple independent predictions and analyses allowing cross-validation of results.

A number of examples of automated function and feature identification - including the proteins BRCA1, the human transcription factor PAX3 and Thrombin - will be presented.

Currently, the system automates functional and feature discovery in four areas. 1) functional and structural features, including functional motifs, secondary structure, predicted fold, domains, etc. 2) homology families, analyzed and cross-validated by family "finger-printing". 3) expression patterns, indicating tissue, cellular, or disease-specific expression levels, working with data from the dbEST, TIGR, and Incyte LifeSeq databases. 4) disease association data, including genetic mapping and polymorphism.

Go Back