Poster 33

Classifiers for genome annotation built on Gene Ontology

Soumyadeep Nandi, Andrew M Lynn
Jawaharlal Nehru University, School of Information Technology, New Delhi, India

Sequence annotation is a classification problem. A large variety of methods (viz. sequence similarity based methods as BLAST; profile based method PSIBLAST, HMMER, etc.) are used to assign function to a sequence by comparing it with well annotated databases using the top hit annotation as the prediction for a query instance. Methods to classify sequences without known orthologs are still in development. Conserved hypothetical proteins - i.e. predicted proteins conserved in more than one organism constitute a substantial fraction of novel sequenced genomes.

Hierarchical classification systems and disparate sources of data can improve classification using supervised learning by providing more training information. Hierarchical classification systems provide two advantages: Firstly, the ability to build more discriminative classifiers using positive and negative training sequences and secondly to assign sequences at a lower (finer) level of hierarchy or precise functional category. Exploiting the hierarchical structure of Gene Ontology and annotating through GO terms provides us a better and precise annotation with the controlled vocabularies of Gene Ontology.

Supervised learning methods have been used earlier to classify sequences, though restricted to sub-families that share a significant homology. In our approach we use a support vector machine. The training data included sequences mapped onto the Gene Ontology terms, and the system trained using patterns of fold extracted from the pfam and superfamily databases as well as functional motifs using the PROSITE database.

We validate this system on a subset of the tree, whose leaf nodes are completely populated . The sensitivity and specificity were 92.2% and 95.1% respectively. This system is used to functionally characterize proteins previously classified as 'conserved hypothetical'.