Academic
Publications
POSOLE: Automated Ontological Annotation for Function Prediction

POSOLE: Automated Ontological Annotation for Function Prediction,Karin Verspoor,Judith Cohn,Susan Mniszewski,Cliff Joslyn

POSOLE: Automated Ontological Annotation for Function Prediction   (Citations: 2)
BibTex | RIS | RefWorks Download
For BioCreAtIvE Task 2, we were provided with a protein identifier (Swiss-Prot identifier) and a relevant journal publication and asked to predict the function of the protein, as represented by a set of GO nodes, on the basis of that publication. The application defines a POSOLE QueryBuilder that is responsible for associating terms in the publication to GO nodes. This is accomplished through the use of natural language processing components. Specifically, the document is processed to morphologically normalize terms to their base forms, identify sentence boundaries, and calculate the relative importance of terms using the statistical measure TFIDF (term frequency inverse document frequency) with respect to a background corpus. We then identify all references to the input protein in the document and collect the terms in a context window around those references. These terms are considered to be in the contextual neighborhood of the protein, and are assumed to be most indicative of the protein's function. These terms in turn are mapped to specific GO nodes through lexical matches between the text and the text of the GO nodes, in the GO node labels and node definitions as well as in additional sets of terms that were previously associated with specific GO nodes via unsupervised learning (see (2)). An input query for POSOC is constructed which consists of the set of matched GO nodes, weighted according to the TFIDF of the matching term. 3. CASP APPLICATION For the CASP function prediction task, we were provided with a protein sequence and asked to predict the function of the protein, again in terms of a set of GO nodes. The application defines a POSOLE QueryBuilder that is responsible for associating the input sequence to GO nodes. In this case, we use a "nearest neighbor" approach: we identify close neighbors of the input sequence in sequence space and collect the GO nodes associated with those neighbors in a curated data set (Swiss-Prot). To identify close neighbors of a target sequence, we performed a PSI-BLAST (Position-Specific Iterated BLAST) (6) search on the target against the NCBI NR database, with 5 iterations, using the default e-value threshold of 10. Once the nearest neighbors have been identified, we collect the GO nodes associated with these sequences utilizing the UniProt Swiss-Prot to GO mappings. Finally, we build a weighted collection of GO nodes, where each node in the collection is weighted according to the PSI-BLAST e-value. Several near neighbors of the original target sequence may map to the same nodes. In this case, each occurrence of a GO node will be weighted individually according to its source.
Cumulative Annual
Sort by: