Mining Semantic Descriptions of Bioinformatics Web Resources from the Literature Hammad Afzal,...

Mining Semantic Descriptions of Bioinformatics Web

Resources from the Literature

Hammad Afzal, Robert Stevens, Goran Nenadic

School of Computer ScienceUniversity of Manchester

[email protected]

Motivation A number of bioinformatics tools and resources

available for service use and composition guessimate is 3000+ Web Services publically available how to find a service, what is out there to use? provenance?

Semantic annotation of bioinformatics services annotate functional capabilities e.g. Taverna, myGrid, myExperiment, EBI, BioMOBY

Not only services and tools databases, repositories, corpora

Motivation

Manual curation e.g. myGrid, BioCatalogue etc. e.g. Taverna/Feta: only ~15-20% functionally

described backlog – and the number of services is

growing

Annotations combine textual descriptions ontological mappings

Example

Service name: Emma Description: Performs a multiple alignment of nucleic acid or protein sequences using ClustalW

program. Name: emma Description: Performs a multiple alignment of nucleic acid or protein sequences using ClustalW

program. Name: sequence_usa Description: The Uniform Sequence Address, or USA, is a standard way of specifying a sequence to be read into a program in EMBOSS. The most common ways of specifying a sequence is to type (database: entry), where database can be embl, uniprot or swissprot and entry is either the sequence’s entry or ID name, or its Accession number in that database…

Op-Input

Parameter

SemanticType: http://www.mygrid.org.uk/mygrid-moby-service#simpleParameter Name: outseq Description: Returns a multiple sequence alignment report

Operation

Op-Output Parameter

SemanticType: http://www.mygrid.org.uk/ontology#multiple_sequence_alignment_report

text

ontological descriptions

- multiple local align.

- Soaplab

BioCatalogue

Single registration point for Web Service providers

Single search site for scientists and developers Place where the community can find contacts

and meet the experts and maintainers of these services

Community-sourced annotation, expert oversee Mixed annotations: free text, tags, controlled

vocabularies, community ontologies

BioCatalogue

Beta version at http://beta.biocatalogue.org/Launch June 2009 at ISMB

Our approach Collect service semantic descriptions by

extracting and integrating information from text resources full text bioinformatics journal publications

Approach: identify descriptors that are used for service

and resource annotations locate them in text infer the annotations

textual evidence and mappings to an ontology

The rest of the talk

Methodology mining bioinformatics terminology extraction of service description profiles

Experiments and results semi-automated curation

What next?

MethodologyCorpus Corpus

Information Retrieval

Information Retrieval

Sentence Filtering

Sentence Filtering

DomainOntology

(e.g. myGrid)

DomainOntology

(e.g. myGrid)

Semantic Description of Services

Identifying Topic Related Terms

Identifying Topic Related Terms

Text Mining Engine (Information Extraction)

Text Mining Engine (Information Extraction)

Semantic Network ofServices

Service Discovery

Bioinformatics terminology

1) get a corpus

2) get all terms

3) get seed examples

4) find relevant ones using term profiling and comparison to seed examples

Learn bioinformatics terms from literature


Use seed terms to bootstrap e.g. known descriptors used in existing

service descriptions, either in literature or service repositories 250 terms identified, manual pruning after

automatic term recognition examples of lexical constituents and textual

behaviour (pragmatics) lexical profiling contextual profiling


Lexical profiling what is in the name

Contextual profiling characterise sentences in which terms

appear (nouns, verbs and context-patterns)

Comparing candidate term profiles to average seed term best-match


Two domain experts evaluated the top 300 terms

Semantic classes – myGrid Informatics concepts

general concepts of data, data structures, databases, metadata

Bioinformatics concepts domain-specific data sources and algorithms

for searching and analysing data e.g. Smith-Waterman algorithm

Semantic classes – myGrid Molecular biology concepts

higher level concepts used to describe bioinformatics data types, used as inputs and outputs in services

e.g. protein sequence, nucleic acid sequence

Task concepts generic tasks a service operation can

perform e.g. retrieving, displaying, aligning

Semantic classes Engineered from MyGrid bioinformatics sub-ontology

class examples

AlgorithmSigCalc algorithm, CHAOS local alignment, SNP analysis, KEGG Genome-based approach, GeneMark method, K-fold cross validation procedure

ApplicationPreBIND Searcher program, Apollo2Go Web Service, FLIP application, Apollo Genome Annotation curation tool, GenePix software, Pegasys system

DataGeneBank record, Genome Microbial CoDing sequences, Drug Data report

Data resource

PIR Protein Information Resource, BIND database, TIGR dataset, BioMOBY Public Code repository

Semantic classes and instances

Service mentions

Named-entity recognition (NER) task Recognition of service mentions using

terminological (semantic) heads of automatically recognised terms Apollo2Go Web Service is an Application BIND database is a Data source assign the corresponding semantic class

Hearst patterns (co-ordinations, appositions, enumerations, etc.)

Semantic descriptors

Recognition of phrases depicting semantic roles used to describe services

Flexible dictionary look-up terms from myGrid ontology terms/noun phrases from existing

descriptions of bioinformatics resources (collected from Taverna and other Web service providers).

Mining service descriptions

Extraction/functional rules Predicate-driven rules: each verb associated with

the type of “information content” it provides

Extraction/functional rules Manually designed predicate-driven rules:

Subject (Arg) – Verb (Predicate) – Object (Arg)

Applied on dependencyparsed sentences Stanford parser no phrase structures complex sentences information in sub-clause

Extraction/functional rules Phrase structures

identified and integratewith the dependency

Predicate-dependent rules applied to extractspecific ‘content’ andprofile the services

Profiles collated for all mentions service name variation

Semantic service profiles For a given service, collection of

descriptors, including parameters links to other related instances related myGrid ontology semantic labels “informative” sentences

Example – GeneClass Descriptors

Descriptors Freq

GeneClass algorithm 5

Motif data 4

Reliable predictive model 2

Genome-wide protein-DNA binding data 2

Differential gene expression 3

Transcriptional gene regulation 2

Example – GeneClass Functions, parameters

Type Predicate Object

functional description show

show how to incorporate genome-wide protein-DNA binding data from ChIP chip experiments

into the GeneClass algorithm

Input/Output predict predicting differential gene expression

Input/Output predict starts with a candidate set of motifs x003bc

functional description extend

extend the original GeneClass algorithm to use all target genes for which both motif and expression

data is available

Example – GeneClass Sentences

We extend the original GeneClass algorithm to use all target genes for which both motif and expression data is available.In order to study different aspects of target gene regulation we use different sets of motifs and parents with the GeneClass algorithm.The GeneClass algorithm for predicting differential gene expression starts with a candidate set of motifs; representing known or putative regulatory element sequence patterns and a candidate set of regulators or parentSS.

Experiments

2120 BMC Bioinformatics articles full-text articles before March 2008

Service descriptors dictionary 471 descriptors from myGrid/Feta 450 descriptors collected from other

bioinformatics service/tools providers

108 predicates used

Experiments Number of candidate resources

Semantic Class

# of instances identified using terminological

head comparison

# of instances identified using coordination

co-occurrences

Total # of instances

Algorithm 5658 64 5722 Application 1862 43 1905 Data 1424 18 1442 Data Resource 2307 34 2341

Number of descriptions collected using rules

Experiments

Evaluated for their capability to be used for semantic description of a given bioinformatics resource

irrelevant

partially useful

useful

HeatMapperThe HeatMapper tool has already proven to be very useful in several studies

KalignTo compare Kalign to other MSA programs, the following test sets were used. Cognitor

To add a new species to the COG system, the annotated protein sequences from the respective genome were compared to the proteins in the COG database by using the BLAST program and assigned to pre-existing COGs by using the COGNITOR program

Evaluation of semantic profiles

Two experiments: 5 well-known resources with descriptions

already available excellent rating for sentences average rating for semantic descriptors predicate functions

5 new, unknown resources excellent rating for sentences average rating for semantic descriptors predicate functions

Evaluation of semantic profiles

What next?

Good recall, poor precision context needs a better model

Mining parameter values sub-language of parameters

Candidate service/resource mentions an entity whose profile looks like a service comparison of semantic profiles network of services [ISMB 2009]

Do we have good service ontologies?

http://gnode1.mib.man.ac.uk/bioinf

Conclusion Literature mining approach to service

description and annotation Aims

reduce curation efforts provide semantic synopses of services for the

Semantic Web

Potential of text mining integration with other annotation approaches extracting the entire service context is still

challenging

Acknowledgements gnTEAM

(text extraction, analitics, mining)H. Yang, I. Spasic, H. Afzal, A. Gledson, J. Eales, M. Greenwood, F. Sarafraz

myGrid team: Franck Tanoh

BBSRC

“Mining term associations from literature to support knowledge discovery in biology” (2005-2008)

“pubmed2ensembl” (2009-2010) “BioCatalogue” (2008-2011)

Announcement Journal of BioMedical Semantics

published by BioMed Central launched at ISMB 2009

Topics include Infrastructure for biomedical semantics

semantic resources and repositories meta-data management and resource description knowledge representation and semantic frameworks Biomedical Semantic Web life-long management of semantic resources

Semantic mining, annotation and analysis

Mining Semantic Descriptions of Bioinformatics Web Resources from the Literature Hammad Afzal,...

Documents

Transcript of Mining Semantic Descriptions of Bioinformatics Web Resources from the Literature Hammad Afzal,...