Mining Semantic Descriptions of Bioinformatics Web Resources from the Literature Hammad Afzal,...
-
date post
20-Dec-2015 -
Category
Documents
-
view
218 -
download
0
Transcript of Mining Semantic Descriptions of Bioinformatics Web Resources from the Literature Hammad Afzal,...
Mining Semantic Descriptions of Bioinformatics Web
Resources from the Literature
Hammad Afzal, Robert Stevens, Goran Nenadic
School of Computer ScienceUniversity of Manchester
Motivation A number of bioinformatics tools and resources
available for service use and composition guessimate is 3000+ Web Services publically available how to find a service, what is out there to use? provenance?
Semantic annotation of bioinformatics services annotate functional capabilities e.g. Taverna, myGrid, myExperiment, EBI, BioMOBY
Not only services and tools databases, repositories, corpora
Motivation
Manual curation e.g. myGrid, BioCatalogue etc. e.g. Taverna/Feta: only ~15-20% functionally
described backlog – and the number of services is
growing
Annotations combine textual descriptions ontological mappings
Example
Service name: Emma Description: Performs a multiple alignment of nucleic acid or protein sequences using ClustalW
program. Name: emma Description: Performs a multiple alignment of nucleic acid or protein sequences using ClustalW
program. Name: sequence_usa Description: The Uniform Sequence Address, or USA, is a standard way of specifying a sequence to be read into a program in EMBOSS. The most common ways of specifying a sequence is to type (database: entry), where database can be embl, uniprot or swissprot and entry is either the sequence’s entry or ID name, or its Accession number in that database…
Op-Input
Parameter
SemanticType: http://www.mygrid.org.uk/mygrid-moby-service#simpleParameter Name: outseq Description: Returns a multiple sequence alignment report
Operation
Op-Output Parameter
SemanticType: http://www.mygrid.org.uk/ontology#multiple_sequence_alignment_report
text
ontological descriptions
- multiple local align.
- Soaplab
BioCatalogue
Single registration point for Web Service providers
Single search site for scientists and developers Place where the community can find contacts
and meet the experts and maintainers of these services
Community-sourced annotation, expert oversee Mixed annotations: free text, tags, controlled
vocabularies, community ontologies
BioCatalogue
Beta version at http://beta.biocatalogue.org/Launch June 2009 at ISMB
Our approach Collect service semantic descriptions by
extracting and integrating information from text resources full text bioinformatics journal publications
Approach: identify descriptors that are used for service
and resource annotations locate them in text infer the annotations
textual evidence and mappings to an ontology
The rest of the talk
Methodology mining bioinformatics terminology extraction of service description profiles
Experiments and results semi-automated curation
What next?
MethodologyCorpus Corpus
Information Retrieval
Information Retrieval
Sentence Filtering
Sentence Filtering
DomainOntology
(e.g. myGrid)
DomainOntology
(e.g. myGrid)
Semantic Description of Services
Identifying Topic Related Terms
Identifying Topic Related Terms
Text Mining Engine (Information Extraction)
Text Mining Engine (Information Extraction)
Semantic Network ofServices
Service Discovery
Bioinformatics terminology
1) get a corpus
2) get all terms
3) get seed examples
4) find relevant ones using term profiling and comparison to seed examples
Learn bioinformatics terms from literature
Bioinformatics terminology
Use seed terms to bootstrap e.g. known descriptors used in existing
service descriptions, either in literature or service repositories 250 terms identified, manual pruning after
automatic term recognition examples of lexical constituents and textual
behaviour (pragmatics) lexical profiling contextual profiling
Bioinformatics terminology
Lexical profiling what is in the name
Contextual profiling characterise sentences in which terms
appear (nouns, verbs and context-patterns)
Comparing candidate term profiles to average seed term best-match
Bioinformatics terminology
Two domain experts evaluated the top 300 terms
Semantic classes – myGrid Informatics concepts
general concepts of data, data structures, databases, metadata
Bioinformatics concepts domain-specific data sources and algorithms
for searching and analysing data e.g. Smith-Waterman algorithm
Semantic classes – myGrid Molecular biology concepts
higher level concepts used to describe bioinformatics data types, used as inputs and outputs in services
e.g. protein sequence, nucleic acid sequence
Task concepts generic tasks a service operation can
perform e.g. retrieving, displaying, aligning
Semantic classes Engineered from MyGrid bioinformatics sub-ontology
class examples
AlgorithmSigCalc algorithm, CHAOS local alignment, SNP analysis, KEGG Genome-based approach, GeneMark method, K-fold cross validation procedure
ApplicationPreBIND Searcher program, Apollo2Go Web Service, FLIP application, Apollo Genome Annotation curation tool, GenePix software, Pegasys system
DataGeneBank record, Genome Microbial CoDing sequences, Drug Data report
Data resource
PIR Protein Information Resource, BIND database, TIGR dataset, BioMOBY Public Code repository
Semantic classes and instances
Semantic classes and instances
Service mentions
Named-entity recognition (NER) task Recognition of service mentions using
terminological (semantic) heads of automatically recognised terms Apollo2Go Web Service is an Application BIND database is a Data source assign the corresponding semantic class
Hearst patterns (co-ordinations, appositions, enumerations, etc.)
Semantic descriptors
Recognition of phrases depicting semantic roles used to describe services
Flexible dictionary look-up terms from myGrid ontology terms/noun phrases from existing
descriptions of bioinformatics resources (collected from Taverna and other Web service providers).
Mining service descriptions
Extraction/functional rules Predicate-driven rules: each verb associated with
the type of “information content” it provides
Extraction/functional rules Manually designed predicate-driven rules:
Subject (Arg) – Verb (Predicate) – Object (Arg)
Applied on dependencyparsed sentences Stanford parser no phrase structures complex sentences information in sub-clause
Extraction/functional rules Phrase structures
identified and integratewith the dependency
Predicate-dependent rules applied to extractspecific ‘content’ andprofile the services
Profiles collated for all mentions service name variation
Semantic service profiles For a given service, collection of
descriptors, including parameters links to other related instances related myGrid ontology semantic labels “informative” sentences
Example – GeneClass Descriptors
Descriptors Freq
GeneClass algorithm 5
Motif data 4
Reliable predictive model 2
Genome-wide protein-DNA binding data 2
Differential gene expression 3
Transcriptional gene regulation 2
Example – GeneClass Functions, parameters
Type Predicate Object
functional description show
show how to incorporate genome-wide protein-DNA binding data from ChIP chip experiments
into the GeneClass algorithm
Input/Output predict predicting differential gene expression
Input/Output predict starts with a candidate set of motifs x003bc
functional description extend
extend the original GeneClass algorithm to use all target genes for which both motif and expression
data is available
Example – GeneClass Sentences
We extend the original GeneClass algorithm to use all target genes for which both motif and expression data is available.In order to study different aspects of target gene regulation we use different sets of motifs and parents with the GeneClass algorithm.The GeneClass algorithm for predicting differential gene expression starts with a candidate set of motifs; representing known or putative regulatory element sequence patterns and a candidate set of regulators or parentSS.
Experiments
2120 BMC Bioinformatics articles full-text articles before March 2008
Service descriptors dictionary 471 descriptors from myGrid/Feta 450 descriptors collected from other
bioinformatics service/tools providers
108 predicates used
Experiments Number of candidate resources
Semantic Class
# of instances identified using terminological
head comparison
# of instances identified using coordination
co-occurrences
Total # of instances
Algorithm 5658 64 5722 Application 1862 43 1905 Data 1424 18 1442 Data Resource 2307 34 2341
Number of descriptions collected using rules
Experiments
Evaluated for their capability to be used for semantic description of a given bioinformatics resource
irrelevant
partially useful
useful
HeatMapperThe HeatMapper tool has already proven to be very useful in several studies
KalignTo compare Kalign to other MSA programs, the following test sets were used. Cognitor
To add a new species to the COG system, the annotated protein sequences from the respective genome were compared to the proteins in the COG database by using the BLAST program and assigned to pre-existing COGs by using the COGNITOR program
Evaluation of semantic profiles
Two experiments: 5 well-known resources with descriptions
already available excellent rating for sentences average rating for semantic descriptors predicate functions
5 new, unknown resources excellent rating for sentences average rating for semantic descriptors predicate functions
Evaluation of semantic profiles
What next?
Good recall, poor precision context needs a better model
Mining parameter values sub-language of parameters
Candidate service/resource mentions an entity whose profile looks like a service comparison of semantic profiles network of services [ISMB 2009]
Do we have good service ontologies?
http://gnode1.mib.man.ac.uk/bioinf
Conclusion Literature mining approach to service
description and annotation Aims
reduce curation efforts provide semantic synopses of services for the
Semantic Web
Potential of text mining integration with other annotation approaches extracting the entire service context is still
challenging
Acknowledgements gnTEAM
(text extraction, analitics, mining)H. Yang, I. Spasic, H. Afzal, A. Gledson, J. Eales, M. Greenwood, F. Sarafraz
myGrid team: Franck Tanoh
BBSRC
“Mining term associations from literature to support knowledge discovery in biology” (2005-2008)
“pubmed2ensembl” (2009-2010) “BioCatalogue” (2008-2011)
Announcement Journal of BioMedical Semantics
published by BioMed Central launched at ISMB 2009
Topics include Infrastructure for biomedical semantics
semantic resources and repositories meta-data management and resource description knowledge representation and semantic frameworks Biomedical Semantic Web life-long management of semantic resources
Semantic mining, annotation and analysis