A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data
Integrating BioMedical Text Mining Services into a Distributed Workflow Environment
-
Upload
barrett-horn -
Category
Documents
-
view
41 -
download
0
description
Transcript of Integrating BioMedical Text Mining Services into a Distributed Workflow Environment
Integrating BioMedical Text Mining Services into Integrating BioMedical Text Mining Services into a Distributed Workflow Environmenta Distributed Workflow Environment
Rob Gaizauskas, Neil Davis, George Demetriou, Yikun Guo, Ian Roberts
UK E-Science All Hands MeetingNottingham
September 1-3, 2004
September 1-3, 2004 All Hands Meeting, Nottingham
OutlineOutline
Introduction: Workflows, Web Services and Text Mining for Bioinformatics
Two Case Studies: Graves’ Disease and Williams Syndrome
Text Services– Text Collection Server – Text Services Workflow Server– Interface/Browsing Client
Conclusions/Future Work
September 1-3, 2004 All Hands Meeting, Nottingham
WorkflowsWorkflows, Web Services and Text , Web Services and Text Mining for BioinformaticsMining for Bioinformatics
Workflows – useful computational models for processes that require repeated
execution of a series of complex analytical tasks
– E.g. biologist researching genetic basis of a disease repeatedly maps reactive spot in microarray data to gene sequence uses a sequence alignment tool to find proteins/DNA of similar structure mines info about these homologues from remote DBs annotates unknown gene sequence with this discovered info
September 1-3, 2004 All Hands Meeting, Nottingham
Workflows, Workflows, Web ServicesWeb Services and Text and Text Mining for BioinformaticsMining for Bioinformatics
Web services– Processing resources that are
available via the Internet use standardised messaging formats, such as XML enable communication between applications without being tied to a
particular operating system/programming language
– Useful for bioinformatics where data used in research is heterogeneous in nature – DB records, numerical results, NL texts distributed across the internet in research institutions around the world available on a variety of platforms and via non-uniform interfaces
September 1-3, 2004 All Hands Meeting, Nottingham
Workflows, Web Services and Workflows, Web Services and Text Text MiningMining for Bioinformatics for Bioinformatics
Text mining– any process of revealing information – regularities, patterns or trends
– in textual data– includes more established research areas such as information
extraction (IE), information retrieval (IR), natural language processing (NLP), knowledge discovery from databases (KDD)
– relevant to bioinformatics because of explosive growth of biomedical literature availability of some information in textual form only, e.g. clinical records
September 1-3, 2004 All Hands Meeting, Nottingham
Workflows, Web Services and Text Workflows, Web Services and Text Mining for BioinformaticsMining for Bioinformatics
Workflows Web services Text mining
Bioinformatics
September 1-3, 2004 All Hands Meeting, Nottingham
ContextContext
Objective: deliver text services for the myGrid and CLEF projects
myGrid has adopted the workflow model for delivering an e-biologist’s workbench– Scufl workflow specification language
– Taverna workflow design tool
– Freefluo workflow enactment engine
Problem: how to integrate text mining into a biological workflow?– Most text mining runs off-line and supports interactive browsing of results
– Most workflows run end to end with no user intervention
– What are the inputs to text mining to be?
Solution: tap off result of a workflow step and treat as implicit query
September 1-3, 2004 All Hands Meeting, Nottingham
Two Case Studies in the Genetic Basis Two Case Studies in the Genetic Basis of Diseaseof Disease
Graves’ Disease– an autoimmune condition affecting tissues in the thyroid and orbit
– being investigated using the micro-array methods micro-array shows which genes are differentially expressed in normal
patients vs patients with the disease = candidate genes sequence alignment search (e.g. BLAST) finds genes/proteins with
similar structure function of these “homologues” may suggest function of candidate gene
– key step for text mining follows BLAST search for homologous proteins BLAST report contains references to proteins in
SWISSPROT protein database Swissprot records contain ids of abstracts describing the protein in
Medline abstract database abstracts can be mined directly or used as ``seed'' documents to
assemble a set of related abstracts
September 1-3, 2004 All Hands Meeting, Nottingham
Two Case Studies in the Genetic Basis Two Case Studies in the Genetic Basis of Diseaseof Disease
Williams Syndrome– congenital disorder resulting in mental retardation caused by
deletion of genetic material on 7th chromosome– area in which deletions occur not well characterised – better
sequence info is becoming available – as new sequence information becomes available
gene finding software run against it BLAST is run against new putative genes to identify
homologues whose function may be known– BLAST reports provide links to abstracts in the literature
September 1-3, 2004 All Hands Meeting, Nottingham
Text Services ArchitectureText Services Architecture
User Client
Medline Server
Swissprot/Blast record
Workflow Server
WorkflowEnactment
ExtractPubMed Id
Get MedlineAbstract
Initial Workflow
Cluster Abstracts
Get Related Abstracts
Medline: pre-processed offline to extract biomedical terms + indexed
Workflow definition+ parameters
Clustered PubMed Ids+ titles
PubMed Ids
PubMed Ids
Term-annotatedMedline abstracts
MedlineAbstracts
September 1-3, 2004 All Hands Meeting, Nottingham
Text Services ArchitectureText Services Architecture
3-way division of labour sensible way to deliver distributed text mining services– Providers of e-archives, such as Medline, will make archives
available via web-services interface Cannot offer tailored sevices for every application Will provide core, common services
– Specialist workflow designers will add value to basic services from archive to meet their organization’s needs
– Users will prefer to execute predefined workflows via standard light clients such as a browser
Architecture appropriate for many research areas, not just bioinformatics
September 1-3, 2004 All Hands Meeting, Nottingham
Text Services ArchitectureText Services Architecture
User Client
Medline Server
Swissprot/Blast record
Workflow Server
WorkflowEnactment
ExtractPubMed Id
Get MedlineAbstract
Initial Workflow
Cluster Abstracts
Get Related Abstracts
Medline: pre-processed offline to extract biomedical terms + indexed
Workflow definition+ parameters
Clustered PubMed Ids+ titles
PubMed Ids
PubMed Ids
Term-annotatedMedline abstracts
MedlineAbstracts
September 1-3, 2004 All Hands Meeting, Nottingham
Text Collection ServerText Collection Server Text collection is Medline (www.ncbi.nlm.nih.gov/)
– > 10 million abstracts since 1950’s– largest repository of biomedical abstracts– copies made available for research, updated annually– records contain semi-structured information annotated in XML
Unique id – PubMed id Citation information – author(s), journal, year, etc. Manually assigned controlled vocabulary keywords (MeSH terms) Text of abstract
September 1-3, 2004 All Hands Meeting, Nottingham
Text Collection Server (cont)Text Collection Server (cont) Local copy
– Loaded in mySQL, indexed on various fields, e.g. MeSH terms
– Text portion indexed with for search engines (Lucene, Madcow)
– Text pre-preprocessed with text mining tools Tokenisation Terminology look-up
and indexes built for term classes (proteins, genes, diseases, etc.)
Server accepts web service calls to, e.g. – Return text of abstract given a PubMed id
– Return MeSH terms of abstracts given PubMed ids
– Return PubMed ids of abstracts with given MeSH terms
– Return PubMed ids of abstracts matching a free text query
– Return PubMed ids of abstracts containing a specific term
Part-of-speech tagging Term Parsing
September 1-3, 2004 All Hands Meeting, Nottingham
Text Services ArchitectureText Services Architecture
User Client
Medline Server
Swissprot/Blast record
Workflow Server
WorkflowEnactment
ExtractPubMed Id
Get MedlineAbstract
Initial Workflow
Cluster Abstracts
Get Related Abstracts
Medline: pre-processed offline to extract biomedical terms + indexed
Workflow definition+ parameters
Clustered PubMed Ids+ titles
PubMed Ids
PubMed Ids
Term-annotatedMedline abstracts
MedlineAbstracts
September 1-3, 2004 All Hands Meeting, Nottingham
Workflow ServerWorkflow Server Workflow server runs Freefluo enactment engine to
execute Scufl workflow (designed using Taverna)
Graves’ disease workflow:
September 1-3, 2004 All Hands Meeting, Nottingham
Text Services ArchitectureText Services Architecture
User Client
Medline Server
Swissprot/Blast record
Workflow Server
WorkflowEnactment
ExtractPubMed Id
Get MedlineAbstract
Initial Workflow
Cluster Abstracts
Get Related Abstracts
Medline: pre-processed offline to extract biomedical terms + indexed
Workflow definition+ parameters
Clustered PubMed Ids+ titles
PubMed Ids
PubMed Ids
Term-annotatedMedline abstracts
MedlineAbstracts
September 1-3, 2004 All Hands Meeting, Nottingham
Interface/Browsing ClientInterface/Browsing Client
Two components– Submit workflow for enactment– Explore results and launch follow-on queries
Three types of follow-on search– Find other texts containing terms in current text– Find texts containing a specific search string (free text search)– Find others text “like” current one (with same MeSH terms)
Implemented as a Java-Swing applet for easy inclusion in portals
September 1-3, 2004 All Hands Meeting, Nottingham
Abstractbody
Interface/Browsing ClientInterface/Browsing Client
MeSH Tree
AbstractTitles
Free textsearch
Searchscoperestrictors
Linkedterms
GetRelatedAbstracts
September 1-3, 2004 All Hands Meeting, Nottingham
ConclusionConclusion
Have implemented a set of text mining web services that run in a workflow to support biologists in exploring the genetic basis of disease
Implementation based on a generic 3 component architecture (archive server, workflow server, browser client) with wider applicability
Basic idea is to glean an implicit query from a workflow operation (e.g. sequence alignment) – find abstracts of papers related to abstracts describing homologous
proteins/genes of gene of interest– Cluster results and present to user
User can explore results and issue follow-on queries via a richly-featured graphical interface
September 1-3, 2004 All Hands Meeting, Nottingham
Future WorkFuture Work
Integrate in practice with rest of Graves’/Williams workflows in myGrid and get feedback from biologists
Explore other intepretations of “relatedness” for abstracts in addition to MeSH terms– in assembling corpus of related abstracts (e.g. vector
space/language model notions of similarity)– in clustering results (e.g. k-means/agglomerative clustering)
Explore other ways of deriving implicit queries from workflows – e.g. mining provenance data
Explore further interface search filtering operations and interface design issues
Scale up to process all of Medline for term/entity identification