Integrating BioMedical Text Mining Services into a Distributed Workflow Environment

Integrating BioMedical Text Mining Services into Integrating BioMedical Text Mining Services into a Distributed Workflow Environmenta Distributed Workflow Environment

Rob Gaizauskas, Neil Davis, George Demetriou, Yikun Guo, Ian Roberts

UK E-Science All Hands MeetingNottingham

September 1-3, 2004

September 1-3, 2004 All Hands Meeting, Nottingham

OutlineOutline

Introduction: Workflows, Web Services and Text Mining for Bioinformatics

Two Case Studies: Graves’ Disease and Williams Syndrome

Text Services– Text Collection Server – Text Services Workflow Server– Interface/Browsing Client

Conclusions/Future Work


WorkflowsWorkflows, Web Services and Text , Web Services and Text Mining for BioinformaticsMining for Bioinformatics

Workflows – useful computational models for processes that require repeated

execution of a series of complex analytical tasks

– E.g. biologist researching genetic basis of a disease repeatedly maps reactive spot in microarray data to gene sequence uses a sequence alignment tool to find proteins/DNA of similar structure mines info about these homologues from remote DBs annotates unknown gene sequence with this discovered info


Workflows, Workflows, Web ServicesWeb Services and Text and Text Mining for BioinformaticsMining for Bioinformatics

Web services– Processing resources that are

available via the Internet use standardised messaging formats, such as XML enable communication between applications without being tied to a

particular operating system/programming language

– Useful for bioinformatics where data used in research is heterogeneous in nature – DB records, numerical results, NL texts distributed across the internet in research institutions around the world available on a variety of platforms and via non-uniform interfaces


Workflows, Web Services and Workflows, Web Services and Text Text MiningMining for Bioinformatics for Bioinformatics

Text mining– any process of revealing information – regularities, patterns or trends

– in textual data– includes more established research areas such as information

extraction (IE), information retrieval (IR), natural language processing (NLP), knowledge discovery from databases (KDD)

– relevant to bioinformatics because of explosive growth of biomedical literature availability of some information in textual form only, e.g. clinical records


Workflows, Web Services and Text Workflows, Web Services and Text Mining for BioinformaticsMining for Bioinformatics

Workflows Web services Text mining

Bioinformatics


ContextContext

Objective: deliver text services for the myGrid and CLEF projects

myGrid has adopted the workflow model for delivering an e-biologist’s workbench– Scufl workflow specification language

– Taverna workflow design tool

– Freefluo workflow enactment engine

Problem: how to integrate text mining into a biological workflow?– Most text mining runs off-line and supports interactive browsing of results

– Most workflows run end to end with no user intervention

– What are the inputs to text mining to be?

Solution: tap off result of a workflow step and treat as implicit query


Two Case Studies in the Genetic Basis Two Case Studies in the Genetic Basis of Diseaseof Disease

Graves’ Disease– an autoimmune condition affecting tissues in the thyroid and orbit

– being investigated using the micro-array methods micro-array shows which genes are differentially expressed in normal

patients vs patients with the disease = candidate genes sequence alignment search (e.g. BLAST) finds genes/proteins with

similar structure function of these “homologues” may suggest function of candidate gene

– key step for text mining follows BLAST search for homologous proteins BLAST report contains references to proteins in

SWISSPROT protein database Swissprot records contain ids of abstracts describing the protein in

Medline abstract database abstracts can be mined directly or used as ``seed'' documents to

assemble a set of related abstracts


Two Case Studies in the Genetic Basis Two Case Studies in the Genetic Basis of Diseaseof Disease

Williams Syndrome– congenital disorder resulting in mental retardation caused by

deletion of genetic material on 7th chromosome– area in which deletions occur not well characterised – better

sequence info is becoming available – as new sequence information becomes available

gene finding software run against it BLAST is run against new putative genes to identify

homologues whose function may be known– BLAST reports provide links to abstracts in the literature


Text Services ArchitectureText Services Architecture

User Client

Medline Server

Swissprot/Blast record

Workflow Server

WorkflowEnactment

ExtractPubMed Id

Get MedlineAbstract

Initial Workflow

Cluster Abstracts

Get Related Abstracts

Medline: pre-processed offline to extract biomedical terms + indexed

Workflow definition+ parameters

Clustered PubMed Ids+ titles

PubMed Ids

PubMed Ids

Term-annotatedMedline abstracts

MedlineAbstracts



3-way division of labour sensible way to deliver distributed text mining services– Providers of e-archives, such as Medline, will make archives

available via web-services interface Cannot offer tailored sevices for every application Will provide core, common services

– Specialist workflow designers will add value to basic services from archive to meet their organization’s needs

– Users will prefer to execute predefined workflows via standard light clients such as a browser

Architecture appropriate for many research areas, not just bioinformatics



User Client

Medline Server


Workflow Server

WorkflowEnactment

ExtractPubMed Id

Get MedlineAbstract

Initial Workflow

Cluster Abstracts





PubMed Ids

PubMed Ids


MedlineAbstracts


Text Collection ServerText Collection Server Text collection is Medline (www.ncbi.nlm.nih.gov/)

– > 10 million abstracts since 1950’s– largest repository of biomedical abstracts– copies made available for research, updated annually– records contain semi-structured information annotated in XML

Unique id – PubMed id Citation information – author(s), journal, year, etc. Manually assigned controlled vocabulary keywords (MeSH terms) Text of abstract

http://www.ncbi.nlm.nih.gov/


Text Collection Server (cont)Text Collection Server (cont) Local copy

– Loaded in mySQL, indexed on various fields, e.g. MeSH terms

– Text portion indexed with for search engines (Lucene, Madcow)

– Text pre-preprocessed with text mining tools Tokenisation Terminology look-up

and indexes built for term classes (proteins, genes, diseases, etc.)

Server accepts web service calls to, e.g. – Return text of abstract given a PubMed id

– Return MeSH terms of abstracts given PubMed ids

– Return PubMed ids of abstracts with given MeSH terms

– Return PubMed ids of abstracts matching a free text query

– Return PubMed ids of abstracts containing a specific term

Part-of-speech tagging Term Parsing



User Client

Medline Server


Workflow Server

WorkflowEnactment

ExtractPubMed Id

Get MedlineAbstract

Initial Workflow

Cluster Abstracts





PubMed Ids

PubMed Ids


MedlineAbstracts


Workflow ServerWorkflow Server Workflow server runs Freefluo enactment engine to

execute Scufl workflow (designed using Taverna)

Graves’ disease workflow:



User Client

Medline Server


Workflow Server

WorkflowEnactment

ExtractPubMed Id

Get MedlineAbstract

Initial Workflow

Cluster Abstracts





PubMed Ids

PubMed Ids


MedlineAbstracts


Interface/Browsing ClientInterface/Browsing Client

Two components– Submit workflow for enactment– Explore results and launch follow-on queries

Three types of follow-on search– Find other texts containing terms in current text– Find texts containing a specific search string (free text search)– Find others text “like” current one (with same MeSH terms)

Implemented as a Java-Swing applet for easy inclusion in portals


Abstractbody

Interface/Browsing ClientInterface/Browsing Client

MeSH Tree

AbstractTitles

Free textsearch

Searchscoperestrictors

Linkedterms

GetRelatedAbstracts


ConclusionConclusion

Have implemented a set of text mining web services that run in a workflow to support biologists in exploring the genetic basis of disease

Implementation based on a generic 3 component architecture (archive server, workflow server, browser client) with wider applicability

Basic idea is to glean an implicit query from a workflow operation (e.g. sequence alignment) – find abstracts of papers related to abstracts describing homologous

proteins/genes of gene of interest– Cluster results and present to user

User can explore results and issue follow-on queries via a richly-featured graphical interface


Future WorkFuture Work

Integrate in practice with rest of Graves’/Williams workflows in myGrid and get feedback from biologists

Explore other intepretations of “relatedness” for abstracts in addition to MeSH terms– in assembling corpus of related abstracts (e.g. vector

space/language model notions of similarity)– in clustering results (e.g. k-means/agglomerative clustering)

Explore other ways of deriving implicit queries from workflows – e.g. mining provenance data

Explore further interface search filtering operations and interface design issues

Scale up to process all of Medline for term/entity identification

Integrating BioMedical Text Mining Services into a Distributed Workflow Environment

Documents

Transcript of Integrating BioMedical Text Mining Services into a Distributed Workflow Environment