1 Improving the Reuse of Scientific Workflows and their By-products Xiaorong Xiang National...

48
1 Improving the Reuse of Scientific Workflows and their By-products Xiaorong Xiang National Evolutionary Synthesis Center (NESCent) Duke University, University of North Carolina - Chapel Hill, and North Carolina State University Gregory Madey Department of Computer Science and Engineering University of Notre Dame 2007 IEEE International Conference on Web Services (ICWS 2007) Salt Lake City, Utah, July 2007 Supported in part by the Indiana Center for Insect Genomics (ICIG) & the Indiana 21st Century Fund
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    219
  • download

    1

Transcript of 1 Improving the Reuse of Scientific Workflows and their By-products Xiaorong Xiang National...

1

Improving the Reuse of Scientific Workflows and their By-products

Xiaorong XiangNational Evolutionary Synthesis Center (NESCent)

Duke University, University of North Carolina - Chapel Hill, and North Carolina State University

Gregory MadeyDepartment of Computer Science and Engineering

University of Notre Dame

2007 IEEE International Conference on Web Services (ICWS 2007)Salt Lake City, Utah, July 2007Supported in part by the Indiana Center for Insect Genomics (ICIG) & the Indiana 21st Century Fund

2

Collaborators: Xiaorong Xiang & Jeanne Romero-Severson

3

Outline: two parts

Production system (MoGServ) for bioinformatics workflow Bioinformatics application Productivity improvement

Prototype system exploring ideas for end-user composition Workflow reuse Knowledge management/discovery

4

From the article “Genome Sequencing vs. Moore’s Law: Cyber Challenges for the Next Decade” by Folker Meyer in journal CTWatch Quarterly August, 2006 volume 2 number 3

Bioinformatics today

• Rapidly accumulating data: DNA sequences, contigs, expression data, annotations, etc.• Non-standard independently developed heterogeneous data sources• Data sharing and security• Productivity Problem!

5

SOA in Bioinformatics

MORE Community efforts needed to provide

more shared and reliable services More demonstration projects needed =>

best practices, measured utility, feedback to middleware projects, etc.

Recent exposure of data & analysis tools as services

Large public databases and bioinformatics toolsMiddleware projects

Provide infrastructure to compose, manage,execute, connect the distributed services

6

Mother of Green (MoG) project

Biological science In collaboration with Prof. Jeanne Romero-Severson,

Biological Sciences, University of Notre Dame. Study the deep phylogeny of plastid

Computer science Provide an environment to support scientists’ investigations A case study of using SOA for data and application

integration A prototype for future research in service-oriented

architecture domain

7

Mother of GreenMother of Green

• Malaria causes Malaria causes 1.5 - 2.7 million deaths every year1.5 - 2.7 million deaths every year

• 3,000 children under age five die of malaria every 3,000 children under age five die of malaria every

dayday

•Plasmodium falciparum (a protozoan parasite)Plasmodium falciparum (a protozoan parasite)

causes human malariacauses human malaria

• Drug resistance a world-wide problemDrug resistance a world-wide problem

• Targeted drug design through phylogenomicsTargeted drug design through phylogenomics

• Malaria causes Malaria causes 1.5 - 2.7 million deaths every year1.5 - 2.7 million deaths every year

• 3,000 children under age five die of malaria every 3,000 children under age five die of malaria every

dayday

•Plasmodium falciparum (a protozoan parasite)Plasmodium falciparum (a protozoan parasite)

causes human malariacauses human malaria

• Drug resistance a world-wide problemDrug resistance a world-wide problem

• Targeted drug design through phylogenomicsTargeted drug design through phylogenomicsP. falciparumP. falciparum

8

Mother of GreenMother of Green

• P. falciparumP. falciparum has three genomes has three genomesNuclear, mitochondrial, plastidNuclear, mitochondrial, plastid

• Animals and insects have only twoAnimals and insects have only two• Target the third genomeTarget the third genome• No harm to animalsNo harm to animals• New antimalarial drugNew antimalarial drug• High risk, high tech, high payoffHigh risk, high tech, high payoff

J. Romero-SeversonJ. Romero-SeversonDepartment of Biological SciencesDepartment of Biological SciencesGreg Madey & Xiaorong XiangGreg Madey & Xiaorong XiangDepartment of Computer Science & EngineeringDepartment of Computer Science & Engineering

J. Romero-SeversonJ. Romero-SeversonDepartment of Biological SciencesDepartment of Biological SciencesGreg Madey & Xiaorong XiangGreg Madey & Xiaorong XiangDepartment of Computer Science & EngineeringDepartment of Computer Science & Engineering

9

Mother of GreenMother of Green

•Plastids are the third genome•Intracellular organelles •Terrestrial plants, algae, apicomplexans•Functions in plants and algae

PhotosynthesisOxidation of water Reduction of NADPSynthesis of ATPFatty acid biosynthesisAromatic amino acid biosynthesis

•Functions in apicomplexans ?

•Plastids are the third genome•Intracellular organelles •Terrestrial plants, algae, apicomplexans•Functions in plants and algae

PhotosynthesisOxidation of water Reduction of NADPSynthesis of ATPFatty acid biosynthesisAromatic amino acid biosynthesis

•Functions in apicomplexans ?

Chloroplast in plant cell

Plastid in Toxoplasma sp.

Apicoplast in P. falciparum

plastid

10

Mother of GreenMother of Green

•The apicoplast appears to code for <30

proteins.

•Repair, replication and transcription proteins

•Why is the apicoplast essential?

11

• Find the ancestors of the apicoplast• Identify genes in the ancestors• Determine gene function • Look for these genes in the P. falciparum nucleus• Then study regulatory mechanisms in candidate genes

Mother of GreenPhylogenomicsMother of GreenPhylogenomics

12

Phylogenomics of plastids

• Very old lineage (> 2.5 billion years)• Cyanobacterial ancestor• Three main plastid lineages

GlaucophytesGroup of freshwater algaeChloroplast resembles intact cyanobacteria

ChlorophytesGreen plant lineageChloroplast genome reducedMany chloroplast genes now in nuclear genome

RhodophytesRed algal lineage

Chloroplast genome bigger than in green plantsOomycetesApicomplexans

13

Phylogenomics of plastids

• One cyanobacterial ancestor ?• Many?• Lineages are not linear

One plastid origin

Multiple plastid origins

14

The process of endosymbiosis.

Horizontal Gene Transfer (arrows) from the plastid to the nucleus.

The nucleomorph is a remnant of the original endosymbiont nucleus.

Primitive eukaryote

Endosymbiont plastid

Secondary endosymbionts

Second eukaryote

Secondary nonphotosynthetic endosymbiont

Cyanobacteria

Nucleus

Nucleus

Nucleomorph

Plastid disappears

15

Secondary endosymbiont

Tertiary endosymbionts

Third eukaryote

Tertiary nonphotosynthetic endosymbiont

Plastid disappears

Tertiary endosymbiosis. Horizontal Gene Transfer

P. falciparum

16

The information gathering problem

• Rapid accumulation of raw sequence information~100 sequenced chloroplast genomes~57 sequenced cyanobacterial genomesRate of accumulation is increasingInformation accumulates faster than analyses finishInformation in forms not readily accessible

• SolutionSemi-automated web-services“Smart” web-servicesSemantic web

17

A typical in-silico investigation – Data driven research

A: Query complete genome sequences

given a taxa

A: Query complete genome sequences

given a taxa

B: Query protein coding genes

for each genome sequence

B: Query protein coding genes

for each genome sequence

C: Eliminate vectorsequences

C: Eliminate vectorsequences

D: Sequences alignment

D: Sequences alignment

E: Phylogenetic analysis

E: Phylogenetic analysis

18

Time consuming manual web-based operations

Data collection Copy & paste!

Analysis tool usage Copy & paste!

Experiment data recording Copy & paste!

Repetitive experiments for scientific discovery Copy & paste!

Repeat as new data becomes available Copy & paste!

19

MoGServ system architecture

MoGServ interface Web interface Application interface

MoGServ middle layer Data access storage Data and analysis services Service and workflow registry Indexing and querying metadata Service and workflow enactment

Acting in two roles: service requester and service provider

Web InterfaceWeb Interface ApplicationsApplications

Application ServerApplication Server

Data AccessServices

Data AccessServices

Data AnalysisServices

Data AnalysisServices

Job ManagerJob Manager

Job LauncherJob Launcher

Service/WorkflowRegistry

Service/WorkflowRegistry

MetadataSearch

MetadataSearch

Local DataStorage

Local DataStorage

Workflow/SoapEngines

Services

NCBINCBI DDBJDDBJ EMBLEMBL

Data/Services Providers

MoGServMiddleLayer

ServicesAccessClient

OthersOthers

MoG

Ser

v S

yste

m A

rchi

tect

ure

21

Data storage and access services

Local database Integrating data from multiple data sources with

scientists interests Supporting repetitive investigations against

several subsets of sequences Avoiding network traffic and service failure when

retrieving data on-the-fly from public data sources Accessing the data in the local database by

services

22

Service and workflow registry

A table-based description with necessary properties Text description Service location Input/output Provider Version Algorithm Invocation method

Not intended for supporting service discovery or composition To answer end-users questions about their results

Provenance: “Which algorithm was used to generate the data and what is the source of the input data?”

A repository of service and workflow used for local application developers

23

Indexing and querying metadata

Metadata Service and workflow description Description of sequence data in order to track the

origination of data Experimental data output, input, and intermediate

data Indexing and querying with keyword

Lucene Implemented as services

24

Service and workflow enactment

INPUT

Parameters

Task Name

Timer

INPUT

Parameters

Task Name

Timer

Service/WorkflowRegistry

Job ManagerJob Manager

Find the service/workflowdefinition using the task name

Form a JobDescription

Output

Job ID

Output

Job ID

Job LauncherJob Launcher

Instances of Workflow/Service Engines

Instances of Workflow/Service Engines

Job Information

25

Implementation Development and deployment

J2EE, JSP, XSLT Tomcat 5.0.18 / Axis 1.2

Database PostgresSQL 8.1

Index and search of metadata Apache Lucene library

Service implementation Java2WSDL Wrap command line applications with JLaunch library

Workflow Taverna workbench, part of myGrid project Freefluo workflow engine

26

Data and services

Services, Workflows Data collection from remote database Query local database Data analysis tools, blast, clustalw, Data format conversion, readseq Management data sets and jobs Download and upload

Data Complete genome sequences ATP gene sequences Sequence sets Saved jobs

27

Taverna workbench

28

A workflow created using the Taverna workbench tool

29

Improvement opportunities

Use existing domain ontology in bioformatics community to describe services, workflows, and data

Integrate the semantic web technology to support end-users workflow creation based on their knowledge of scientific domain

Support users with limited knowledge of scientific processes

Record various workflow representations Facilitate the discovery and reuse of prior workflows

Knowledge management Knowledge discovery

30

Service Composition and workflows

Service composition Ad-hoc Semi-automate

Semantic annotation + reasoning Automated

Semantic annotation + planning

Scientific workflows Workflows composed based on service-oriented

architecture for assisting scientists in accessing and analyzing data.

31

Current workflow management systems Existing workflow management system and bi

oinformatics middleware Taverna, Kepler, Triana, Pegasus Design, execute, monitor, re-run

Support ad-hoc, semi-automated and automated service discovery and composition from scratch

32

Our approach

Reuse the verified knowledge and workflow in the community Increase the correctness of composed workflows

over time Provide more accurate guidelines for users

A four level hierarchical workflow structure An enhanced workflow system

33

Aligning

Retrieving

Workflow A defined by a less experienced user using the functional definition of services

queryGene

clustalW

Workflow B defined by an intermediate user with executable services

queryGene

clustalW

queryGene queryGene

setIds

setFilter

clustalW clustalW

Workflow C defined by an expert user with two extra executable services to ensure the accurate output of

the biological process

Three user-defined workflows from different viewsQuestion: “are gene genealogies for ATP subunits α, β,and γ different?”

34

UserService

Annotator

Abstractworkflow

OWLDL reasoner

OWLDL reasoner

Ontology

Create abstract workflow using ontology

Annotate services using ontology

Semantics enabled service registry

Semantics enabled service discovery

Semantics enabled service discovery

Service matchmakingService matchmaking

Workflow composer (software agent/experienced users)

Find appropriate service

Workflow execution

engine

Workflow execution

engine

concreteworkflow

Data provenancemanagement

Data provenancemanagement

Collect and manage information about data origination

Knowledgebase

management

Knowledgebase

managementKnowledgediscovery

Knowledgediscovery

Enhanced workflow system

MogServ

35

Encode, convert theHigh level definition To low-level executable

Invoke a workflow withSpecific input data andRecord the data Provenance and Performance of services,workflows.

Abstract workflow

Concrete workflow

Optimal workflow

Workflow instance

Replace individual Services with theiroptimal alternatives

Task A Task B

Service B

Service A

Service DService C

Service BService A

Service DService C’

input

outputService B

Service A

Service DService C’

Our hierarchical workflow structure

F F T f i l e a

/usr/local/bin/fft /home/file1

M o v e f i l e a f r o m h o s t 1 : / /

h o m e / f i l e a

t o h o s t 2 : / /h o m e / f i l e 1

Abstract Workflow

Concrete Workflow

DataTransfer

Data Registration

Pegasus workflow structure

36

Reusable knowledge Connectivity

Helps to convert from abstract workflow to concrete workflow

Alternative services Helps to convert from concrete workflow to optimal

workflow Quality profile of services

Helps discover optimal workflows Mapping of abstract workflow and concrete workflow

Helps to choose reusable workflows

37

Connectivity identification(Match detection)

Service: QueryLocal Operation: createSet

performTask: mygrid:retrieving

inputPara: Settype(String, mog:gene)

Queryterm(String, null) outputPara:

Setid(string, mog:geneset)useResource: MoG

Service: ClustalW Operation: runClustalWdf

performTask: mygrid:aligning

inputPara: Setid(String, mog:set )Sequencetype(String,

mog:sequence) outputPara:

filen(string, mygrid:sequence_alignment_report)

useResource: EBI

Service: FormatConversion

Operation: convert performtask:

mygrid: translatinginputPara: filen(String, mygrid:sequence

_alignment_report )outputPara:

Out(String, mygrid:nexus_paup_format)

useResource: MoG

Parameter (data type, semantic type) Matching rule: opertation ij → operation mn if exist parameterk is output parameter of operationij and exist parametero is input parameter of operationmn and data type (parametero) = data type (parameterk) and semantic type (parametero) = semantic type(parameterk)

38

Need for verified service connectivity The mismatching problem

TP FP

FN TN

Match Detectionoutput

Accurate annotation

Inaccurate annotationLack semantic annotationInaccurate reasoning

Inaccurate annotationLack of semantic annotationInaccurate reasoning

Accurate annotation

GenBankServiceOut:GenBank record

BlastpIn: protein sequenceX

Mediator, adaptor,shim

DDBJ-XMLOut: sequence

data record

NCBI blastIn: sequence data

record

fasta formatSelf-defined format

May be detectedby experts at design time or after run

Can be detected automatically

X

Yes No

Yes

No

FPTN

Real match

39

Connectivity Graph Implementation

Registrationprocess

registry

Automatically Identify the connectivity

Knowledge base

Store the connectivity

Workflow Translation /

Service compositionprocess

Refine, update, decompose the workflow

connect (servicea, operationai, parameterc, serviceb, operationbi, parameterd)identifyConnect (Single service, rdf repository)Search at syntactic level: search path between two nodes search next available service

automatic composition base on input, output Implementation: shortest path algorithm Dijkstra

Connectivity between services is converted to finding a path between two nodes in a graph

40

Generic Service Description Ontology(myGrid/Feta model)

DataServices

Workflows

Service Domain Ontology(myGrid)

MoGServ applicationDomain Ontology

(MoGServ)

Software components for annotation RDFStore

Ontological modules used for semantic description of data, services & workflows

41

MoGServ Application Domain Ontology

To better track the data origination

To support the automation of workflow creation

To better share the data on the web in the future

properties domain range

invokedby Job User

isParentOf Set Set

isInstanceOf Job Service

hasSetName Set XML:String

Ontological modules

Number of Concepts Number of propertiesObject Datatype

MoGServ 12 9 7

myGrid 419 8

myGrid/Feta model 26 11 17

Example concepts and properties defined in MoGServ

42

Sample service/workflow annotation

Question:Which service has an operation that accepts nucleotide_sequence as a parameter

Answer:Uri:http://www.ebi.ac.uk …/alignment:blastn_ncbiOperationName: Run

Displayed byRdf-Gravity

43

Implementation of annotation and query components for data, services & workflows

Sesame 1.2.6 library Supports files, RDBMS, SeRQL

Sesame RDF store

AnnotationTemplates

(Data)

AnnotationTemplates(Service)

Querytemplates

Select Y, W, X from {Y} mg:hasOperation{W} mg:inputParameter {X} rdf:type {mog:set}using namespace rdf = <http://www.w3.org/1999/02/22-rdf-syntax-ns#>, mg = <http://www.mygrid.org.uk/ontology#>, mog = <http://almond.cse.nd.edu:10000/mog#>

QueryComponents

Annotationcomponents

resultService: http:host.cse.nd.edu/axis/services/ClustalW?wsdlOperation: runClustalWdfinputParameter: setidSeRQL

44

Experiment Used 418 concepts from domain

ontology for semantic type, defined 10 concepts for data type.

Randomly generate service annotation. 1 input, 1 output

1000 services connectivity graph (right side)

Intel Pentium mobile 1.5GZ

Number of services Number of Matched pair

Load RDF repository

(milliseconds)

Average time of match detection per single service (milliseconds)

200 10 1547 12.02

400 34 2346 13.01

600 84 2600 12.31

800 138 3015 12.35

1000 225 3325 12.51

Number of nodes 724

Number of arcs 587

Average path search time (milliseconds)

Less than 1

Connectivity graph load time (milliseconds)

220

Length 0 = 724, length 1= 587,length 2=448, length 3= 281,Length 4=114, length 5=71Length 6 =28, length 7=16Length 8 = 4, length 9 = 2

Conclusion:Feasible solution.

45

Reuse of workflows Reuse of abstract

workflows Reuse of concret

e workflows Compare structur

al similarity of two workflows

Implementation: SUBDUE algorithm

SUBDUE is has a graphy match utility that is part of its data mining system

Given workflow is converted to a graph and fed to the SUBDUE match algorithm

Abstract example …

input

output

query_term

hasParameter

task

hasInput

task

hasNext

retrieving

aligning

multiple_alignment_report

performTask

hasOutput performTask

hasParameter

v 1 inputv 2 outputv 3 taskv 4 taskv 5 query_termv 6 retrievingv 7 aligningv 8 multiple_aligning_report

e 3 4 hasNexte 3 1 hasInpute 4 2 hasOutpute 3 6 performTaske 4 7 performTaske 1 5 hasParametere 2 8 hasParameter

SUBDUE input formatGraph view

46

Conclusion Pro

Increase the correctness of the formed workflow over time Avoid the incorrect, inaccurate semantic annotations Take advantage of verified knowledge Avoid the ontological reasoning process

Better support for semi-automated and automated service composition over time Provide more accurate guideline to users over time

Con The connectivity graph can be big

Number of parameters Number of services

Search the connectivity of a service when a service is registered in the system may take relative long time More complex matching rule Number of parameters

May not have high accuracy at the beginning

47

Future work

Integrate the GridSam into the MoGServ for execution, monitoring

Integrate the Grid computing technology for resource allocation

Refine the MoGServ application domain ontology Create interface for end-user workflow creation Create interface for individual workspace Evaluate the scalability, accuracy of connectivity

graph approach and the graph matching approach with large number real workflows and services

48

Thank you

Questions?