Post on 12-Jan-2016
Towards a Provenance ArchitectureTowards a Provenance ArchitectureTowards a Provenance ArchitectureTowards a Provenance Architecture
Karen SchuchardtPNNL
2
Kepler Provenence Meeting Jan 05
OutlineOutlineOutlineOutline
Past and Present Work Use CasesThoughts on Workflow Provenance and Architectures
3
Kepler Provenence Meeting Jan 05
Past and Present Provenance WorkPast and Present Provenance WorkPast and Present Provenance WorkPast and Present Provenance Work
Ecce Chemistry EnvironmentElectronic Laboratory NotebooksCollaboratory for Multi-Scale Chemical Science (CMCS)Scientific Annotation MiddlewareTowards a Semantic Data Grid for Systems Science
mid 90s
late 90s
2000
2000
2004-2006
4
Kepler Provenence Meeting Jan 05
Ecce Chemistry EnvironmentEcce Chemistry EnvironmentEcce Chemistry EnvironmentEcce Chemistry Environment
Chemistry-based calculation workflowProvenance Captured as user performs actions
W’s (who, what, when) Job submissionstatus Info Relationships (Xlinks) between
calculations, outputs, inputs etc Linkbase for molecular dynamics
multi-step processes
WebDAV-based server captures all inputs, outputs and metadataProvenance used to
provide at-a-glance summary of work performed,
duplicate and rerun, search, Bind rules based on types and
relationships
5
Kepler Provenence Meeting Jan 05
Electronic Laboratory NotebooksElectronic Laboratory NotebooksElectronic Laboratory NotebooksElectronic Laboratory NotebooksHierarchical, Chronological Chapters/Pages/Notes
File upload, sketch, text, equations, forms, image capture, …
Add/View/Search NotesRecords functionality:
Non-repudiation - digital signatures and timestamps
Persistence/completeness - write-once/no deletions/audit trail
Standardized lifecycle – signing/witnessing policies, archiving, retention schedules, …
Now based on WebDAVProvenance
Structure of notebook Records data Mimetype-based functionality
6
Kepler Provenence Meeting Jan 05
Collaboratory for Multi-Scale Chemical Collaboratory for Multi-Scale Chemical Sciences (CMCS)Sciences (CMCS)
Collaboratory for Multi-Scale Chemical Collaboratory for Multi-Scale Chemical Sciences (CMCS)Sciences (CMCS)
Dublin Core for basic pedigree: title, creator, dates, publisher, is-referenced-by, references, replaces, is-replaced-by, has-version
Dublin Core Element Set and Qualified Dublin Core
Both XML and RDF to encode metadata values
Use of XLink to express values of relationships
CMCS properties for chemical science to enable searching: species name, CAS, chemical properties, and chemical formula.CMCS properties for defining scientific data: has-inputs, has-outputs, and is-part-of-project.CMCS properties for scientific publication and peer review annotations: is-sanctioned-by.Flexible infrastructure for addition of new metadata. As new metadata is added to infrastructure,current apps will not break!
7
Kepler Provenence Meeting Jan 05
Scientific Annotation MiddlewareScientific Annotation MiddlewareScientific Annotation MiddlewareScientific Annotation Middleware
Provides a node plus metadata/relationship view of underlying data sourcesSupport put/get/search/access control of arbitrary data/metadataConfigurable metadata extraction from binary/ASCII/XML filesConfigurable Data TranslationSemantic/graph queriesRDF ExportNotebook Services (page display, signatures, timestamps, …)Pluggable security
Direct connection between metadata and resource limits use as next generation provenance store
8
Kepler Provenence Meeting Jan 05
Towards a Semantic Data GridTowards a Semantic Data GridTowards a Semantic Data GridTowards a Semantic Data Grid
Explore frameworks for advanced model-driven data integration capabilities Seamlessly integrate files, databases Automated scientific workflow mechanisms Capture, represent, and disseminate knowledge Identify changes via discovery mechanisms
Internally funded 2 year project
9
Kepler Provenence Meeting Jan 05
Towards a Semantic Data GridTowards a Semantic Data GridTowards a Semantic Data GridTowards a Semantic Data Grid
What proteins in my organism(s) are both predicted and shown by experiment to interact with E. Coli
Resources required Microarray spreadsheets NCBI data services BIND data base DIP database Work-group specific
databases
Other data services Extraction Translation Merging HPC Services Public Web services Discovery
10
Kepler Provenence Meeting Jan 05
Use Case - Personal RecordsUse Case - Personal RecordsUse Case - Personal RecordsUse Case - Personal Records
Capture and organize display of provenance simplifies the job keeping track of activities performed over the course of long research process
Example: Bioinformatisist performs data integration/analysis for many diverse projects. After 6 months, he/she can’t remember what a particular result pertained to or how it was generated.
11
Kepler Provenence Meeting Jan 05
Use Case - VerifiabilityUse Case - VerifiabilityUse Case - VerifiabilityUse Case - Verifiability
Data generated from instruments/experiments undergoes numerous automatic processes before becoming available to researcher(s)
Example: High-throughput biology experiments run through several automated and in some cases manual processes before it becomes available to the bioinformatisist. The bioinformatisist often does not trust the data. They want to know who created, what was done to it, when it was generated….
12
Kepler Provenence Meeting Jan 05
Use Case - ApplicabilityUse Case - ApplicabilityUse Case - ApplicabilityUse Case - Applicability
Increasingly, research problems span disciplines or scales. Though data needs to move across these boundaries, it is often a manual process involving personal communications.
Example: In the combustion multi-scale research environment, data generated at one scale (e.g. thermochemical data) serves as input to successive scales (e.g. mechanisms). But its not that simple - we must be able to determine the applicability of available data - are the theoretical underpinnings under which it was generated consistent with the intended use?
13
Kepler Provenence Meeting Jan 05
Use Case - Best PracticesUse Case - Best PracticesUse Case - Best PracticesUse Case - Best Practices
By capturing and providing access to provenance of prior work, best practices can be shared.
Example: This is a little bit hypothetical but… best practices can be shared by sharing workflow definitions or by viewing provenance (and inputs) from instances of workflows.
14
Kepler Provenence Meeting Jan 05
Types of Provenance in Workflow Types of Provenance in Workflow EnvironmentEnvironment
Types of Provenance in Workflow Types of Provenance in Workflow EnvironmentEnvironment
Interaction Provenance Data that moves between services
State Provenance Data known only to the actor itself
Observable Provenance Start/completion times Error detection
15
Kepler Provenence Meeting Jan 05
Other ProvenanceOther ProvenanceOther ProvenanceOther Provenance
Other Applications will record data Pedigree/Provenance Experiment Metadata Project Organization Categorization Detected Features Instrument logs Digital Signatures Endorsements Community Annotations Other workflow engines
16
Kepler Provenence Meeting Jan 05
Logical ArchitectureLogical ArchitectureLogical ArchitectureLogical Architecture
ProvenanceStore(s)
Query Interface
Sub
mis
sion
Int
erfa
ceUser Recording Tools
PortletsAnnotator
Notebooks ScienceApplications
Client QueryLibrary
Clie
nt S
ubm
issi
on L
ibra
ry
Experiment Services
Workflow engine
Domain specific services
Presentation Services
Visualizer/Browser
DifferenceVisualizer
Workflow construction
Processing Services
Difference Analyzer
Quality Analyzer
Extracted from escience Strawman - Moreau
ProvenanceStore(s)
17
Kepler Provenence Meeting Jan 05
Components of Physical ArchitectureComponents of Physical ArchitectureComponents of Physical ArchitectureComponents of Physical Architecture
One or more RDF triple storesGlobal naming serviceArbitrary data stores for data referenced by the provenanceSecurity services (pluggable for scalability)
18
Kepler Provenence Meeting Jan 05
Workflow and ProvenanceWorkflow and ProvenanceWorkflow and ProvenanceWorkflow and Provenance
Requires binding to provenance serviceNeed mechanism to associate provenance from workflow instance Id? Links?
Requires communication of service information or other mechanism for actors to contribute state provenance
19
Kepler Provenence Meeting Jan 05
SummarySummarySummarySummary
We’ve done a lot of work on provenance but see value in moving to more flexible architectureWorkflow engines are just one component that can contribute to the provenance of research results.Provenance capture should be thought of as a cross-cutting technologyModels for provenance need to be flexible allowing arbitrary contentProvenance services need to be scalable low-footprint usages for individual applications large experimental facilities Virtual organizations