Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher...

75
Semantic Technologies: Towards Semantic Technologies: Towards Making a Difference in Scientific Making a Difference in Scientific Data Management Data Management Bertram Lud Bertram Lud ä ä scher scher ([email protected]) ([email protected]) San Diego Supercomputer Center Associate Professor Associate Professor Dept. of Computer Science & Genome Center Dept. of Computer Science & Genome Center University of California, Davis University of California, Davis Fellow Fellow San Diego Supercomputer Center San Diego Supercomputer Center University of California, San Diego University of California, San Diego UC DAVIS Department of Computer Science

Transcript of Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher...

Page 1: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

Semantic Technologies: Towards Semantic Technologies: Towards

Making a Difference in Scientific Data Making a Difference in Scientific Data

Management Management

Bertram Bertram LudLudääscherscher

([email protected])([email protected])

San Diego Supercomputer Center

Associate ProfessorAssociate ProfessorDept. of Computer Science & Genome Dept. of Computer Science & Genome

CenterCenterUniversity of California, DavisUniversity of California, Davis

FellowFellow

San Diego Supercomputer CenterSan Diego Supercomputer CenterUniversity of California, San DiegoUniversity of California, San Diego

UC DAVISDepartment ofComputer Science

Page 2: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

2DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

OutlineOutline

• Semantics & Scientific Data IntegrationSemantics & Scientific Data Integration

• Semantics & Scientific Workflow ManagementSemantics & Scientific Workflow Management

• ConclusionsConclusions

Page 3: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

3DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Anatomy of the Science Anatomy of the Science Environment for Ecological Environment for Ecological

Knowledge (SEEK) Collaboratory Knowledge (SEEK) Collaboratory • Domain Science DriverDomain Science Driver

– Ecology (LTER), biodiversity, …

• Analysis & Modeling SystemAnalysis & Modeling System– Design & execution of ecological

models & analysis (“scientific workflows”)

– {application,upper}-ware Kepler system

• Semantic Mediation SystemSemantic Mediation System– Data Integration of hard-to-relate

sources and processes– Semantic Types and Ontologies– upper middleware Sparrow Toolkit

• EcoGridEcoGrid – Access to ecology data and tools– {middle,under}-ware unified API to SRB/MCAT,

MetaCat, DiGIR, … datasetssample CS problem [DILS’04]

Page 4: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

4DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Common Collaboratories / Distributed Common Collaboratories / Distributed Science / Cyberinfrastructure PiecesScience / Cyberinfrastructure Pieces

• Seamless and uniform data access (“Data-Grid”)Seamless and uniform data access (“Data-Grid”)– data & metadata registry

• distributed and high performance computing distributed and high performance computing platform (“Compute-Grid”)platform (“Compute-Grid”)– service registry

• User-friendly workbench / problem-solving User-friendly workbench / problem-solving environmentenvironment– scientific workflow system

• A common problem:A common problem:– integrating (or at least linking) data from multiple sites,

investigators, communities, …, scales, …, species, …

Federated, integrated, mediated databasesFederated, integrated, mediated databases– often use of semantic extensions (e.g. ontologies)

Page 5: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

5DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Interoperability & Integration Interoperability & Integration ChallengesChallenges

• System aspects: “Grid” Middleware• distributed data & computing, SOA• web services, WSDL/SOAP, WSRF, OGSA, …• sources = functions, files, data sets …

• Syntax & Structure: (XML-Based) Data Mediators

• wrapping, restructuring • (XML) queries and views• sources = (XML) databases

• Semantics: Model-Based/Semantic Mediators

• conceptual models and declarative views • Knowledge Representation: ontologies,

description logics (RDF(S),OWL ...)• sources = knowledge bases (DB+CMs+ICs)

• Synthesis: Scientific Workflow Design & Execution

• Composition of declarative and procedural components into larger workflows

• (re)sources = services, processes, actors, … • Semantic extensions needed here as well!

reconciling reconciling SS55 heterogeneitiesheterogeneities

““gluing” together gluing” together resources resources

bridging information and bridging information and knowledge gaps knowledge gaps computationallycomputationally

Page 6: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

6DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Information Integration Challenges: Information Integration Challenges: SS44 Heterogeneities Heterogeneities• SSystemystem aspects aspects

– platforms, devices, data & service distribution, APIs, protocols, … Grid middleware technologies + e.g. single sign-on, platform independence, transparent use of remote

resources, …

• SSyntaxyntax & & SStructuretructure– heterogeneous data formats (one for each tool ...)– heterogeneous data models (RDBs, ORDBs, OODBs, XMLDBs, flat files,

…) – heterogeneous schemas (one for each DB ...) Database mediation and warehousing technologies+ XML-based data exchange, integrated views, transparent query

rewriting, …

• SSemanticsemantics– descriptive metadata, different terminologies, implicit assumptions &

hidden semantics (“context”) of experiments, simulations, observation, …

Knowledge representation & semantic mediation technologies+ “smart” data discovery & integration+ e.g. ask about X (‘mafic’); find data about Y (‘diorite’); be happy

anyways!

Page 7: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

7DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Information Integration Challenges: Information Integration Challenges: SS55 Heterogeneities Heterogeneities

• SSynthesisynthesis of applications, analysis tools, data & of applications, analysis tools, data & query components, … into “query components, … into “scientific workflowsscientific workflows” ” – How to make use of these wonderful things & put them

together to solve a scientist’s problem? Scientific Problem Solving Environments (PSEs)Scientific Problem Solving Environments (PSEs)

Portals,Workbench (“scientist’s view”, end user)+ ontology-enhanced data registration, discovery, manipulation+ creation and registration of new data products from existing

ones, … Scientific Workflow System (“engineer’s view”, tool

maker)+ for designing, re-engineering, deploying analysis pipelines

and scientific workflows; a tool to make new tools … + e.g., creation of new datasets from existing ones, dataset

registration, …

Not discussed here: the “6th S”: Social challenges …

Page 8: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

8DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Our FocusOur Focus• Scientific Data Integration: Scientific Data Integration:

– need DB/DI + KR (“semantic mediation”)

• Automation of Scientific Data Analysis, Automation of Scientific Data Analysis, Process & Application IntegrationProcess & Application Integration– need for scientific workflow systems– need for semantic extensions

• But first:But first:– Some data & information integration

problems

Page 9: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

9DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

An Online Shopper’s Information Integration An Online Shopper’s Information Integration ProblemProblem

El Cheapo: “Where can I get the cheapest copy (including shipping cost) of El Cheapo: “Where can I get the cheapest copy (including shipping cost) of Wittgenstein’s Tractatus Logicus-Philosophicus within a week?” Wittgenstein’s Tractatus Logicus-Philosophicus within a week?”

??Information Information IntegrationIntegration

addall.comaddall.com

“One-World”Mediation

amazon.comamazon.comamazon.comamazon.com A1books.comA1books.comA1books.comA1books.comhalf.comhalf.comhalf.comhalf.combarnes&noble.combarnes&noble.combarnes&noble.combarnes&noble.com

Mediator (virtual DB)Mediator (virtual DB)(vs. Datawarehouse)(vs. Datawarehouse)NOTE: non-trivial NOTE: non-trivial

data engineering challenges!data engineering challenges!

Page 10: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

10DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

A Home Buyer’s Information Integration A Home Buyer’s Information Integration ProblemProblem

What houses for sale under $500k have at least 2 bathrooms, 2 bedrooms, a nearby school ranking in the upper third, in a neighborhood

with below-average crime rate and diverse population?

?Information Integration

RealtorRealtor DemographicsDemographicsSchool RankingsSchool RankingsCrime StatsCrime Stats

“Multiple-Worlds”Mediation

Page 11: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

11DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Information Integration from a Information Integration from a Database Perspective Database Perspective

• Information Integration ProblemInformation Integration Problem– Given: data sources S1, ..., Sk (databases, web sites, ...)

and user questions Q1,..., Qn that can –in principle– be answered using the information in the Si

– Find: the answers to Q1, ..., Qn

• The Database Perspective: The Database Perspective: source = source = “database”“database” Si has a schema (relational, XML, OO, ...) Si can be queried define virtual (or materialized) integrated (or

global) view G over local sources S1 ,..., Sk using database query languages (SQL, XQuery,...)

questions become queries Qi against G(S1,..., Sk)

Page 12: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

12DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

5. Post processing5. Post processing2. Query rewriting2. Query rewriting

Standard Mediator ArchitectureStandard Mediator Architecture

MEDIATORMEDIATOR

Integrated Global(XML) View G

Integrated ViewDefinition

G(..) S1(..)…Sk(..)

USER/ClientUSER/Client

1. Query Q ( G (S1. Query Q ( G (S11,..., S,..., Skk) )) )

S1

Wrapper

(XML) View

S2

Wrapper

(XML) View

Sk

Wrapper

(XML) Viewweb services as wrapper APIs

3. Q1 Q2 Q33. Q1 Q2 Q3

4. {answers(Q1)} {answers(Q2)} {answers(Q3)}4. {answers(Q1)} {answers(Q2)} {answers(Q3)}

6. {answers(Q)}6. {answers(Q)}

Page 13: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

13DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Query Planning in Data Query Planning in Data IntegrationIntegration

• GivenGiven: : – Declarative user query Q: answer(…) …G ...– … & { G … S … } global-as-view (GAV)– … & { S … G … } local-as-view (LAV)– … & { ic(…) … S … G… } integrity constraints (ICs)

• FindFind: : – equivalent (or minimal containing, maximal contained) query plan Q’: answer(…) … S …

query rewriting (logical/calculus, algebraic, physical levels)

• ResultsResults::– A variety of results/algorithms; depending on classes of

queries, views, and ICs: P, NP, … , undecidable– hot research area in core CS (database community)

Page 14: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

14DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

A Neuroscientist’s A Neuroscientist’s Information Integration Information Integration

ProblemProblemWhat is the cerebellar distribution of rat proteins with more than

70% homology with human NCS-1? Any structure specificity?How about other rodents?

?Information Integration

protein localization(NCMIR)

neurotransmission(SENSELAB)

sequence info(CaPROT)

morphometry(SYNAPSE)

“Complex Multiple-Worlds”

Mediation

Biomedical InformaticsBiomedical InformaticsResearch NetworkResearch Networkhttp://nbirn.nethttp://nbirn.net

Inter-source links:• unclear for the non-scientists• hard for the scientist

Page 15: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

15DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Page 16: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

16DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Scientific Data Integration Scientific Data Integration using using

Semantic ExtensionsSemantic Extensions

Page 17: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

17DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Example: Geologic Map Example: Geologic Map IntegrationIntegration

• GivenGiven: : – Geologic maps from different state geological

surveys (shapefiles w/ different data schemas)– Different ontologies:

• Geologic age ontology (e.g. USGS)• Rock classification ontologies:

– Multiple hierarchies (chemical, fabric, texture, genesis) from Geological Survey of Canada (GSC)

– Single hierarchy from British Geological Survey (BGS)

• ProblemProblem::– Support uniform queries across all maps – … possibly using different ontologies– Support registration w/ ontology A, querying

w/ ontology B

Page 18: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

18DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

IntegrationSchema

Schema Integration Schema Integration (“registering” (“registering” local schemas to the global schema)local schemas to the global schema)

Arizona

Colorado

Utah

Nevada

Wyoming

New Mexico

Montana E.

Idaho

Montana West

FormationFormation ……

AgeAge ……

FormationFormation ……

AgeAge ……

FormationFormation ……

AgeAge ……

FormationFormation ……

AgeAge ……

FormationFormation ……

AgeAge ……

FormationFormation ……

AgeAge ……

FormationFormation ……

AgeAge ……

…… FormationFormation

…… AgeAge

…… CompositionComposition

…… FabricFabric

…… TextureTexture

…… FormationFormation

…… AgeAge

…… CompositionComposition

…… FabricFabric

…… TextureTexture

ABBREV

PERIOD

PERIOD

NAME

PERIOD

TYPE

TIME_UNIT

FMATN

PERIOD

NAME

PERIOD

NAME

FORMATION

PERIOD

FORMATION

FORMATION

LITHOLOGY

LITHOLOGY

AGE

AGE

andesitic sandstone

Livingston formation

Tertiary-Cretaceous

Sources

Sources

Page 19: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

19DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Multihierarchical Rock Classification Multihierarchical Rock Classification “Ontology” (Taxonomies) for “Thematic “Ontology” (Taxonomies) for “Thematic

Queries” (GSC)Queries” (GSC)

Composition

Genesis

Fabric

Texture

Page 20: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

20DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Ontology-Enabled Application Example:Ontology-Enabled Application Example:Geologic Map IntegrationGeologic Map Integration

Show formations where AGE = ‘Paleozic’

(without age ontology)

Show formations where AGE = ‘Paleozic’

(without age ontology)

Show formations where AGE = ‘Paleozic’

(with age ontology)

Show formations where AGE = ‘Paleozic’

(with age ontology)

+/- a few hundred million years

domainknowledge

domainknowledge

Knowledge r

epresentatio

n

Geologic Age

ONTOLOGY

NevadaNevada

Page 21: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

21DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Querying by Geologic Age … Querying by Geologic Age …

Page 22: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

22DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Querying by Geologic Age: Querying by Geologic Age: ResultsResults

Page 23: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

23DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Semantic Mediation Semantic Mediation (via “semantic (via “semantic registration” of schemas and ontology registration” of schemas and ontology

articulations)articulations)

• Schema elements and/or data values are Schema elements and/or data values are associated with concept expressions from associated with concept expressions from the target ontologythe target ontology conceptual queries “through” the ontology

• Articulation ontology Articulation ontology source registration to A, querying through B

• Semantic mediation: query rewriting w/ ontologiesSemantic mediation: query rewriting w/ ontologies

Database1

Database2

Ontology A

Ontology B

semanticregistration

ontology articulations

Concept-based(“semantic”)

queries

semanticregistration

Page 24: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

24DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Different views on State Different views on State Geological MapsGeological Maps

Page 25: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

25DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Sedimentary Rocks: BGS Sedimentary Rocks: BGS OntologyOntology

Page 26: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

26DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Sedimentary Rocks: GSC Sedimentary Rocks: GSC OntologyOntology

Page 27: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

27DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Some Thoughts … Some Thoughts … • Translate this idea of multiple conceptual (ontology) Translate this idea of multiple conceptual (ontology)

views to views to your your domain!domain!– e.g. datasets biological pathways registration

• Your data is valuable (time & $$$ spent in producing it)Your data is valuable (time & $$$ spent in producing it)data (re-)usabilitydata (re-)usability

• Metadata helps to discover, localize, assess relevant data sets, Metadata helps to discover, localize, assess relevant data sets, given particular scientific questions & queriesgiven particular scientific questions & queries

• Does your system “understand” what to do with the metadata? Does your system “understand” what to do with the metadata? • Capturing more semanticsCapturing more semantics of a data set in a way that humans of a data set in a way that humans

and systems can exploit it is an and systems can exploit it is an investment in reusabilityinvestment in reusability– “We are producing more and more data”– Today “we can store everything!”– But can we use anything? (i.e., is anyone looking at the data after

the initial creation?)• Design system, interfaces, data and metadata models with Design system, interfaces, data and metadata models with

reusability in mind (think archives and “time capsules”)reusability in mind (think archives and “time capsules”)• This may even be pushed to the experiment/simulation/workflow This may even be pushed to the experiment/simulation/workflow

design…design…

Page 28: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

28DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Data Semantics and Ontologies should be Data Semantics and Ontologies should be useful for Humans and “The Machine” useful for Humans and “The Machine”

Page 29: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

29DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscherformalized as domain map/ontology

Purkinje cells and Pyramidal cells have dendritesthat have higher-order branches that contain spines.Dendritic spines are ion (calcium) regulating components.Spines have ion binding proteins. Neurotransmissioninvolves ionic activity (release). Ion-binding proteinscontrol ion activity (propagation) in a cell. Ion-regulatingcomponents of cells affect ionic activity (release).

domain expert knowledge

Made usable for the system using Description Logic

Example: Domain Example: Domain KnowledgeKnowledge to “glue” to “glue”

SYNAPSESYNAPSE & & NCMIR DataNCMIR Data

Page 30: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

30DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

““Semantic Source Browsing”: Semantic Source Browsing”: Domain Maps/Ontologies (left) & conceptually linked data (right)Domain Maps/Ontologies (left) & conceptually linked data (right)

Page 31: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

31DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

A Semantic Mediation A Semantic Mediation Result ViewResult View

Page 32: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

32DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Source Contextualization Source Contextualization through Ontology Refinementthrough Ontology Refinement

In addition to registering (“hanging off”) data relative toexisting concepts, a source may also refine the mediator’s domain map...

sources can register new concepts at the mediator ...

increase your data usability

Page 33: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

33DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

OutlineOutline

• Semantics & Scientific Data IntegrationSemantics & Scientific Data Integration

• Semantics & Scientific Workflow ManagementSemantics & Scientific Workflow Management

• ConclusionsConclusions

Page 34: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

34DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

What is a Scientific Workflow What is a Scientific Workflow (SWF)?(SWF)?

• GoalsGoals::– automate a scientist’s repetitive data management

and analysis tasks – typical phases:

• data access, scheduling, generation, transformation, aggregation, analysis, visualization

design, test, share, deploy, execute, reuse, … SWFs

Page 35: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

35DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Promoter Identification Promoter Identification WorkflowWorkflow

Source: Matt Coleman (LLNL)Source: Matt Coleman (LLNL)

Page 36: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

36DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Source: NIH BIRN (Jeffrey Grethe, UCSD)Source: NIH BIRN (Jeffrey Grethe, UCSD)

Page 37: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

37DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Ecology: GARP Analysis Ecology: GARP Analysis Pipeline for Invasive Species Pipeline for Invasive Species

PredictionPrediction

Training sample

(d)

GARPrule set

(e)

Test sample (d)

Integrated layers

(native range) (c)

Speciespresence &

absence points(native range)

(a)EcoGridQuery

EcoGridQuery

LayerIntegration

LayerIntegration

SampleData

+A3+A2

+A1

DataCalculation

MapGeneration

Validation

User

Validation

MapGeneration

Integrated layers (invasion area) (c)

Species presence &absence points

(invasion area) (a)

Native range

predictionmap (f)

Model qualityparameter (g)

Environmental layers (native

range) (b)

GenerateMetadata

ArchiveTo Ecogrid

RegisteredEcogrid

Database

RegisteredEcogrid

Database

RegisteredEcogrid

Database

RegisteredEcogrid

Database

Environmental layers (invasion

area) (b)

Invasionarea prediction

map (f)

Model qualityparameter (g)

Selectedpredictionmaps (h)

Source: NSF SEEK (Deana Pennington et. al, UNM)Source: NSF SEEK (Deana Pennington et. al, UNM)

Page 38: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

38DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Page 39: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

39DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Commercial & Open Source Commercial & Open Source Scientific “Workflow” (often Scientific “Workflow” (often DataflowDataflow) Systems) Systems

Kensington Discovery Edition from InforSense

Taverna

Triana

Page 40: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

40DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

SCIRun: Problem Solving Environments SCIRun: Problem Solving Environments for Large-Scale Scientific Computingfor Large-Scale Scientific Computing

• SCIRun: PSE for interactive construction, debugging, and SCIRun: PSE for interactive construction, debugging, and steering of large-scale scientific computations and visualizationssteering of large-scale scientific computations and visualizations

• Component model, based on generalized dataflow programmingComponent model, based on generalized dataflow programming

Steve Parker (cs.utah.edu)Steve Parker (cs.utah.edu)

Page 41: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

41DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Ptolemy II

see!see!see!see!

try!try!try!try!

read!read!read!read!

Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/

Page 42: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

42DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Why Ptolemy II (and thus KEPLER)?Why Ptolemy II (and thus KEPLER)?• Ptolemy II Objective:Ptolemy II Objective:

– “The focus is on assembly of concurrent components. The key underlying principle in the project is the use of well-defined models of computation that govern the interaction between components. A major problem area being addressed is the use of heterogeneous mixtures of models of computation.”

• Dataflow Process Networks w/ natural support for Dataflow Process Networks w/ natural support for abstractionabstraction, , pipeliningpipelining (streaming) (streaming) actor-orientation, actor-orientation, actor actor reusereuse

• User-OrientationUser-Orientation– Workflow design & exec console (Vergil GUI)– “Application/Glue-Ware”

• excellent modeling and design support• run-time support, monitoring, …• not a middle-/underware (we use someone else’s, e.g. Globus, SRB, …)• but middle-/underware is conveniently accessible through actors!

• PRAGMATICSPRAGMATICS– Ptolemy II is mature, continuously extended & improved, well-documented

(500+pp) – open source system– Ptolemy II folks actively participate in KEPLER

Page 43: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

43DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

KEPLER/KEPLER/CSPCSP: : CContributors, ontributors, SSponsors, ponsors, PProjectsrojects(or loosely coupled (or loosely coupled CCommunicating ommunicating SSequential equential PPersons ;-)ersons ;-)

Ilkay Altintas Ilkay Altintas SDM, ResurgenceSDM, Resurgence

Kim Baldridge Kim Baldridge Resurgence, NMIResurgence, NMI

Chad Berkley Chad Berkley SEEKSEEK

Shawn Bowers Shawn Bowers SEEKSEEK

Terence Critchlow Terence Critchlow SDMSDM

Tobin Fricke Tobin Fricke ROADNetROADNet

Jeffrey Grethe Jeffrey Grethe BIRNBIRN

Christopher H. Brooks Christopher H. Brooks Ptolemy IIPtolemy II

Zhengang Cheng Zhengang Cheng SDMSDM

Dan Higgins Dan Higgins SEEKSEEK

Efrat Jaeger Efrat Jaeger GEONGEON

Matt Jones Matt Jones SEEKSEEK

Werner Krebs, Werner Krebs, EOLEOL

Edward A. Lee Edward A. Lee Ptolemy IIPtolemy II

Kai Lin Kai Lin GEONGEON

Bertram Ludaescher Bertram Ludaescher SDM, SEEKSDM, SEEK, , GEONGEON, , BIRN,BIRN, ROADNetROADNet

Mark Miller Mark Miller EOLEOL

Steve Mock Steve Mock NMINMI

Steve Neuendorffer Steve Neuendorffer Ptolemy IIPtolemy II

Jing Tao Jing Tao SEEKSEEK

Mladen Vouk Mladen Vouk SDMSDM

Xiaowen Xin Xiaowen Xin SDMSDM

Yang Zhao Yang Zhao Ptolemy IIPtolemy II

Bing Zhu Bing Zhu SEEKSEEK

••••••

Ptolemy IIPtolemy II

                                                

                                            

www.kepler-project.orgwww.kepler-project.org

Page 44: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

44DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

KEPLER: An Open CollaborationKEPLER: An Open Collaboration• Initiated by members from DOE SDM/SPA and NSF SEEK; now several Initiated by members from DOE SDM/SPA and NSF SEEK; now several

other projects (GEON, Ptolemy II, EOL, Resurgence/NMI, …)other projects (GEON, Ptolemy II, EOL, Resurgence/NMI, …)• Open Source (BSD-style license)Open Source (BSD-style license)• Intensive Communications: Intensive Communications:

– Web-archived mailing lists– IRC (!)– Meetings, Hackathons

• Co-development: Co-development: – via shared CVS repository– joining as a new co-developer (currently):

• get a CVS account (read-only)• local development + contribution via existing KEPLER member• be voted “in” as a member/co-developer

• Software & social engineeringSoftware & social engineering– How to better accommodate new groups/communities?– How to better accommodate different usage/contribution models (core dev …

special purpose extender … user)?

Page 45: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

45DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Ptolemy II/KEPLER GUI (Vergil)Ptolemy II/KEPLER GUI (Vergil)“Directors” define the component interaction & execution semantics

Large, polymorphic component (“Actors”) and Directors libraries (drag & drop)

Page 46: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

46DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Web Services Web Services Actors Actors (WS Harvester)(WS Harvester)

12

3

4

“Minute-made” (MM) WS-based application integration• Similarly: MM workflow design & sharing w/o implemented

components

Page 47: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

47DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Rapid Web Service-based Prototyping Rapid Web Service-based Prototyping

(Here: ROADNet Command & Control Services for LOOKING Kick-Off Mtg)(Here: ROADNet Command & Control Services for LOOKING Kick-Off Mtg)

Source: Ilkay Altintas, SDM, NLADRROADNet: Vernon, Orcutt et al

Web services: Tony Fountain et al

Source: Ilkay Altintas, SDM, NLADRROADNet: Vernon, Orcutt et al

Web services: Tony Fountain et al

Page 48: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

48DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

An “early” example: An “early” example: Promoter Identification Promoter Identification

SSDBM, AD 2003SSDBM, AD 2003

• Scientist models application as a “workflow” of connected components (“actors”)

• If all components exist, the workflow can be automated/ executed

• Different directors can be used to pick appropriate execution model (often “pipelined” execution: PN director)

Page 49: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

PIW Workflow Today

Page 50: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

50DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Enter initial inputs,

Run

and

Display results

“Run Window”

Page 51: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

51DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Custom Output Visualizer

Page 52: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

52DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Job Management (here: Job Management (here: NIMROD)NIMROD)

• Job management infrastructure in place• Results database: under development• Goal: 1000’s of GAMESS jobs (quantum mechanics)

Page 53: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

53DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Some Recent Actor AdditionsSome Recent Actor Additions

Page 54: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

54DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

in KEPLER in KEPLER (w/ editable (w/ editable script)script)

Source: Dan Higgins, Kepler/SEEKSource: Dan Higgins, Kepler/SEEK

Page 55: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

55DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

in KEPLER in KEPLER (interactive session)(interactive session)

Source: Dan Higgins, Kepler/SEEKSource: Dan Higgins, Kepler/SEEK

Page 56: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

56DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Blurring Blurring Design (ToDo)Design (ToDo) and Execution and Execution

Page 57: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

57DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Some Scientific Workflow Some Scientific Workflow ChallengesChallenges

• Typical FeaturesTypical Features– data-intensive and/or compute-intensive– plumbing-intensive (consecutive web services won’t

fit)– dataflow-oriented– distributed (remote data, remote processing)– user-interaction “in the middle”, …– … vs. (C-z; bg; fg)-ing (“detach” and reconnect)– advanced programming constructs (map(f), zip,

takewhile, …)– logging, provenance, “registering back”

(intermediate) products…

Page 58: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

58DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Scientific Workflows & Scientific Workflows & SemanticsSemantics• Registering data to ontologies: semantic Registering data to ontologies: semantic

types (in addition to structural data types)types (in addition to structural data types)• Smarter data set discovery & integrationSmarter data set discovery & integration• Now also: Now also:

– Smarter workflow design– More “intelligent” (semantics-aware)

component composition– Improved (re-)usability of data, services

(actors), and workflows– Given semantic type of my input ports, what

other data sets / actors produce such input

Page 59: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

59DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Reengineering a Geoscientist’s Mineral Classification Workflow

Add semantic types to ports!!

Page 60: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

60DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Beginnings: Ontology-based Beginnings: Ontology-based Actor/Service DiscoveryActor/Service Discovery

Ontology based actor (service) and dataset

search

Result Display

Page 61: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

61DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Semantics & Scientific WorkflowsSemantics & Scientific Workflows

Data comes from Data comes from heterogeneousheterogeneous sources sources– Real-world observations– Spatial-temporal contexts– Collection/measurement protocols and procedures– Many representations for the same information (count,

area, density)– Schematically heterogeneous

Data discovered and “synthesized” Data discovered and “synthesized” manuallymanually

Hard to reuse/repurpose existing analytical Hard to reuse/repurpose existing analytical steps (another form of heterogeneity)steps (another form of heterogeneity)

Page 62: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

62DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

The KR/SMS “Waterfall” … The KR/SMS “Waterfall” …

Ontologies

SemanticAnnotation

IterativeDevelopment

Resource Discovery

ResourceIntegration

WorkflowPlanning

WorkflowAnalysis

Source: [Bowers-SEEK-AHM-04]Source: [Bowers-SEEK-AHM-04]

Page 63: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

63DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

A KR+DI+Scientific Workflow A KR+DI+Scientific Workflow ProblemProblem

• Services can be Services can be semantically compatiblesemantically compatible, , but but structurally incompatiblestructurally incompatible

SourceServiceSourceService

TargetServiceTargetService

Ps Pt

SemanticType Ps

SemanticType Ps

SemanticType Pt

SemanticType Pt

StructuralType Pt

StructuralType Pt

StructuralType Ps

StructuralType Ps

Desired Connection

Incompatible

Compatible

(⋠)

(⊑)

(Ps)(Ps) (≺)

Ontologies (OWL)Ontologies (OWL)

Source: [Bowers-Ludaescher, DILS’04]Source: [Bowers-Ludaescher, DILS’04]

Page 64: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

64DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Ontology-Informed Data Ontology-Informed Data Transformation Transformation (“Structure-Shim”)(“Structure-Shim”)

SourceServiceSourceService

TargetServiceTargetService

Ps Pt

SemanticType Ps

SemanticType Ps

SemanticType Pt

SemanticType Pt

StructuralType Pt

StructuralType Pt

StructuralType Ps

StructuralType Ps

Desired Connection

Compatible ( )⊑

RegistrationMapping (Output)

RegistrationMapping (Input)

CorrespondenceCorrespondence

Generate (Ps)(Ps)

Ontologies (OWL)Ontologies (OWL)

Transformation

Source: [Bowers-Ludaescher, DILS’04]Source: [Bowers-Ludaescher, DILS’04]

Page 65: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

65DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

OutlineOutline

• Scientific Data IntegrationScientific Data Integration

• Scientific Workflow ManagementScientific Workflow Management

• Musings & Conclusions Musings & Conclusions

Page 66: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

66DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Some Thoughts … Some Thoughts … • Translate this idea of multiple conceptual (ontology) Translate this idea of multiple conceptual (ontology)

views to views to your your domain!domain!– e.g. datasets biological pathways registration

• Your data is valuable (time & $$$ spent in producing it)Your data is valuable (time & $$$ spent in producing it)data (re-)usabilitydata (re-)usability

• Metadata helps to discover, localize, assess relevant data sets, Metadata helps to discover, localize, assess relevant data sets, given particular scientific questions & queriesgiven particular scientific questions & queries

• Does your system “understand” what to do with the metadata? Does your system “understand” what to do with the metadata? • Capturing more semanticsCapturing more semantics of a data set in a way that humans of a data set in a way that humans

and systems can exploit it is an and systems can exploit it is an investment in reusabilityinvestment in reusability– “We are producing more and more data”– Today “we can store everything!”– But can we use anything? (i.e., is anyone looking at the data after

the initial creation?)• Design system, interfaces, data and metadata models with Design system, interfaces, data and metadata models with

reusability in mind (think archives and “time capsules”)reusability in mind (think archives and “time capsules”)• This may even be pushed to the experiment/simulation/workflow This may even be pushed to the experiment/simulation/workflow

design…design…

Page 67: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

67DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

The FutureThe Future• We start to see the benefits of We start to see the benefits of semantic semantic

technologiestechnologies in scientific data in scientific data managementmanagement– BTW: semantic technologies have been

there for a while!• think conceptual models, ER diagrams, …• or Gottlob Frege (German mathematician,

logician, philosopher; 1848-1925) • Today: momentum through “Semantic Web”,

“Semantic Grid”

• Where will semantics lead us in 10, 20, Where will semantics lead us in 10, 20, 50 years?50 years?

Page 68: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

68DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

A 50 year forecast in retrospectiveA 50 year forecast in retrospective(even if a hoax … you get the idea…)(even if a hoax … you get the idea…)

Page 69: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

69DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

KEPLER – a Collaboration KEPLER – a Collaboration ExampleExample

• A A grass-rootsgrass-roots project project– Needed a coalition of the (really!) willing – People matter!

• Intra-project Intra-project linkslinks– e.g. in SEEK: AMS SMS EcoGrid

• Inter-projectInter-project links links– SEEK ITR, GEON ITR, ROADNet ITRs, DOE SciDAC SDM,

Ptolemy II, NIH BIRN (coming we hope …), UK eScience myGrid, …

• Inter-technologyInter-technology links links– Globus, SRB, JDBC, web services, soaplab services, command

line tools, R, GRASS, XSLT, …

• InterdisciplinaryInterdisciplinary links links– CS, IT, domain sciences, … (recently: usability engineer)

Page 70: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

70DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

GEON Dataset Generation & GEON Dataset Generation & RegistrationRegistration

(a co-development in KEPLER)(a co-development in KEPLER)

Xiaowen (SDM)

Edward et al.(Ptolemy)

Yang (Ptolemy)

Efrat(GEON)

Ilkay(SDM)

SQL database access (JDBC)Matt,Chad,

Dan et al. (SEEK)

% Makefile$> ant run

% Makefile$> ant run

Page 71: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

71DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Summary/Lessons Learned Summary/Lessons Learned • Semantics mattersSemantics matters• Collaboration tools neededCollaboration tools needed

– CVS repositories (+cvsview, webcvs)– Mailing lists (e.g. mailman googlified) – Bugzilla (detailed tracking of tech. issues & bugs)– WIKI (community authored web resource, e.g. high-level tech.

issues)• People matterPeople matter• Repositories matterRepositories matter

– EcoGrid (SEEK) registry, GEON registry, BIRN registry KEPLER actor & datasets repository, …– UDDI what?

• “ “Melting Pots”: Melting Pots”: – Places, projects, organizations (GGF), tools:

• National Labs, …, SDSC, NCEAS, LTER, NLADR (w/ NCSA), KU Specify, …, new Genome Center@UC Davis (moving in …), …

• SDM, BIRN, GEON, SEEK, … • Kepler, …

Page 72: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

72DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Q & AQ & A

Page 73: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

73DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Further ReadingFurther Reading

under review – available upon request from [email protected]

Page 74: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

74DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Related PublicationsRelated Publications• Semantic Data Registration and IntegrationSemantic Data Registration and Integration• On Integrating Scientific Resources through Semantic RegistrationOn Integrating Scientific Resources through Semantic Registration, S. Bowers, K. Lin, , S. Bowers, K. Lin,

and B. Ludäscher, and B. Ludäscher, 16th International Conference on Scientific and Statistical Database 16th International Conference on Scientific and Statistical Database ManagementManagement ( (SSDBM'04SSDBM'04), 21-23 June 2004, Santorini Island, Greece. ), 21-23 June 2004, Santorini Island, Greece.

• A System for Semantic Integration of Geologic Maps via A System for Semantic Integration of Geologic Maps via OntologiesOntologies, K. Lin and B. , K. Lin and B. Ludäscher. In Ludäscher. In Semantic Web Technologies for Searching and Retrieving Scientific DataSemantic Web Technologies for Searching and Retrieving Scientific Data ( (SCISWSCISW), Sanibel Island, Florida, 2003. ), Sanibel Island, Florida, 2003.

• Towards a Generic Framework for Semantic Registration of Scientific DataTowards a Generic Framework for Semantic Registration of Scientific Data, S. Bowers , S. Bowers and B. Ludäscher. In and B. Ludäscher. In Semantic Web Technologies for Searching and Retrieving Semantic Web Technologies for Searching and Retrieving Scientific DataScientific Data ( (SCISWSCISW), Sanibel Island, Florida, 2003. ), Sanibel Island, Florida, 2003.

• The Role of XML in Mediated Data Integration Systems with Examples from Geological (MThe Role of XML in Mediated Data Integration Systems with Examples from Geological (Map) Data Interoperabilityap) Data Interoperability, B. Brodaric, B. Ludäscher, and K. Lin. In , B. Brodaric, B. Ludäscher, and K. Lin. In Geological Society of America (GSA) Geological Society of America (GSA) Annual MeetingAnnual Meeting, volume 35(6), November 2003. , volume 35(6), November 2003.

• Semantic Mediation Services in Geologic Data Integration: A Case Study from the GEON GSemantic Mediation Services in Geologic Data Integration: A Case Study from the GEON Gridrid, K. Lin, B. Ludäscher, B. Brodaric, D. Seber, C. Baru, and K. A. Sinha. In , K. Lin, B. Ludäscher, B. Brodaric, D. Seber, C. Baru, and K. A. Sinha. In Geological Society of America (GSA) Annual MeetingGeological Society of America (GSA) Annual Meeting, volume 35(6), November 2003. , volume 35(6), November 2003.

• Query Planning and RewritingQuery Planning and Rewriting• Processing First-Order Queries under Limited Access PatternsProcessing First-Order Queries under Limited Access Patterns, Alan Nash and B. , Alan Nash and B.

Ludäscher, Ludäscher, Proc. 23rd ACM Symposium on Principles of Database SystemsProc. 23rd ACM Symposium on Principles of Database Systems ( (PODS'04PODS'04) ) Paris, France, June 2004. Paris, France, June 2004.

• Processing Unions of Conjunctive Queries with Negation under Limited Access PatternsProcessing Unions of Conjunctive Queries with Negation under Limited Access Patterns, , Alan Nash and B. Ludäscher., Alan Nash and B. Ludäscher., 9th Intl. Conference on Extending Database Technology9th Intl. Conference on Extending Database Technology ((EDBT'04EDBT'04) Heraklion, Crete, Greece, March 2004, LNCS 2992. ) Heraklion, Crete, Greece, March 2004, LNCS 2992.

• Web Service Composition Through Declarative Queries: The Case of Conjunctive Queries Web Service Composition Through Declarative Queries: The Case of Conjunctive Queries with Union and Negationwith Union and Negation, B. Ludäscher and Alan Nash. Research abstract (poster), , B. Ludäscher and Alan Nash. Research abstract (poster), 20th Intl. Conference on 20th Intl. Conference on Data EngineeringData Engineering ( (ICDE'04ICDE'04) Boston, IEEE Computer Society, April 2004.) Boston, IEEE Computer Society, April 2004.

Page 75: Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate.

75DOE NC Meeting, Boulder, Dec 1st 2004 Semantic Technologies in SDM, B.Ludäscher

Related PublicationsRelated Publications• Scientific WorkflowsScientific Workflows• KeplerKepler: An Extensible System for Design and Execution of Scientific Workflows: An Extensible System for Design and Execution of Scientific Workflows, I. Altintas, C. , I. Altintas, C.

Berkley, E. Jaeger, M. Jones, B. Ludäscher, S. Mock, Berkley, E. Jaeger, M. Jones, B. Ludäscher, S. Mock, 16th International Conference on 16th International Conference on Scientific and Statistical Database ManagementScientific and Statistical Database Management ( (SSDBM'04SSDBM'04), 21-23 June 2004, Santorini ), 21-23 June 2004, Santorini Island, Greece. Island, Greece.

• KeplerKepler: Towards a Grid-Enabled System for Scientific Workflows: Towards a Grid-Enabled System for Scientific Workflows, Ilkay Altintas, Chad , Ilkay Altintas, Chad Berkley, Efrat Jaeger, Matthew Jones, Bertram Ludäscher, Steve Mock, Berkley, Efrat Jaeger, Matthew Jones, Bertram Ludäscher, Steve Mock, Workflow in Grid Systems (GGF10)Workflow in Grid Systems (GGF10), Berlin, March 9th, 2004., Berlin, March 9th, 2004.

• An Ontology-Driven Framework for Data Transformation in Scientific WorkflowsAn Ontology-Driven Framework for Data Transformation in Scientific Workflows, S. Bowers , S. Bowers and B. Ludäscher, and B. Ludäscher, Intl. Workshop on Data Integration in the Life SciencesIntl. Workshop on Data Integration in the Life Sciences ( (DILS'04DILS'04), March ), March 25-26, 2004 Leipzig, Germany, LNCS 2994. 25-26, 2004 Leipzig, Germany, LNCS 2994.

• A Web Service Composition and Deployment Framework for Scientific Workflows, I. A Web Service Composition and Deployment Framework for Scientific Workflows, I. Altintas, E. Jaeger, K. Lin, B. Ludaescher, A. Memon, In the Altintas, E. Jaeger, K. Lin, B. Ludaescher, A. Memon, In the 2nd Intl. Conference on 2nd Intl. Conference on Web ServicesWeb Services ( (ICWSICWS), San Diego, California, July 2004.), San Diego, California, July 2004.