Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University...

30
Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

Transcript of Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University...

Page 1: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

Data Grid Research GroupDept. of Computer Science and EngineeringThe Ohio State UniversityColumbus, Ohio 43210, USA

David Chiu & Gagan Agrawal

Enabling Ad Hoc Queries over Low-Level Scientific

Data Sets

Page 2: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

2D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Presentation Outline

• Motivation‣ Current Trends in Scientific Data Management‣ Problem Discussion

• Data Registration Indexing‣ Metadata Extraction‣ Transformation

• Service Composition

• Conclusion

Page 3: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

3D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Increased tremendously over the years

Scientific Data Sets

• The collection of scientific data has increased over the years with new instruments, simulations, etc.

• Data sets are stored in repositories around the globe

• Just within U.S. entities in the geospatial domain‣ NOAA: oceanic, climate, water

quality, ...‣ NASA: ozone, air quality, tropical, ...‣ NRCS: land quality, watershed, ...

Page 4: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

4D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Increased tremendously over the years

Scientific Data Sets

• The collection of scientific data has increased over the years with new instruments, simulations, etc.

• Data sets are stored in repositories around the globe

• Just within U.S. entities in the geospatial domain‣ NOAA: oceanic, climate, water

quality, ...‣ NASA: ozone, air quality, tropical, ...‣ NRCS: land quality, watershed, ...

Page 5: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

5D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Increased tremendously over the years

Scientific Data Sets

• The collection of scientific data has increased over the years with new instruments, simulations, etc.

• Data sets are stored in repositories around the globe

• Just within U.S. entities in the geospatial domain‣ NOAA: oceanic, climate, water

quality, ...‣ NASA: ozone, air quality, tropical, ...‣ NRCS: land quality, watershed, ...

Page 6: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

6D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Data Repositories

Web or Data Grid InfrastructureMass StorageSystems (MSS)

Page 7: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

7D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Scientific Data Sets

• Data sets are typically low level, i.e., ‣ Unstructured or semi-structured0101071895 0.34 -2.45 0.50 -0.65 -0.62 -0.71 0.00 -0.96 0101071896 -1.71 0.49 0.27 -0.79 -1.53 0.60 0.09 -2.210101071897 -0.53 0.14 4.32 1.95 -1.55 -1.68 -1.32 -0.690101071898 1.90 -2.64 -1.70 1.11 -2.18 -1.08 -0.53 -0.250101071899 0.44 0.97 1.65 -0.71 -2.02 -2.10 -0.50 -2.030101071900 -1.65 1.19 -1.34 0.57 -1.37 7.00 -0.48 -1.77 . . .

• However, data is well-documented‣ Accompanying XML-based metadata describing data sets is

typically required in today’s repositories

Page 8: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

8D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Data Repositories

Mass StorageSystems (MSS)

Grid/Web Services & portals

Web or Data Grid Infrastructure

Page 9: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

9D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Data Repositories in the Global Scale

US EU

AU ...

Page 10: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

10

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

What Do the Users Want?

US

EU

AU

...

I don’t care where data is located.

I also want to share my own data with others!

Don’t just give me the data, but...

- Transform it - Manipulate it - Compose it with other processes and data sets

And do this with the least amount of work required from me!

Page 11: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

11

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

System Goals

• To enable queries over low level data sets, which involves:‣ identification of relevant data sets‣ automatic planning for the composition of dependent

services (processes) for derivation

• ... while being non-intrusive to existing schemes, i.e.,‣ avoids a standardized format for storing data sets‣ accommodates heterogeneous metadata‣ this system should - fit - into existing MSS and scientific

computing infrastructures (Data Grid & the Web)

Page 12: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

12

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

That’s good and all, but...

Challenges

• Not without challenges...‣ dealing with metadata from multiple entities‣ efficiently identifying relevant data sets‣ planning and executing accurate service compositions on

the spot

Page 13: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

13

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

That’s good and all, but...

Challenges

• Not without challenges...‣ dealing with metadata from multiple entities‣ efficiently identifying relevant data sets‣ planning and executing accurate service compositions on

the spot

DOMAIN KNOWLEDGE & SEMANTICS

• And without question, the need for

Page 14: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

14

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

The AUSPICE System

AUSPICE: Automatic Service Planning and Execution in Cloud/Grid Environments

Page 15: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

15

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

The Semantics Layer

A Need for Domain Level Knowledge

• Assume the following service retrieves a satellite image pertaining to (x,y) with resolution respective to r

• Questions to ask the system:‣ How to deduce that this service can be used?‣ How to determine what information is needed for input?‣ Did the user provide enough information to invoke this service?

get_sat_image(double x, double y, double r)

inputsTo inputsToinputsTo

longitude latitude grid_size

outputsTo

satellite image

Page 16: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

16

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

In the Semantics Layer

Applying Domain Information

Domain concepts can be derivedfrom executing a service

Domain concepts can also be derived from retrieving an

existing data setService parameters representdifferent domain concepts

Page 17: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

17

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Data Registration Service

Indexing Data Sets

• Handling heterogeneous metadata

• For instance, just within the geospatial domain,

Country Metadata Standards

US CSDGM

AU, NZ ANZLIC

EU ???

CDN ???

... ...

Page 18: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

18

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Data Registration Service

Indexing Data Sets

• Handling heterogeneous metadata

Page 19: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

19

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Data Registration Service

Indexing Data Sets

• Metadata Transformation

. .

.

(transform to spatial index)

Page 20: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

20

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Data Registration Service

Indexing Data Sets

• Metadata to DB transformations

. .

.

insert

Page 21: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

21

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Data Registration Service

Indexing Data Sets

Page 22: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

22

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Data Registration Service

Indexing Data Sets

Page 23: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

23

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Data Registration Service

Indexing Data Sets

Page 24: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

24

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

In the Semantics Layer

Applying Domain Information

Data registration simplifies identification process within

Page 25: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

25

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Indexing Services

• Services (inputs, outputs) are also registered in much the same way

Page 26: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

26

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

The Planning Layer

Service Composition: An Example

A subset of the ontology (unrolled)

Page 27: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

27

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

The Planning Layer

Service Composition

begin compSrvc(concept, Q[...])W := ()

//perform DFS starting from conceptlet v := concept be the currently visited node

if v is a data type then W := (W, index.getData(v, Q))

else //v is a servicelet (p1,..,pn) be v’s params

//recursive call on each piW := (W, (v, compSrvc(p1, Q), ... , compSrvc(pn, Q)))

end if

return Wend

Page 28: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

28

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

The Planning Layer

Service Composition: An Example

Ontology (unrolled)

A Derived Execution Plan This is what data registration provides

Page 29: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

29

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Planning Times

Page 30: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

30

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Conclusion

• The AUSPICE System...‣ unifies heterogeneous metadata‣ extracts certain metadata attributes and indexes low level

data sets and services for fast access from distributed repositories

‣ automatically composes these services and data sets to answer user queries

• Questions - Comments?‣ David Chiu [email protected]‣ Gagan Agrawal [email protected]