CASJobs: A Workflow Environment Designed for Large Scientific Catalogs

14
CASJOBS: A WORKFLOW ENVIRONMENT DESIGNED FOR LARGE SCIENTIFIC CATALOGS Nolan Li, Johns Hopkins University

description

CASJobs: A Workflow Environment Designed for Large Scientific Catalogs. Nolan Li, Johns Hopkins University. What is CASJobs. Terabytes of scientific data Web based system Data distribution Server-side analysis Optimize user work patterns Server-side user storage and programmability. - PowerPoint PPT Presentation

Transcript of CASJobs: A Workflow Environment Designed for Large Scientific Catalogs

Page 1: CASJobs: A Workflow Environment Designed for Large Scientific Catalogs

CASJOBS: A WORKFLOW ENVIRONMENT DESIGNED FOR LARGE SCIENTIFIC CATALOGS

Nolan Li, Johns Hopkins University

Page 2: CASJobs: A Workflow Environment Designed for Large Scientific Catalogs

What is CASJobs

Terabytes of scientific data Web based system

Data distribution Server-side analysis Optimize user work patterns Server-side user storage and

programmability

Page 3: CASJobs: A Workflow Environment Designed for Large Scientific Catalogs

Sloan Digital Sky Survey (SDSS) Astronomical Survey

Images (fits) - 15.7 TB

Other data products ( masks, jpeg images, etc.) (DAS, fits format) - 26.8 TB

Catalogs (CAS, SQL database) - 18 TB

Data is public Delivery?

Page 4: CASJobs: A Workflow Environment Designed for Large Scientific Catalogs

Database

Bandwidth is expensive!

10 terabytes is big! So database it

(SkyServer) Partial delivery Move work to data

Scalability Traffic++ Complexity ++ Data++

So… Cap execution time Cap results Build something else

Monthly CAS Usage

1.E+04

1.E+05

1.E+06

1.E+07

Web Hits

SQL Queries

Page 5: CASJobs: A Workflow Environment Designed for Large Scientific Catalogs

CASJobs

Catalog Archive Server Jobs Server-side user storage and programmability

MyDB Hardware abstraction and long-term query

portability Contexts

Complete, automatic query logging Scalable performance

Controlled asynchronous query execution Data sharing

Groups http://casjobs.sdss.org/casjobs

Page 6: CASJobs: A Workflow Environment Designed for Large Scientific Catalogs

MyDB

Server-side user database

Intermediate storage

Data import User

programmable

SELECT *FROM DR4WHERE a.objid = 38573498OR a.objid = 92837451OR a.objid = 20394833OR a.objid = 90284723

SELECT *FROM DR4 a, MyDB.MyTable bWHERE a.objid = b.objid

Page 7: CASJobs: A Workflow Environment Designed for Large Scientific Catalogs

Logging

Automatically log all user queries Resubmit old queries Reconstruct database objects

Page 8: CASJobs: A Workflow Environment Designed for Large Scientific Catalogs

Contexts

Databases are identified by their data, not their location

Queries are independent of hardware configuration

SELECT TOP 10 *FROM [server].[catalog].[user].MyTable

SELECT TOP 10 *FROM DR4.MyTable

Page 9: CASJobs: A Workflow Environment Designed for Large Scientific Catalogs

Quick Jobs

Executes right away

But not for very long

Restricted memory usage

For things like… How many objects

? Table previews Preliminary

queries System queries

Page 10: CASJobs: A Workflow Environment Designed for Large Scientific Catalogs

Long Jobs

Asynchronous Less restricted

execution time Storage capped

by MyDB size

For things like… Heavy IO Heavy

computation

Page 11: CASJobs: A Workflow Environment Designed for Large Scientific Catalogs

Groups

Non exclusive sets of CASJobs users

Share data Keep more work

at the data

SELECT *FROM myGroup.otherUser.theirTable

Page 12: CASJobs: A Workflow Environment Designed for Large Scientific Catalogs

Hardware

Flexible configuration

1+ machine per context (non exclusive)

1+ machine for MyDBs

Page 13: CASJobs: A Workflow Environment Designed for Large Scientific Catalogs

Interface

Web Site Web Services

Page 14: CASJobs: A Workflow Environment Designed for Large Scientific Catalogs

Usage

> two million jobs > 2200 users Astro deployments

Galaxy Evolution Explorer (GALEX)

Palomar Quest Panoramic Survey

Telescope and Rapid Response System (Pan-STARRS)[3].

Non Astro deployments Ameriflux Swiss Institute of

Bioinformatics (ISB) 8/29

/200

3 17

:32

11/3

0/20

03 1

6:33

2/27

/200

4 15

:45

5/31

/200

4 8:

42

8/31

/200

4 19

:41

11/3

0/20

04 2

0:08

2/28

/200

5 23

:59

5/31

/200

5 23

:57

8/31

/200

5 23

:58

11/3

0/20

05 2

3:58

2/28

/200

6 23

:57

5/31

/200

6 23

:42

8/31

/200

6 23

:35

11/3

0/20

06 2

3:41

2/28

/200

7 22

:44

5/31

/200

7 14

:08

8/31

/200

7 23

:46

11/3

0/20

07 2

3:35

2/29

/200

8 23

:43

5/31

/200

8 23

:47

8/31

/200

8 23

:59

0

50000

100000

150000

200000

250000

Monthly CASJobs