CASJobs: A Workflow Environment Designed for Large Scientific Catalogs

Post on 15-Jan-2016

25 views 0 download

Tags:

description

CASJobs: A Workflow Environment Designed for Large Scientific Catalogs. Nolan Li, Johns Hopkins University. What is CASJobs. Terabytes of scientific data Web based system Data distribution Server-side analysis Optimize user work patterns Server-side user storage and programmability. - PowerPoint PPT Presentation

Transcript of CASJobs: A Workflow Environment Designed for Large Scientific Catalogs

CASJOBS: A WORKFLOW ENVIRONMENT DESIGNED FOR LARGE SCIENTIFIC CATALOGS

Nolan Li, Johns Hopkins University

What is CASJobs

Terabytes of scientific data Web based system

Data distribution Server-side analysis Optimize user work patterns Server-side user storage and

programmability

Sloan Digital Sky Survey (SDSS) Astronomical Survey

Images (fits) - 15.7 TB

Other data products ( masks, jpeg images, etc.) (DAS, fits format) - 26.8 TB

Catalogs (CAS, SQL database) - 18 TB

Data is public Delivery?

Database

Bandwidth is expensive!

10 terabytes is big! So database it

(SkyServer) Partial delivery Move work to data

Scalability Traffic++ Complexity ++ Data++

So… Cap execution time Cap results Build something else

Monthly CAS Usage

1.E+04

1.E+05

1.E+06

1.E+07

Web Hits

SQL Queries

CASJobs

Catalog Archive Server Jobs Server-side user storage and programmability

MyDB Hardware abstraction and long-term query

portability Contexts

Complete, automatic query logging Scalable performance

Controlled asynchronous query execution Data sharing

Groups http://casjobs.sdss.org/casjobs

MyDB

Server-side user database

Intermediate storage

Data import User

programmable

SELECT *FROM DR4WHERE a.objid = 38573498OR a.objid = 92837451OR a.objid = 20394833OR a.objid = 90284723

SELECT *FROM DR4 a, MyDB.MyTable bWHERE a.objid = b.objid

Logging

Automatically log all user queries Resubmit old queries Reconstruct database objects

Contexts

Databases are identified by their data, not their location

Queries are independent of hardware configuration

SELECT TOP 10 *FROM [server].[catalog].[user].MyTable

SELECT TOP 10 *FROM DR4.MyTable

Quick Jobs

Executes right away

But not for very long

Restricted memory usage

For things like… How many objects

? Table previews Preliminary

queries System queries

Long Jobs

Asynchronous Less restricted

execution time Storage capped

by MyDB size

For things like… Heavy IO Heavy

computation

Groups

Non exclusive sets of CASJobs users

Share data Keep more work

at the data

SELECT *FROM myGroup.otherUser.theirTable

Hardware

Flexible configuration

1+ machine per context (non exclusive)

1+ machine for MyDBs

Interface

Web Site Web Services

Usage

> two million jobs > 2200 users Astro deployments

Galaxy Evolution Explorer (GALEX)

Palomar Quest Panoramic Survey

Telescope and Rapid Response System (Pan-STARRS)[3].

Non Astro deployments Ameriflux Swiss Institute of

Bioinformatics (ISB) 8/29

/200

3 17

:32

11/3

0/20

03 1

6:33

2/27

/200

4 15

:45

5/31

/200

4 8:

42

8/31

/200

4 19

:41

11/3

0/20

04 2

0:08

2/28

/200

5 23

:59

5/31

/200

5 23

:57

8/31

/200

5 23

:58

11/3

0/20

05 2

3:58

2/28

/200

6 23

:57

5/31

/200

6 23

:42

8/31

/200

6 23

:35

11/3

0/20

06 2

3:41

2/28

/200

7 22

:44

5/31

/200

7 14

:08

8/31

/200

7 23

:46

11/3

0/20

07 2

3:35

2/29

/200

8 23

:43

5/31

/200

8 23

:47

8/31

/200

8 23

:59

0

50000

100000

150000

200000

250000

Monthly CASJobs