PS1 PSPS Object Data Manager Design

PS1 PSPSObject Data Manager Design

PSPS Critical Design Review November 5-6, 2007

IfA

Outline

ODM Overview Critical Requirements Driving Design Work Completed Detailed Design Spatial Querying [AS]

ODM Prototype [MN]

Hardware/Scalability [JV]

How Design Meets Requirements WBS and Schedule Issues/Risks

[AS] = Alex, [MN] = Maria, [JV] = Jan

ODM Overview

The Object Data Manager will:

Provide a scalable data archive for the Pan-STARRS data products

Provide query access to the data for Pan-STARRS users

Provide detailed usage tracking and logging

ODM Driving Requirements

Total size 100 TB, • 1.5 x 1011 P2 detections• 8.3x1010 P2 cumulative-sky (stack) detections• 5.5x109 celestial objects

Nominal daily rate (divide by 3.5x365)• P2 detections: 120 Million/day• Stack detections: 65 Million/day• Objects: 4.3 Million/day

Cross-Match requirement: 120 Million / 12 hrs ~ 2800 / s DB size requirement:

• 25 TB / yr• ~100 TB by of PS1 (3.5 yrs)

Work completed so far

Built a prototype Scoped and built prototype hardware Generated simulated data

• 300M SDSS DR5 objects, 1.5B Galactic plane objects

Initial Load done – Created 15 TB DB of simulated data• Largest astronomical DB in existence today

Partitioned the data correctly using Zones algorithm Able to run simple queries on distributed DB Demonstrated critical steps of incremental loading It is fast enough

• Cross-match > 60k detections/sec• Required rate is ~3k/sec

Detailed Design

Reuse SDSS software as much as possible Data Transformation Layer (DX) – Interface to IPP Data Loading Pipeline (DLP) Data Storage (DS)

• Schema and Test Queries• Database Management System• Scalable Data Architecture• Hardware

Query Manager (QM: CasJobs for prototype)

High-Level Organization

Legend

DatabaseFull table [partitioned table]Output tablePartitioned View

Query Manager (QM)Query Manager (QM)

PS1

P1 Pm

PartionsMap

Objects

LnkToObj

Meta

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

Meta

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

MetaDetections

Linked servers

Data Storage (DS)

Web Based Interface (WBI)Web Based Interface (WBI)

Data Transformation Layer (DX)Data Transformation Layer (DX)

LoadAdmin

LoadSupport1

objZoneIndx

orphans

Detections_l1

LnkToObj_l1

objZoneIndx

orphans

Detections_ln

LnkToObj_ln

LoadSupportn

Linked servers

PartitionMapData Loading Pipeline (DLP)

Legend

DatabaseFull table [partitioned table]Output tablePartitioned View

Query Manager (QM)Query Manager (QM)

PS1

P1 Pm

PartionsMap

Objects

LnkToObj

Meta

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

Meta

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

MetaDetections

Linked servers

Data Storage (DS)

PS1

P1 Pm

PartionsMap

Objects

LnkToObj

Meta

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

Meta

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

MetaDetections

Linked servers

Data Storage (DS)

Web Based Interface (WBI)Web Based Interface (WBI)

Data Transformation Layer (DX)Data Transformation Layer (DX)

LoadAdmin

LoadSupport1

objZoneIndx

orphans

Detections_l1

LnkToObj_l1

objZoneIndx

orphans

Detections_ln

LnkToObj_ln

LoadSupportn

Linked servers


LoadAdmin

LoadSupport1

objZoneIndx

orphans

Detections_l1

LnkToObj_l1

objZoneIndx

orphans

Detections_ln

LnkToObj_ln

LoadSupportn

Linked servers


Detailed Design




Data Transformation Layer (DX)

Based on SDSS sqlFits2CSV package• LINUX/C++ application• FITS reader driven off header files

Convert IPP FITS files to• ASCII CSV format for ingest (initially)• SQL Server native binary later (3x faster)

Follow the batch and ingest verification procedure described in ICD• 4-step batch verification• Notification and handling of broken publication cycle

Deposit CSV or Binary input files in directory structure• Create “ready” file in each batch directory

Stage input data on LINUX side as it comes in from IPP

DX Subtasks

DXDX

Initialization Job

FITS schemaFITS reader

CSV ConverterCSV Writer

Initialization Job

FITS schemaFITS reader

CSV ConverterCSV Writer

Batch Ingest

Interface with IPPNaming conventionUncompress batch

Read batchVerify Batch

Batch Ingest

Interface with IPPNaming conventionUncompress batch

Read batchVerify Batch

BatchVerification

Verify ManifestVerify FITS IntegrityVerify FITS Content

Verify FITS DataHandle Broken Cycle

BatchVerification

Verify ManifestVerify FITS IntegrityVerify FITS Content

Verify FITS DataHandle Broken Cycle

BatchConversion

CSV ConverterBinary Converter

“batch_ready”Interface with DLP

BatchConversion

CSV ConverterBinary Converter

“batch_ready”Interface with DLP

DX-DLP Interface

Directory structure on staging FS (LINUX):• Separate directory for each JobID_BatchID• Contains a “batch_ready” manifest file

– Name, #rows and destination table of each file• Contains one file per destination table in ODM

– Objects, Detections, other tables Creation of “batch_ready” file is signal to loader to ingest

the batch Batch size and frequency of ingest cycle TBD

Detailed Design




Data Loading Pipeline (DLP)

sqlLoader – SDSS data loading pipeline• Pseudo-automated workflow system• Loads, validates and publishes data

– From CSV to SQL tables• Maintains a log of every step of loading• Managed from Load Monitor Web interface

Has been used to load every SDSS data release• EDR, DR1-6, ~ 15 TB of data altogether• Most of it (since DR2) loaded incrementally• Kept many data errors from getting into database

– Duplicate ObjIDs (symptom of other problems)– Data corruption (CSV format invaluable in

catching this)

sqlLoader Design

Existing functionality• Shown for SDSS version• Workflow, distributed loading, Load Monitor

New functionality• Schema changes• Workflow changes• Incremental loading

– Cross-match and partitioning

sqlLoader Workflow

Distributed design achieved with linked servers and SQL Server Agent

LOAD stage can be done in parallel by loading into temporary task databases

PUBLISH stage writes from task DBs to final DB

FINISH stage creates indices and auxiliary (derived) tables

LOADLOAD

PUBLISHPUBLISHFINISHFINISH

EXPEXP

CHKCHK

BLDBLD

SQLSQL

VALVAL

BCKBCK

DTCDTC

Export

Check CSV

Build Task DBs

Build SQL Schema

Validate

Backup

Detach

PUBPUB

CLNCLN

Publish

Cleanup

FINFIN

LOADLOAD


EXPEXP

CHKCHK

BLDBLD

SQLSQL

VALVAL

BCKBCK

DTCDTC

Export

Check CSV

Build Task DBs

Build SQL Schema

Validate

Backup

Detach

PUBPUB

CLNCLN

Publish

Cleanup

FINFIN

LOADLOAD


EXPEXP

CHKCHK

BLDBLD

SQLSQL

VALVAL

BCKBCK

DTCDTC

Export

Check CSV

Build Task DBs

Build SQL Schema

Validate

Backup

Detach

PUBPUB

CLNCLN

Publish

Cleanup

FINFIN

LOADLOAD


EXPEXP

CHKCHK

BLDBLD

SQLSQL

VALVAL

BCKBCK

DTCDTC

Export

Check CSV

Build Task DBs

Build SQL Schema

Validate

Backup

Detach

PUBPUB

CLNCLN

Publish

Cleanup

FINFIN

Loading pipeline is a system of VB and SQL scripts, stored procedures and functions

Load Monitor Tasks Page

Load Monitor Active Tasks

Load Monitor Statistics Page

Load Monitor – New Task(s)

Test UniquenessOf Primary KeysTest UniquenessOf Primary Keys

TestForeign Keys

TestForeign Keys

TestCardinalities

TestCardinalities

TestHTM IDs

TestHTM IDs

Test Link TableConsistency

Test Link TableConsistency

Test the uniqueKey in each table

Test for consistencyof keys that link tables

Test consistency of numbers of various quantities

Test the HierarchicalTriamgular Mesh IDsused for spatial indexing

Ensure that links areconsistent

Data Validation

Tests for data integrity and consistency

Scrubs data and finds problems in upstream pipelines

Most of the validation can be performed within the individual task DB (in parallel)

Master Master

SlaveSlave SlaveSlave

Samba-mounted CSV/Binary FilesSamba-mounted CSV/Binary Files

PublishData

PublishData

FinishFinish

Task DB Task DBTaskDataTaskData

Task DB

Task DBView of

MasterSchema

TaskDataTaskData

LoadSupportLoadSupport Task DB

Task DB

TaskDataTaskData

Load Monitor

PublishSchema

View ofMaster

Schema

View ofMaster

Schema

MasterSchema

LoadAdminLoadAdmin

Distributed Loading

Publish

LoadSupportLoadSupportLoadSupportLoadSupport

Schema Changes

Schema in task and publish DBs is driven off a list of schema DDL files to execute (xschema.txt)

Requires replacing DDL files in schema/sql directory and updating xschema.txt with their names

PS1 schema DDL files have already been built Index definitions have also been created Metadata tables will be automatically generated using

metadata scripts already in the loader

LOADExportExport

CheckCSVs

CheckCSVs

CreateTask DBsCreate

Task DBs

Build SQLSchema

Build SQLSchema

ValidateValidate

XMatchXMatch

Workflow Changes

Cross-Match and Partition steps will be added to the workflow

Cross-match will match detections to objects

Partition will horizontally partition data, move it to slice servers, and build DPVs on main

PUBLISH

PartitionPartition

Matching Detections with Objects

Algorithm described fully in prototype section Stored procedures to cross-match detections will be part

of the LOAD stage in loader pipeline Vertical partition of Objects table kept on load server for

matching with detections Zones cross-match algorithm used to do 1” and 2”

matches Detections with no matches saved in Orphans table

XMatch and Partition Data Flow

Loadsupport

PS1

Pm

Detections

LoadDetections

XMatchDetections_In

PullChunk

LinkToObj_In

ObjZoneIndx

Orphans

Detections_chunk

LinkToObj_chunk

MergePartitions

Detections_m

LinkToObj_m

UpdateObjects

Objects_mPull

PartitionSwitch

Partition

Objects_m

LinkToObj_m

Objects

LinkToObj

Detailed Design




Data Storage – Schema

PS1 Table Sizes Spreadsheet

Stars 5.00E+09 1.51E+11Galaxies 5.00E+08 36750000000Total Objects 5.50E+09 m

2.3E-07 0.3*DR1 3.00P2 Detections per year 4.30E+10 0.3 0.29 0.57 0.86 1.00

tablename columns bytes/row total rows total size (TB) Prototype DR1 DR2 DR3 DR4

AltModels 0 7 1547 10 1.547E-08 1.547E-08 1.547E-08 1.547E-08 1.547E-08 1.547E-08 1 1CameraConfig 0 5 287 30 8.61E-09 8.61E-09 8.61E-09 8.61E-09 8.61E-09 8.61E-09 1 1FileGroupMap 0 4 4335 100 4.335E-07 4.335E-07 4.335E-07 4.335E-07 4.335E-07 4.335E-07 1 1IndexMap 0 7 2301 100 2.301E-07 2.301E-07 2.301E-07 2.301E-07 2.301E-07 2.301E-07 1 1Objects 0 88 420 5.50E+09 2.31 0.693 2.31 2.31 2.31 2.31 1 0.33ObjZoneIndx 0 7 63 5.50E+09 0.3465 0.10395 0.3465 0.3465 0.3465 0.3465 1 0PartitionMap 0 3 4111 100 4.111E-07 4.111E-07 4.111E-07 4.111E-07 4.111E-07 4.111E-07 1 1PhotoCal 0 10 151 1000 0.000000151 0.000000151 0.000000151 0.000000151 0.000000151 0.000000151 1 1PhotozRecipes 0 2 267 10 2.67E-09 2.67E-09 2.67E-09 2.67E-09 2.67E-09 2.67E-09 1 1SkyCells 0 2 10 50000 0.0000005 0.0000005 0.0000005 0.0000005 0.0000005 0.0000005 1 1Surveys 0 2 267 30 8.01E-09 8.01E-09 8.01E-09 8.01E-09 8.01E-09 8.01E-09 1 1DropP2ToObj 1 4 39 4.00E+06 0.000156 1.33714E-05 4.45714E-05 8.91429E-05 0.000133714 0.000156 1 0.33DropStackToObj 1 4 39 4.00E+06 0.000156 1.33714E-05 4.45714E-05 8.91429E-05 0.000133714 0.000156 1 0.33P2AltFits 1 13 71 1.51E+10 1.06855 0.09159 0.3053 0.6106 0.9159 1.06855 0 0.33P2FrameMeta 1 18 343 1.05E+06 0.00036015 0.00003087 0.0001029 0.0002058 0.0003087 0.00036015 1 1P2ImageMeta 1 64 2870 6.72E+07 0.192864 0.0165312 0.055104 0.110208 0.165312 0.192864 1 1P2PsfFits 1 34 183 1.51E+11 27.5415 2.3607 7.869 15.738 23.607 27.5415 0 0.33P2ToObj 1 3 31 1.51E+11 4.6655 0.3999 1.333 2.666 3.999 4.6655 1 0.33P2ToStack 1 2 15 1.51E+11 2.2575 0.1935 0.645 1.29 1.935 2.2575 0 0.33StackDeltaAltFits 1 13 71 3.68E+09 0.260925 0.022365 0.07455 0.1491 0.22365 0.260925 0 0.33StackHiSigDeltas 1 32 167 3.68E+10 6.13725 0.52605 1.7535 3.507 5.2605 6.13725 0 0.33StackLowSigDelta 1 2 5000 1.65E+06 0.00825 0.000707143 0.002357143 0.004714286 0.007071429 0.00825 0 0.33StackMeta 1 49 1551 700000 0.0010857 0.00032571 0.0010857 0.0010857 0.0010857 0.0010857 0 0.33StackModelFits 1 131 535 7.50E+09 4.0125 0.343928571 1.146428571 2.292857143 3.439285714 4.0125 0 0.33StackPsfFits 1 44 215 8.25E+10 17.7375 1.520357143 5.067857143 10.13571429 15.20357143 17.7375 0 0.33StackToObj 1 4 39 8.25E+10 3.2175 0.275785714 0.919285714 1.838571429 2.757857143 3.2175 1 0.33StationaryTransient 1 2 23 5.00E+08 0.0115 0.000985714 0.003285714 0.006571429 0.009857143 0.0115 1 0.33

sum 69.76959861 6.549735569 21.83244779 41.00730812 60.18216845 69.76959861indices 13.95391972 1.309947114 4.366489558 8.201461624 12.03643369 13.95391972total 83.72351833 7.859682683 26.19893735 49.20876974 72.21860214 83.72351833

0 means the table size is essentially the same for all data releases Primary filegroup1 means the table size will grow

0 means full table1 means the table is partitioned and distributed across the cluster

Fraction of the table contained on each partition

Note: These estimates are for the whole PS1, assuming 3.5 years. 7 bytes added to each row for overhead as suggested by Alex

PS1 Table Sizes - All Servers

Table Year 1 Year 2 Year 3 Year 3.5

Objects 4.63 4.63 4.61 4.59

StackPsfFits 5.08 10.16 15.20 17.76

StackToObj 1.84 3.68 5.56 6.46

StackModelFits 1.16 2.32 3.40 3.96

P2PsfFits 7.88 15.76 23.60 27.60

P2ToObj 2.65 5.31 8.00 9.35

Other Tables 3.41 6.94 10.52 12.67

Indexes +20% 5.33 9.76 14.18 16.48

Total 31.98 58.56 85.07 98.87

Sizes are in TB

Data Storage – Test Queries

Drawn from several sources• Initial set of SDSS 20 queries• SDSS SkyServer Sample Queries• Queries from PS scientists (Monet, Howell, Kaiser,

Heasley) Two objectives

• Find potential holes/issues in schema• Serve as test queries

– Test DBMS iintegrity– Test DBMS performance

Loaded into CasJobs (Query Manager) as sample queries for prototype

Data Storage – DBMS

Microsoft SQL Server 2005• Relational DBMS with excellent query optimizer

Plus• Spherical/HTM (C# library + SQL glue)

– Spatial index (Hierarchical Triangular Mesh)• Zones (SQL library)

– Alternate spatial decomposition with dec zones• Many stored procedures and functions

– From coordinate conversions to neighbor search functions

• Self-extracting documentation (metadata) and diagnostics

Documentation and Diagnostics

Data Storage – Scalable Architecture

Monolithic database design (a la SDSS) will not do it SQL Server does not have cluster implementation

• Do it by hand Partitions vs Slices

• Partitions are file-groups on the same server– Parallelize disk accesses on the same machine

• Slices are data partitions on separate servers• We use both!

Additional slices can be added for scale-out For PS1, use SQL Server Distributed Partition Views

(DPVs)

Distributed Partitioned Views

Difference between DPVs and file-group partitioning• FG on same database• DPVs on separate DBs• FGs are for scale-up• DPVs are for scale-out

Main server has a view of a partitioned table that includes remote partitions (we call them slices to distinguish them from FG partitions)

Accomplished with SQL Server’s linked server technology

NOT truly parallel, though

Scalable Data Architecture

Shared-nothing architecture Detections split across cluster Objects

replicated on Head and Slice DBs

DPVs of Detections tables on the Headnode DB

Queries on Objects stay on head node

S2

S3

Head

S1

Objects_S1

Objects_S2

Objects_S3

Objects_S1

Objects_S2

Objects_S3

Detections_S1

Detections_S2

Detections_S3

Objects

Detections_S1

Detections_S2

Detections_S3

Detections DPV

Queries on detections use only local data on slices

Hardware - Prototype

LXPS01

L1PS13

L2/MPS05

Staging Loading

10 TB 9 TB

8

4

4HeadPS11

8

DB

S1PS12

8

S2PS03

4

S3PS04

4

WPS02

4

MyDB

39 TB

2A

2A

2A

2A

A

A2B B

RAID5 RAID10 RAID10 RAID10

14D/3.5W 12D/4W

Total space

RAID config

Disk/rack config

Function

10A = 10 x [13 x 750 GB]3B = 3 x [12 x 500 GB]

LX = LinuxL = Load serverS/Head = DB serverM = MyDB serverW = Web server

Web

0 TB

PS0x = 4-corePS1x = 8-core

Server NamingConvention:

Storage:

Function:

Hardware – PS1

Offline(Copy 2)

Spare(Copy 3)

Live(Copy 1)

Offline(Copy 2)

Spare(Copy 3)

Live(Copy 1)

Queries Ingest

Offline(Copy 1)

Spare(Copy 3)

Live(Copy 2)

Live(Copy 2)

Spare(Copy 3)

Live(Copy 1)

ReplicateQueries

Queries

Queries

Replicate

Queries

Ping-pong configuration to maintain high availability and query performance

2 copies of each slice and of main (head) node database on fast hardware (hot spares)

3rd spare copy on slow hardware (can be just disk)

Updates/ingest on offline copy then switch copies when ingest and replication finished

Synchronize second copy while first copy is online

Both copies live when no ingest

3x basic config. for PS1

Detailed Design




Query Manager

Based on SDSS CasJobs Configure to work with distributed database, DPVs Direct links (contexts) to slices can be added later if

necessary Segregates quick queries from long ones Saves query results server-side in MyDB Gives users a powerful query workbench Can be scaled out to meet any query load PS1 Sample Queries available to users PS1 Prototype QM demo

http://ps1.pha.jhu.edu/casjobs

ODM Prototype Components

Data Loading Pipeline Data Storage CasJobs

• Query Manager (QM)• Web Based Interface (WBI)

Testing

Spatial Queries (Alex)

Prototype (Maria)

Hardware/Scalability (Jan)

How Design Meets Requirements

Cross-matching detections with objects• Zone cross-match part of loading pipeline• Already exceeded requirement with prototype

Query performance• Ping-pong configuration for query during ingest• Spatial indexing and distributed queries• Query manager can be scaled out as necessary

Scalability• Shared-nothing architecture• Scale out as needed• Beyond PS1 we will need truly parallel query plans

WBS/Development Tasks

Refine Prototype/Schema

Staging/Transformation

Initial Load

Load/Resolve Detections

Resolve/Synchronize Objects

Create Snapshot

Replication Module

Query Processing

• Workflow Systems• Logging• Data Scrubbing• SSIS (?) + C#

• QM/LoggingHardware

Documentation

2 PM

3 PM

1 PM

3 PM

3 PM

1 PM

2 PM

2 PM

2 PM

2 PM

4 PM

4 PM

4 PM

2 PM

Total Effort: 35 PMDelivery: 9/2008

Testing

Redistribute Data

Personnel Available

2 new hires (SW Engineers) 100% Maria 80% Ani 20% Jan 10% Alainna 15% Nolan Li 25% Sam Carliles 25% George Fekete 5% Laszlo Dobos 50% (for 6 months)

Issues/Risks

Versioning• Do we need to preserve snapshots of monthly

versions?• How will users reproduce queries on subsequent

versions?• Is it ok that a new version of the sky replaces the

previous one every month? Backup/recovery

• Will we need 3 local copies rather than 2 for safety• Is restoring from offsite copy feasible?

Handoff to IfA beyond scope of WBS shown• This will involve several PMs

Mahalo!

Context that query

is executed in

MyDB table that query results go

into

Name that this query

job is given

Check query syntax

Get graphical query plan

Run query in quick (1

minute) mode

Submit query to long (8-

hour) queue

Query buffer

Load one of the sample queries into

query buffer

Query Manager

Stored procedure arguments

SQL code for stored procedure

Query Manager

MyDB context is the default, but other contexts can be selected

The space used and total space available

Multiple tables can be selected and dropped at once

Table list can be sorted by name, size, type.

User can browse DB Views, Tables, Functions and

Procedures

Query Manager

The query that created this

table

Query Manager

Search radius

Table to hold results

Context to run search on

Query Manager

PS1 PSPS Object Data Manager Design

Documents

Transcript of PS1 PSPS Object Data Manager Design