The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante...

18
The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications

Transcript of The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante...

Page 1: The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.

The Pipeline Processing Framework

LSST Applications MeetingIPAC

Feb. 19, 2008

Raymond PlanteNational Center for Supercomputing Applications

Page 2: The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.

LSST Applications Meeting

February 19-20, 2008

2

Overview

• Pipeline Framework provides – a container for hosting science algorithms– a mechanism for applying algorithm in parallel

• Data-Parallel Processing Model– algorithm implemented as “stage” of the pipeline– stage can have optional serial sections– parallel section applied to one data-parallel unit of data

• one CCD amplifier• one section of sky

– algorithm implementation usually avoids doing I/O• I/O handled in separate steps• stage is handed data it is supposed to work on• exception: database access

Page 3: The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.

LSST Applications Meeting

February 19-20, 2008

3

Pipeline Concepts

Page 4: The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.

LSST Applications Meeting

February 19-20, 2008

4

Pipeline Concepts

• Pipeline = a sequence of processing Stages

Page 5: The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.

LSST Applications Meeting

February 19-20, 2008

5

Pipeline Concepts

• Pipeline = a sequence of processing Stages• Each stage can be distributed across multiple

processors.

Page 6: The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.

LSST Applications Meeting

February 19-20, 2008

6

Pipeline Concepts

• Pipeline = a sequence of processing Stages• Each stage can be distributed across multiple

processors.– Each stage starts and ends with synchronized serial steps

Page 7: The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.

LSST Applications Meeting

February 19-20, 2008

7

Pipeline Concepts

• Pipeline = a sequence of processing Stages• Each stage can be distributed across multiple

processors.– Each stage starts and ends with synchronized serial steps

• Slice = Parts of the stages working on the same portion of data.– Can reside in one address space on a single machine

Page 8: The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.

Parallel processing

Slice

Slice

Stage

QueueQueue Queue

Stage Stage

Queue

PipelineSerial processing

Pipeline

Parallel processing

Slice

Pipeline Process• executes serial processing• controls the parallel slice workers

Slice Worker Processes• processes one data-parallel

portion of the data (e.g. a CCD)

Stage

QueueQueue Queue

Stage Stage

Queue

Slice

Stage

QueueQueue Queue

Stage Stage

Queue

Parallel processing

Slice

Slice

Stage

QueueQueue Queue

Stage Stage

Queue

Parallel processing

Slice

Slice

Stage

QueueQueue Queue

Stage Stage

Queue

DC2 Pipeline Harness

Page 9: The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.

LSST Applications Meeting

February 19-20, 2008

9

Pipeline Execution

• Pipeline Harness manages parallel processing on HPC platforms– Message Passing Interface

• MPI-2 functionality via MPICH2 • Explicit process spawning, control

– Coordination of Serial & Parallel Processing

• Pipeline is a sequence of Stages• “Slices” serve as data parallel worker threads• Pipeline manager instructs Slices in execution of Stages• Pipeline <=> Slices communicate via MPI

• Pipeline Harness interface hides complexity – Application Stage developers implement Stage API

• process() Parallel processing• preprocess(), postprocess() Serial processing

– Python as Stage “glue”

• Stage developer writes algorithm code in C++• Python interface is generated • Stitches algorithm code together to create a Stage using Python

Page 10: The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.

LSST Applications Meeting

February 19-20, 2008

10

Pipeline Dataflow

• Data flows through Stages via Queues• A stage can add data products to it output Queue. • Products can be persisted at any point in the chain.

PipelineManager

Pipeline

Stage

QueueQueue Queue

Stage Stage

QueueNew Input

DataOutput

Products

Page 11: The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.

LSST Applications Meeting

February 19-20, 2008

11

Coupling Pipelines via the Event Framework

PipelineManager

Image/DetectionPipeline

Stage

QueueQueue Queue

Stage Stage

Queue

PipelineManager

Object AssociationPipeline

Stage

QueueQueue Queue

Stage Stage

QueuePipelineManager

Moving ObjectsPipeline

Stage

QueueQueue Queue

Stage Stage

Queue

EventSystem

“New Detectionsavailable”

“New MovingObject Candidates

Available”

Page 12: The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.

LSST Applications Meeting

February 19-20, 2008

12

Tools for Stage Implementations

• Configuring a stage with Policies– Policy: a set of data properties as name-value pairs– Provided to stage implementation when stage is configured

• Recording messages: Logging– Messages have an associated “loudness”

• “DEBUG” = soft; “WARN” = louder

– Messages sent to a named topic• topics have an associated loudness threshold• messages louder than the threshold will be recorded

– Messages can have data properties associated with them• all messages automatically timestamped

– can be used to time sub-portions of implementation

• caller can attach other arbitrary properties

– Framework handles destination of messages• outside of pipeline harness, messages printed to screen• inside a parallel pipeline, messages sent out through event system, recorded

in database

Page 13: The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.

LSST Applications Meeting

February 19-20, 2008

13

Possible Variations

• Fine control over inter-slice communication– normal communication between master and slices

– stage could have direct access to other slices via MPI commands

• Custom pipeline – managed by pipeline orchestration layer for monitoring

– external communication via events

Page 14: The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.

LSST Applications Meeting

February 19-20, 2008

14

Building the Stack

• Basic Installation instructions: http://dev.lsstcorp.org/pkgs/GettingStarted.html

setenv LSST_HOME $PWD/stackmkdir $LSST_HOME; cd $LSST_HOMEcurl -o newinstall.sh http://dev.lsstcorp.org/pkgs/newinstall.shsh ./newinstall.shsource loadLSST.shlsstpkg fetch LSSTPipe

• Best supported platform:– Linux, gcc v3.4.6

• Alternatives to building the stack– Logging into LSST cluster @ NCSA

– Running Virtual Machine with stack pre-installed

Page 15: The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.

LSST Applications Meeting

February 19-20, 2008

15

Working with code from the repository

• GettingStarted document contains SVN survival guide

• LSST software organized into packages– packages are separately versioned

– usually one person is in charge of tracking its state

• Building from SVNsetenv LSST_DC2 svn+ssh://svn.lsstcorp.org/DC2svn co $LSST_DC2/fw/trunk fw-trunk # check out the packagecd fw-trunksetup -r . # load required environmentscons # build it in placescons install # install it into the stack

Page 16: The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.

LSST Applications Meeting

February 19-20, 2008

16

Testing Your Code

• Outside the framework– Create classes that can apply your algorithm to arbitrary data– Classes should not depend on pipeline framework– Create unit tests (in tests subdir.) or examples (in examples

subdir.) that exercise the class– testing can occur in C++, Python or both

• Inside the framework– Create python implementation of a Stage class– Create a policy file for configuring stage– Create a simple pipeline using policy files– Use the launchDC2.py script from the dc2pipe package to run

• provide identifying name for run (run ID) as input

• Process will likely change somewhat for DC3

Page 17: The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.

LSST Applications Meeting

February 19-20, 2008

17

Running on the cluster

• The LSST cluster @ NCSA – up to date software stack– input data organized and ready for use– standard pipelines configured to write output in organized tree

• /lsst/DC2root contains directory for each run ID• each run ID has subdirectory that names a pipeline that was run• each pipeline contains...

– input: the input data processed– output: the output image products– work: the pipeline's working directory, contains copy of all input

policy file, log capturing stdout, stderr from master process.

• output database products– saved in MySQL database on lsst10– database named after run ID

Page 18: The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.

LSST Applications Meeting

February 19-20, 2008

18

Dealing with bugs

• Bugs, issues and milestones are tracked using trac– life as a trac “ticket”

• Life Cycle:– ticket is created and assigned to a developer– developer creates copy of relevant package under the

package's tickets subdirectory in svn. Example: ticket #350 for change to fw

svn copy -m “addressing #350” $LSST_DC2/fw/trunk $LSST_DC2/fw/tickets/350

– changes are implemented, tested, checked into ticket branch– request code review: checked for compliance against coding

standards– reviewed code merged into trunk

• some refinement of process is expected