Association Pipeline Take difference sources, moving object predictions for a visit Compare them to...

18
Association Pipeline Take difference sources, moving object predictions for a visit Compare them to current knowledge of the sky objects, historical difference sources Use results to determine if something interesting happened (send out alerts) improve knowledge of the sky (create new objects and update existing ones)

description

2 Spatial Matches difference sources D v vs. objects O v –d  D v and o  O v match iff distance(d, o) < R v –avoid alerting on known variables, capture variability of known objects mops predictions P v vs. difference sources D v –m  P v and d  D v match iff d falls within positional error ellipse of m –avoid alerting on known movers, don’t create entries for moving objects in the stationary object catalog –only match against difference sources that did not match a known variable object

Transcript of Association Pipeline Take difference sources, moving object predictions for a visit Compare them to...

Page 1: Association Pipeline Take difference sources, moving object predictions for a visit Compare them to current knowledge of the sky –objects, historical difference.

Association Pipeline

• Take difference sources, moving object predictions for a visit

• Compare them to current knowledge of the sky– objects, historical difference sources

• Use results to – determine if something interesting happened

(send out alerts)– improve knowledge of the sky (create new objects

and update existing ones)

Page 2: Association Pipeline Take difference sources, moving object predictions for a visit Compare them to current knowledge of the sky –objects, historical difference.

Outline

• Spatial Matching– algorithm & implementation

• Architecture• DC3 Discussion

Page 3: Association Pipeline Take difference sources, moving object predictions for a visit Compare them to current knowledge of the sky –objects, historical difference.

2 Spatial Matches

• difference sources Dv vs. objects Ov

– d Dv and o Ov match iff distance(d, o) < Rv

– avoid alerting on known variables, capture variability of known objects

• mops predictions Pv vs. difference sources Dv

– m Pv and d Dv match iff d falls within positional error ellipse of m

– avoid alerting on known movers, don’t create entries for moving objects in the stationary object catalog

– only match against difference sources that did not match a known variable object

Page 4: Association Pipeline Take difference sources, moving object predictions for a visit Compare them to current knowledge of the sky –objects, historical difference.

Algorithm 1

• For now assume: objects Ov, difference sources Dv, predictions Pv for FOV are in memory, focus on matching only

• Create ZoneIndex (ZoneTypes.h) for Ov. Goal: support fast proximity searches– bucket sort positions by declination: obtain a

set of constant height zones– within each zone, sort positions by right

ascension

Page 5: Association Pipeline Take difference sources, moving object predictions for a visit Compare them to current knowledge of the sky –objects, historical difference.

Algorithm 2

• Choosing zone height ≥ Rv means that for p in zone Z, only need to look at zones immediately above/below to find potential matches

• Furthermore, for each Z can compute s.t. any 2 points within Z separated by more than in ra are separated by distance ≥ Rv

• Given point p=(ra,dec), look at entries in range [ra - , ra + ] for 3 zones. Zone entries are sorted on right ascension - use binary search to locate candidates.

• Finally, compute distance between p and small set of candidates -- done! Well, except that…

Page 6: Association Pipeline Take difference sources, moving object predictions for a visit Compare them to current knowledge of the sky –objects, historical difference.

Algorithm 3• If one picks difference sources at random, cache

miss rate will be high (4.0e6 objects, index > 100MB)

• So, create ZoneIndex for difference sources as well (re-used later)

• Process difference sources one zone at a time, in ra order

• Can process zones in parallel• Within a zone, use linear search to find entries in

range [ra + - , ra + + ] from those in range [ra - , ra + ]

Page 7: Association Pipeline Take difference sources, moving object predictions for a visit Compare them to current knowledge of the sky –objects, historical difference.

Algorithm 4• See distanceMatch() in Match.h for details• Similar algorithm used to find DiaSources within the error ellipse

of moving object predictions• Differences

– No fixed Rv: error ellipses have wildly varying size. Compute a bounding box in zone,ra for each ellipse

– Find ra range of bounding box by treating ellipse as a circle with radius equal to the semi-major axis

– Don’t create a ZoneIndex for the ellipses, just sort based on lowest zone of ellipse bounding box

– Cache issues less severe (100,000 difference sources worst case, rather than 10 million objects)

• See ellipseMatch(), ellipseGroupedMatch() in Match.h for matching algorithm, EllipseTypes.h for ellipse representation

Page 8: Association Pipeline Take difference sources, moving object predictions for a visit Compare them to current knowledge of the sky –objects, historical difference.

Algorithm 5• All these match routines work on generic structures (Ellipse,

ZoneEntry) that contain position information and references to further data (e.g. a full DiaSource).

• Each takes 2 functor parameters that can be used to filter out a particular difference source, object, or prediction from the match.

• Also, each routine has a MatchListProcessor or MatchPairProcessor parameter

• MatchListProcessor: operator() takes e.g. a DiaSource and the list of all matching objects. In DC2, only action is to keep a record of matches found, but more complicated matching logic could be implemented here

• MatchPairProcessor: same as MatchListProcessor, but works on a single match at a time.

Page 9: Association Pipeline Take difference sources, moving object predictions for a visit Compare them to current knowledge of the sky –objects, historical difference.

Timing

• ~0.33 seconds to build ZoneIndex for 400k Objects

• Parallelizable• ~5 ms to match 2.6k difference sources to

400k Objects (1.8k matches)• DC2: max of 22 moving object predictions per

FOV, matching those to difference sources takes ~0.1ms

• results at -O0

Page 10: Association Pipeline Take difference sources, moving object predictions for a visit Compare them to current knowledge of the sky –objects, historical difference.

Architecture: High Level• Load phase (lsst.ap.LoadStage)

– read positions for objects within FOV from files (main Object table is in RDBMS, association pipeline keeps the files and table in sync).

– build ZoneIndex for objects

• Compute phase (triggered by detection)– Read difference sources coming from detection (lsst.dps.IOStage)– Build spatial index for difference sources (lsst.ap.MatchDiaSourcesStage)– Match difference sources against objects (lsst.ap.MatchDiaSourcesStage)– Write match results to database (lsst.dps.IOStage)– Read moving object predictions from database (lsst.dps.IOStage)– Match them to difference sources that didn’t match a known variable object

(lsst.ap.MatchMopsPredsStage)– Create objects from difference sources that didn’t match any object (lsst.ap.MatchMopsPredsStage)– Write matches and new objects to database (lsst.dps.IOStage)

• Store phase (lsst.ap.StoreStage)– Write positions of new objects to chunk delta files– Execute MySQL scripts that update the Object table, insert new objects into it, append per-visit tables to

historical tables and finally drop per visit tables

Page 11: Association Pipeline Take difference sources, moving object predictions for a visit Compare them to current knowledge of the sky –objects, historical difference.

Architecture: Details 1

Stripes are a fraction of a field-of-view high (less than one to minimize wasted I/O around the circular FOV)

Chunks are one stripe height wide

Objects are divided into declinationstripes, and physically partitioned into chunks. LoadStage only reads

chunks overlapping the circular FOV, keeping IO per visit low

CircularRegion/RectangularRegion classes represent FOVs, ZoneStripeChunkDecomposition maps positions to the obvious things, computeChunkIds() maps FOVs to overlapping chunks.

Page 12: Association Pipeline Take difference sources, moving object predictions for a visit Compare them to current knowledge of the sky –objects, historical difference.

Architecture: Details 2• For each spatial chunk have a:

– chunk file that contains object position, id, variability probabilities as specified at the beginning of a run. These are read-only.

– chunk delta file that contains new objects created during visit processing (new objects are also stored in the database). These are rewritten every visit.

• Multiple slices load stripes of object chunks in parallel• DC2:

– no way to send this data back to master via pipeline framework– no way to communicate it to other slices

• Solution: shared memory

Slice 0

Slice 1

Page 13: Association Pipeline Take difference sources, moving object predictions for a visit Compare them to current knowledge of the sky –objects, historical difference.

Architecture: Details 3• each slice stores loaded object positions into a shared memory

chunk store• works so long as master and workers live on same machine• allows parallel IO in workers, single-threaded matching by

master• allows multiple association pipeline instances to co-exist

– so that Load, Compute, Store can be overlapped

• has non-trivial effect on code– cannot store pointers in shared memory (different processes may map the shared

memory segment to different virtual addresses)– instead, must store offsets relative to the beginning of the shared memory area– requires hand-rolled code for a lot of things normally taken for granted: associative

container, memory allocation, etc…

Page 14: Association Pipeline Take difference sources, moving object predictions for a visit Compare them to current knowledge of the sky –objects, historical difference.

Architecture: Details 4

• This is mostly hidden from view by the Chunk and SharedSimpleObjectChunkManager classes.

• The chunk manager– tracks visits that are in-flight– allocates memory for chunks– registers new visits– tracks which chunks are being used by a visit, enforce 1 owner per chunk– allows a visit to wait until it has acquired ownership of all chunks

overlapping the visit FOV (make sure concurrent pipelines don’t step on eachother)

– allows to skip reading chunks that a previous visit is still holding in memory– either commits or rolls back all changes to the in-memory chunk store for a

given visit.

Page 15: Association Pipeline Take difference sources, moving object predictions for a visit Compare them to current knowledge of the sky –objects, historical difference.

Architecture: Details 5• A Chunk

– supports inserting/removing/updating entries– copy-on-write semantics to allow for rollback:

• removing an entry really means flagging it as DELETED• updating an entry means flagging the existing version as

DELETED and inserting a modified copy – provides access to entry flags– supports reading and writing of chunk and chunk

delta files, with and without gzip compression– allows a series of inserts/deletes to be marked as

committed or rolled back

Page 16: Association Pipeline Take difference sources, moving object predictions for a visit Compare them to current knowledge of the sky –objects, historical difference.

C++/Python Boundary• LoadStage constructs a VisitProcessingContext

which contains per-visit state and parameters. It is SWIGed and passed between stages on a Clipboard

• Stages.h declares functions for each of the high level steps outlined in previous slides - each takes a VisitProcessingContext as parameter

• Pipeline logic is almost entirely in C++.• Exception is StoreStage, which generates SQL

scripts from a template and then calls mysql to run them.

Page 17: Association Pipeline Take difference sources, moving object predictions for a visit Compare them to current knowledge of the sky –objects, historical difference.

DC3 Code Items• More extensive Python/C++ interface?• Get rid of shared memory chunk store?

– would simplify code a lot!– but would also lose functionality

• ability to spread visits across pipeline instances• overlapping Load, Compute, Store

• Transactions/fault tolerance:– Support this for DC3? At what granularity?

• Go to a shared nothing architecture with message passing?– Need support from middleware for passing data from master to slice and

back. Depending on what is parallelized, even slice to slice communication could be necessary. Could use MPI directly.

– Better fit with current pipeline model, can scale beyond a single server– But not clear this is necessary unless algorithms get heavier: by 2014

expect many (32+) cores per server.

Page 18: Association Pipeline Take difference sources, moving object predictions for a visit Compare them to current knowledge of the sky –objects, historical difference.

DC3 Algorithms• Use source classification information from detection (or compute

it in association)– http://dev.lsstcorp.org/htmldocs/SourceClassificationTable.pdf

• Vary the match-radius on a per difference source basis?• Probabilistic matching (make use of error ellipses for difference

sources and objects)?• Take magnitudes of difference sources into account? How?• Cadence of observations often results in pairs of visits to the

same FOV (within 30min)– Take pairs of DiaSources from both visits and do a full orbital fit against the orbits

intersecting the FOV (http://listserv.lsstcorp.org/mailman/private/lsst-data/2007-June/003268.html)

– Cross-match new objects from both to avoid generating back to back alerts for a new moving object (http://listserv.lsstcorp.org/mailman/private/lsst-data/2007-June/003264.html)