Stampede A Cluster Programming Middleware for Interactive Stream-oriented Applications Umakishore...

21
Stampede A Cluster Programming Middleware for Interactive Stream-oriented Applications Umakishore Ramachandran, Rishiyur Nikhil, James Matthew Rehg, Yavor Angelov, Arnab Paul, Sameer Adhikari, Kenneth Mackenzie, Nissim Harel, Kathleen Knobe IEEE Transactions on Parallel and Distributed Systems, November 2003

Transcript of Stampede A Cluster Programming Middleware for Interactive Stream-oriented Applications Umakishore...

Stampede

A Cluster Programming Middleware for Interactive Stream-oriented Applications

Umakishore Ramachandran, Rishiyur Nikhil, James Matthew Rehg, Yavor Angelov, Arnab Paul, Sameer Adhikari, Kenneth

Mackenzie, Nissim Harel, Kathleen Knobe

IEEE Transactions on Parallel and Distributed Systems, November 2003

Introduction

New application domains: interactive vision, multimedia collaboration, animation Interactive Process temporal data High computational requirements Exhibit task & data parallelism Dynamic – unpredictable at compile time

Stampede: programming system to enable execution on SMPs/clusters Support for task, data parallelism Temporal data handling, buffer management High level data sharing: space-time memory

Example: Smart Kiosk

Public device for providing information, entertainment

Interact with multiple people

Capable of initiating interaction

I/O: video cameras, microphones, touch screens, infrared, speakers, …

Kiosk application characteristics

Tasks have different computational requirements

higher level tasks may be more expensive

May not run as often – data dependent

Multiple (heterogeneous) time correlated data sets

Tasks have different priorities

e.g., interacting with customer vs. looking for new customers

Input may not be accessed in strict order

e.g., skip all but most recent data

May need to re-analyze earlier data

Claim: streams, lists not expressive enough

Space time memory

Distributed shared data structures for temporal data

STM channel: random access

STM queue: FIFO access

STM register: cluster-wide shared variable

Unique system wide names

Threads attach, detach dynamically

Threads communicate only via STM

STM channels

STM channel API

Channels• supports bounded/unbounded size• Separate API for typed access, hooks for marshalling,

unmarshalling

Timestamp wildcards• Request newest/oldest item in channel• Newest value not previously read

Get/put • Blocking/nonblocking operation• Timestamps can be out of order• Copy-in, copy-out semantics• Get can be called on an item 0-#conn times

STM queue

Supports data parallelism

Get/put behave as enqueue/dequeue Get: items retrieved exactly once Put: multiple items w/same timestamp can be added

Used for partitioning data items (regions in frame) Runtime adds ticket for unique id

Garbage collection

How to determine if an STM item is no longer needed?

Consume API call indicates this for a connection

Queues Items have implicit reference count of 1 GC after consume

Channels Number of consumers unknown

Threads can skip items New connections can be created dynamically

Reachability via timestamps GC if item cannot be accessed by any current or future connection System: item not GCed until marked consumed by all connections Application: must mark each item consumed (can mark timestamp ranges)

GC and timestamps

Threads propagate input timestamps to output

Threads at data source (e.g. camera) generate timestamps

Virtual time: per thread, application specific (e.g. frame number)

Visibility: per-thread, minimum of virtual time & item timestamps from all connections

Put: item timestamp >= visibility

Create thread: child virtual time >= visibility

Attach: items < visibility implicitly consumed

Set virtual time: any value >= visibility. Infinity or must guarantee advancement

Global minimum timestamp, ts_min. Minimum of:

Virtual time of all threads

Timestamps of items on all queues

Timestamps of unconsumed items on all input connections of all channels

Items with timestamps < ts_min can be garbage collected

Code samples

People tracker for Smart Kiosk

Track multiple moving targets based on color

Goals: low latency, keep up with frame rate

Application: color-based tracking

Model 1 Model 2

Mapping to Stampede

Expected bottleneck: target detection

Data parallelize by color models, frame regions (horizontal stripes)

Placement on cluster1 node: all threads except inner DPSN nodes: 1 inner DPS each

Color tracking results

Setup: 17 node cluster (Dell 8450s)• 8 CPUs/node: 550 MHz P3 Xeon• 4 GB memory/node• 2 MB L2 cache/CPU• Gigabit ethernet• OS: Linux• Stampede used CLF messaging

Data: 1 MB/frame @ 30 fps, 8 models

Bottleneck was histogram thread

Application: video textures

Batch video processing: generate video loop from set of frames

Randomly transition between computed cut points,or create loop of specified length

Calculate best places to cut – pairwise frame comparison

Comparisons independent – lots of parallelism

Problem: data distribution –don’t send every frame everywhere

Mapping to StampedeCluster nodes

Decentralized data distribution

Fetches all images fetches a subset and reuses images

“tiling with chaining”

Stripe size experiment

Tune image comparison for L2 cache size Compare image regions rather than whole images Find stripe size (#rows) s.t. comparisons fit in cache Measure single node speedup as a function of stripe size,

number of worker threads

Setup: cluster as before

Data: 316 frames, 640x480, 24 bit color (~900KB)

comparisons = N(N-1)/2 = 49770

Stripe size results

Memory bottleneck

(sec

onds

) Whole image comparison

Data distribution experiment

Single-source vs. decentralized data distribution Measure speedup as a function of nodes, threads/node Tile size varies with number of nodes

Larger tiles: better compute/communication ratio Smaller tiles: better load balancing

Compare to algorithm-limited speedup no communication costs shows effect of load imbalances

Setup: as before

Full image comparisons

Data distribution results

Single source bottleneck – as #nodes ↑, communication time > computation time

1-thread vs. 8-thread performance: communication for initial tile fetchno computation overlap