Начало - Бедността в БългарияAuthor: Yavor Created Date: 4/28/2016 9:00:39 AM
Stampede A Cluster Programming Middleware for Interactive Stream-oriented Applications Umakishore...
-
Upload
anthony-strickland -
Category
Documents
-
view
220 -
download
0
Transcript of Stampede A Cluster Programming Middleware for Interactive Stream-oriented Applications Umakishore...
Stampede
A Cluster Programming Middleware for Interactive Stream-oriented Applications
Umakishore Ramachandran, Rishiyur Nikhil, James Matthew Rehg, Yavor Angelov, Arnab Paul, Sameer Adhikari, Kenneth
Mackenzie, Nissim Harel, Kathleen Knobe
IEEE Transactions on Parallel and Distributed Systems, November 2003
Introduction
New application domains: interactive vision, multimedia collaboration, animation Interactive Process temporal data High computational requirements Exhibit task & data parallelism Dynamic – unpredictable at compile time
Stampede: programming system to enable execution on SMPs/clusters Support for task, data parallelism Temporal data handling, buffer management High level data sharing: space-time memory
Example: Smart Kiosk
Public device for providing information, entertainment
Interact with multiple people
Capable of initiating interaction
I/O: video cameras, microphones, touch screens, infrared, speakers, …
Kiosk application characteristics
Tasks have different computational requirements
higher level tasks may be more expensive
May not run as often – data dependent
Multiple (heterogeneous) time correlated data sets
Tasks have different priorities
e.g., interacting with customer vs. looking for new customers
Input may not be accessed in strict order
e.g., skip all but most recent data
May need to re-analyze earlier data
Claim: streams, lists not expressive enough
Space time memory
Distributed shared data structures for temporal data
STM channel: random access
STM queue: FIFO access
STM register: cluster-wide shared variable
Unique system wide names
Threads attach, detach dynamically
Threads communicate only via STM
STM channel API
Channels• supports bounded/unbounded size• Separate API for typed access, hooks for marshalling,
unmarshalling
Timestamp wildcards• Request newest/oldest item in channel• Newest value not previously read
Get/put • Blocking/nonblocking operation• Timestamps can be out of order• Copy-in, copy-out semantics• Get can be called on an item 0-#conn times
STM queue
Supports data parallelism
Get/put behave as enqueue/dequeue Get: items retrieved exactly once Put: multiple items w/same timestamp can be added
Used for partitioning data items (regions in frame) Runtime adds ticket for unique id
Garbage collection
How to determine if an STM item is no longer needed?
Consume API call indicates this for a connection
Queues Items have implicit reference count of 1 GC after consume
Channels Number of consumers unknown
Threads can skip items New connections can be created dynamically
Reachability via timestamps GC if item cannot be accessed by any current or future connection System: item not GCed until marked consumed by all connections Application: must mark each item consumed (can mark timestamp ranges)
GC and timestamps
Threads propagate input timestamps to output
Threads at data source (e.g. camera) generate timestamps
Virtual time: per thread, application specific (e.g. frame number)
Visibility: per-thread, minimum of virtual time & item timestamps from all connections
Put: item timestamp >= visibility
Create thread: child virtual time >= visibility
Attach: items < visibility implicitly consumed
Set virtual time: any value >= visibility. Infinity or must guarantee advancement
Global minimum timestamp, ts_min. Minimum of:
Virtual time of all threads
Timestamps of items on all queues
Timestamps of unconsumed items on all input connections of all channels
Items with timestamps < ts_min can be garbage collected
People tracker for Smart Kiosk
Track multiple moving targets based on color
Goals: low latency, keep up with frame rate
Application: color-based tracking
Model 1 Model 2
Mapping to Stampede
Expected bottleneck: target detection
Data parallelize by color models, frame regions (horizontal stripes)
Placement on cluster1 node: all threads except inner DPSN nodes: 1 inner DPS each
Color tracking results
Setup: 17 node cluster (Dell 8450s)• 8 CPUs/node: 550 MHz P3 Xeon• 4 GB memory/node• 2 MB L2 cache/CPU• Gigabit ethernet• OS: Linux• Stampede used CLF messaging
Data: 1 MB/frame @ 30 fps, 8 models
Bottleneck was histogram thread
Application: video textures
Batch video processing: generate video loop from set of frames
Randomly transition between computed cut points,or create loop of specified length
Calculate best places to cut – pairwise frame comparison
Comparisons independent – lots of parallelism
Problem: data distribution –don’t send every frame everywhere
Decentralized data distribution
Fetches all images fetches a subset and reuses images
“tiling with chaining”
Stripe size experiment
Tune image comparison for L2 cache size Compare image regions rather than whole images Find stripe size (#rows) s.t. comparisons fit in cache Measure single node speedup as a function of stripe size,
number of worker threads
Setup: cluster as before
Data: 316 frames, 640x480, 24 bit color (~900KB)
comparisons = N(N-1)/2 = 49770
Data distribution experiment
Single-source vs. decentralized data distribution Measure speedup as a function of nodes, threads/node Tile size varies with number of nodes
Larger tiles: better compute/communication ratio Smaller tiles: better load balancing
Compare to algorithm-limited speedup no communication costs shows effect of load imbalances
Setup: as before
Full image comparisons