Download - Comparison of the Various Stanford Streaming Languages

Comparison of the Various Stanford

Streaming Languages

Lance Hammond

Stanford University

October 12, 2001

The Basics: Variables & Operators

Only DSP-C addresses variable-precision integers

“Native” short-vector types Brook introduces short-vector types (like vec2i) to avoid the use of streaming operations

with these types i/f distinction alone may not be sufficient!

DSP-C just defines these as very short CAAs, which should not cause the generation of kernels Allows type definition of any kind of short vector you may want using a general

language specification

Matrix multiply operations Brook uses “vector * vector” to mean a matrix multiply

So other “vector OP vector” operations are illegal?? DSP-C uses “vector OP vector” (including multiply) to mean an element-by-element

operation using the vectors A conventional matrix multiply could be specified as a special operator of its own,

however

“Stream” Definition

DSP-C uses arrays Sections of the arrays are selected for use in calculations using array indices or array range

specifications

Stream-C defines 1-D, array-like streams Arbitrary sections of the base arrays can be selected in advance for streaming into execution kernels

Brook defines much more abstract streams Dynamic length streams Could be multidimensional (but only fixed length??) Entire stream is a monolithic object

Brook problems, as I see them (currently) No substreams can be defined . . . you have to use the whole thing No in-place R/W stream usage to save memory when possible Implicit loop start/end isn’t well defined (Where do you start if you have ±1 usage on an input?

What if input streams differ in size?) Memory allocation of dynamic-length streams?

At the very least, we need to be able to specify when dynamic allocation isn’t actually necessary, to optimize fixed-length code

Parallelism Classification

Independent thread parallelism Stick with pthreads or other high-level definition

Loop iteration, data-division parallelism Divide up loop iterations among functional units Loop iterations must be data-independent (no critical dependencies)

Pipelining of segments of “serial” code Find places to overlap non-dependent portions of serial code

Ex. 1: Start a later loop before earlier one finishes Ex. 2: Start different functions on different processors

Harder than loop iteration parallelism because of load balancing

Pipelining between timesteps Run multiple timesteps in parallel, using a pipeline Doesn’t necessarily require finding overlap of loops or functions — running them on

different timesteps makes them data parallel StreaMIT is best example of a language designed to optimize for this, and up to now I

don’t think any of our proposals have addressed it

Parallelism Classification Graphic

Note: Any parallel blocks without dependencies in the above figure can be considered unordered with respect to each other, and rescheduled accordingly. There is technically no need to specify “unordered” explicitly, as a result.

thread 1:

a task thread 2:

step #1

step #2 (not dependent)

BIG LOOP (#3)

step #4 Other threads:

timed event step #1

timed event step #2

timed event step #3

A task

Step #1 Step #2

Step #4

Lo

op

3.1

Lo

op

3.2

Lo

op

3.3

Lo

op

3.4

TES #1 t=0

TES #1 t=1

TES #2 t=0

TES #1 t=2

TES #2 t=1

TES #3 t=0

TES #1 t=3

TES #2 t=2

TES #3 t=1

TES #2 t=3

TES #3 t=2

TES #3 t=3

Original Code Parallelization of the Program

Tim

e

Threads

2

2

3

41

1 Independent tasks can be parallelized

“Serial” execution blocks without dependencies can be parallelized 4

3Loops can be parallelized by distributing nondependent iterations into threads

Pipelines that execute continuously can be split into threads by pipe stage

Kernel Usage/Parallelism

DSP-C extracts kernels from loop nests Standard C loops or “dsp_for” loops Can find loop iteration and overlap parallelism, but relies on the compiler to perform compiler analysis to

find possible dependencies Probably OK for some cases, but may be too complex in general Flexible: Allows compiler to pick “best” loop nest for parallelism, if it is smart enough to make a good

choice We can supply “hints” to overcome some compiler stupidity

Stream-C explicitly calls function-like kernels Makes most inter-iteration dataflow and parallelism fairly obvious

Stream I/O behavior is clearly stated Still allows inter-iteration scalar dependencies

Overlapping/combining of kernels requires some Stream-C analysis Kernels are static and cannot be arbitrarily broken up to fit hardware

Brook also calls function-like kernels Makes most inter-iteration dataflow and parallelism fairly obvious

Stream inputs are simple, but outputs can vary in length Inter-iteration dependencies are completely eliminated

Overlapping of kernels can often be determined using “prototypes” Kernels are static and cannot be arbitrarily broken up to fit hardware

Stream Communication Allowed

DSP-C: Written CAA entries to read CAA entries Compiler must ensure that all potential dependencies are maintained Pattern of reading and writing within CAAs must be determined

Static “footprint” of reads and writes per iteration should be calculated to maximize parallelism

Dynamic address calculation limits parallelism for R/W arrays Much of this analysis must be done anyway to optimize memory layout May affect compiler’s selection of a loop nest to parallelize at

Stream-C: Written streams to read streams Kernels normally must be serialized when a kernel reads a stream written by another

one Overlap is possible if the compiler checks substream usage

Brook: Written output to read stream footprint Similar to stream-C, but read footprint specification in kernel header allows kernel

overlap without compiler analysis inside kernel

Scalar Communication Allowed

DSP-C: Anything allowed Pattern matcher in compiler looks for reduction operations and turns them into parallel

reductions with combining code at the end Other dependencies are left alone and make the loop serial unless they happen to map to

specialized target hardware Short-hop communication would map to the inter-cluster communication hardware on

Imagine Use of CAA input or output ports with “if . . . then port++” usage to target Imagine

conditional streams Another possibility: Allow explicit communication when desired using an extra

“parallelism” dimension on CAAs (Nuwan’s idea)

Stream-C: Anything allowed Uses tricks like prefix calculation to minimize inter-cluster dependencies, when possible

(How effective?? How automatic??) Uses high-bandwidth inter-cluster communication when necessary Supports high-speed conditional stream pointer hardware in Imagine

Brook: Only certain reductions allowed Reduction operator can be applied to a single kernel return value

Compiler Analysis in DSP-C

Picking parallel loops Could be done automatically, but probably would be best done with programmer hints

(“parallel for” loops) Various nesting levels can be considered for parallel splitting

Calculation of memory footprint used by loop array accesses Try to take advantage of DSP-C register/local/main memory layers Used to perform local memory (SRF, memory mat, etc.) allocation/buffering of streams during

execution Optional rearrangement of non-dependent loop iterations to improve memory system usage

Analysis of dependencies between loop iterations Elimination of reductions using reduction optimizations Marking of required dependencies for communication

Generation of parallel kernels Unroll loops as desired and recalculate memory footprints Apply SIMD operations to (partially) parallelize one dimension of the loop nest, if possible Generate “best” parallel kernel based on all previous analysis (and probably based on profiling

of several “good” choices)

CA: Memory Footprint Graphic

Input

Coefficients

Accumulator

A single iteration reads two input values and both reads and writes an accumulator.

Input

Coefficients

Accumulator

The inner loop runs down a row (or column) of elements from two input arrays, while using only a single accumulator repeatedly in a classic reduction operation.

Accumulator

Input Coefficients

Output

The second loop goes the opposite direction from the inner loop to complete the 2-D input arrays. At this level, an output pixel is produced. Other than the accumulator, nothing is used more than once in the calculation.

Input Coefficients

The third loop finds the accumulator and coefficients becoming fixed, and re-used every time. Meanwhile, the input data is streamed in with a single new column (or row) being brought in with every iteration and replacing an old column (or row). Output is a simple stream of data flowing out.

AccumulatorOutput

The final loop extends upon the third loop’s footprint by multiplying the input and output streams by a number of rows (or columns), while keeping the coefficient array and accumulator identical.

0

1

2

3

4

Footprint: 3 Per-iteration I/O: 4



Footprint: 154 + 10 x rowsize Per-iteration I/O: 10

Footprint: 146 + 8 x rowsize + 8 x columnsize + 2 x rowsize x columnsize Per-iteration I/O: 10

CA: Memory Model Graphic

Registers

Local Memory

Main Memory

Size: A few scalar values

Size: A small number of arrays

Size: Essentially unlimited

Communication: Load individual array elements

Communication: Load entire arrays (or segments)

Computation + Local <-> Register

Communication

Load Input Arrays in Local Memory from

Main Memory

Store Results in Local Memory to

Main Memory

Time

Previous Kernel LM -> MM

MM -> LM Next Kernel

CA: SIMD Extraction Graphic

PE0 PE1 PE2 PE3

PE0 PE1 PE2 PE3

Light-to-dark colors represent different SIMD parallel “slots”

Interleaved-unit parallelism

Array-division parallelism

Double-interleaved, SIMD first

Double-interleaved, parallel execution units first

SIMD interleaved, parallel units divided

Parallel units interleaved, SIMD divided

Pipeline-in-Time Syntax

It might be a good idea to steal some of StreaMIT’s dataflow definition technology to allow pipelining in time

In DSP-C, we must add an optional “time” dimension that can be added to any CAA Acts like a standard dimension, except that it can only be accessed

relative to the current time, and not absolutely Size of this dimension is unspecified (determined by blocking)

More Stream-C/Brook style kernel encoding of dataflow would probably be better to allow automatic determination of blocking-in-time by the compiler

Need special syntax if we want to handle out-of-flow messages like control signals to interrupt normal time flow This is my current project!