Comparison of the Various Stanford
Streaming Languages
Lance Hammond
Stanford University
October 12, 2001
The Basics: Variables & Operators
Only DSP-C addresses variable-precision integers
“Native” short-vector types Brook introduces short-vector types (like vec2i) to avoid the use of streaming operations
with these types i/f distinction alone may not be sufficient!
DSP-C just defines these as very short CAAs, which should not cause the generation of kernels Allows type definition of any kind of short vector you may want using a general
language specification
Matrix multiply operations Brook uses “vector * vector” to mean a matrix multiply
So other “vector OP vector” operations are illegal?? DSP-C uses “vector OP vector” (including multiply) to mean an element-by-element
operation using the vectors A conventional matrix multiply could be specified as a special operator of its own,
however
“Stream” Definition
DSP-C uses arrays Sections of the arrays are selected for use in calculations using array indices or array range
specifications
Stream-C defines 1-D, array-like streams Arbitrary sections of the base arrays can be selected in advance for streaming into execution kernels
Brook defines much more abstract streams Dynamic length streams Could be multidimensional (but only fixed length??) Entire stream is a monolithic object
Brook problems, as I see them (currently) No substreams can be defined . . . you have to use the whole thing No in-place R/W stream usage to save memory when possible Implicit loop start/end isn’t well defined (Where do you start if you have ±1 usage on an input?
What if input streams differ in size?) Memory allocation of dynamic-length streams?
At the very least, we need to be able to specify when dynamic allocation isn’t actually necessary, to optimize fixed-length code
Parallelism Classification
Independent thread parallelism Stick with pthreads or other high-level definition
Loop iteration, data-division parallelism Divide up loop iterations among functional units Loop iterations must be data-independent (no critical dependencies)
Pipelining of segments of “serial” code Find places to overlap non-dependent portions of serial code
Ex. 1: Start a later loop before earlier one finishes Ex. 2: Start different functions on different processors
Harder than loop iteration parallelism because of load balancing
Pipelining between timesteps Run multiple timesteps in parallel, using a pipeline Doesn’t necessarily require finding overlap of loops or functions — running them on
different timesteps makes them data parallel StreaMIT is best example of a language designed to optimize for this, and up to now I
don’t think any of our proposals have addressed it
Parallelism Classification Graphic
Note: Any parallel blocks without dependencies in the above figure can be considered unordered with respect to each other, and rescheduled accordingly. There is technically no need to specify “unordered” explicitly, as a result.
thread 1:
a task thread 2:
step #1
step #2 (not dependent)
BIG LOOP (#3)
step #4 Other threads:
timed event step #1
timed event step #2
timed event step #3
A task
Step #1 Step #2
Step #4
Lo
op
3.1
Lo
op
3.2
Lo
op
3.3
Lo
op
3.4
TES #1 t=0
TES #1 t=1
TES #2 t=0
TES #1 t=2
TES #2 t=1
TES #3 t=0
TES #1 t=3
TES #2 t=2
TES #3 t=1
TES #2 t=3
TES #3 t=2
TES #3 t=3
Original Code Parallelization of the Program
Tim
e
Threads
2
2
3
41
1 Independent tasks can be parallelized
“Serial” execution blocks without dependencies can be parallelized 4
3Loops can be parallelized by distributing nondependent iterations into threads
Pipelines that execute continuously can be split into threads by pipe stage
Kernel Usage/Parallelism
DSP-C extracts kernels from loop nests Standard C loops or “dsp_for” loops Can find loop iteration and overlap parallelism, but relies on the compiler to perform compiler analysis to
find possible dependencies Probably OK for some cases, but may be too complex in general Flexible: Allows compiler to pick “best” loop nest for parallelism, if it is smart enough to make a good
choice We can supply “hints” to overcome some compiler stupidity
Stream-C explicitly calls function-like kernels Makes most inter-iteration dataflow and parallelism fairly obvious
Stream I/O behavior is clearly stated Still allows inter-iteration scalar dependencies
Overlapping/combining of kernels requires some Stream-C analysis Kernels are static and cannot be arbitrarily broken up to fit hardware
Brook also calls function-like kernels Makes most inter-iteration dataflow and parallelism fairly obvious
Stream inputs are simple, but outputs can vary in length Inter-iteration dependencies are completely eliminated
Overlapping of kernels can often be determined using “prototypes” Kernels are static and cannot be arbitrarily broken up to fit hardware
Stream Communication Allowed
DSP-C: Written CAA entries to read CAA entries Compiler must ensure that all potential dependencies are maintained Pattern of reading and writing within CAAs must be determined
Static “footprint” of reads and writes per iteration should be calculated to maximize parallelism
Dynamic address calculation limits parallelism for R/W arrays Much of this analysis must be done anyway to optimize memory layout May affect compiler’s selection of a loop nest to parallelize at
Stream-C: Written streams to read streams Kernels normally must be serialized when a kernel reads a stream written by another
one Overlap is possible if the compiler checks substream usage
Brook: Written output to read stream footprint Similar to stream-C, but read footprint specification in kernel header allows kernel
overlap without compiler analysis inside kernel
Scalar Communication Allowed
DSP-C: Anything allowed Pattern matcher in compiler looks for reduction operations and turns them into parallel
reductions with combining code at the end Other dependencies are left alone and make the loop serial unless they happen to map to
specialized target hardware Short-hop communication would map to the inter-cluster communication hardware on
Imagine Use of CAA input or output ports with “if . . . then port++” usage to target Imagine
conditional streams Another possibility: Allow explicit communication when desired using an extra
“parallelism” dimension on CAAs (Nuwan’s idea)
Stream-C: Anything allowed Uses tricks like prefix calculation to minimize inter-cluster dependencies, when possible
(How effective?? How automatic??) Uses high-bandwidth inter-cluster communication when necessary Supports high-speed conditional stream pointer hardware in Imagine
Brook: Only certain reductions allowed Reduction operator can be applied to a single kernel return value
Compiler Analysis in DSP-C
Picking parallel loops Could be done automatically, but probably would be best done with programmer hints
(“parallel for” loops) Various nesting levels can be considered for parallel splitting
Calculation of memory footprint used by loop array accesses Try to take advantage of DSP-C register/local/main memory layers Used to perform local memory (SRF, memory mat, etc.) allocation/buffering of streams during
execution Optional rearrangement of non-dependent loop iterations to improve memory system usage
Analysis of dependencies between loop iterations Elimination of reductions using reduction optimizations Marking of required dependencies for communication
Generation of parallel kernels Unroll loops as desired and recalculate memory footprints Apply SIMD operations to (partially) parallelize one dimension of the loop nest, if possible Generate “best” parallel kernel based on all previous analysis (and probably based on profiling
of several “good” choices)
CA: Memory Footprint Graphic
Input
Coefficients
Accumulator
A single iteration reads two input values and both reads and writes an accumulator.
Input
Coefficients
Accumulator
The inner loop runs down a row (or column) of elements from two input arrays, while using only a single accumulator repeatedly in a classic reduction operation.
Accumulator
Input Coefficients
Output
The second loop goes the opposite direction from the inner loop to complete the 2-D input arrays. At this level, an output pixel is produced. Other than the accumulator, nothing is used more than once in the calculation.
Input Coefficients
The third loop finds the accumulator and coefficients becoming fixed, and re-used every time. Meanwhile, the input data is streamed in with a single new column (or row) being brought in with every iteration and replacing an old column (or row). Output is a simple stream of data flowing out.
AccumulatorOutput
The final loop extends upon the third loop’s footprint by multiplying the input and output streams by a number of rows (or columns), while keeping the coefficient array and accumulator identical.
0
1
2
3
4
Footprint: 3 Per-iteration I/O: 4
Footprint: 19 Per-iteration I/O: 2
Footprint: 164 Per-iteration I/O: 18
Footprint: 154 + 10 x rowsize Per-iteration I/O: 10
Footprint: 146 + 8 x rowsize + 8 x columnsize + 2 x rowsize x columnsize Per-iteration I/O: 10
CA: Memory Model Graphic
Registers
Local Memory
Main Memory
Size: A few scalar values
Size: A small number of arrays
Size: Essentially unlimited
Communication: Load individual array elements
Communication: Load entire arrays (or segments)
Computation + Local <-> Register
Communication
Load Input Arrays in Local Memory from
Main Memory
Store Results in Local Memory to
Main Memory
Time
Previous Kernel LM -> MM
MM -> LM Next Kernel
CA: SIMD Extraction Graphic
PE0 PE1 PE2 PE3
PE0 PE1 PE2 PE3
Light-to-dark colors represent different SIMD parallel “slots”
Interleaved-unit parallelism
Array-division parallelism
Double-interleaved, SIMD first
Double-interleaved, parallel execution units first
SIMD interleaved, parallel units divided
Parallel units interleaved, SIMD divided
Pipeline-in-Time Syntax
It might be a good idea to steal some of StreaMIT’s dataflow definition technology to allow pipelining in time
In DSP-C, we must add an optional “time” dimension that can be added to any CAA Acts like a standard dimension, except that it can only be accessed
relative to the current time, and not absolutely Size of this dimension is unspecified (determined by blocking)
More Stream-C/Brook style kernel encoding of dataflow would probably be better to allow automatic determination of blocking-in-time by the compiler
Need special syntax if we want to handle out-of-flow messages like control signals to interrupt normal time flow This is my current project!
Top Related