University at Albany,SUNYlrm-1
lrm 04/20/23
Levels of Processor/Memory Hierarchy
• Can be Modeled by Increasing Dimensionality of Data Array.
– Additional dimension for each level of the hierarchy.– Envision data as reshaped to reflect increased
dimensionality.– Calculus automatically transforms algorithm to reflect
reshaped data array.– Data, layout, data movement, and scalarization automatically
generated based on reshaped data array.
University at Albany,SUNYlrm-2
lrm 04/20/23
Levels of Processor/Memory Hierarchycontinued
• Math and indexing operations in same expression
• Framework for design space search– Rigorous and provably correct– Extensible to complex architectures
Approach
Mathematics of Arrays
Example: “raising” arraydimensionality
y= conv (x)
Me
mo
ry H
iera
rch
y
Parallelism
Main Memory
L2 Cache
L1 Cache
Map
x: < 0 1 2 … 35 >
Map:
< 3 4 5 >< 0 1 2 >
< 6 7 8 >< 9 10 11 >
< 12 13 14 >
< 18 19 20 >< 21 22 23 >
< 24 25 26 >< 27 28 29 >
< 30 31 32 >
< 15 16 17 >
< 33 34 35 >
P0 P1 P2
P0
P1
P2
University at Albany,SUNYlrm-3
lrm 04/20/23
Application DomainSignal Processing
3-d Radar Data Processing Composition of Monolithic Array Operations
Algorithm is Input
Architectural Information is Input
Hardware Info:- Memory- Processor
Change algorithmto better match
hardware/memory/communication.Lift dimensionalgebraically
PulseCompression
DopplerFiltering
Beamforming Detection
Model processors(dim=dim+1); Model time-variance(dim=dim+1); Model Level 1 cache(dim=dim+1)
Model All Three: dim=dim+3
Model processors(dim=dim+1); Model time-variance(dim=dim+1); Model Level 1 cache(dim=dim+1)
Model All Three: dim=dim+3
Convolution Matrix Multiply
University at Albany,SUNYlrm-4
lrm 04/20/23
Some Psi Calculus Operations
University at Albany,SUNYlrm-5
lrm 04/20/23
Convolution: PSI Calculus Description
PSI Calculus operators compose to form higher level operationsPSI Calculus operators compose to form higher level operations
Definitionof
y=conv(h,x)
y[n]= where x ‘has N elements, h has M elements, 0≤n<N+M-1, andx’ is x padded by M-1 zeros on either end
knxkhM
k
1
0
AlgorithmandPSI
CalculusDescription
Algorithm step Psi Calculus
sum Y=unaryOmega (sum, 1, Prod)
Initial step x= < 1 2 3 4 > h= < 5 6 7 >
rotate x’(N+M-1) times x’ rot=binaryOmega(rotate,0,iota(N+M-1), 1 x’)
Form x’ x’=cat(reshape(<k-1>, <0>), cat(x, reshape(<k-1>,<0>)))=
take the size of h part of x’rot
x’ final=binaryOmega(take,0,reshape<N+M-1>,<M>,=1,x’ rot
multiply Prod=binaryOmega (*,1, h,1,x’ final)
x= < 1 2 3 4 > h= < 5 6 7 >
< 0 0 1 . . . 4 0 0 >
x’ rot=< 0 0 1 2 . . . >< 0 1 2 3 . . . >< 1 2 3 4 . . . >
< 7 20 38 . . . >
< 0 0 1 >< 0 1 2 >< 1 2 3 >
< 0 0 7 >< 0 6 14 >< 5 12 21 >
x’ final=
Prod=
Y=
x’=
University at Albany,SUNYlrm-6
lrm 04/20/23
Experimental Platform and Method
Hardware• DY4 CHAMP-AV Board
– Contains 4 MPC7400’s and 1 MPC 8420
• MPC7400 (G4)– 450 MHz– 32 KB L1 data cache– 2 MB L2 cache– 64 MB memory/processor
Software
• VxWorks 5.2– Real-time OS
• GCC 2.95.4 (non-official release)– GCC 2.95.3 with patches for
VxWorks– Optimization flags:
-O3 -funroll-loops -fstrict-aliasing
Method
• Run many iterations, report average, minimum, maximum time
– From 10,000,000 iterations for small data sizes, to 1000 for large data sizes
• All approaches run on same data
• Only average times shown here
• Only one G4 processor used
• Use of the VxWorks OS resulted in very low variability in timing• High degree of confidence in results
• Use of the VxWorks OS resulted in very low variability in timing• High degree of confidence in results
University at Albany,SUNYlrm-7
lrm 04/20/23
Convolution and Dimension Lifting
• Model Processor and Level 1 cache.– Start with 1-d inputs(input dimension).– Envision 2nd dimension ranging over output values.– Envision Processors
Reshaped into a 3rd dimension. The 2nd dimension is partitioned.
– Envision Cache Reshaped into a 4th dimension. The 1st dimension is partitioned.
– “psi” Reduce to Normal Form
University at Albany,SUNYlrm-8
lrm 04/20/23
– Envision 2nd dimension ranging over output values.
Let tz=N+M-1 M=h=4
N=x
tztz
h3 h2 h1 h0 0 0 0 x0 x4
University at Albany,SUNYlrm-9
lrm 04/20/23
2
x xtz
2
tz
2
- Envision Processors Reshaped into a 3rd dimension. The 2nd dimension is partitioned.
Let p = 2
-4-4 - -4-
University at Albany,SUNYlrm-10
lrm 04/20/23
– Envision Cache
Tz/2
x
2x
2
Reshaped into a 4th dimension
The 1st dimension is partitioned.
tz/2
Tz/2
x
2x
2
2
tz
2
2 2
University at Albany,SUNYlrm-11
lrm 04/20/23
ONF for the Convolution Decomposition with Processors & Cache
Generic form- 4 dimensional after “psi” Reduction
1. For i0= 0 to p-1 do:2. For i11= 0 to tz/p –1 do:3. sum 04. For icacherow= 0 to M/cache -1 do:5. For i3 = 0 to cache –1 do:6. sum sum + h [(M-((icacherow cache) + i3))-1]
x’[(((tz/p i0)+i1) + icacherow cache) + i3)]
Let tz=N+M-1M=hN=x
Time DomainTime Domain
Processor
loop
TI
me
loop
Cache
loop
sum is calculated for each element of y.
University at Albany,SUNYlrm-12
lrm 04/20/23
Outline
• Overview• Array Algebra: MoA and Index Calculus: Psi Calculus• Time Domain Convolution• Other algorithms in Radar
– Modified Gram-Schmidt QR Decompositions MOA to ONF Experiments
– Composition of Matrix Multiplication in Beamforming MoA to DNF Experiments
– FFT
• Benefits of Using Moa and Psi Calculus
University at Albany,SUNYlrm-13
lrm 04/20/23
ONFfor1
proc
Algorithms in Radar
Time DomainConvolution (x,y)
Modified Gram SchmidtQR (A)
A x (BH x C)Beamforming
Manualdescription
&derivation
for 1 processor
DNF
Lift dimension- Processor- L1 cache
reformulate
DNF ONF
Mechanize UsingExpression Templates
Use toreason
about RAW
Benchmark at NCSAw/LAPACK
CompilerOptimizationsDNF to ONF
ImplementDNF/ONFFortran 90
Thoughtson an
AbstractMachine
MoA & CalculusMoA & Calculus
University at Albany,SUNYlrm-14
lrm 04/20/23
Benefits of Using Moa and Psi Calculus
• Processor/Memory Hierarchy can be modeled by reshaping data using an extra dimension for each level.
• Composition of monolithic operations can be reexpressed as composition of operations on smaller data granularities
– Matches memory hierarchy levels– Avoids materialization of intermediate arrays.
• Algorithm can be automatically(algebraically) transformed to reflect array reshapings above.
• Facilitates programming expressed at a high level– Facilitates intentional program design and analysis– Facilitates portability
• This approach is applicable to many other problems in radar.
University at Albany,SUNYlrm-15
lrm 04/20/23
ONF for the QRDecomposition with Processors & Cache
Modified Gram Schmidt
Main
Loop
ProcessorLoop
ProcessorLoop
ProcessorCacheLoop
ProcessorCacheLoop
Initialization
ComputeNorm
Normalize
DoTProduct
Ortothogonalize
University at Albany,SUNYlrm-16
lrm 04/20/23
DNF for the Composition of A x (BH x C)
Generic form- 4 dimensional
1. Z=02. For i=0 to n-1 do:3. For j=0 to n-1 do:4. For k=0 to n-1 do:5. z[k;]z[k;]+A[k;j]xX[j;i]xB[i;]
Given A, B, X, Zn by n arrays
BeamformingBeamforming
University at Albany,SUNYlrm-17
lrm 04/20/23
Typical C++ Operator Overloading
temp
B+C temp
temp copy A
Main
Operator +
Operator =
1. Pass B and Creferences tooperator +
2. Create temporaryresult vector
3. Calculate results,store in temporary
4.Return copy of temporary
5. Pass results referenceto operator=
6. Perform assignment
temp copy
temp copy &
Example: A=B+C vector addExample: A=B+C vector add
B&,
C&
Additional Memory Use
Additional Execution Time
•Static memory•Dynamic memory(also affectsexecution time)
• Cache misses/page faults
• Time to create anew vector
• Time to create a copy of a vector
• Time to destructboth temporaries
2 temporary vectors created2 temporary vectors created
University at Albany,SUNYlrm-18
lrm 04/20/23
C++ Expression Templates and PETE
Parse trees, not vectors, createdParse trees, not vectors, created
Reduced Memory Use
Reduced Execution Time
• Parse tree only contains references
• Better cache use• Loop fusion style
optimization• Compile-time
expression tree manipulation
PETE: http://www.acl.lanl.gov/pete
• PETE, the Portable Expression Template Engine, is available from theAdvanced Computing Laboratory at Los Alamos National Laboratory
• PETE provides:– Expression template capability– Facilities to help navigate and evaluating parse trees
A=B+CA=B+CBinaryNode<OpAdd, Reference<Vector>, Reference<Vector > >Expression
Templates
Expression
Expression TypeParse Tree
B+C A
Main
Operator +
Operator =
+
B& C&
1. Pass B and Creferences tooperator +
4. Pass expression treereference to operator
2. Create expressionparse tree
3. Return expressionparse tree
5. Calculate result andperform assignment
copy &
copy
B&,
C&
Parse trees, not vectors, createdParse trees, not vectors, created
+
B C
University at Albany,SUNYlrm-19
lrm 04/20/23
Implementing Psi Calculus with Expression Templates
Example: A=take(4,drop(3,rev(B)))
B=<1 2 3 4 5 6 7 8 9 10>A=<7 6 5 4>
Recall:Psi Reduction for 1-d arrays always yields one or more expressions of the form:
x[i]=y[stride*i+ offset]l ≤ i < u
1. Form expression tree
2. Add size information 3. Apply Psi Reduction rules
4. Rewrite as sub-expressions with iterators at the leaves, and loop bounds information at the root
take
drop
rev
B
4
3
take
drop
rev
B
4
3
size=4
size=7
size=10
size=10
size=4iterator:offset=6stride=-1
size=4 A[i]=B[-i+6]
size=7 A[i]=B[-(i+3)+9]=B[-i+6]
size=10 A[i]=B[-i+B.size-1] =B[-i+9]
size=10 A[i]=B[i]
Siz
e i
nfo
Re
du
cti
on
• Iterators used for efficiency, rather than recalculating indices for each i• One “for” loop to evaluate each sub-expression
Top Related