Post on 25-Sep-2020
EXCELlent way of looking at FIR optimization as a function of
processor architectureAssignment 3
Knowledge expected by midterm
Start with basic FIR filter
float FIR_Filter(float newValue, float *FIFO, float *coeffs, int numTaps)R4 R8 R12 ?
Course exams– I WILL PROBALY say – pretend numTaps comes in R16
How to handle in real life – write in C++ first and see what compiler does to handle this situation – then copy thatg
Careful – Compiler treats these situation differently as “it knows more in second case”float FIR_Filter_1(float newValue, float *FIFO, float *coeffs, int numTaps){
}
And Extern volatile float FIFO[ ];Extern volatile float coeffs[ ]; float FIR_Filter_2(float newValue, int numTaps){
}
And these differently – and perhaps differently between debug and release modesExtern volatile float FIFO[ ]; Extern volatile float coeffs[ ]; #define numTaps 120float FIR_Filter_2(float newValue) {}
Extern volatile float FIFO[ ]; Extern volatile float coeffs[ ]; Volatile int numTaps = 120;float FIR_Filter_2(float newValue) {
}
Extern volatile float FIFO[ ]; Extern volatile float coeffs[ ]; int numTaps = 120;float FIR_Filter_2(float newValue) {
}
Standard FIR filter from Lab 1
float FIR_Filter(float newValue, float *FIFO, float *coeffs, int numTaps) {For (int count = 1; count < numTaps, count++) FIFO[count – 1] = FIFO[count];
float *FIFOpt = FIFO + numTaps – 1; // DOes C do pointer arithmetic?*FIFOpt = newValue;
sum = 0.0;for (int count = 0; count < numTaps, count++)
sum = sum + *FIFOpt‐‐ * *coeffs++;
return sum
Assume processor architecture is von‐Neumann and can’t do data fetch, add or multiplication in same cycle
Now – increase cycle time by 25% to do pt++ in same cycle as fetch – STEP 1
Now – increase cycle time by 25% to do pt++ in same cycle as fetch – STEP 1 ‐‐ Change pipeline to allow 1 Math op to occur during next fetch – STEP 2
UNROLL LOOP TO OPEN UP OTHER POSSIBLE PARALLEL INSTRUCTIONSTOTALLY MEMORY / DAG 1 RESOURCE LIMITEDNEED TO CHANGE PROCESSOR ARCHITECTURE
Instead of 1 cycle mult + 1 cycle addUse 2 cycle (pipeline MACC instruction)Multiply / Accumulate
Does 1 or 2 cycle MACC improve performance
• FETCH MULT INSTRUCTION• DO MULT ‐‐ FETCH ADD INSTRUCTION• DO ADD
• Compared to 2 cycle MACC• FETCH MACC INSTRUCTION• DO MULT• DO ADD
Assume Harvard – Architecture with floating point MACC (SHARC)
Harvard processor without the MACColour each resource for an instruction
Take advantage (carefully) of parallel DM and PM operations to fetch instructions earlier
In principle 4 cycles faster for twice round the loop – but data dependencies conflict
You complete the analysis with separate Add and Mult instruction
Show the advantages of using a 2 cycle MACC instruction. Is 1 cycle MACC offer any further advantage?
Move over to Super Harvard architecture with instruction cache in use always. Start using PM bus for data ops
• DON’T LOOK AT NEXT SLIDE UNTIL YOU HAVE TACKLED LAST SLIDE
Loop of size 10 for twice around loopKey resource – FETCH INSTR 8 / 10
Using cache ONLY when instr / data conflict on pm bus means can have smaller (cheaper) cache
Get more speed by UNROLLING THE LOOP 3 times and then thinking
Re‐Roll the loop and execute N‐2 times
Next step – MOVE TO VLIW instruction setWHERE INSTR ALLOWS MATH‐OP, dm and pm fetch at the same time
DOES NOT HAVE TO WAIT
Next step – MOVE TO V‐VLIW instruction setWHERE INSTR ALLOWS + and *, dm and pm fetch at the same time
DOES NOT HAVE TO WAIT
IF USE V‐VLIW INSTR* + dm pm
then loop is 1 cycle
FIR loop look like this
• FETCH DATA1• FETCH DATA2, DO MULT OF DATA1• FETCH DATA3, DO MULT OF DATA2, ADD OF DATA1• FETCH DATA4, DO MULT OF DATA3, ADD OF DATA2• DO MULT OF DATA4, ADD OF DATA3• ADD OF DATA4
Lab 2
• Programming VLIW assembly code (single cycle FIR hardware loop)
• Does C++ automatically switch to this mode in release mode if we pass dm and pm memory array pointers
• If not – how do we make C++ switch to this mode