Post on 11-Jan-2016
Ultra sound solution
Impact of C++ DSP optimization techniques
Research Team discussion Ultra-sound probe (20 MHz) that sends out
signals into body that reflect off moving blood cells in (Artery? Vein?)
Ultra-sound frequency received is Doppler shifted compared to transmitted frequency Same as sound when ambulance goes by. Higher
if approaching, lower if receding They get the positive frequencies (towards)
on the left audio channel and negative frequencies (away) on the right audio channel.
04/21/23.ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 2 / 33
Picture looks like this
Note that the display loses all direction information Can I help them to output the maximum frequency?
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 3 / 33
Captured audio signal
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 4 / 33
Engineering Problems
Problem 5 – Different amplitudes common
Problem 6 – Why are funny dead spots not lining up in left and right channels? Handling stereo not mono signals
Incorrect labeling / misinterpreation
Problem 7 – How to remove dead-spots?
Max frequency – definition 1 Frequency
below which X% of the frequencies fall
Noisy signal for large thresholds
> 80%
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 5 / 33
After XPI Stage 2 Have a working algorithm concept Engineering problem 1 – Complex math (a + jb) on SHARC! Engineering Problem 2 – Define maximum frequency
zillions of blood cells – therefore distribution of frequencies Workable prototype – discuss more with customer
Engineering Problem 3 – SHARC D/A can’t handle DC signal Workable prototype – discuss more with customer
Engineering Problem 4 – Can SHARC handle all this in real-time?
Problem 5 – Is different amplitudes of input channels common? Yes
Problem 6 – Why are funny dead spots not lining up in left and right channels? Artifact – mislabeled and misinterpreted sampled
Problem 7 – How to remove dead-spots? – Discuss more with customer
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 6 / 33
ProcessBlockDONEOUTSIDEINTERRUPT
AVOIDS RACE
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 7 / 33
Real life problem -- Stereo
Minor changes to Audio Premptive Task
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 8 / 33
Make “C – code more general Moved buffer[ ] to external files Unknown size of arrays being
processed
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 9 / 33
Switch to Release mode Switch to optimizing compiler
(ReleaseNWC) means can no longer set breakpoints – Fix with these steps
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 10 / 33
First look at code
Timing -- software loop with r2 as loop counter – test at end
N * (10 – 1) cycles (jump is not db)
-1 for 1parallel instruction
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 11 / 33
UseCompilerInfo button
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 12 / 33
3 Stalls – 2 on software jump. 1 on ?
Obvious things to do We are already processing left and
right channels in one program Switch to left audio in dm memory and
right audio in pm memory
Need to do Make right buffers ‘pm’ Change prototype of function to padd pm
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 13 / 33
As expected 2 cycles saved
Parallel dm and pm reads and writes
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 14 / 33
Why software loop? Switch does know what to do about
size of loop so can’t oprtimize loop
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 15 / 33
THIS PRAGMAIS A CONTRACTBETWEEN THEDEVELOPER AND COMPILEDON’T LIE
This does not compile
Pragma variables not handled by preprocessor
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 16 / 33
Variable as end of loop Compile will not optimizewhen loop parameter is declared external, or internal or static
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 17 / 33
Loop parameters all constantsknown to compiler
Drop from 8 cycles to2 cycles as compiler knows enough to switch to hardware loop control – STALLS FROM JUMP GONE
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 18 / 33
Where am I getting all my info?
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 19 / 33
Can we switch to SIMD mode
VECTORIZATION
MAY NOT BE POSSIBLE IF COMPILER DOES NOT KNOW ABOUT ALIGNMENT OF ARRAYS
(How arrays placed in memory)
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 20 / 33
Impact of vectorization Before -- loop count was 0x80 With memory operations of the form
r2 = dm(i4, m6) where m6 = 1 meaning code is doing r2 = i4+
+;
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 21 / 33
New instructions – SIMD mode
Bit set mode1 0x200000 (bit clr mode 1)
Processor doing r2 = dm(i5, 2)
Same as r2 = dm(i5, 1) AND s2 = dm(i5, 1)
Loading two registers
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 22 / 33
Try using #pragma inline BEFORE AFTER (20 cycles
faster?)
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 23 / 33
C++ showing out of order execution
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 24 / 33
WARNING
Lets do “inline” ProcessOneBlock( ) is called by four
subroutines – lets in
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 25 / 33
Mixed mode view is interesting
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 26 / 33
Mixed Mode Out of order execution with 4 copies of the code for
DoCopyBlock( ) (one for each of Process 0, Process1, Process2, Process 3)
NO CODE OF ProcessOneBlock( )
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 27 / 33
Speed improvement Moving from software loop and using dm and pm
memories caused a change from 8 cycles / pt to 2 cycles for two points processed in SIMD (4 CALLS * 7 CYCLES SAVED * N POINTS PROCESSED)
Moving to IN_LINE causes a change of around 120 cycles for each subroutine call (4 CALLS * 120 CYCLES SAVED)
N = 128 -- (4 * 1800 to 4 * 120) 480 Mhz processor -- 15 us to 1 us LESSON LEARNT – SPEND YOUR TIME OPTIMIZING
THE LOOPS – REST IS SMALLER AND GETS SMALLER WITH LARGER N
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 28 / 33
Otherimprovementsdepend oncode Characteristicsspecifics
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 29 / 33
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 30 / 33
Profile guided optimization
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 31 / 33
Memory alignment can be important
After first char fetch, system and move to move 8 chars in SIMD
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 32 / 33
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 33 / 33
Conditional code (manual PGO)
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 34 / 33
Correct ways to process loops
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 35 / 33
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 36 / 33
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 37 / 33
#pragma all_aligned #pragma loop_unroll N #pragma SIMD_for #pragma align num #pragma alignment_region( and
#pragma alignment_region_end
04/21/23ENCM515 – Ultrasound ProblemCopyright smithmr@ucalgary.ca 38 / 33