Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow...
-
Upload
edwin-scott -
Category
Documents
-
view
223 -
download
4
Transcript of Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow...
![Page 1: Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.](https://reader035.fdocuments.in/reader035/viewer/2022081513/56649cec5503460f949b8c6f/html5/thumbnails/1.jpg)
Parallelizing Audio Feature Extraction Using an Automatically-Partitioned
Streaming Dataflow Language
Eric BattenbergMark Murphy
CS 267, Spring 2008
![Page 2: Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.](https://reader035.fdocuments.in/reader035/viewer/2022081513/56649cec5503460f949b8c6f/html5/thumbnails/2.jpg)
Introduction
• World is going Multicore– Sequential performance has hit the “Three Walls”
• Power Wall + Memory Wall + ILP Wall = Brick Wall [Patterson]
• Software development is about to get really hard– For performance, must utilize parallel processors– Multiple, non-portable granularities:
• SIMD vs Multi-Core vs Multi-Socket vs Local-Stores vs GPU ...
![Page 3: Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.](https://reader035.fdocuments.in/reader035/viewer/2022081513/56649cec5503460f949b8c6f/html5/thumbnails/3.jpg)
The StreamIt Language
![Page 4: Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.](https://reader035.fdocuments.in/reader035/viewer/2022081513/56649cec5503460f949b8c6f/html5/thumbnails/4.jpg)
The StreamIt Compiler Strategy
• 1. Fuse Stateless filters – Eliminates communication and buffer copies
• 2. Data-Parallelize– Allow for concurrent execution of future work
• 3. Pipeline– Optimal partitioning of the work
Coarsen Granularity
Data Parallelize
Software Pipeline
Graphic from:Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs.Michael Gordon.ASPLOS 2006, San Jose, CA, October, 2006.
![Page 5: Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.](https://reader035.fdocuments.in/reader035/viewer/2022081513/56649cec5503460f949b8c6f/html5/thumbnails/5.jpg)
Audio Feature Extraction• Frequency-warped spectrum
– Human ear has better spectral resolution at lower frequencies
– To get necessary spectral resolution need FFT of length ~1024 (23ms)
– With frequency warping only need length 32(~1ms)
• Allows finer time resolution – Improves analysis of rhythmic
patterns and fast transients.
0 0.5 1 1.5 2 2.5
x 104
0
0.2
0.4
0.6
0.8
1
frequency [Hz}
mag
nitu
de r
espo
nse
16-point DFT
0 0.5 1 1.5 2 2.5
x 104
0
0.2
0.4
0.6
0.8
1
frequency [Hz}
mag
nitu
de r
espo
nse
16-point Warped Delay Line DFT
![Page 6: Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.](https://reader035.fdocuments.in/reader035/viewer/2022081513/56649cec5503460f949b8c6f/html5/thumbnails/6.jpg)
Audio Feature Extraction• Frequency-warped spectrum
– Human ear has better spectral resolution at lower frequencies
– To get necessary spectral resolution need FFT of length ~1024 (23ms)
– With frequency warping only need length 32(~1ms)
• Allows finer time resolution – Improves analysis of rhythmic
patterns and fast transients.
![Page 7: Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.](https://reader035.fdocuments.in/reader035/viewer/2022081513/56649cec5503460f949b8c6f/html5/thumbnails/7.jpg)
Stream Graph
• Produced by StreamIt compiler.
• Stream graph for a 4 point warped FFT.
• For an application could be up to 128 point.
• Fine-grained version of the algorithm.
![Page 8: Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.](https://reader035.fdocuments.in/reader035/viewer/2022081513/56649cec5503460f949b8c6f/html5/thumbnails/8.jpg)
Interesting Architectures: Intel Clovertown
• r41 in RAD cluster• Dual-Socket X Quad Core, 2.66 GHz 4-wide
pipelines
![Page 9: Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.](https://reader035.fdocuments.in/reader035/viewer/2022081513/56649cec5503460f949b8c6f/html5/thumbnails/9.jpg)
Interesting Architectures: AMD Opteron
• r28 in RAD cluster• Dual Socket X Dual Core, 2.2 GHz, 3-wide
pipelines
![Page 10: Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.](https://reader035.fdocuments.in/reader035/viewer/2022081513/56649cec5503460f949b8c6f/html5/thumbnails/10.jpg)
Interesting Architectures: UltraSparc T2 (Niagara2)
• r45 in RAD Cluster• 1 socket X 8-core X 8-threads, 1.4 GHz
– two-wide in-order pipelines, 1 FPU per core
![Page 11: Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.](https://reader035.fdocuments.in/reader035/viewer/2022081513/56649cec5503460f949b8c6f/html5/thumbnails/11.jpg)
StreamIt Scaling Performance
• StreamIt compiler thinks its doing a good job.• It isn't.
![Page 12: Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.](https://reader035.fdocuments.in/reader035/viewer/2022081513/56649cec5503460f949b8c6f/html5/thumbnails/12.jpg)
Portability Problems
• StreamIt code base in a primitive state– Unable to produce working code for Niagara– Frequently buggy even on x86 Multicores
• Bugs ranged from “simple” to “subtle”– Simple: Check return of fopen() for NULL– Subtle: Die nondeterministically with cryptic error
message• Lack of root access
– Made meeting compiler dependencies difficult
![Page 13: Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.](https://reader035.fdocuments.in/reader035/viewer/2022081513/56649cec5503460f949b8c6f/html5/thumbnails/13.jpg)
Dual-Core Performance• As a sanity check, ran on Core 2 Duo
T9300 (Penryn)– Eric’s Laptop, root access, (w00t)
• Coarse-grained implementation designed for easy parallelization– Shows speedup with –O1 flag but
is actually faster with –O0• Fine-grained version shows definite
slowdown.– But performs much faster than
coarse-grained.– Compilation failed at 64 and 128
(out of memory)
![Page 14: Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.](https://reader035.fdocuments.in/reader035/viewer/2022081513/56649cec5503460f949b8c6f/html5/thumbnails/14.jpg)
Dual-Core Performance
• Best runs of each implementation compared– Coarse: 2 Cores (-O0)– Fine: 1 Core (-O1)– Control: Hand-optimized Matlab implementation, 2 threads
• Single processor StreamIt performance very acceptable.– Considering the productivity
of the StreamIt language and the performance of the automatically-optimized
• Why can’t the multi-core code work this well yet on x86 architectures?
― No, seriously, I want it now. ―
![Page 15: Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.](https://reader035.fdocuments.in/reader035/viewer/2022081513/56649cec5503460f949b8c6f/html5/thumbnails/15.jpg)
Conclusions
• High-Level, Domain-Specific programming environments– Productivity win: we didn't have to write streaming
communication library• Compile-time transformations, lowering to C++
– Efficiency win: remove run-time overhead of High-Level language
• StreamIt optimizations incomplete– Agnostic of cache-coherence/communication
topologies