MUDAMUltiple Data Accelerator language
Project OverviewFeb 24, 2008
Syoyo FUJITA
?
Nikkei 225 index
?
2007 Feb/2008
+800 %Mac Pro octa
204 Gflops
Geforce 9800 GX2 rumor
1 TFlops?( 3x of G80)500 GFlops? (+50% of G80)
No update !
?
PS3179.2 Gflops
GPU slumpsCPU soars
Nikkei 225 index
Nikkei 225 index
Future of GPU trend
Subprime shock!Credit boom ends!US economy declines!Green IT!
CPU GPU
GPGPU
Acceleratedcomputing
many-core
CPU GPU
GPGPU
Acceleratedcomputing
many-core
NO!
GPGPU was dead!!GPU will be dead soon!!
Why GPU -> GPGPU is BAD
• Larger latency : host <-> PCI-ex
• Internal architecture is black box
• Only GPU maker knows it
• Larger cost of branching
• Debugger?
• Program only runs on specific GPU maker’s GPU
• Not portable.
Why CPU -> Accelerated computing is GOOD
• Easy to program
• CPU maker provides good internal spec documentation
• Fast execution of branching
• gdb :-)
• Portable & Versatile
CPU
Acceleratedcomputing
many-core
MUDA
MUDA’s goal
• Withdraw CPU’s maximum floating point performance for large data
• SIMD
• Cache optimized computation
MUDA example
vec sqrtmu(vec x){ vec y0, y0x, y0xhalf; vec oneish = bit(0x3f800001);
y0 = rsqrt(x); y0x = y0 * x; y0xhalf = 0.5 * y0x; return ((oneish - y0 * y0x) * y0xhalf + y0x); }
MUDA code
__m128 sqrtmu (const __m128 * x) { __m128 y0 ; __m128 y0x ; __m128 y0xhalf ; const __m128 t_vec4 = (__m128)_mm_set1_epi32( 1065353217) ; __m128 oneish = t_vec4 ; const __m128 t_vec6 = (*x) ; const __m128 t_vec5 = _mm_rsqrt_ps( t_vec6) ; y0 = t_vec5 ; const __m128 t_vec8 = y0 ; const __m128 t_vec9 = (*x) ; const __m128 t_vec7 = _mm_mul_ps( t_vec8 , t_vec9 ) ; y0x = t_vec7 ; const float t_float13 = 0.5 ; const float t_float12 = t_float13 ; const __m128 t_vec10 = _mm_set_ps1( t_float12 ) ; const __m128 t_vec14 = y0x ; const __m128 t_vec11 = _mm_mul_ps( t_vec10 , t_vec14 ) ; y0xhalf = t_vec11 ; const __m128 t_vec19 = oneish ; const __m128 t_vec20 = y0 ; const __m128 t_vec21 = y0x ; const __m128 t_vec15 = _mm_mul_ps( t_vec20 , t_vec21 ) ; const __m128 t_vec16 = _mm_sub_ps( t_vec19 , t_vec15 ) ; const __m128 t_vec22 = y0xhalf ; const __m128 t_vec17 = _mm_mul_ps( t_vec16 , t_vec22 ) ; const __m128 t_vec23 = y0x ; const __m128 t_vec18 = _mm_add_ps( t_vec17 , t_vec23 ) ; return t_vec18 ;}
x86/SSE output
Why MUDA?
No unified way to describe SIMD op
• SSE: _mm_add_ps()
• AltiVec: vec_add
• SPE: spu_add
CPU ISA changes frequently
• SSE2(2000), SSE3(2004), SSE4(2006)
• SSE5 and Coming New CPU design(?)
• 8-element SIMD?, no SIMD in the future CPU?
• Keeping up with them is hard and not productive. Waste of your time.
MUDA MUDAcompiler
SSE2 C code
SSE4 C code
VMX C code
LLVM IR
Portable, CPU independent
description
CPU or Arch dependentcode
Status
• SSE2 backend : 75 %
• SSE4 backend : 0 %
• VMX backend : 20 %
• LLVM IR backend : 30 %
• SIMD math function for MUDA : 5 %
• Automatic optimizer : TODO
= I’m currently working on
Future direction
• Cache miss analysis and memory access optimization
• Valgrind, Cache Miss Equation(CME)
• Automatic optimization
• Such like FFTW, ATLAS and Spiral are doing
• Automatic error measurement for floating point computation
• Interval Arithmetic, Affine Arithmetic, Gappa
Performance gap
0
25
50
75
100
SIMD Memory
Scalar:SIMD =
1:4
cache miss:cache hit =
1:100
Better
Performance gap
0
25
50
75
100
SIMD Memory
Scalar:SIMD =
1:4
cache miss:cache hit =
1:100
Better
Optimizing memory access is much more important than SIMDization
Top Related