Abstracting Vector Architectures in Library Generators ... · Abstracting Vector Architectures in...

17

Transcript of Abstracting Vector Architectures in Library Generators ... · Abstracting Vector Architectures in...

Page 2: Abstracting Vector Architectures in Library Generators ... · Abstracting Vector Architectures in Library Generators: Case Study Convolution Filters Author: Alen Stojanov Created

Page 3: Abstracting Vector Architectures in Library Generators ... · Abstracting Vector Architectures in Library Generators: Case Study Convolution Filters Author: Alen Stojanov Created

1 11

111

Page 4: Abstracting Vector Architectures in Library Generators ... · Abstracting Vector Architectures in Library Generators: Case Study Convolution Filters Author: Alen Stojanov Created

1 11

111

#define T doublevoid add(T* x, T* y, T* z, int N) {

for(int i = 0; i < N; ++i) {T x1, y1, z1; x1 = x[i];y1 = y[i];z1 = x1 + y1;z[i] = z1;

}}

#define T doublevoid add(T* x, T* y, T* z, int N) {for(int i = 0; i < N; i += 4) {

__m256d x1, y1, z1;x1 = _mm256_loadu_pd(x + i);y1 = _mm256_loadu_pd(y + i);z1 = _mm256_add_pd(x1, y1);_mm256_storeu_pd(z + i, z1);

}}

#define T doublevoid add(T* x, T* y, T* z, int N) {

for(int i = 0; i < N; ++i) {T x1, y1, z1; x1 = x[i];y1 = y[i];z1 = x1 + y1;z[i] = z1;

}}

#define T doublevoid add(T* x, T* y, T* z, int N) {for(int i = 0; i < N; i += 4) {

__m256d x1, y1, z1;x1 = _mm256_loadu_pd(x + i);y1 = _mm256_loadu_pd(y + i);z1 = _mm256_add_pd(x1, y1);_mm256_storeu_pd(z + i, z1);

}}

LBB0_3:movsd (%rdi,%rax,8), %xmm0addsd (%rsi,%rax,8), %xmm0movsd %xmm0, (%rdx,%rax,8)incq %raxcmpl %eax, %r9djne LBB0_3

LBB0_3:vmovupd (%rdi,%r10,8), %ymm0vaddpd (%rsi,%r10,8), %ymm0, %ymm0vmovupd %ymm0, (%rax) addq $4, %r10addq $32, %raxaddq $1, %rcxjne LBB0_3

Page 5: Abstracting Vector Architectures in Library Generators ... · Abstracting Vector Architectures in Library Generators: Case Study Convolution Filters Author: Alen Stojanov Created

That’s not all

Page 6: Abstracting Vector Architectures in Library Generators ... · Abstracting Vector Architectures in Library Generators: Case Study Convolution Filters Author: Alen Stojanov Created
Page 7: Abstracting Vector Architectures in Library Generators ... · Abstracting Vector Architectures in Library Generators: Case Study Convolution Filters Author: Alen Stojanov Created
Page 8: Abstracting Vector Architectures in Library Generators ... · Abstracting Vector Architectures in Library Generators: Case Study Convolution Filters Author: Alen Stojanov Created

Page 9: Abstracting Vector Architectures in Library Generators ... · Abstracting Vector Architectures in Library Generators: Case Study Convolution Filters Author: Alen Stojanov Created

def add (a: Float, b: Float

): Float = {a + b

}

def add (a: Rep[Float], b: Rep[Float]

): Rep[Float] = { a + b

}

add (2,3)

add (2,3) Plus(2, 3)

5

2 3

def add (a: Rep[Float], b: Rep[Float]

): Rep[Float] = { a + b

}

float add (float a, float b

) = { return a - b;

}

def add (a: Rep[Float], b: Rep[Float]

): Rep[Float] = { a + b * (-1)

}

a

b-1

a b

Page 10: Abstracting Vector Architectures in Library Generators ... · Abstracting Vector Architectures in Library Generators: Case Study Convolution Filters Author: Alen Stojanov Created

data-3.3.16.xml

Parse XML intrinsics specification

Generate ISA specific DSL in LMS

intrinsics.jar: SSE.scala / AVX.scala / …

Split each ISA specific DSL into subclasses

1case class Def[T]

2Def[T] → Exp[T]

3Exp[T] → Def[T]

4Def[T] → String

Definition

SSA

Mirroring

Un-parsing

Infer intrinsic mutability

Ifm

uta

ble

, han

dle

LM

S effect

s

Parse XML &

Generate eDSLs

in LMS

Start developing SIMD code

1Implement a native function placeholder

2Mixin one or several ISA-specific eDSLs

3Implement the SIMD staged function

4Call compile to generate native code

Start the Java Virtual Machine (JVM)

Infer available ISAs and compiler flags

LMS: remove abstraction & generate C code

Inspect the system through CPUID

Detect available C/C++ compilers

Co

mpile

tim

e (

deve

loper)

Runti

me

(LM

S +

com

pile

rpip

elin

e)

Compile and link the code to the JVM

Use high performance SIMD code

Use SIMD eDSLs:

AVX.scala / SSE.scala

Generate, Compile &

Link staged Code in

the

JVM Runtime

Page 11: Abstracting Vector Architectures in Library Generators ... · Abstracting Vector Architectures in Library Generators: Case Study Convolution Filters Author: Alen Stojanov Created

<intrinsic tech='AVX' rettype='__m256d'name='__mm256_add_pd'><type>Floating Point</type><CPUID>AVX</CPUID><category>Arithmetic<category><parameter varname=’a' type='__m256d’ /><parameter varname=’b' type='__m256d’ /><instruction name='vaddpd' form='ymm, ymm, ymm’ /><header>immintrin.h</header>

</intrinsic>

data-3.3.16.xml

Parse XML intrinsics specification

Generate ISA specific DSL in LMS

intrinsics.jar: SSE.scala / AVX.scala / …

Split each ISA specific DSL into subclasses

1case class Def[T]

2Def[T] → Exp[T]

3Exp[T] → Def[T]

4Def[T] → String

Definition

SSA

Mirroring

Un-parsing

Infer intrinsic mutability

Ifm

uta

ble

, han

dle

LM

S effect

s

Parse XML &

Generate eDSLs

in LMS

case class MM256_ADD_PS (a: Exp[__m256], b: Exp[__m256]) extends IntrinsicDef[__m256] { val category = List(IntrinsicsCategory.Arithmetic) val intrinsicType = List(IntrinsicsType.FloatingPoint) val performance = Map.empty[MicroArchType, Performance] val header = "immintrin.h"

}

def _mm256_add_ps(a: Exp[__m256], b: Exp[__m256]): Exp[__m256] = { MM256_ADD_PS(a, b)

}

override def mirror[A:Typ](e: Def[A], f: Transformer)(implicit pos: SourceContext): Exp[A] = (e match { case MM256_ADD_PS (a, b) => _mm256_add_ps(f(a), f(b)) // Pattern match against all other nodescase _ => super.emitNode(sym, rhs)

}

override def emitNode(sym: Sym[Any], rhs: Def[Any]) = rhs match { case iDef@MM256_ADD_PS(a, b) => headers += iDef.headeremitValDef(sym, s"_mm256_add_ps(${quote(a)}, ${quote(b)})")

// Pattern match against all other nodescase _ => super.emitNode(sym, rhs)

}

data-3.3.16.xml

Parse XML intrinsics specification

Generate ISA specific DSL in LMS

intrinsics.jar: SSE.scala / AVX.scala / …

Split each ISA specific DSL into subclasses

1case class Def[T]

2Def[T] → Exp[T]

3Exp[T] → Def[T]

4Def[T] → String

Definition

SSA

Mirroring

Un-parsing

Infer intrinsic mutability

Ifm

uta

ble

, han

dle

LM

S effect

s

Page 12: Abstracting Vector Architectures in Library Generators ... · Abstracting Vector Architectures in Library Generators: Case Study Convolution Filters Author: Alen Stojanov Created

<intrinsic tech='AVX' rettype='__m256d'name='_mm256_loadu_pd'><type>Floating Point</type><CPUID>AVX</CPUID>

<category>Load</category>

<parametervarname='mem_addr' type='double const *’

/><instruction name='vmovupd' form='ymm, m256’ /><header>immintrin.h</header>

</intrinsic>

Challenge #1

Challenge #2

mirror

Page 13: Abstracting Vector Architectures in Library Generators ... · Abstracting Vector Architectures in Library Generators: Case Study Convolution Filters Author: Alen Stojanov Created

Challenge #3

Challenge #4

Challenge #5

https://github.com/ivtoskov/lms-

intrinsics

Page 14: Abstracting Vector Architectures in Library Generators ... · Abstracting Vector Architectures in Library Generators: Case Study Convolution Filters Author: Alen Stojanov Created

Start developing SIMD code

1Implement a native function placeholder

2Mixin one or several ISA-specific eDSLs

3Implement the SIMD staged function

4Call compile to generate native code

Start the Java Virtual Machine (JVM)

Infer available ISAs and compiler flags

LMS: remove abstraction & generate C code

Inspect the system through CPUID

Detect available C/C++ compilers

Co

mpile

tim

e (

deve

loper)

Runti

me

(LM

S +

com

pile

rpip

elin

e)

Compile and link the code to the JVM

Use high performance SIMD code

class NSaxpy { // Step 1: Placeholder for the SAXPY native function@native def apply(a: Array[Float], b: Array[Float], scalar: Float, n: Int

): Unit

// Step 2: DSL instance of the intrinsicsval cIR = new AVX2; import cIR._

// Step 3: Staged SAXPY function using AVX + FMA def saxpy_staged( a_imm : Rep[Array[Float]], b : Rep[Array[Float]], scalar : Rep[Float], n : Rep[Int]

): Rep[Unit] = { import ImplicitLift._ // make array `a` mutable val a_sym = a_imm.asInstanceOf[Sym[Array[Float]]] val a = reflectMutableSym(a_sym) // start with the computation val n0 = (n >> 3) << 3 val vec_s = _mm256_set1_ps(scalar) forloop(0, n0, fresh[Int], 8, (i : Rep[Int]) => { val vec_a = _mm256_loadu_ps(a, i) val vec_b = _mm256_loadu_ps(b, i) val res = _mm256_fmadd_ps(vec_b, vec_s, vec_a) _mm256_storeu_ps(a, res, i)

}) forloop(n0, n, fresh[Int], 1, (i : Rep[Int]) => { a(i) = a(i) + b(i) * scalar

})}

// Step 4: generate the saxpy function, // compile it and link it to the JVM compile(saxpy_staged _, this, nameOf(apply _))

}

Use SIMD eDSLs:

AVX.scala / SSE.scala

Generate, Compile &

Link staged Code in

the

JVM Runtime

Page 15: Abstracting Vector Architectures in Library Generators ... · Abstracting Vector Architectures in Library Generators: Case Study Convolution Filters Author: Alen Stojanov Created

// Perform Matrix-Matrix-Multiplicationdef staged_mmm_blocked ( a : Rep[Array[Float]], b : Rep[Array[Float]], c_imm : Rep[Array[Float]], n : Rep[Int] // assume n == 8k

): Rep[Unit] = { val c_sym = c_imm.asInstanceOf[Sym[Array[Float]]]val c = reflectMutableSym(c_sym) forloop(0, n, fresh[Int], 8, (kk: Exp[Int]) => { forloop(0, n, fresh[Int], 8, (jj: Exp[Int]) => { // Load the block of matrix B and transpose it val blockB = transpose((0 to 7).map { i => _mm256_loadu_ps(b, (kk + i) * n + jj)

}) // Multiply all the vectors of a of the // corresponding block column with the running // block and store the result in matrix C forloop(0, n, fresh[Int], 1, (i: Exp[Int]) => { val rowA = _mm256_loadu_ps(a, i * n + kk) val mulAB = transpose( blockB.map(_mm256_mul_ps(rowA, _))

)def f(l: Seq[Exp[__m256]]): Exp[__m256] = l.size match { case 1 => l.headcase s => val lhs = f(l.take(s/2)) val rhs = f(l.drop(s/2)) _mm256_add_ps(lhs, rhs)

} val rowC = _mm256_loadu_ps(c, i * n + jj) val accC = _mm256_add_ps(f(mulAB), rowC) _mm256_storeu_ps(c, accC, i * n + jj)

})})

})}

Page 16: Abstracting Vector Architectures in Library Generators ... · Abstracting Vector Architectures in Library Generators: Case Study Convolution Filters Author: Alen Stojanov Created
Page 17: Abstracting Vector Architectures in Library Generators ... · Abstracting Vector Architectures in Library Generators: Case Study Convolution Filters Author: Alen Stojanov Created

https://github.com/astojanov/NGen

Start developing SIMD code

1Implement a native function placeholder

2Mixin one or several ISA-specific eDSLs

3Implement the SIMD staged function

4Call compile to generate native code

Start the Java Virtual Machine (JVM)

Infer available ISAs and compiler flags

LMS: remove abstraction & generate C code

Inspect the system through CPUID

Detect available C/C++ compilers

Co

mpile

tim

e (

deve

loper)

Runti

me

(LM

S +

com

pile

rpip

elin

e)

Compile and link the code to the JVM

Use high performance SIMD code

data-3.3.16.xml

Parse XML intrinsics specification

Generate ISA specific DSL in LMS

intrinsics.jar: SSE.scala / AVX.scala / …

Split each ISA specific DSL into subclasses

1case class Def[T]

2Def[T] → Exp[T]

3Exp[T] → Def[T]

4Def[T] → String

Definition

SSA

Mirroring

Un-parsing

Infer intrinsic mutability

Ifm

uta

ble

, han

dle

LM

S effect

s