EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism

EECS 583 – Class 22Research Topic 4: Automatic SIMDization - Superword Level Parallelism

University of Michigan

December 10, 2012

Announcements

Last class today!» No more reading

Dec 12-18 – Project presentations» Each group sign up for 30-minute slot

» See me after class if you have not signed up

Course evaluations reminder» Please fill one out, it will only take 5 minutes

» I do read them

» Improve the experience for future 583 students

Notes on Project Demos

Demo format» Each group gets 30 minutes

Strict deadlines enforced because many back to back groups Don’t be late! Figure out your room number ahead of time (see schedule on my door)

» Plan for 20 mins of presentation (no more!), 10 mins questions Some slides are helpful, try to have all group members say something Talk about what you did (basic idea, previous work), how you did it

(approach + implementation), and results Demo or real code examples are good

Report» 5 pg double spaced including figures – what you did + why,

implementation, and results

» Due either when you do your demo or Dec 18 at 6pm

SIMD Processors: Larrabee (now called Knights Corner) Block Diagram

Vector Unit Block Diagram

Processor Core Block Diagram

Larrabee vs Conventional GPUs

Each Larrabee core is a complete Intel processor» Context switching & pre-emptive multi-tasking

» Virtual memory and page swapping, even in texture logic

» Fully coherent caches at all levels of the hierarchy

Efficient inter-block communication» Ring bus for full inter-processor communication

» Low latency high bandwidth L1 and L2 caches

» Fast synchronization between cores and caches

Larrabee: the programmability of IA with the parallelism of graphics processors

Exploiting Superword Level Parallelism with Multimedia

Instruction Sets

Multimedia Extensions

• Additions to all major ISAs• SIMD operations

Instruction Set Architecture SIMD Width Floating PointAltiVec PowerPC 128 yesMMX/SSE Intel 64/128 yes3DNow! AMD 64 yesVIS Sun 64 noMAX2 HP 64 noMVI Alpha 64 noMDMX MIPS V 64 yes

Using Multimedia Extensions

• Library calls and inline assembly– Difficult to program– Not portable

• Different extensions to the same ISA– MMX and SSE– SSE vs. 3DNow!

• Need automatic compilation

Vector Compilation

• Pros:– Successful for vector computers– Large body of research

• Cons:– Involved transformations – Targets loop nests

Superword Level Parallelism (SLP)

• Small amount of parallelism– Typically 2 to 8-way

• Exists within basic blocks • Uncovered with a simple analysis

• Independent isomorphic operations– New paradigm

1. Independent ALU Ops

R = R + XR * 1.08327G = G + XG * 1.89234B = B + XB * 1.29835

R R XR 1.08327G = G + XG * 1.89234B B XB 1.29835

2. Adjacent Memory References

R = R + X[i+0]G = G + X[i+1]B = B + X[i+2]

R RG = G + X[i:i+2]B B

for (i=0; i<100; i+=1) A[i+0] = A[i+0] + B[i+0]

3. Vectorizable Loops

for (i=0; i<100; i+=4)

A[i:i+3] = B[i:i+3] + C[i:i+3]

for (i=0; i<100; i+=4) A[i+0] = A[i+0] + B[i+0]

A[i+1] = A[i+1] + B[i+1]A[i+2] = A[i+2] + B[i+2]A[i+3] = A[i+3] + B[i+3]

4. Partially Vectorizable Loops

for (i=0; i<16; i+=1) L = A[i+0] – B[i+0] D = D + abs(L)

4. Partially Vectorizable Loops

for (i=0; i<16; i+=2)

= A[i:i+1] – B[i:i+1]

D = D + abs(L0)D = D + abs(L1)

for (i=0; i<16; i+=2) L = A[i+0] – B[i+0] D = D + abs(L)

L = A[i+1] – B[i+1]D = D + abs(L)

Exploiting SLP with SIMD Execution

• Benefit:– Multiple ALU ops One SIMD op– Multiple ld/st ops One wide mem op

• Cost:– Packing and unpacking– Reshuffling within a register

Packing/Unpacking Costs

C = A + 2D = B + 3

C A 2D B 3= +

• Packing source operands

A AB BA = f()

B = g()C = A + 2D = B + 3

C A 2D B 3= +

• Packing source operands• Unpacking destination operands

C CD D

A = f()B = g()C = A + 2D = B + 3E = C / 5F = D * 7

A AB B

C A 2D B 3= +

Optimizing Program Performance

• To achieve the best speedup:– Maximize parallelization– Minimize packing/unpacking

• Many packing possibilities– Worst case: n ops n! configurations– Different cost/benefit for each choice

A = B + CD = E + F

Observation 1:Packing Costs can be Amortized

• Use packed result operands

G = A - HI = D - J

Observation 1:Packing Costs can be Amortized

• Use packed result operands• Share packed source operands

A = B + CD = E + F

G = B + HI = E + J

A = B + CD = E + F

G = A - HI = D - J

Observation 2:Adjacent Memory is Key

• Large potential performance gains– Eliminate ld/str instructions– Reduce memory bandwidth

• Few packing possibilities– Only one ordering exploits pre-packing

SLP Extraction Algorithm

• Identify adjacent memory references

A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B

• Identify adjacent memory references

AB = X[i:i+1]

• Follow def-use chains

AB = X[i:i+1]

• Follow def-use chains

AB = X[i:i+1]

• Follow use-def chains

AB = X[i:i+1]

SLP Availability

p FIR IIRVM

% dynamic SUIF instructions eliminated

128 bits1024 bits

SLP vs. Vector Parallelism

wave5 ap

hydro2

turb3d

SLPVector

Conclusions

• Multimedia architectures abundant– Need automatic compilation

• SLP is the right paradigm– 20% non-vectorizable in SPEC95fp

• SLP extraction successful– Simple, local analysis– Provides speedups from 1.24 – 6.70

• Found SLP in general-purpose codes

EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism

Documents

Transcript of EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism

583 t gp_business_light

IRS Publication 583

EE 583 Lecture09

642-583 Exam

583 - Amiens -France

583 affiliate offices worldwide

goSLP: Globally Optimized Superword Level Parallelism ...groups.csail.mit.edu/commit/papers/2018/oopsla-goslp.pdf · Superword level parallelism (SLP) is a type of ine-grained parallelism

CS 583 –Computational Audio

583 t gp_business_light_ani

CEng 583 - Computational Vision

one piece 583

CHEE 583 Plastic Recycling

EE 583 Lecture08

Bach Trio BWV 583

SP/CPX-583 Pipelayers

Materials Casebook 583-596

December 5, 2014 #583

583.full (01)

API 583 Corrosion Under Insulation and Fireproofing CUI ... 583... · API 583 Corrosion Under Insulation & Fireproofing (CUI): Detection, Mitigation and Prevention Corrosion Courses

Auto-SIMDization Challenges - IBM · Auto-SIMDization Challenges October 17, 2005 @2005 IBM Corporation Auto-SIMDization Challenges Amy Wang, Peng Zhao, IBM Toronto Laboratory Peng