ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2)...
Transcript of ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2)...
![Page 1: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/1.jpg)
1
ESE 345 Computer Architecture
Data Level Parallelism (DLP) and Single-Instruction Multiple Data (SIMD) Computing
Computer Architecture
CA: DLP and SIMD
![Page 2: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/2.jpg)
2
SIMD Architectures
Data-Level Parallelism (DLP): Executing one operation on multiple data streams
Example: Multiplying a coefficient vector by a data vector (e.g. in filtering)
y[i] := c[i] x[i], 0i<n
Sources of performance improvement: One instruction is fetched & decoded for entire
operation
Multiplications are known to be independent
Pipelining/concurrency in memory access as well
CA: DLP and SIMD
![Page 3: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/3.jpg)
3
Intel’s SIMD Instructions
CA: DLP and SIMD
![Page 4: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/4.jpg)
4
X86 SIMD Processing Extensions (1)
MMX – Intel Pentium II 1997
8 64-bit integer registers aliased with x87 FP regs
3DNow! – AMD K6-2 1998
Similar to MMX
extended with 21 SP FP ops
SSE (Streaming SIMD Extensions) – Intel Pentium III 1999
Single-precision floating-point instructions
8 new 128 bit XMM registers
SSE-2 Intel Pentium 4 2001
Double-precision floating-point opsCA: DLP and SIMD
![Page 5: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/5.jpg)
5
X86 SIMD Processing Extensions (2)
SSE3 – 2004 , SSE4 -2007
AVX (Advanced Vector Extensions) – proposed by Intel and AMD in 2008
Intel Sandy Bridge processor -2011
16 256-bit registers
AVX2 – Intel Haswell processor 2013
256-bit SSE and AVX ops
Fused multiply-add (FMA3)
AVX-512 Intel Knights Landing processor 2016
32 512-bit regs
4 operand instructionsCA: DLP and SIMD
![Page 6: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/6.jpg)
6
Example: SIMD Array Processing
CA: DLP and SIMD
![Page 7: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/7.jpg)
7
SSE Instruction Categories for Multimedia Support
CA: DLP and SIMD
SSE-2+ supports wider data types to allow 16 × 8-bit and 8 × 16-bit operands
![Page 8: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/8.jpg)
8
Intel Architecture SSE2+ 128-Bit SIMD Data Types
CA: DLP and SIMD
![Page 9: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/9.jpg)
9
XMM Registers
CA: DLP and SIMD
![Page 10: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/10.jpg)
10
SSE/SSE2 Floating Point Instructions
CA: DLP and SIMD
![Page 11: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/11.jpg)
11
SSE/SSE2 Floating Point Instructions
CA: DLP and SIMD
![Page 12: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/12.jpg)
12
Example: Add Single Precision FP Vectors
CA: DLP and SIMD
![Page 13: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/13.jpg)
13
Packed and Scalar Double-Precision Floating-Point Operations
CA: DLP and SIMD
![Page 14: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/14.jpg)
14
Example: Image Converter (1/5)
CA: DLP and SIMD
![Page 15: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/15.jpg)
15
Example: Image Converter (2/5)
CA: DLP and SIMD
![Page 16: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/16.jpg)
16
Example: Image Converter (3/5)
CA: DLP and SIMD
![Page 17: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/17.jpg)
17
Example: Image Converter (4/5)
CA: DLP and SIMD
![Page 18: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/18.jpg)
18
Example: Image Converter (5/5)
CA: DLP and SIMD
![Page 19: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/19.jpg)
19
Intel SSE Intrinsics
CA: DLP and SIMD
![Page 20: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/20.jpg)
20
Sample of SSE Intrinsics
CA: DLP and SIMD
![Page 21: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/21.jpg)
21
Example: 2 × 2 Matrix Multiply
CA: DLP and SIMD
![Page 22: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/22.jpg)
22
Example: 2 × 2 Matrix Multiply
CA: DLP and SIMD
![Page 23: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/23.jpg)
23
Example: 2 × 2 Matrix Multiply
CA: DLP and SIMD
![Page 24: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/24.jpg)
24
Example: 2 × 2 Matrix Multiply
CA: DLP and SIMD
![Page 25: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/25.jpg)
25
Example: 2 × 2 Matrix Multiply
CA: DLP and SIMD
![Page 26: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/26.jpg)
26
Inner loop from gcc -O -S
CA: DLP and SIMD
![Page 27: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/27.jpg)
27
Performance-Driven ISA Extensions
CA: DLP and SIMD
![Page 28: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/28.jpg)
28CA: DLP and SIMD
Architectures for Data Parallelism
The Current Landscape: Chips that deliver TeraOps/s in 2014, and how they differ.
E5-26xx v2: Stretching the Xeon server approach for compute-intensive apps.
GK110: nVidia’s Kepler GPU, customized for compute applications.
SONY/IBM PS3 Cell processor in 2006.
GM107: nVidia’s Maxwell GPU, customized for energy-efficiency.
![Page 29: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/29.jpg)
29CA: DLP and SIMD
Sony/IBM Playstation PS3 Cell Chip - Released 2006
8 SPEs,
3.2 GHz clock,
200 GigaOps/s (peak)
![Page 30: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/30.jpg)
30CA: DLP and SIMD
Sony PS3 Cell Processor SPE Floating-Point Unit
4 single-precision
multiply-adds
issue in lockstep
(SIMD) per cycle.
6-cycle latency
Single-Instruction
Multiple-Data
3.2 GHz clock,
--> 25.6 SP GFLOPS
![Page 31: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/31.jpg)
31CA: DLP and SIMD
Intel Xeon Ivy Bridge E5-2697v2 (2013)12-core Xeon Ivy Bridge
0.52 TeraOps/s (130W)
12 cores @ 2.7 GHz
Each core
can issue 16 single-precision operations
per cycle.
$2,600 per chip
Haswell: 1.04 TeraOps/s
![Page 32: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/32.jpg)
32CA: DLP and SIMD
Intel E5-2697v2 vs. Haswell
12 cores @ 2.7 GHz
Each core can issue 16 single-precision
operations per cycle.
Haswell cores issue 32 SP FLOPS/cycle.
How?
![Page 33: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/33.jpg)
33
Advanced Vector Extension (AVX) unit
CA: DLP and SIMD
Relative area has increased in Haswell
Smaller than L3 cache, but larger than L2 cache.
Die closeup of one Sandy Bridge core
![Page 34: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/34.jpg)
34
AVX: Not Just Single-precision Floating-point
CA: DLP and SIMD
256-bit version -> double-precision vectors of length 4
AVX instruction variants interpret 128-bit registersas 4 floats, 2 doubles, 16 8-bit integers, etc ...
![Page 35: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/35.jpg)
35
Sandy Bridge, Haswell (2014)
CA: DLP and SIMD
Sandy Bridge extends register set to 256 bits: vectors are twice the size.
IA-64
AVX/AVX2
has 16
registers (IA-32: 8)
Haswell adds 3-operand instructions a*b + c (Fused multiply-add (FMA))
2 EX units with FMA --> 2X increase in ops/cycle
![Page 36: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/36.jpg)
36
Out of Order Issue in Haswell (2014)
CA: DLP and SIMD
Haswell sustains 4micro-op issues per cycle.
One possibility:
2 for AVX, and
2 for Loads, Stores and book-keeping.
Haswell has twocopies of the FMAengine, on separate ports.
![Page 37: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/37.jpg)
37
Graphical Processing Units Given the hardware invested to do graphics well, how
can be supplement it to improve performance of a wider range of applications?
Basic idea:
Heterogeneous execution model
CPU is the host, GPU is the device
Develop a C-like programming language for GPU
Unify all forms of GPU parallelism as CUDA thread
Programming model is “Single Instruction Multiple Thread”
CA: DLP and SIMD
![Page 38: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/38.jpg)
38
Kepler GK 110 nVidia GPU
CA: DLP and SIMD
5.12 TeraOps/s
2880 MACs
@ 889 MHz
single-precision multiply-adds
$999
GTX Titan Black with 6GB GDDR5
(and 1 GPU)
![Page 39: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/39.jpg)
39
Applications
Multimedia processing
Video compression
Graphics
Image processing
Simulations
Engineering tools
cad
Cryptography
Etc…
CA: DLP and SIMD
![Page 40: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/40.jpg)
40
SIMD Summary
Intel SSE/AVX SIMD Instructions
One instruction fetch that operates on multiple operands simultaneously
512/256/128/64 bit multimedia (XMM & YMM registers
Embed the SSE machine instructions directly into C programs through use of intrinsics
CA: DLP and SIMD
![Page 41: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/41.jpg)
41
ESE345 Project: Pipelined Multimedia SIMD Unit Block Diagram
CA: DLP and SIMD
![Page 42: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/42.jpg)
42
128-bit Multimedia ALU
Three 128-bit Inputs
One 128-bit Output
Packed Register Format
Word: Four 32-bit Fields
Halfword (HW): Eight 16-bit Fields
CA: DLP and SIMD
127 96 95 64 63 32 31 0
Word 3 Word 2 Word 1 Word 0
127 112 111 96 95 80 79 64 63 48 47 32 31 16 15 0
HW 7 HW 6 HW 5 HW 4 HW 3 HW 2 HW 1 HW 0
![Page 43: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/43.jpg)
43
Example: Add instruction
All instructions take place for each field specified
Treated as separate registers
Carry does not enter field to left
CA: DLP and SIMD
a3 a2 a2 a0
b3 b2 b1 b0
a3+b3 a2+b2 a1+b1 a0+b0
+
=
![Page 44: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/44.jpg)
44
Load Immediate
Unlike MIPS, an Immediate value can be loaded directly
Need to specify by Load Index which field to place Immediate into
Can only load 16 bits (halfword) at a time
CA: DLP and SIMD
![Page 45: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/45.jpg)
45
R4 Instructions
Multiply and Add/Subtract Instructions
Takes 3 Inputs
Multiplication Field Half-Size of Addition/Subtraction
This half is determined by High/Low bit
Supports two different multiplication sizes
With Saturation
CA: DLP and SIMD
![Page 46: ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and](https://reader031.fdocuments.in/reader031/viewer/2022013010/5ed2e7476d79440c0a443c4a/html5/thumbnails/46.jpg)
46
Acknowledgements These slides contain material developed
and copyright by:
Morgan Kauffmann (Elsevier, Inc.)
Arvind (MIT)
Krste Asanovic (MIT/UCB)
Joel Emer (Intel/MIT)
James Hoe (CMU)
John Kubiatowicz (UCB)
David Patterson (UCB)
Justin Hsia (UCB)
Mikhail Dorojevets (SBU)
CA: DLP and SIMD