Modern Processor Architectures...

Modern Processor Architectures (II)

Hung-Wei Tseng

• How many of the following would happen given the modern processor microarchitecture?① The branch predictor will predict not taken for branch A② The cache may contain the content of array2[array1[16] * 512];③ temp can potentially become the value of array2[array1[16] * 512];④ The program will raise an exceptionA. 0B. 1C. 2D. 3E. 4

2

Putting it all together

unsigned int array1_size = 16;

uint8_t array1[160] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 260}; uint8_t array2[256 * 512];void bar(size_t x) { if (x < array1_size) { // Branch A: Taken if the statement is not going to be executed. temp &= array2[array1[x] * 512]; }}

void foo(size_t x) { int i = 0, j=0; for(j=0;j<10000;j++) bar(rand()%17);}

— very likely

— possibly

— impossible— not really, as x < array1_size

— where the security issues come from

Spectre and meltdown

3

• Exceptions and incorrect branch prediction can cause “rollback” of transient instructions

• Old register states are preserved, can be restored• Memory writes are buffered, can be discarded• Cache modifications are not restored!

4

What happen when mis-speculation detected

• Execution without speculation is safe• CPU will never read array1[x] for any x ≥ array1_size

• Execution with speculation can be exploited• Attacker sets up some conditions• train branch predictor to assume ‘if’ is likely true• make array1_size and array2[] uncached• Invokes code with out-of-bounds x such that array1[x] is a secret• Processor recognizes its error when array1_size arrives, restores its architectural state,

and proceeds with ‘if’ false• Attacker detects cache change (e.g. basic FLUSH+RELOAD or EVICT+RELOAD)• E.g. next read to array2[i*256] will be fast i=array[x] since this got cached

5

Speculative execution on the following codeif (x < array1_size) y = array2[array1[x] * 256];

• Compiler can only see/optimize static instructions, instructions in the compiled binary

• Compiler cannot optimize dynamic instructions, the real instruction sequence when executing the program• Compiler cannot re-order 3, 5 or 4,5• Compiler cannot predict cache misses

• Compiler optimization is constrained by false dependencies due to limited number of registers (even worse for x86)• Instructions lw $t1, 0($a0) and addi $a0, $a0, 4 do not

depend on each other• Compiler optimizations do not work for all architectures

• The code optimization in the previous example works for single pipeline, but not for superscalar

6

Why OoO instead of compilersStatic instructions

LOOP: lw $t1, 0($a0) addi $a0, $a0, 4 add $v0, $v0, $t1 bne $a0, $t0, LOOP lw $t0, 0($sp) lw $t1, 4($sp)

Dynamic instructions1: lw $t1, 0($a0) 2: addi $a0, $a0, 4 3: add $v0, $v0, $t1 4: bne $a0, $t0, LOOP 5: lw $t1, 0($a0) 6: addi $a0, $a0, 4 7: add $v0, $v0, $t1 8: bne $a0, $t0, LOOP

• The modern OOO processors have 3-6 issue widths• Keeping instruction window filled is hard

• Branches are every 4-5 instructions. • If the instruction window is 32 instructions the processor has to predict 6-8 consecutive

branches correctly to keep IW full.• The ILP within an application is low

• Usually 1-2 per thread• ILP is even lower is data depends on memory operations (if cache misses) or long latency

operations• Demo

7

Problems with OOO+Superscalar

• Consider the following dynamic instructions1: lw $t1, 0($a0) 2: lw $a0, 4($a0) 3: add $v0, $v0, $t1 4: bne $a0, $zero, LOOP 5: lw $t1, 0($a0) 6: lw $t2, 4($a0) 7: add $v0, $v0, $t1 8: bne $t2, $zero, LOOP Assume a superscalar processor with unlimited issue width & physical registers that can fetch up to 4 instructions per cycle, 2 cycles to execute a memory instruction how many cycles it takes to issue all instructions?

8

Dynamic execution with register naming + ROB

A. 1B. 2C. 3D. 4E. 5

1

3

2

4

8

5 6

cycle #1

cycle #2

cycle #3

cycle #4

cycle #5

You code looks like this when performing “linked list” traversal

7

Wasted

Wasted

Simultaneous multithreading

9

• Fetch instructions from different threads/processes to fill the not utilized part of pipeline• Exploit “thread level parallelism” (TLP) to solve the problem of insufficient ILP in a single

thread• You need to create an illusion of multiple processors for OSs

10


• To create an illusion of a multi-core processor and allow the core to run instructions from multiple threads concurrently, how many of the following units in the processor must be duplicated?① Program counter② Register file③ ALUs④ Data cache⑤ Reorder bufferA. 1B. 2C. 3D. 4E. 5

11

Illusion of a multi-core processor

• Fetch instructions from different threads/processes to fill the not utilized part of pipeline• Exploit “thread level parallelism” (TLP) to solve the problem of insufficient ILP in a single

thread• You need to create an illusion of multiple processors for OSs

• Keep separate architectural states for each thread• PC• Register Files• Reorder Buffer• The rest of superscalar processor hardware is shared

• Invented by Dean Tullsen — my thesis advisor

12


Simplified SMT-OOO pipeline

13

Register renaming

logicSchedule Execution

UnitsData

Cache

Instruction Fetch: T0

Instruction Decode




ROB: T0

ROB: T1

ROB: T2

ROB: T3

• Fetch 2 instructions from each thread/process at each cycle to fill the not utilized part of pipeline

• Issue width is still 2, commit width is still 4

14

SMT

T1 1: lw $t1, 0($a0)T1 2: lw $a0, 0($t1)T2 1: sll $t0, $a1, 2T2 2: add $t1, $a0, $t0T1 3: addi $a1, $a1, -1T1 4: bne $a1, $zero, LOOPT2 3: lw $v0, 0($t1)T2 4: addi $t1, $t1, 4T2 5: add $v0, $v0, $t2T2 6: jr $ra

IF

IF

IF

IF

IF

IF

ID

ID

ID

ID

IF

IF

EXE

Sch

MEM

Sch

MEM

EXE

Sch

Sch

Sch

Sch

ID

ID

ID

ID

Ren

Ren

Ren

Ren

IF

IF

ROB

ROB

ROB

MEM

ROB

Sch

EXE

Sch

Sch

Sch

Sch

Ren

Ren

Ren

Ren

ID

ID

ROB

ROB

EXE

ROB

ROB

ROB

EXE

ROB

ROB

EXE

ROB

ROB

Sch

Sch

Sch

Sch

ROB

ROB

MEM

EXE

EXE

Sch

Sch

EXE

Sch

Sch

Sch

EXE

Sch

Sch

Sch

Ren

Ren

• How many of the following about SMT are correct?① SMT makes processors with deep pipelines more tolerable to mis-predicted branches② SMT can improve the throughput of a single-threaded application③ SMT processors can better utilize hardware during cache misses comparing with

superscalar processors with the same issue width④ SMT processors can have higher cache miss rates comparing with superscalar

processors with the same cache sizes when executing the same set of applications.A. 0B. 1C. 2D. 3E. 4

15

SMT

hurt, b/c you are sharing resource with other threads.

• Improve the throughput of execution• May increase the latency of a single thread

• Less branch penalty per thread• Increase hardware utilization• Simple hardware design: Only need to duplicate PC/Register Files• Real Case:

• Intel HyperThreading (supports up to two threads per core)• Intel Pentium 4, Intel Atom, Intel Core i7

• AMD RyZen

16

SMT

Pentium 4

17

Intel SkyLake

18

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

2-2

2.1 THE SKYLAKE MICROARCHITECTURE The Skylake microarchitecture builds on the successes of the Haswell and Broadwell microarchitectures. The basic pipeline functionality of the Skylake microarchitecture is depicted in Figure 2-1.

The Skylake microarchitecture offers the following enhancements:• Larger internal buffers to enable deeper OOO execution and higher cache bandwidth.• Improved front end throughput.• Improved branch predictor.• Improved divider throughput and latency.• Lower power consumption.• Improved SMT performance with Hyper-Threading Technology.• Balanced floating-point ADD, MUL, FMA throughput and latency.

The microarchitecture supports flexible integration of multiple processor cores with a shared uncore sub-system consisting of a number of components including a ring interconnect to multiple slices of L3 (an off-die L4 is optional), processor graphics, integrated memory controller, interconnect fabrics, etc. A four-core configuration can be supported similar to the arrangement shown in Figure 2-3.

Figure 2-1. CPU Core Pipeline Functionality of the Skylake Microarchitecture

32K L1 Instruction Cache

MSROM Decoded Icache (DSB)

Legacy DecodePipeline

Instruction Decode Queue (IDQ,, or micro-op queue)

Allocate/Rename/Retire/MoveElimination/ZeroIdiom

32K L1 Data Cache

256K L2 Cache (Unified)

Int ALU, Vec FMA,Vec MUL,Vec Add,Vec ALU,Vec Shft,Divide,

Branch2

Port 2LD/STA

Scheduler

BPU

Port 0

Int ALU, Fast LEA,Vec FMA,Vec MUL,Vec Add,Vec ALU,Vec Shft,Int MUL,Slow LEA

Int ALU, Fast LEA,Vec SHUF,Vec ALU,

CVT

Int ALU, Int Shft,Branch1,

Port 3LD/STA

Port 4STD

Port 7STA

Port 1 Port 5 Port 6

5 uops/cycle4 uops/cycle6 uops/cycle

4-issue integer pipeline 4-issue memory pipeline

Parallel architectures (I)Hung-Wei Tseng

Von Neumann architecture

22

memory

2

8

3

CPU is a dominant factor of performance since we heavily rely on it to execute programs

By pointing “PC” to different part of your memory, we can perform different functions!

23

CPU performance scales well before 2002History of Processor Performance

1Tuesday, April 24, 12

52%/year

• Using PCIe P2P to eliminate CPU/DRAM from the data plane• Reduce the CPU load• Avoid redundant data copies in DRAM• Improve the latency

24

The slowdown of CPU scaling

SPEC

Rate

0

10

20

30

40

50

60

70

80

90

Sep-

06De

c-06

Mar

-07

Jun-

07Se

p-07

Dec-

07M

ar-0

8Ju

n-08

Sep-

08De

c-08

Mar

-09

Jun-

09Se

p-09

Dec-

09M

ar-1

0Ju

n-10

Sep-

10De

c-10

Mar

-11

Jun-

11Se

p-11

Dec-

11M

ar-1

2Ju

n-12

Sep-

12De

c-12

Mar

-13

Jun-

13Se

p-13

Dec-

13M

ar-1

4Ju

n-14

Sep-

14De

c-14

Mar

-15

Jun-

15Se

p-15

Dec-

15M

ar-1

6Ju

n-16

Sep-

16De

c-16

Mar

-17

Jun-

17Se

p-17

5x in 67 months

1.5x in 67 months

25

Intel P4(2000)1 core

Intel Nahalem(2010)4 cores

Nvidia Tegra 3(2011)5 cores

SPARC T3(2010)

16 cores

AMD Zambezi(2011)

16 cores

AMD Athlon 64 X2 (2005)2 cores

Die photo of a CMP processor

26

• Each processor has its own local cache

27

Memory hierarchy on CMP

Core 0

Local $

Core 1

Local $

Core 2

Local $

Core 3

Local $

Bus

Shar

ed $

• Assuming both application X and application Y have similar instruction combination, say 60% ALU, 20% load/store, and 20% branches. Consider two processors:P1: CMP with a 2-issue pipeline on each core. Each core has a private L1 32KB D-cache P2: SMT with a 4-issue pipeline. 64KB L1 D-cache Which one do you think is better?A. P1B. P2

28

Comparing SMT and CMP

Parallelism

29

• Instruction-level parallelism• Data-level parallelism• Thread-level parallelism

30

Parallelism in modern computers

• SISD — single instruction stream, single data• Pipelining instructions within a single program• Superscalar

• SIMD — single instruction stream, multiple data• Vector instructions• GPUs

• MIMD — multiple instruction stream (e.g. multiple threads, multiple processes), multiple data• Multicore processors• Multiple processors• Simultaneous multithreading

31

Processing models

• SISD — single instruction stream, single data• Pipelining instructions within a single program• Superscalar

• SIMD — single instruction stream, multiple data• Vector instructions• GPUs

• MIMD — multiple instruction stream (e.g. multiple threads, multiple processes), multiple data• Multicore processors• Multiple processors• Simultaneous multithreading

32

Processing models

We will focus on this today

• How can we create programs to utilize these cores?A. Parallel programming is easy, programmers will just write parallel programs from now on.B. Parallel programming was hard, but architects have generally solved this problem in the

10 years since we saw the problemC. You don’t need to write parallel code, Intel’s new compilers know how to extract thread

level parallelismD. Intel (and everyone else) is just building the chips, it’s on you to figure out how to use

them.

33

How to utilize these cores?

Modern Processor Architectures...

Documents

Transcript of Modern Processor Architectures...