Computer Architecture ELEC3441elec3441/sp20/handout/L13... · Computer Architecture...

9
Computer Architecture ELEC3441 Lecture 13 –Multi-Core Processors Dr. Hayden Kwok-Hay So Department of Electrical and Electronic Engineering 1 5 9 13 18 24 51 80 117 183 280 481 649 993 1,267 1,779 3,016 4,195 6,043 6,681 7,108 11,865 14,387 19,484 21,871 24,129 1 10 100 1000 10,000 100,000 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 Performance (vs. VAX-11/780) 25%/year 52%/year 22%/year IBM POWERstation 100, 150 MHz Digital Alphastation 4/266, 266 MHz Digital Alphastation 5/300, 300 MHz Digital Alphastation 5/500, 500 MHz AlphaServer 4000 5/600, 600 MHz 21164 Digital AlphaServer 8400 6/575, 575 MHz 21264 Professional Workstation XP1000, 667 MHz 21264A Intel VC820 motherboard, 1.0 GHz Pentium III processor IBM Power4, 1.3 GHz Intel Xeon EE 3.2 GHz AMD Athlon, 2.6 GHz Intel Core 2 Extreme 2 cores, 2.9 GHz Intel Core Duo Extreme 2 cores, 3.0 GHz Intel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz) Intel Xeon 4 cores, 3.3 GHz (boost to 3.6 GHz) Intel Xeon 6 cores, 3.3 GHz (boost to 3.6 GHz) Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-Threading Technology) 1.5, VAX-11/785 AMD Athlon 64, 2.8 GHz Digital 3000 AXP/500, 150 MHz HP 9000/750, 66 MHz IBM RS6000/540, 30 MHz MIPS M2000, 25 MHz MIPS M/120, 16.7 MHz Sun-4/260, 16.7 MHz VAX 8700, 22 MHz AX-11/780, 5 MHz End of an Era … HKUEEE ENGG3441 - HS 2 Limited by Power, ILP, Memory speed Ways to Achieve Parallelism n Instruction Level Parallelism (ILP) Parallel operations come from instructions that execute in parallel Dynamic: Super-scalar processor, OOO execution Static: VLIW n Data Level Parallelism (DLP) Parallel operations come from concurrent operation on independent data Vector machines, SIMD extensions n Thread Level Parallelism HKUEEE ENGG3441 - HS 3 4 HKUEEE ENGG3441 - HS

Transcript of Computer Architecture ELEC3441elec3441/sp20/handout/L13... · Computer Architecture...

Page 1: Computer Architecture ELEC3441elec3441/sp20/handout/L13... · Computer Architecture ELEC3441!"#$%&"'()'* +%,$- ./0&"'1&0#"220&2 3&4'5678"9':;0

Computer Architecture

ELEC3441

Lecture 13 –Multi-Core Processors

Dr. Hayden Kwok-Hay So

Department of Electrical and

Electronic Engineering 1

5

9

13

18

24

51

80

117

183

280

481

649

993

1,267

1,779

3,016

4,1956,043 6,681

7,108

11,86514,387

19,48421,871

24,129

1

10

100

1000

10,000

100,000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012

Pe

rfo

rma

nce

(vs.

VA

X-1

1/7

80

)

25%/year

52%/year

22%/year

IBM POWERstation 100, 150 MHz

Digital Alphastation 4/266, 266 MHz

Digital Alphastation 5/300, 300 MHz

Digital Alphastation 5/500, 500 MHz

AlphaServer 4000 5/600, 600 MHz 21164

Digital AlphaServer 8400 6/575, 575 MHz 21264Professional Workstation XP1000, 667 MHz 21264A

Intel VC820 motherboard, 1.0 GHz Pentium III processor

IBM Power4, 1.3 GHz

Intel Xeon EE 3.2 GHz AMD Athlon, 2.6 GHz

Intel Core 2 Extreme 2 cores, 2.9 GHz

Intel Core Duo Extreme 2 cores, 3.0 GHz

Intel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz) Intel Xeon 4 cores, 3.3 GHz (boost to 3.6 GHz)

Intel Xeon 6 cores, 3.3 GHz (boost to 3.6 GHz)

Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-Threading Technology)

1.5, VAX-11/785

AMD Athlon 64, 2.8 GHz

Digital 3000 AXP/500, 150 MHz

HP 9000/750, 66 MHz

IBM RS6000/540, 30 MHz

MIPS M2000, 25 MHz

MIPS M/120, 16.7 MHz

Sun-4/260, 16.7 MHz

VAX 8700, 22 MHz

AX-11/780, 5 MHz

End of an Era …

HKUEEE ENGG3441 - HS 2

Limited by Power, ILP,

Memory speed

Ways to Achieve Parallelism

n Instruction Level Parallelism (ILP)

• Parallel operations come from instructions that

execute in parallel

• Dynamic: Super-scalar processor, OOO execution

• Static: VLIW

n Data Level Parallelism (DLP)

• Parallel operations come from concurrent

operation on independent data

• Vector machines, SIMD extensions

n Thread Level Parallelism

HKUEEE ENGG3441 - HS 3 4HKUEEE ENGG3441 - HS

Page 2: Computer Architecture ELEC3441elec3441/sp20/handout/L13... · Computer Architecture ELEC3441!"#$%&"'()'* +%,$- ./0&"'1&0#"220&2 3&4'5678"9':;0

Multiprocessor Systems on a Chipn Machines with more than 1 processors was popular

among servers and supercomputers in the 80 and 90s

n Uniprocessor speed comes to a halt due to power wall

n All major processor vendors move to multi-core designs

HKUEEE ENGG3441 - HS 5

Chip Multi-ProcessorMulti-Processor

board level

Connecting Cores

6

CPU CPU CPU

On-chip NetworkShared Memory

CPU CPU CPU

CPU CPU CPU

CPU CPU CPU

Shared memory

Direct Network

CPU CPU

HKUEEE ENGG3441 - HS

Multi-processor System-on-Chip

Direct Connections

n Usually in the form of low latency, high throughput, point-to-point network between processors• By pass I/O subsystems

n Allows low-latency communication between neighboring processors• Sometimes with dedicated machine instructions

n Multi-hop routing for further processors• Typology of network plays an important role

• e.g. Ring, torus, mesh…

n Often tie to the distributed memory system

n Often proprietary design

n Commercial examples:• AMD: HyperTransport

• Intel: QuickPath Interconnect

HKUEEE ENGG3441 - HS 7

Network Typology

HKUEEE ENGG3441 - HS 8

ring mesh

torus

Page 3: Computer Architecture ELEC3441elec3441/sp20/handout/L13... · Computer Architecture ELEC3441!"#$%&"'()'* +%,$- ./0&"'1&0#"220&2 3&4'5678"9':;0

On-chip Networkn The study of building network in system-on-chip

• A complete computer system on a chip

• Including graphics, peripheral and memory controllers, accelerators

n MPSoC: multi-processor system on a chip• Multiple compute core in the system

• Different types of cores

n Mostly proprietary

n Some example of on-chip network:• Advanced Microcontroller Bus Architecture (AMBA):

on-chip interconnect developed by ARM

• Wishbone: OpenCore standard

HKUEEE ENGG3441 - HS 9

Shared memory coresn Common typology for commercial multi-core processors

n Various combination of shared and private cache/memory

HKUEEE ENGG3441 - HS 10

Main Memory

L1I$

Shared L2$

L1D$

CPUCore

L1I$

L1D$

CPUCore

Main Memory

L1I$

Shared L3$

L1D$

CPUCore

L1I$

L1D$

CPUCore

L2$ L2$

e.g. Intel Nehalem, Sandy Bridge, Ivy Bridge

e.g. Intel Core, Core 2

Symmetric Multiprocessors

11

symmetric

• All memory is equally far

away from all processors

• Any processor can do any I/O

(set up a DMA transfer)

Memory

I/O controller

Graphics

output

CPU-Memory bus

bridge

Processor

I/O controller I/O controller

I/O bus

Networks

Processor

Synchronization

12

The need for synchronization arises whenever

there are concurrent processes in a system

(even in a uniprocessor system)

Two classes of synchronization:

Producer-Consumer: A consumer process must

wait until the producer process has produced

data

Mutual Exclusion: Ensure that only one process

uses a resource at a given time

producer

consumer

Shared

Resource

P1 P2

Page 4: Computer Architecture ELEC3441elec3441/sp20/handout/L13... · Computer Architecture ELEC3441!"#$%&"'()'* +%,$- ./0&"'1&0#"220&2 3&4'5678"9':;0

A Producer-Consumer Example

13

The program is written assuming

instructions are executed in order.

Producer posting Item x:Load Rtail, 0(tail)

Store 0(Rtail), x

Rtail=Rtail+1

Store 0(tail), Rtail

Consumer:Load Rhead, 0(head)

spin: Load Rtail, 0(tail)

if Rhead==Rtail goto spin

Load R, 0(Rhead)

Rhead=Rhead+1

Store 0(head), Rheadprocess(R)

Producer Consumertail head

Rtail Rtail Rhead R

Problems?

buf* tail;

buf* head;

A Producer-Consumer Example continued

14

Can the tail pointer get updatedbefore the item x is stored?

Programmer assumes that if 3 happens after 2, then 4happens after 1.

Problem sequences are:2, 3, 4, 14, 1, 2, 3

1

2

3

4

Consumer:Load Rhead, 0(head)

spin: Load Rtail, 0(tail)

if Rhead==Rtail goto spin

Load R, 0(Rhead)

Rhead=Rhead+1

Store 0(head), Rheadprocess(R)

Producer posting Item x:Load Rtail, 0(tail)

Store 0(Rtail), x

Rtail=Rtail+1

Store 0(tail), Rtail

Sequential ConsistencyA Memory Model

15

“ A system is sequentially consistent if the result of any

execution is the same as if the operations of all the

processors were executed in some sequential order, and the

operations of each individual processor appear in the order

specified by the program”

Leslie Lamport

Sequential Consistency =

arbitrary order-preserving interleaving

of memory references of sequential programs

M

P P P P P P

Sequential Consistency

16

Sequential concurrent tasks: T1, T2Shared variables: X, Y (initially X = 0, Y = 10)

T1: T2:Store (X), 1 #X ç 1 Load R1, (Y)

Store (Y), 11 #Y ç 11 Store (Y’), R1 #Y’ ç YLoad R2, (X)

Store (X’), R2 #X’ç X

what are the legitimate answers for X’ and Y’ ?

(X’,Y’) ∈ {(1,11), (0,10), (1,10), (0,11)} ?

If y is 11 then x cannot be 0

Page 5: Computer Architecture ELEC3441elec3441/sp20/handout/L13... · Computer Architecture ELEC3441!"#$%&"'()'* +%,$- ./0&"'1&0#"220&2 3&4'5678"9':;0

Sequential Consistency

17

Sequential consistency imposes more memory ordering constraints than those imposed by uniprocessor program dependencies ( )

What are these in our example ?

T1: T2:Store (X), 1 #Xç1 Load R1, (Y)

Store (Y), 11#Yç11 Store (Y’), R1 #Y’çYLoad R2, (X)

Store (X’), R2 #X’çXadditional SC requirements

Does (can) a system with caches or out-of-order execution capability provide a sequentially consistent

view of the memory ?more on this later

Issues in Implementing Sequential Consistency

18

Implementation of SC is complicated by two issues

• Out-of-order execution capabilityLoad(a); Load(b) yesLoad(a); Store(b) yes if a ¹ bStore(a); Load(b) yes if a ¹ bStore(a); Store(b) yes if a ¹ b

• CachesCaches can prevent the effect of a store from being seen by other processors

M

P P P P P P

No common commercial architecture has a

sequentially consistent memory model!

Memory FencesInstructions to serialize memory accesses

19

Processors with relaxed or weak memory models (i.e.,permit Loads and Stores to different addresses to be reordered) need to provide memory fence instructions to force the serialization of memory accesses

Examples of processors with relaxed memory models:Sparc V8 (TSO,PSO): MembarSparc V9 (RMO):

Membar #LoadLoad, Membar #LoadStoreMembar #StoreLoad, Membar #StoreStore

PowerPC (WO): Sync, EIEIOARM: DMB (Data Memory Barrier)X86/64: mfence (Global Memory Barrier)

Memory fences are expensive operations, however, one pays the cost of serialization only when it is required

Memory Coherence in SMPs

20

Suppose CPU-1 updates A to 200.

write-back: memory and cache-2 have stale valueswrite-through: cache-2 has a stale value

Do these stale values matter?What is the view of shared memory for programming?

cache-1A 100

CPU-Memory bus

CPU-1 CPU-2

cache-2A 100

memoryA 100

Page 6: Computer Architecture ELEC3441elec3441/sp20/handout/L13... · Computer Architecture ELEC3441!"#$%&"'()'* +%,$- ./0&"'1&0#"220&2 3&4'5678"9':;0

Write-back Caches & SC

21

• T1 is executed

prog T2LD Y, R1ST Y’, R1LD X, R2

ST X’,R2

prog T1ST X, 1ST Y,11

cache-2cache-1 memory

X = 0Y =10X’=Y’=

X= 1Y=11

Y =Y’= X = X’=

• cache-1 writes back YX = 0Y =11X’=Y’=

X= 1Y=11

Y =Y’= X = X’=

X = 1Y =11X’=Y’=

X= 1Y=11

Y = 11Y’= 11X = 0X’= 0

• cache-1 writes back X

X = 0Y =11X’=Y’=

X= 1Y=11

Y = 11Y’= 11X = 0X’= 0

• T2 executed

X = 1Y =11X’= 0Y’=11

X= 1Y=11

Y =11Y’=11 X = 0X’= 0

• cache-2 writes back

X’ & Y’ inconsistent

Write-through Caches & SC

22

cache-2Y = Y’= X = 0

X’=

memoryX = 0Y =10X’=

Y’=

cache-1X= 0Y=10

prog T2LD Y, R1ST Y’, R1LD X, R2

ST X’,R2

prog T1ST X, 1ST Y,11

Write-through caches don’t preserve

sequential consistency either

• T1 executed

Y = Y’= X = 0

X’=

X = 1Y =11X’=

Y’=

X= 1Y=11

• T2 executedY = 11Y’= 11X = 0

X’= 0

X = 1Y =11X’= 0

Y’=11

X= 1Y=11

Maintaining Cache Coherence

§Hardware support is required such that– only one processor at a time has write permission for

a location

– no processor can load a stale copy of the location after a write

è cache coherence protocols

23

Cache Coherence vs. Memory Consistency

§ A cache coherence protocol ensures that all writes by one

processor are eventually visible to other processors, for one memory address

– i.e., updates are not lost

§ A memory consistency model gives the rules on when a

write by one processor can be observed by a read on

another, across different addresses– Equivalently, what values can be seen by a load

§ A cache coherence protocol is not enough to ensure

sequential consistency

– But if sequentially consistent, then caches must be coherent

§ Combination of cache coherence protocol plus processor

memory reorder buffer used to implement a given

architecture’s memory consistency model

24

Page 7: Computer Architecture ELEC3441elec3441/sp20/handout/L13... · Computer Architecture ELEC3441!"#$%&"'()'* +%,$- ./0&"'1&0#"220&2 3&4'5678"9':;0

Snoopy Cache, Goodman 1983

§ Idea: Have cache watch (or snoop upon) DMA transfers, and then “do the right thing”

§ Snoopy cache tags are dual-ported

25

Proc.

Cache

Snoopy read portattached to Memory

BusData(lines)

Tags andState

A

D

R/W

Used to drive Memory Buswhen Cache is Bus Master

A

R/W

Shared Memory Multiprocessor

26

Use snoopy mechanism to keep all processors’ view of memory coherent

M1

M2

M3

SnoopyCache

DMA

Physical

Memory

MemoryBus

SnoopyCache

SnoopyCache

DISKS

Snoopy Cache Coherence Protocols

27

write miss:

the address is invalidated in all othercaches before the write is performed

read miss:

if a dirty copy is found in some cache, a write-

back is performed before the memory is read

Cache State Transition DiagramThe MSI protocol

28

M

S I

M: ModifiedS: SharedI: Invalid

Each cache line has state bits

Address tag

statebits Write miss

(P1 gets line from memory)

Other processorintent to write

(P1 writes back)

Read miss(P1 gets line from memory)

P1in

tent

to w

rite

Other processorintent to write

Read by anyprocessor

P1 readsor writes

Cache state in processor P1

Other processor reads(P1 writes back)

Page 8: Computer Architecture ELEC3441elec3441/sp20/handout/L13... · Computer Architecture ELEC3441!"#$%&"'()'* +%,$- ./0&"'1&0#"220&2 3&4'5678"9':;0

Two Processor Example(Reading and writing the same cache line)

29

M

S I

Write miss

Readmiss

P1inte

nt to

write

P2 intent to write

P2 reads,

P1 writes back

P1 reads

or writes

P2 intent to write

P1

M

S I

Write miss

Readmiss

P2inte

nt to

write

P1 intent to write

P1 reads,

P2 writes back

P2 reads

or writes

P1 intent to write

P2

P1 reads

P1 writes

P2 reads

P2 writes

P1 writes

P2 writes

P1 reads

P1 writes

Observation

§ If a line is in the M state then no other cache can have a copy

of the line!

§ Memory stays coherent, multiple differing copies cannot exist

30

M

S I

Write miss

Other processorintent to write

Readmiss

P1in

tent

to w

rite

Other processorintent to write

Read by anyprocessor

P1 readsor writes

Other processor readsP1 writes back

MESI: An Enhanced MSI protocol

increased performance for private data

31

M E

S I

M: Modified ExclusiveE: Exclusive but unmodified

S: Shared

I: Invalid

Each cache line has a tag

Address tag

statebits

Write miss

Other processorintent to write

Read miss,shared

Other processorintent to write

P1 write

Read by anyprocessor

Other processor reads

P1 writes back

P1 readP1 writeor read

Cache state in processor P1

P1 intent to write

Read miss, not sharedOther

processorreads

Other processor

intent to write, P1 writes back

Optimized Snoop with Level-2 Caches

32

Snooper Snooper Snooper Snooper

• Processors often have two-level caches

• small L1, large L2 (usually both on chip now)

• Inclusion property: entries in L1 must be in L2

invalidation in L2 è invalidation in L1

• Snooping on L2 does not affect CPU-L1 bandwidth

What problem could occur?

CPU

L1 $

L2 $

CPU

L1 $

L2 $

CPU

L1 $

L2 $

CPU

L1 $

L2 $

Page 9: Computer Architecture ELEC3441elec3441/sp20/handout/L13... · Computer Architecture ELEC3441!"#$%&"'()'* +%,$- ./0&"'1&0#"220&2 3&4'5678"9':;0

Intervention

33

When a read-miss for A occurs in cache-2,

a read request for A is placed on the bus• Cache-1 needs to supply & change its state to shared

• The memory may respond to the request also!

Does memory know it has stale data?

Cache-1 needs to intervene through memory controller

to supply correct data to cache-2

cache-1A 200

CPU-Memory bus

CPU-1 CPU-2

cache-2

memory (stale data)A 100

False Sharing

34

state line addr data0 data1 ... dataN

A cache line contains more than one word

Cache-coherence is done at the line-level and not

word-level

Suppose M1 writes wordi and M2 writes wordk and

both words have the same line address.

What can happen?

Out-of-Order Loads/Stores & CC

35

Blocking cachesOne request at a time + CC Þ SC

Non-blocking cachesMultiple requests (different addresses) concurrently + CC

Þ Relaxed memory models

CC ensures that all processors observe the same

order of loads and stores to an address

CacheMemorypushout (Wb-rep)

load/storebuffers

CPU

(S-req, E-req)

(S-rep, E-rep)

Wb-req, Inv-req, Inv-rep

snooper

(I/S/E)

CPU/MemoryInterface

36

Acknowledgementsn These slides contain material developed and

copyright by:• Arvind (MIT)

• Krste Asanovic (MIT/UCB)

• Joel Emer (Intel/MIT)

• James Hoe (CMU)

• John Kubiatowicz (UCB)

• David Patterson (UCB)

• John Lazzaro (UCB)

n MIT material derived from course 6.823

n UCB material derived from course CS152, CS252

HKUEEE ENGG3441 - HS