COMP25212 CPU Multi Threading

Post on 30-Dec-2015

38 views 4 download

Tags:

description

COMP25212 CPU Multi Threading. Learning Outcomes: to be able to: Describe the motivation for multithread support in CPU hardware To distinguish the benefits and implementations of coarse grain, fine grain and simultaneous multithreading To explain when multithreading is inappropriate - PowerPoint PPT Presentation

Transcript of COMP25212 CPU Multi Threading

COMP25212 CPU Multi Threading

• Learning Outcomes: to be able to:– Describe the motivation for multithread support in CPU

hardware

– To distinguish the benefits and implementations of coarse grain, fine grain and simultaneous multithreading

– To explain when multithreading is inappropriate

– To be able to describe a multithreading implementations

– To be able to estimate performance of these implementations

– To be able to state important assumptions of this performance model

Revision: IncreasingCPU Performance

Data Cache

Fetch Logic

Fetch Logic

Decode Logic

Fetch Logic

Exec Logic

Fetch Logic

Mem

Logic

Write Logic

Inst Cache

How can throughput be increased?

Clock

a

c

b

d

f

e

Increasing CPU Performance

a) By increasing clock frequency

b) By increasing Instructions per Clock

c) Minimizing memory access impact – data cached) Maximising Inst issue rate – branch prediction

e) Maximising Inst issue rate – superscalar

f) Maximising pipeline utilisation – avoid instruction dependencies – out of order execution

g) (What does lengthening pipeline do?)

Increasing Program Parellelism

– Keep issuing instructions after branch?– Keep processing instructions after cache miss?– Process instructions in parallel?– Write register while previous write pending?

• Where can we find additional independent instructions?– In a different program!

Revision – Process States

Terminated

Running on a CPU

Blocked waiting for

event

Ready waiting for

a CPU

New

Dispatch(scheduler)

Needs to wait(e.g. I/O)

I/O occurs

Pre-empted(e.g. timer)

Revision – Process Control Block

• Process ID• Process State• PC• Stack Pointer• General Registers• Memory Management

Info

• Open File List, with positions

• Network Connections• CPU time used• Parent Process ID

Revision: CPU Switch

Process P0Process P1Operating System

Save state into PCB0

Load state fromPCB1

Save state into PCB0

Load state fromPCB1

What does CPU load on dispatch?

• Process ID• Process State• PC• Stack Pointer• General Registers• Memory Management

Info

• Open File List, with positions

• Network Connections• CPU time used• Parent Process ID

What does CPU need to store on deschedule?

• Process ID• Process State• PC• Stack Pointer• General Registers• Memory Management

Info

• Open File List, with positions

• Network Connections• CPU time used• Parent Process ID

CPU Support for Multithreading

Data Cache

Fetch Logic

Fetch Logic

Decode Logic

Fetch Logic

Exec Logic

Fetch Logic

Mem

Logic

Write Logic

Inst Cache

PCA

PCB

VA MappingA

VA MappingB

AddressTranslation

GPRsA

GPRsB

How Should OS View Extra Hardware Thread?

• A variety of solutions

• Simplest is probably to declare extra CPU

• Need multiprocessor-aware OS

CPU Support for Multithreading

Data Cache

Fetch Logic

Fetch Logic

Decode Logic

Fetch Logic

Exec Logic

Fetch Logic

Mem

Logic

Write Logic

Inst Cache

PCA

PCB

VA MappingA

VA MappingB

AddressTranslation

GPRsA

GPRsB

Design Issue:when to switch

threads

Coarse-Grain Multithreading

• Switch Thread on “expensive” operation:– E.g. I-cache miss– E.g. D-cache miss

• Some are easier than others!

Switch Threads on Icache miss1 2 3 4 5 6 7

Inst a IF ID EX MEM WB

Inst b IF ID EX MEM WB

Inst c IF MISS ID EX MEM WB

Inst d IF ID EX MEM

Inst e IF ID EX

Inst f IF ID

Inst X

Inst Y

Inst Z

- - - -

Performance of Coarse Grain

• Assume (conservatively)– 1GHz clock (1nS clock tick!), 20nS memory ( = 20 clocks)– 1 i-cache miss per 100 instructions– 1 instruction per clock otherwise

• Then, time to execute 100 instructions without multithreading– 100 + 20 clock cycles– Inst per Clock = 100 / 120 = 0.83.

• With multithreading: time to exec 100 instructions:– 100 [+ 1]– Inst per Clock = 100 / 101 = 0.99..

Switch Threads on Dcache miss1 2 3 4 5 6 7

Inst a IF ID EX M-Miss WB

Inst b IF ID EX MEM WB

Inst c IF ID EX MEM WB

Inst d IF ID EX MEM

Inst e IF ID EX

Inst f IF ID

MISS MISS MISS

- - -

- - -

- - -

Inst X

Inst Y

Performance:similar calculation (STATE ASSUMPTIONS!)

Where to restart after memory cycle? I suggest instruction “a” – why?

Abort theseAbort these