CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer...

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

Electrical and Computer EngineeringUniversity of Alabama in Huntsville

Aleksandar [email protected]

http://www.ece.uah.edu/~milenka

2

AM

LaCASA

Outline

Trends in microarchitecture Exploiting thread-level parallelism Exploiting TLP within a processor Resource sharing Performance implications Design challenges Intel’s HT technology

3

AM

LaCASA

Trends in microarchitecture

Higher clock speeds To achieve high clock frequency make pipeline

deeper (superpipelining) Events that disrupt pipeline (branch mispredictions,

cache misses, etc) become very expensive in terms of lost clock cycles

ILP: Instruction Level Parallelism Extract parallelism in a single program Superscalar processors have multiple execution units working in

parallel Challenge to find enough instructions that can be executed

concurrently Out-of-order execution => instructions are sent to execution units

based on instruction dependencies rather than program order

4

AM

LaCASA

Trends in microarchitecture

Cache hierarchies Processor-memory speed gap Use caches to reduce memory latency Multiple levels of caches:

smaller and faster closer to the processor core Thread-level Parallelism

Multiple programs execute concurrently Web-servers have an abundance of software threads Users: surfing the web, listening to music, encoding/decoding

video streams, etc.

5

AM

LaCASA

Exploiting thread-level parallelism

CMP – Chip Multiprocessing Multiple processors, each with a full set of

architectural resources, reside on the same die Processors may share an on-chip cache

or each can have its own cache Examples: HP Mako, IBM Power4 Challenges: Power, Die area (cost)

Time-slice multithreading Processor switches between software threads

after a predefined time slice Can minimize the effects of long lasting events Still, some execution slots are wasted

6

AM

LaCASA

Multithreading Within a Processor

Until now, we have executed multiple threads of an application on different processors – can multiple threads execute concurrently on the same processor?

Why is this desireable? inexpensive – one CPU, no external interconnects no remote or coherence misses (more capacity misses)

Why does this make sense? most processors can’t find enough work – peak IPC is

6, average IPC is 1.5! threads can share resources we can increase

threads without a corresponding linear increase in area

7

AM

LaCASA

What Resources are Shared?

Multiple threads are simultaneously active (in other words, a new thread can start without a context switch)

For correctness, each thread needs its own PC, its own logical regs (and its own mapping from logical to phys regs)

For performance, each thread could have its own ROB (so that a stall in one thread does not stall commit in other threads), I-cache, branch predictor, D-cache, etc. (for low interference), although note that more sharing better utilization of resources

Each additional thread costs a PC, rename table, and ROB – cheap!

8

AM

LaCASA

Approaches to Multithreading Within a Processor

Fine-grained multithreading:switches threads on every clock cycle Pro: hide latency of from both short and long stalls Con: Slows down execution of the individual threads

ready to go Course-grained multithreading:

switches threads only on costly stalls (e.g., L2 stalls) Pros: no switching each clock cycle, no slow down for

ready-to-go threads Con: limitations in hiding shorter stalls

Simultaneous Multithreading:exploits TLP at the same time it exploits ILP

9

AM

LaCASA

How Resources are Shared?

Each box represents an issue slot for a functional unit. Peak thruput is 4 IPC.

• Superscalar processor has high under-utilization – not enough work every cycle, especially when there is a cache miss

• Fine-grained multithreading can only issue instructions from a single thread in a cycle – can not find max work every cycle, but cache misses can be tolerated

• Simultaneous multithreading can issue instructions from any thread every cycle – has the highest probability of finding work for every issue slot

Fine-GrainedMultithreading

SimultaneousMultithreading

Thread 1

Thread 2

Thread 3

Thread 4

Idle

Cycles

Superscalar Coarse-grainedMultithreading

10

AM

LaCASA

R1 R1 + R2R3 R1 + R4R5 R1 + R3

R2 R1 + R2R5 R1 + R2R3 R5 + R3

P73 P1 + P2P74 P73 + P4P75 P73 + P74

P76 P33 + P34P77 P33 + P76P78 P77 + P35

P73 P1 + P2P74 P73 + P4P75 P73 + P74P76 P33 + P34P77 P33 + P76P78 P77 + P35

FU FU FU FU

Instr Fetch

Instr Fetch

Instr Rename

Instr Rename Issue Queue

Register File

Thread-1

Thread-2

Resource Sharing

11

AM

LaCASA

Performance Implications of SMT

Single thread performance is likely to go down (caches, branch predictors, registers, etc. are shared) – this effect can be mitigated by trying to prioritize one thread

While fetching instructions, thread priority can dramatically influence total throughput – a widely accepted heuristic (ICOUNT): fetch such that each thread has an equal share of processor resources

With eight threads in a processor with many resources, SMT yields throughput improvements of roughly 2-4

Alpha 21464 and Intel Pentium 4 are examples of SMT

12

AM

LaCASA

Design Challenges

How many threads? Many to find enough parallelism However, mixing many threads will compromise

execution of individual threads Processor front-end (instruction fetch)

Fetch as far as possible in a single thread (to maximize thread performance)

However, this limits the number of instructions available for scheduling from other threads

Larger register files (multiple contexts) Minimize clock cycle time Cache conflicts

13

AM

LaCASA

Pentium 4: Hyperthreading architecture

One physical processor appears as multiple logical processors

HT implementation on NetBurst microarchitecture has 2 logical processors

Architectural State

Architectural State

Processor execution resources

Architectural state:

- general purpose registers

- control registers

- APIC: advanced programmable interrupt controller

14

AM

LaCASA

Pentium 4: Hyperthreading architecture

Main processor resources are shared caches, branch predictors, execution units,

buses, control logic Duplicated resources

register alias tables (map the architectural registers to physical rename registers)

next instruction pointer and associated control logic

return stack pointer instruction streaming buffer and trace cache

fill buffers

15

AM

LaCASA

Pentium 4: Die Size and Complexity

16

AM

LaCASA

Pentium 4: Resources sharing schemes

Partition – dedicate equal resources to each logical processors Good when expect high utilization and somewhat

unpredicatable Threshold – flexible resource sharing with a limit on

maximum resource usage Good for small resources with bursty utilization and

when the micro-ops stay in the structure for short predictable periods

Full sharing – flexible with no limits Good for large structures, with variable working-set

sizes

17

AM

LaCASA

Pentium 4: Shared vs. partitioned queues

shared

partitioned

18

AM

LaCASA

NetBurst Pipeline

partitioned

threshold

19

AM

LaCASA

Pentium 4: Shared vs. partitioned resources

Partitioned E.g., major pipeline queues

Threshold Puts a threshold on the number of resource

entries a logical processor can have E.g., scheduler

Fully shared resources E.g., caches

Modest interference Benefit if we have shared code and/or data

20

AM

LaCASA

Pentium 4: Scheduler occupancy

21

AM

LaCASA

Pentium 4: Shared vs. partitioned cache

22

AM

LaCASA

Pentium 4: Performance Improvements

23

AM

LaCASA

Multi-Programmed Speedup

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer...

Documents

Transcript of CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer...