Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based...

Superscalar OrganizationSuperscalar Organization

Prof. Mikko H. LipastiUniversity of Wisconsin-Madison

Lecture notes based on notes by John P. ShenUpdated by Mikko Lipasti

Limitations of Scalar PipelinesLimitations of Scalar Pipelines

Scalar upper bound on throughput– IPC <= 1 or CPI >= 1

Inefficient unified pipeline– Long latency for each instruction

Rigid pipeline stall policy– One stalled instruction stalls all newer

instructions

Parallel PipelinesParallel Pipelines

(a) No Parallelism (b) Temporal Parallelism

(c) Spatial Parallelism

(d) Parallel Pipeline

Spatial Pipeline UnrollingSpatial Pipeline Unrolling

Pipeline Unrolling - PowerPipeline Unrolling - Power

12-stage pipeline, 60% latch power, 25% latch delay+setup, 12% area on latches, 20% leakage power

Intel Pentium Parallel PipelineIntel Pentium Parallel Pipeline

U - Pipe V - Pipe

Diversified PipelinesDiversified Pipelines

• • •

• • •IF

ALU MEM1 FP1 BR

MEM2 FP2

Power4 Diversified PipelinesPower4 Diversified PipelinesPCI-Cache

BR Scan

BR Predict

Fetch Q

Decode

Reorder BufferBR/CRIssue Q

CRUnit

BRUnit

FX/LD 1Issue Q

FX1Unit LD1

FX/LD 2Issue Q

LD2Unit

FX2Unit

FPIssue Q

FP1Unit

FP2Unit

D-Cache

Rigid Pipeline Stall PolicyRigid Pipeline Stall Policy

Bypassing of StalledInstruction

Stalled Instruction

Backward Propagationof Stalling

Not Allowed

Dynamic PipelinesDynamic Pipelines

• • •

• • •IF

ALU MEM1 FP1 BR

MEM2 FP2

DispatchBuffer

ReorderBuffer

( in order )

( out of order )

( in order )

Interstage BuffersInterstage Buffers

• • •

Stage i

Buffer (n)

Stage i +1

Stage i

Buffer (> n)

Stage i + 1

Stage i

Buffer (1)

Stage i + 1

(a) (b)

• • •

( in order )

( out of order )_

( in order )1

• • •

( in order )

Superscalar Pipeline StagesSuperscalar Pipeline Stages

Instruction Buffer

Dispatch Buffer

Decode

Issuing Buffer

Dispatch

Completion Buffer

Execute

Store Buffer

Complete

Retire

In Program

Limitations of Scalar PipelinesLimitations of Scalar Pipelines

Scalar upper bound on throughput– IPC <= 1 or CPI >= 1– Solution: wide (superscalar) pipeline

Inefficient unified pipeline– Long latency for each instruction– Solution: diversified, specialized pipelines

Rigid pipeline stall policy– One stalled instruction stalls all newer instructions– Solution: Out-of-order execution, distributed

execution pipelines

Impediments to High IPCImpediments to High IPC

I-cache

DECODE

COMMIT

D-cache

BranchPredictor Instruction

Buffer

StoreQueue

ReorderBuffer

Integer Floating-point Media Memory

Instruction

RegisterData

MemoryData

EXECUTE

Superscalar Pipeline DesignSuperscalar Pipeline Design

Instruction Fetching Issues Instruction Decoding Issues Instruction Dispatching Issues Instruction Execution Issues Instruction Completion & Retiring Issues

Instruction FlowInstruction Flow

Challenges:– Branches: control dependences– Branch target misalignment– Instruction cache misses

Solutions– Code alignment (static vs.dynamic)– Prediction/speculation

Instruction Memory

3 instructions fetched

Objective: Fetch multiple instructions per cycle

I-Cache OrganizationI-Cache OrganizationR

••

CacheLine

••

Address

1 cache line = 1 physical row

••

•Cache

••

Address

1 cache line = 2 physical rows

TAG Ro

Fetch AlignmentFetch Alignment

RIOS-I Fetch HardwareRIOS-I Fetch Hardware

Issues in DecodingIssues in Decoding

Primary Tasks– Identify individual instructions (!)– Determine instruction types– Determine dependences between

instructionsTwo important factors

– Instruction set architecture– Pipeline width

Pentium Pro Fetch/DecodePentium Pro Fetch/Decode

Predecoding in the AMD K5Predecoding in the AMD K5

Instruction Dispatch and IssueInstruction Dispatch and Issue

Parallel pipeline– Centralized instruction fetch– Centralized instruction decode

Diversified pipeline– Distributed instruction execution

Necessity of Instruction DispatchNecessity of Instruction Dispatch

Centralized Reservation StationCentralized Reservation Station

Distributed Reservation StationDistributed Reservation Station

Issues in Instruction ExecutionIssues in Instruction Execution

Current trends– More parallelism bypassing very challenging– Deeper pipelines– More diversity

Functional unit types– Integer– Floating point– Load/store most difficult to make parallel– Branch– Specialized units (media)

Very wide datapaths (256 bits/register or more)

Bypass NetworksBypass Networks

O(n2) interconnect from/to FU inputs and outputs Associative tag-match to find operands Solutions (hurt IPC, help cycle time)

– Use RF only (IBM Power4) with no bypass network– Decompose into clusters (Alpha 21264)

PCI-Cache

BR Scan

BR Predict

Fetch Q

Decode

Reorder BufferBR/CRIssue Q

CRUnit

BRUnit

FX/LD 1Issue Q

FX1Unit LD1

FX/LD 2Issue Q

LD2Unit

FX2Unit

FPIssue Q

FP1Unit

FP2Unit

D-Cache

Specialized unitsSpecialized units

Intel Pentium 4 staggered adders– Fireball

Run at 2x clock frequency

Two 16-bit bitslices Dependent ops

execute on half-cycle boundaries

Full result not available until full cycle later

Specialized unitsSpecialized units FP multiply-

accumulateR = (A x B) + C

Doubles FLOP/instruction

Lose RISC instruction format symmetry:– 3 source operands

Widely used

Media Data TypesMedia Data Types

Subword parallel vector extensions– Media data (pixels, quantized datum) often 1-2 bytes– Several operands packed in single 32/64b register

{a,b,c,d} and {e,f,g,h} stored in two 32b registers– Vector instructions operate on 4/8 operands in

parallel– New instructions, e.g. sum of abs. differences (SAD)

me = |a – e| + |b – f| + |c – g| + |d – h| Substantial throughput improvement

– Usually requires hand-coding of critical loops– Shuffle ops (gather/scatter of vector elements)

e f g ha b c d

Issues in Completion/RetirementIssues in Completion/Retirement

Out-of-order execution– ALU instructions– Load/store instructions

In-order completion/retirement– Precise exceptions– Memory coherence and consistency

Solutions– Reorder buffer– Store buffer– Load queue snooping (later)

A Dynamic Superscalar ProcessorA Dynamic Superscalar Processor

Impediments to High IPCImpediments to High IPC

I-cache

DECODE

COMMIT

D-cache

BranchPredictor Instruction

Buffer

StoreQueue

ReorderBuffer

Integer Floating-point Media Memory

Instruction

RegisterData

MemoryData

EXECUTE

Superscalar OverviewSuperscalar Overview

Instruction flow– Branches, jumps, calls: predict target, direction– Fetch alignment– Instruction cache misses

Register data flow– Register renaming: RAW/WAR/WAW

Memory data flow– In-order stores: WAR/WAW– Store queue: RAW– Data cache misses

Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based...

Documents

Transcript of Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based...

High Performance Embedded Computing © 2007 Elsevier Lecture 8: Embedded Processor Issues Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

Advanced Branch Prediction Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

A Dynamic Binary Translation Approach to Architectural Simulation Harold “Trey” Cain, Kevin Lepak, and Mikko Lipasti Computer Sciences Department Department.

ECE/CS 552: Parallel Processors Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison.

Introduction to Computer Engineering ECE 252, Fall 2010 Prof. Mikko Lipasti Department of Electrical and Computer Engineering University of Wisconsin –

Circuit-Switched Coherence Natalie Enright Jerger*, Li-Shiuan Peh +, Mikko Lipasti* *University of Wisconsin - Madison + Princeton University 2 nd IEEE.

Memory Data Flow Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Precise and Accurate Processor Simulation Harold Cain, Kevin Lepak, Brandon Schwartz, and Mikko H. Lipasti University of Wisconsin—Madison pharm.

ECE/CS 552: Introduction to Superscalar Processors Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes partially based.

High Performance Embedded Computing 2007 Elsevier Lecture 7: Memory Systems Code Compression Embedded Computing Systems Mikko Lipasti, adapted from.

ECE/CS 552: Pipeline Hazards © Prof. Mikko Lipasti Lecture notes based in part on slides created by…

Engineering Ethics ECE/CS 252, Fall 2010 Prof. Mikko Lipasti Department of Electrical and Computer Engineering University of Wisconsin – Madison.

Introduction to Computer Engineering ECE/CS 252, Fall 2008 Prof. Mikko Lipasti Department of Electrical and Computer Engineering University of Wisconsin.

Characterization of Silent Stores Gordon B.Bell Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison pharm.

February 18, 2004 Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Slide 1 of 23 Understanding Scheduling Replay Schemes Ilhyun Kim Mikko H. Lipasti PHARM Team.

Advanced Branch Prediction - University of … Branch Prediction Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko

Executing Multiple Threads Prof. Mikko H. Lipasti University of Wisconsin-Madison.

Memory Hierarchy CS 282 – KAUST – Spring 2010 Slides by: Mikko Lipasti Muhamed Mudawar.

Science and Engineering...John Paul Shen and Mikko H. Lipasti, Modern Processor Design: Fundamentals of Superscalar Processors, Tata McGraw-Hill. M. J. Flynn, Computer Architecture:

On the Value Locality of Store Instructions Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison pharm.

Circuit-Switched Coherence Natalie Enright Jerger, Li-Shiuan Peh +, Mikko Lipasti *University of Wisconsin - Madison + Princeton University 2 nd IEEE.