PIPELINING AND PROCESSOR PERFORMANCE - … · PIPELINING AND PROCESSOR PERFORMANCE ... Computer...
Transcript of PIPELINING AND PROCESSOR PERFORMANCE - … · PIPELINING AND PROCESSOR PERFORMANCE ... Computer...
PIPELINING AND
PROCESSOR PERFORMANCESlides by: Pedro Tomás
Additional reading: Computer Architecture: A Quantitative Approach”, 5th edition, Chapter 1, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011
ADVANCED COMPUTER ARCHITECTURES
ARQUITECTURAS AVANÇADAS DE COMPUTADORES (AAC)
Advanced Computer Architectures, 2014
Outline
2
Revision
Single cycle processor
Multi cycle processor
Processor pipelining
Instruction flow in a pipeline processor
Execution conflicts
Evaluating processor performance
Advanced Computer Architectures, 2014
Revision of a RISC architectures
Single cycle processor3
Advanced Computer Architectures, 2014
Revision of a RISC architectures
Multi-cycle processor4
S
B
A OP
Flags
AL
U
Register File (RF)AA A BA B
AABA
R[BA]
Asynchronous read ports
R[AA]
IMM
IMM
MU
X
MEMEX
SEL A
SEL B
IMM
SEL B
SEL A
OP SEL
DA WE DATA
Synchronous write ports CLK
WB
Data
Address
WE
DataMemory
Data
SEL OUT MUX
Decoder
DAWE
INST
MEM WRITE
IF
Address
InstructionsMemory
Data
INSTRUCTION
PC
CLK
Clock
JMP CTRL
Clo
ck
+
4FLAGS
COND
COND
NEXT PC
AD
ID&OF
MU
X
EnableIF
EnableID&OF
EnableEX
EnableMEM
EnableWB
Advanced Computer Architectures, 2014
Instruction flow
Single cycle processor5
Each instruction takes 1 cycle to execute
Clock period limited by the worst case path of the whole processor
Example:
for (i=0,aux=0; i<100; i++) {
if (V[i] > aux){
aux=v[i];
}
}
LI R2,100
LI R1,4
MOVE R3,R0
LW R4,100(R1)
SUB R5,R4,R3
BLEZ R5,LOOP_END
ADDI R1,R1,4
MOVE R3,R4
BNE R2,R1,LOOP_NXT
SLL R2,R2,2
LOOP_END:
LOOP_NXT:
Some of the used instructions (e.g., LI and MOVE) do not actual exist in MIPS64 instruction
set. Hence, they should be replaced with an equivalent instruction such as OR DR,R0,operand.
However, they are left here to simplify the reading of the Assembly code.
Advanced Computer Architectures, 2014
Instruction flow
Single cycle processor6
Each instruction takes 1 cycle to execute
Clock period limited by the worst case path of the whole processor
LI R2,100
LI R1,4
MOVE R3,R0
LW R4,100(R1)
SUB R5,R4,R3
BLEZ R5,LOOP_END
ADDI R1,R1,4
MOVE R3,R4
BNE R2,R1,LOOP_NXT
SLL R2,R2,2
LOOP_END:
LOOP_NXT:
IF,ID,EX,MEM,WB
IF,ID,EX,MEM,WB
IF,ID,EX,MEM,WB
IF,ID,EX,MEM,WB
Cycle 1 Cycle 2 Cycle 3 Cycle 4
Advanced Computer Architectures, 2014
Instruction flow
Multi cycle processor7
Each instruction takes 5 cycles to execute
Clock period limited by the worst case path of all stages
The working frequency is higher but the instruction throughput is lower
LI R2,100
LI R1,4
MOVE R3,R0
LW R4,100(R1)
SUB R5,R4,R3
BLEZ R5,LOOP_END
ADDI R1,R1,4
MOVE R3,R4
BNE R2,R1,LOOP_NXT
SLL R2,R2,2
LOOP_END:
LOOP_NXT:
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Advanced Computer Architectures, 2014
LI R2,100
LI R1,4
MOVE R3,R0
LW R4,100(R1)
SUB R5,R4,R3
BLEZ R5,LOOP_END
ADDI R1,R1,4
MOVE R3,R4
BNE R2,R1,LOOP_NXT
SLL R2,R2,2
LOOP_END:
LOOP_NXT:
IF ID EX MEM WB
IF ID EX MEM WB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
Instruction flow
Pipeline processor8
The processor simultaneously executes a part of up to 5 different
instructions each clock cycle
The instruction throughput increases by up to 5x
Pipeline overview during
clock cycle 7
Advanced Computer Architectures, 2014
Instruction flow
Pipeline processor9
The instruction throughput can increase by 5x (potential)
Much higher performance… but…
LI R2,100
LI R1,4
MOVE R3,R0
LW R4,100(R1)
SUB R5,R4,R3
BLEZ R5,LOOP_END
ADDI R1,R1,4
MOVE R3,R4
BNE R2,R1,LOOP_NXT
SLL R2,R2,2
LOOP_END:
LOOP_NXT:
IF ID EX MEM WB
IF ID EX MEM WB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
Advanced Computer Architectures, 2014
Instruction flow
Pipeline processor10
The instruction throughput can increase by 5x (potential)
Much higher performance… but… it generates conflicts that must be solved
to guarantee the correct behaviour
LI R2,100
LI R1,4
MOVE R3,R0
LW R4,100(R1)
SUB R5,R4,R3
BLEZ R5,LOOP_END
ADDI R1,R1,4
MOVE R3,R4
BNE R2,R1,LOOP_NXT
SLL R2,R2,2
LOOP_END:
LOOP_NXT:
IF ID EX MEM WB
IFRead
R2EX MEM WB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
IF ID EX MEM WB
IF ID EX MEM WB
IFRead
R1EX MEM WB
IF ID EX MEM WB
IFReadR4,R3
EX MEM WB
IFRead
R5EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
R2 is valid
R1 is valid
R3 is valid
R4 is valid
R5 is valid
Advanced Computer Architectures, 2014
Instruction flow
Solving conflicts from pipelining11
The conflicts can be solved by delaying instruction issue whenever it is necessary
Whenever a conflict is found the instruction pipeline is stalled
The real instruction throughput from pipelining is smaller than 5x
LI R2,100
LI R1,4
MOVE R3,R0
LW R4,100(R1)
SUB R5,R4,R3
BLEZ R5,LOOP_END
ADDI R1,R1,4
MOVE R3,R4
BNE R2,R1,LOOP_NXT
SLL R2,R2,2
LOOP_END:
LOOP_NXT:
IF ID EX MEM WB
IFRead
R2EX MEM WB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
IF ID EX MEM WB
IF ID EX MEM WB
IFRead
R1EX MEM WB
IF ID EX MEM WB
IFRead
R4EX
IF
R2 is valid
R1 is valid
R3 is valid
R4 is valid
IDStall
IDStall
IFStall
IFStall
IDStall
IFStall
IDStall
IFStall
IDStall
IFStall
IDStall
IFStall
IDStall
IFStall
IDStall
IFStall
Advanced Computer Architectures, 2014
Instruction flow
Solving conflicts from pipelining12
The additional logic to detect and solve conflicts increases the clock period
The performance increase is even smaller than expected
The clock period must increase to allow for conflict detection and resolution
LI R2,100
LI R1,4
MOVE R3,R0
LW R4,100(R1)
SUB R5,R4,R3
BLEZ R5,LOOP_END
ADDI R1,R1,4
MOVE R3,R4
BNE R2,R1,LOOP_NXT
SLL R2,R2,2
LOOP_END:
LOOP_NXT:
IF ID EX MEM WB
IFRead
R2EX MEM WB
1 2 3 4 5 6 7 8 9 10 11 12 13 14
IF ID EX MEM WB
IF ID EX MEM WB
IFRead
R1EX MEM WB
IF ID EX MEM
IF
IDStall
IDStall
IFStall
IFStall
IDStall
IFStall
IDStall
IFStall
IDStall
IFStall
IDStall
IFStall
IDStall
IFStall
Advanced Computer Architectures, 2014
Processor performance
13
The processor performance depends on a number of factors:
Clock frequency
Instruction Set Architecture (e.g., RISC vs CISC)
ISA implementation (e.g., single cycle vs multi cycle vs pipeline)
Benchmarks (programs) used
Compiler optimizations
Memory bandwidth and latency
…
What is the best metric to assess processor performance?
Advanced Computer Architectures, 2014
Measuring processor performance
14
Frequency (GHz)
Does not take into account architectural differences (e.g., ISA)
MIPS (million instructions per second)
𝑀𝐼𝑃𝑆 =#𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠
𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑇𝑖𝑚𝑒 [𝜇𝑠]=
𝐶𝑙𝑜𝑐𝑘 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
𝐶𝑃𝐼 × 106
Valid only when using the exact same program, compiler and OS
Requires that both processors use the same ISA
1 CISC instruction is equivalent to several RISC instructions, but takes longer to execute
MFLOPS (million floating point operations per second)
Has the same problems as the MIPS metric
Valid only for floating point intensive programs
e.g., does not make sense for H.264 video compression
CPI = Cycles Per Instruction
Advanced Computer Architectures, 2014
Measuring processor performance
15
Use time to measure processor performance
Requires the implementation (or at least simulation) of the proposed
processor
𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝑃 =1
𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑇𝑖𝑚𝑒 𝑃
What does it mean to say:
“Processor PA is x times faster than processor PB”
𝑥 =𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝑃𝐴
𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝑃𝐵=
𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑇𝑖𝑚𝑒 𝑃𝐵
𝐸𝑥𝑒𝑐𝑢𝑐𝑡𝑖𝑜𝑛 𝑇𝑖𝑚𝑒 𝑃𝐴 Speed-up
x is also called the speedup of processor PA versus processor PB
Advanced Computer Architectures, 2014
Measuring processor performance:
Task selection16
What are the best benchmarks?
Real applications of interest to the user
Different users different benchmarks
Representative programs, e.g., SPEC CPU 2006
Synthetic Programs, e.g., Dhrystone
Different metrics:
Execution rhythm (tasks/second) vs latency (seconds/task)
More realistic
Fit for real systems
Simpler programs
Easier to test in simulation
Advanced Computer Architectures, 2014
Measuring processor performance:
SPEC CPU 2006 (integer)17
Benchmark Lang. Application Area Description
400.Perlbench C Programming Language Derived from Perl V5.8.7. The workload includes SpamAssassin, MHonArc (an
email indexer), and specdiff (SPEC's tool that checks benchmark outputs).
401.bzip2 C CompressionJulian Seward's bzip2 version 1.0.3, modified to do most work in memory, rather
than doing I/O.
403.gcc C C Compiler Based on gcc Version 3.2, generates code for Opteron.
429.mcf C Combinatorial Optimization Vehicle scheduling. Uses a network simplex algorithm (which is also used in
commercial products) to schedule public transport.
445.gobmk C Artificial Intelligence: Go Plays the game of Go, a simply described but deeply complex game.
456.hmmer C Search Gene Sequence Protein sequence analysis using profile hidden Markov models (profile HMMs)
458.sjeng C Artificial Intelligence: chess A highly-ranked chess program that also plays several chess variants.
462.libquantum C Physics / Quantum ComputingSimulates a quantum computer, running Shor's polynomial-time factorization
algorithm.
464.h264ref C Video Compression A reference implementation of H.264/AVC, encodes a videostream using 2
parameter sets. The H.264/AVC standard is expected to replace MPEG2
471.omnetpp C++ Discrete Event Simulation Uses the OMNet++ discrete event simulator to model a large Ethernet campus
network.
473.astar C++ Path-finding Algorithms Pathfinding library for 2D maps, including the well known A* algorithm.
483.xalancbmk C++ XML Processing A modified version of Xalan-C++, which transforms XML documents to other
document types.
Advanced Computer Architectures, 2014
Measuring processor performance:
SPEC CPU 2006 (floating point)18
Benchmark Lang. Application Area Description
410.bwaves Fortran Fluid Dynamics Computes 3D transonic transient laminar viscous flow.
416.gamess Fortran Quantum Chemistry. Gamess implements a wide range of quantum chemical computations.
433.milc C Quantum Chromodynamics A gauge field generating program for lattice gauge theory programs.
434.zeusmp Fortran Physics / CFD Computational fluid dynamics code for simulating of astrophysical phenomena.
435.gromacs C,FortranBiochemistry / Molecular Dynamics
Molecular dynamics, i.e. simulate Newtonian equations of motion for hundreds to millions of particles. The test case simulates protein Lysozyme in a solution.
436.cactusADM C,Fortran Physics / General Relativity Solves the Einstein evolution equations using a staggered-leapfrog method
437.leslie3d Fortran Fluid Dynamics Large-Eddy Simulations with Linear-Eddy Model in 3D.
444.namd C++ Biology / Molecular Dynamics Simulates large biomolecular systems.
447.dealII C++ Finite Element Analysis Program library targeted at adaptive finite elements and error estimation.
450.soplex C++Linear Programming, Optimization
Solves a linear program using a simplex algorithm and sparse linear algebra.
453.povray C++ Image Ray-tracing Image rendering of a 1280x1024 anti-aliased landscape.
454.calculix C,Fortran Structural Mechanics Finite element code for linear and nonlinear 3D structural applications.
459.GemsFDTD FortranComputational Electromagnetics
Solves the 3D Maxwell equations in 3D using the finite-difference time-domain (FDTD) method.
465.Tonto Fortran Quantum Chemistry An open source quantum chemistry package
470.lbm C Fluid Dynamics Simulates incompressible fluids in 3D
481.wrf C,Fortran Weather Weather modeling from scales of meters to thousands of kilometers.
482.sphinx3 C Speech recognition A widely-known speech recognition system from Carnegie Mellon University
Advanced Computer Architectures, 2014
Measuring processor performance:
Averaging performance19
NormalAll benchmarks have the same weight
Weighted (𝑊𝒊)Benchmarks are weighted by
frequency or relevance
Arithmetic Mean1
𝑁
𝑖=1
𝑁
𝑇𝑖
𝑖=1
𝑁
𝑊𝑖𝑇𝑖
𝑖=1
𝑁
𝑊𝑖
Harmonic Mean(Less sensitive to large outliers
and increases the influence of
small values)
1
𝑁
𝑖=1
𝑁
𝑇𝑖
−1
=𝑛
𝑖=1𝑁 1 𝑇𝑖
𝑖=1𝑛 𝑤𝑖
𝑖=1𝑁 𝑤𝑖 𝑇𝑖
Alternative:
Instead of using Execution Time 𝑇𝑖
Use speedup regarding a standard reference, Speedup𝑖 = 𝑇𝑖𝑅𝑒𝑓
𝑇𝑖
SPECs use a SPARStation (SUN Sparc10) as reference
Advanced Computer Architectures, 2014
Measuring processor performance:
Amdahl's Law20
Consider that we improve processor performance by better designing some
part of it
E.g., improve floating point calculations by 3x
What is the actual improvement?
𝑆𝑝𝑒𝑒𝑑𝑢𝑝 =𝑇
𝑇=
𝑇
𝑇 𝐹𝑃 + 𝑇(𝑁𝑜𝑛 𝐹𝑃)
The Non-FP instructions have the same execution time: 𝑇 𝑁𝑜𝑛 𝐹𝑃 = 𝑇(𝑁𝑜𝑛 𝐹𝑃)
FP Instructions execution time:
𝑇 𝐹𝑃 =𝑇(𝐹𝑃)
𝑆𝑝𝑒𝑒𝑑𝑢𝑝(𝐹𝑃)=
1
3𝑇(𝐹𝑃)
𝑇 – Execution time in the original processor 𝑇 – Execution time in the improved processor
Advanced Computer Architectures, 2014
Measuring processor performance:
Amdahl's Law21
Consider that we improve processor performance by better designing some
part of it
E.g., improve floating point calculations by 3x
What is the actual improvement?
𝑆𝑝𝑒𝑒𝑑𝑢𝑝 =𝑇
𝑇(𝐹𝑃)
𝑆𝑝𝑒𝑒𝑑𝑢𝑝(𝐹𝑃)+𝑇(𝑁𝑜𝑛 𝐹𝑃)
=1
𝑇(𝐹𝑃)/𝑇
𝑆𝑝𝑒𝑒𝑑𝑢𝑝(𝐹𝑃)+𝑇(𝑁𝑜𝑛 𝐹𝑃)/𝑇
Lets us consider that, in the original processor, the fraction of time executing
floating point instructions is 𝛼 𝐹𝑃 =T FP
T= 0.25
𝑆𝑝𝑒𝑒𝑑𝑢𝑝 =1
𝛼(𝐹𝑃)
𝑆𝑝𝑒𝑒𝑑𝑢𝑝(𝐹𝑃)+ 1−𝛼(𝐹𝑃)
=1
0.25
3+0.75
= 1.2
𝑇 – Execution time in the original processor 𝑇 – Execution time in the improved processor
Advanced Computer Architectures, 2014
Measuring processor performance:
Amdahl's Law (corollary)22
Consider that we improve processor performance by better designing some
part of it
E.g., improve floating point calculations by 3x
What is the maximum improvement possible?
𝑆𝑝𝑒𝑒𝑑𝑢𝑝 =1
𝛼(𝐹𝑃)
𝑆𝑝𝑒𝑒𝑑𝑢𝑝(𝐹𝑃)+ 1−𝛼(𝐹𝑃)
=1
0.25
3+0.75
= 1.2
Consider that 𝑆𝑝𝑒𝑒𝑑𝑢𝑝 𝐹𝑃 → +∞
𝑀𝑎𝑥𝑖𝑚𝑢𝑚 𝑎𝑐ℎ𝑖𝑒𝑣𝑎𝑏𝑙𝑒 𝑆𝑝𝑒𝑒𝑑𝑢𝑝 =1
1−𝛼(𝐹𝑃)=
1
0.75= 1.33(3)
𝑇 – Execution time in the original processor 𝑇 – Execution time in the improved processor
Advanced Computer Architectures, 2014
Measuring processor performance:
Amdahl's Law (summary)23
Execution Time:
𝑇𝐼𝑚𝑝𝑟𝑜𝑣𝑒𝑑 𝑚𝑎𝑐ℎ𝑖𝑛𝑒 = 𝑇𝑅𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑀𝑎𝑐ℎ𝑖𝑛𝑒 1 − 𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝐼𝑚𝑝𝑟𝑜𝑣𝑒𝑑 +𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝐼𝑚𝑝𝑟𝑜𝑣𝑒𝑑
𝑆𝑝𝑒𝑒𝑑𝑢𝑝𝐼𝑚𝑝𝑟𝑜𝑣𝑒𝑑
Actual Speedup:
𝑆𝑝𝑒𝑒𝑑𝑢𝑝𝐼𝑚𝑝𝑟𝑜𝑣𝑒𝑑 𝑚𝑎𝑐ℎ𝑖𝑛𝑒 =1
1−𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝐼𝑚𝑝𝑟𝑜𝑣𝑒𝑑 +𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝐼𝑚𝑝𝑟𝑜𝑣𝑒𝑑
𝑆𝑝𝑒𝑒𝑑𝑢𝑝𝐼𝑚𝑝𝑟𝑜𝑣𝑒𝑑
Maximum achievable Speedup (𝑆𝑝𝑒𝑒𝑑𝑢𝑝𝐼𝑚𝑝𝑟𝑜𝑣𝑒𝑑 → +∞):
𝑀𝑎𝑥𝑖𝑚𝑢𝑚 𝑆𝑝𝑒𝑒𝑑𝑢𝑝𝐼𝑚𝑝𝑟𝑜𝑣𝑒𝑑 𝑀𝑎𝑐ℎ𝑖𝑛𝑒 =1
1−𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝐼𝑚𝑝𝑟𝑜𝑣𝑒𝑑
More on processor pipelining
Conflict identification
Solving conflicts
Next lesson24