2014-4-3 John Lazzaro (not a prof - “John” is always OK)
description
Transcript of 2014-4-3 John Lazzaro (not a prof - “John” is always OK)
![Page 1: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/1.jpg)
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
2014-4-3
John Lazzaro(not a prof - “John” is always OK)
CS 152Computer Architecture and Engineering
www-inst.eecs.berkeley.edu/~cs152/
TA: Eric Love
Lecture 19 -- Dynamic Scheduling II
Play:
![Page 2: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/2.jpg)
UC Regents Fall 2006 © UCBCS 152 L21: Networks and Routers
Case studies of dynamic execution
DEC Alpha 21264: High performance from a relatively simple implementation of a modern instruction set.
IBM Power: Evolving dynamic designs over many generations.
Simultaneous Multi-threading: Adapting multi-threading to dynamic scheduling.
Short Break
![Page 3: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/3.jpg)
DEC Alpha
21164: 4-issue in-order design.
21264 was 50% to 200% faster in real-world applications.
21264: 4-issue out-of-order design.
![Page 4: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/4.jpg)
500 MHz 0.5µ parts for in-order 21164 and out-of-order
21264.
Similarly-sized on-chip caches (116K vs 128K)
In-order 21164 has larger
off-chip cache.
21264 has 55% more transistors
than the 21164.
The die is 44% larger.
21264 has a 1.7x advantage on integer code, and a 2.7x advantage of floating-point
code.
21264 consumes
46% more power
than the 21164.
![Page 5: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/5.jpg)
UC Regents Spring 2014 © UCBCS 152 L19: Dynamic Scheduling II
The Real Difference: Speculation
If the ability to recover from
mis-speculation is built into an
implementation ... it offers
the option to add speculative features to all parts of the
design.
![Page 6: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/6.jpg)
FP Pipe
Int Pipe
Int Pipe
OoOOoO
I-CacheI-CacheData Cache
Data Cache
Fetch and
predict
21264 die
Separate OoO control for integer
and floating point.
RISC decode happens in OoO blocks
Unlabeled areas devoted
to memory system control
![Page 7: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/7.jpg)
21264 pipeline diagramRename and Issue stages are primary
locations of dynamic scheduling logic. Load/store disambiguation support resides in Memory stage.
Slot: absorbs delay of long path on last slide.
![Page 8: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/8.jpg)
Fetch stage close-up:Each cache line stores predictions of the next line,
and the cache way to be fetched. If predictions are correct, fetcher maintains the required 4
instructions/cycle pace.
Speculative
![Page 9: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/9.jpg)
Rename stage close-up:(1) Allocates new physical registers for destinations,
(2) Looks up physical register numbers for sources, (3) Handle rename dependences within the 4
issuing instructions in one clock cycle!
Output:12 physical
registers numbers:
1 destination and 2
sources for the 4
instructions to be issued.Input: 4 instructions specifying
architected registers.
For mis-speculation recovery
Time-stamped.
![Page 10: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/10.jpg)
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
Recall: malloc() -- free() in hardware
The record-keeping
shown in this diagram occurs in the rename
stage.
![Page 11: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/11.jpg)
Issue stage close-up:(1) Newly issued instructions placed in top of queue.
(2) Instructions check scoreboard: are 2 sources ready?
(3) Arbiter selects 4 oldest “ready” instructions.(4) Update removes these 4 from queue.Output:
The 4 oldest
instructions whose 2 source registers are ready for use.
Input: 4
just-issued instructions, renamed to use physical
registers.
Scoreboard: Tracks writes to physical registers.
![Page 12: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/12.jpg)
Execution close-up:(1) Two copies of register files, to reduce port
pressure.(2) Forwarding buses are low-latency paths through
CPU. Relies on speculations
![Page 13: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/13.jpg)
Latencies, from issue to retirement.
8 retirements per cycle can be sustained over
short time periods.Peak rate is 11
retirements in a single cycle.
Retirement managed here.
Short latencies keep buffers to a reasonable size.
![Page 14: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/14.jpg)
Execution unit close-up:(1) Two arbiters: one for top pipes, one for bottom
pipes.(2) Instructions statically assigned to top or bottom.
(3) Arbiter dynamically selects left or right.TopTop
Bottom
Thus, 2 dual-issue dynamic machines, not a 4-issue machine. Why? Simplifies arbiter. Performance penalty? A few %.
![Page 15: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/15.jpg)
Memory stages close-up:
Input: Say something
Loads and stores from execution unit appear as
“Cluster 0/1 memory unit” in the diagram
below.
1st stop: TLB, to convert virtual memory
addresses.
3rd stop: Flush STQ to the data cache ... on a miss, place in Miss Address File.
(MAF == MHSR)
“Doublepumped”
1 GHz
2nd stop: Load Queue(LDQ) and Store Queue (SDQ) each hold 32 instructions, until retirement ...
So we can roll back!
![Page 16: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/16.jpg)
LDQ/STQ close-up:
Hazards we are trying to prevent:
To do so, LDQ and SDQ lists of up to 32 loads and stores, in issued order. When a new load or store arrives, addresses are compared to detect/fix hazards:
![Page 17: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/17.jpg)
LDQ/STQ speculation
It also marks the load instruction in a predictor, so that future invocations are not speculatively executed.
First execution Subsequent execution
![Page 18: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/18.jpg)
Designing a microprocessor is a team sport. Below are the author and acknowledgement lists for the papers whose figures I use.
There is no “i” in T-E-A-M ...
circuits
architectmicro-architects
![Page 19: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/19.jpg)
UC Regents Spring 2014 © UCBCS 152 L19: Dynamic Scheduling II
Break
Play:
![Page 20: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/20.jpg)
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
Multi-Threading
(Dynamic Scheduling)
![Page 21: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/21.jpg)
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
Power 4 (predates Power 5 shown earlier)
Single-threaded predecessor to Power 5. 8 execution units inout-of-order engine, each mayissue an instruction each cycle.
![Page 22: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/22.jpg)
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
For most apps, most execution units lie idle
From: Tullsen, Eggers, and Levy,“Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995.
For an 8-way superscalar.Observation:
Most hardware in an
out-of-order CPU concerns
physical registers.
Could severalinstruction
threads share this hardware?
![Page 23: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/23.jpg)
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
Simultaneous Multi-threading ...
1
2
3
4
5
6
7
8
9
M M FX FX FP FP BR CCCycle
One thread, 8 units
M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes
1
2
3
4
5
6
7
8
9
M M FX FX FP FP BR CCCycle
Two threads, 8 units
![Page 24: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/24.jpg)
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
Power 4
Power 5
2 fetch (PC),2 initial decodes
2 commits(architected register sets)
![Page 25: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/25.jpg)
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
Power 5 data flow ...
Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck.
![Page 26: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/26.jpg)
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
Power 5 thread performance ...
Relative priority of each thread controllable in hardware.
For balanced operation, both threads run slower than if they “owned” the machine.
![Page 27: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/27.jpg)
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
Multi-Core
![Page 28: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/28.jpg)
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
Recall: Superscalar utilization by a threadFor an 8-way superscalar. Observation:
In many cases, the on-chip cache and DRAM I/O
bandwidth is also
underutilized by one CPU.
So, let 2 cores share them.
![Page 29: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/29.jpg)
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
Most of Power 5 die is shared hardware
Core #1
Core #2
SharedComponents
L2 Cache
L3 Cache Control
DRAMController
![Page 30: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/30.jpg)
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
Core-to-core interactions stay on chip
(2) Threads on two cores share memory via L2 cache operations.Much faster than2 CPUs on 2 chips.
(1) Threads on two cores that use shared libraries conserve L2 memory.
![Page 31: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/31.jpg)
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
Sun Niagara
![Page 32: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/32.jpg)
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
The case for Sun’s Niagara ...For an 8-way superscalar.
Observation:Some apps struggle to
reach a CPI == 1.
For throughput on these apps,a large number of single-issue cores is better
than a few superscalars.
![Page 33: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/33.jpg)
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
Niagara (original): 32 threads on one chip
8 cores:Single-issue, 1.2 GHz6-stage pipeline4-way multi-
threadedFast crypto support
Shared resources:3MB on-chip cache4 DDR2 interfaces32G DRAM, 20 Gb/s1 shared FP unitGB Ethernet ports
Sources: Hot Chips, via EE Times, Infoworld. J Schwartz weblog (Sun COO)
Die size: 340 mm² in 90 nm.Power: 50-60 W
![Page 34: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/34.jpg)
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
The board that booted Niagara first-silicon
Source: J Schwartz weblog (then Sun COO, now CEO)
![Page 35: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/35.jpg)
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
Used in Sun Fire T2000: “Coolthreads”
Web server benchmarks used to position the T2000 in the market.
Claim: server uses 1/3 the power of competing servers.
![Page 36: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/36.jpg)
UC Regents Spring 2014 © UCBCS 152 L19: Dynamic Scheduling II
2014
IBM RISC chips, since Power 4 (2001) ...
![Page 37: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/37.jpg)
UC Regents Spring 2014 © UCBCS 152 L19: Dynamic Scheduling II
![Page 38: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/38.jpg)
UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I
Recap: Dynamic Scheduling
Three big ideas: register renaming, data-driven detection of RAW resolution, bus-based architecture.
Has saved architectures that have a small number of registers: IBM 360floating-point ISA, Intel x86 ISA.
Very complex, but enables many things: out-of-order execution, multiple issue, loop unrolling, etc.
![Page 39: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)](https://reader035.fdocuments.in/reader035/viewer/2022070418/56815870550346895dc5cfe2/html5/thumbnails/39.jpg)
On Tuesday
Epilogue ...
Have a good weekend!