Post on 31-May-2020
Lecture 13 Slide 1EECS 470
© 2018 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch
EECS 470
Lecture 13
Memory Speculation
Winter 2018
Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch of Carnegie Mellon University, Purdue University, University of Michigan, Univerity of Pennsylvania, and University of Wisconsin.
Lecture 13 Slide 2EECS 470
© 2018 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch
Tomasulo-Style Scheduler Implementation
Sch
edul
er
Logi
c
Results
Input/Result Networks
Reservation Stations
Inputs
Network
Control
Valid
Bits
Tag
Tag
V
V
Value
Value
FlagsOp
• Synchronization managed by scheduler logic
• Communication through input/output networks
• Infrastructure geared towards register communication
Lecture 13 Slide 3EECS 470
© 2018 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch
Out-of-Order Memory Operations
Scheduling is straightforward in out-of-order… Register inputs only
Register renaming captures all true dependences
Tags tell you exactly when you can execute
… except with loads and stores Register and memory inputs (older stores)
Register renaming does not tell you all dependences There are some in memory
How do loads find older in-flight stores to same address (if any)?
Issue of finding if addresses match is called “memory disambiguation”
Lecture 13 Slide 4EECS 470
© 2018 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch
The Good: Register Communication
• Directly specified dependencies (contained in inst) Accurate description of communication
no false or missing dependency edges
permits realization of dataflow schedule
Early description of communication know dependencies upon decode allows scheduler logic to be pipelined without impacting speed of
communication
• Small communication name space (32-64 usually) Fast access to communication storage
possible to map entire communication space (no tags) possible to bypass communication storage
• Forwarding
Lecture 13 Slide 5EECS 470
© 2018 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch
The Bad (and the ugly): Memory Scheduling
• Recall how we handle dependencies in registers We address false dependencies with register renaming.
We address true dependencies with the out-of-order machine CDB broadcast and wakeup
• Why can’t we do the same things with memory? What would rename look like for memory?
Why might it be hard to pull off?
What about wakeup? What’s tricky here?
• Cannot directly use the same techniques Indirectly specified memory dependencies
Large communication space (232-64 bytes!)
• Memory latency is variable• Complicates scheduling
Lecture 13 Slide 6EECS 470
© 2018 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch
Requirements for a Solution
• Accurate description of memory dependencies No (or few) missing or false dependencies
Permit realization of dataflow schedule
• Early presentation of dependencies Permit pipelining of scheduler logic
• Fast access to communication space Preferably as fast as register communication (zero cycles)
EECS 470Lecture 9 Slide 7EECS 470 EECS 470
© 2018 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch
The Three Dependency “Flows”
Lecture 13 Slide 8EECS 470
© 2018 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch
Some trivial ways to handle LoadsAllow only one load or store in OoO core
Stall other operations at dispatch – very slow
No need for LSQ
Load/store only issues when LSQ/ROB head (in-order scheduling) Stall other operations at dispatch
Loads always get value from cache, only 1 outstanding load
More aggressive options for forwarding: Conservative load-to-store forwarding, when addresses known
Optimistic load-to-store forwarding – requires rewind mechanism
Aggressive options for memory: Aggressive fetching from memory if no load-to-store forwarding.
Lecture 13 Slide 9EECS 470
© 2018 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch
Implementation
Several hardware realizations: Unified LSQ (easier to understand, but nasty hardware)
Separate LQ* and SQ (more complicated, but fairly elegant)
We’ll start with a unified LSQ and move to separate LB and SQ.
*Might end up with a load buffer rather than a queue…
Lecture 13 Slide 10EECS 470
© 2018 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch
In-order Load/Store Scheduling
• Idea: Schedule all loads and stores in program order This cannot violate true data
dependencies (non-speculative)
• Capabilities/limitations: Overly restrictive – likely to add many
false dependencies
Early presentation of dependencies (no addresses)
Not fast, all communication through memory structures
• Found in in-order issue pipelines
st X
ld Y
st Z
ld X
ld Z
true realized
Dependencies
prog
ram
ord
er
Lecture 13 Slide 11EECS 470
© 2018 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch
Consider the LSQ cases
Slot LSQ addresses (top is head)
A Store 0x44
B Load 0x20
C Load 0x44
D Store ????
E Load 0x44
F Store 0x20
G Load 0x20
H Load 0x60
• Which of the loads are we sure we will fulfill via D$/Memory?
• Which of the loads will we fulfill via load-to-store forwarding?
• Which aren’t we sure of?
• Identify what to do with each load.
Lecture 13 Slide 12EECS 470
© 2018 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch
Unified Load/Store Queue
Operates as a circular FIFO
Allocate on dispatch
De-allocate on retirement
Calc address in register dataflow order
A NxN comparator matrix detects memory address dependence (also considers relative age of entries)
Store ops are held until retirement
Load ops are issued when no dependency exists & all older store addresses known
==========
=========
=
========
==
=======
= = ====
======
=
address
calculation+
translation
Lecture 13 Slide 13EECS 470
© 2018 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch
Unified Load/Store Queue Questions
When do we search for store-to-load forwarding? As soon as we have the load address
What could happen once we have the load address? There is a store whose data we’ll use
There is no store whose data we’ll use
We aren’t sure which store’s data, if any, we’ll use.
What should we do for each of those three cases?
EECS 470Lecture 9 Slide 14EECS 470 EECS 470
© 2018 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch
Split LQ and SQ
D$/TLB + structures to handle in-flight loads/storesPerforms four functions
In-order store retirement
• Writes stores to D$ in order
• Basic, implemented by store queue (SQ)
Store-load forwarding
• Allows loads to read values from older un-retired stores
• Data provided to LQ from SQ.
Memory ordering violation detection
• Checks load speculation (more later)
• Advanced, implemented by load queue (LQ)
Memory ordering violation avoidance
• Advanced, implemented by dependence predictors
EECS 470Lecture 9 Slide 15EECS 470 EECS 470
© 2018 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch
Simple Data Memory FU: D$/TLB + SQ
Just like any other FU• 2 register inputs (addr, data in)
• 1 register output (data out)
• 1 non-register input (load pos)?
Store queue (SQ)• In-flight store address/value
• In program order (like ROB)
• Addresses associatively searchable
• Size heuristic: 15-20% of ROB
But what does it do?
valueaddress================
age
D$/TLB
head
tail
load position
address data in data out
Store Queue (SQ)
EECS 470Lecture 9 Slide 16EECS 470 EECS 470
© 2018 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch
“Out-of-Order” Load Execution
In parallel with D$ access
Send address to SQ• Compare with all store addresses
• CAM: like FA$, or RS tag match
• Select all matching addresses
Age logic selects youngest store that is older than load• Uses load position input
• Three possibilities
• There is a store we can forward from
• Do so!
• There is no store we can forward from
• Get from D$
• We don’t know.
• ?????
valueaddress================
age
D$/TLB
head
tail
load position
address data in data out
Store Queue (SQ)
EECS 470Lecture 9 Slide 17EECS 470 EECS 470
© 2018 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch
D$/TLB + SQ + LQ
Load queue (LQ)• In-flight load addresses
• In program-order (like ROB,SQ)
• Associatively searchable
• Size heuristic: 20-30% of ROB================
D$/TLB
head
tail
load queue
(LQ)
address================
tail
head
age
store position flush?
SQ
EECS 470Lecture 9 Slide 18EECS 470 EECS 470
© 2018 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch
Conservative Memory Scheduling
• Schedule loads and stores when all dependencies known satisfied• Conservative - won’t violate true
dependencies (non-speculative)
• Capabilities/limitations:• Accurate only if addresses arrive early
• Late presentation of dependencies (verified with addresses)
• Not fast, all communication through memory and/or complex store forward network
• Better for small windows
st X
ld Y
st?Z
ld X
ld Z
true realized
Dependencies
prog
ram
ord
er
EECS 470Lecture 9 Slide 19EECS 470 EECS 470
© 2018 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch
ld Y
st X st X
Conservative Dataflow Scheduling
st X
ld Y
st?Z
ld X
ld Z
true realized
Dependencies
prog
ram
ord
er
time
ld Y
st?Z
ld X
ld Z
st?Z
ld X
ld Z
ld Y
st X
st?Z
ld X
ld Z
ld Y
st X
st Z
ld X
ld Z
ld Y
st X
st Z
ld X
ld Z
ld Y
st X
st Z
ld X
ld Z
Z
stall cycle
EECS 470Lecture 9 Slide 20EECS 470 EECS 470
© 2018 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch
Opportunistic Memory Scheduling
• Observe: on average, < 10% of loads forward from SQ• Even if older store address is unknown, chances are it won’t match
• Let loads execute in presence of older “ambiguous stores”
+ Increases performance
• But what if ambiguous store does match?
• Memory ordering violation: load executed too early • Must detect…
• And fix (e.g., by flushing/refetching instruction starting at load)
EECS 470Lecture 9 Slide 21EECS 470 EECS 470
© 2018 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch
Opportunistic Memory Speculation
• Schedule loads and stores when register dependencies satisfied• May violate true data dependencies
(speculative)
• Capabilities/limitations:• Accurate - if little in-flight
communication through memory
• Early presentation of dependencies (no dependencies!)
• Not fast, all communication through memory structures
• Most common with small windows
st X
ld Y
st Z
ld X
ld Z
true realized
Dependencies
prog
ram
ord
er
EECS 470Lecture 9 Slide 22EECS 470 EECS 470
© 2018 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch
Less aggressive speculation
• Once you have a load address, check if you will forward for certain (from somewhere, even if you aren’t sure where)• If not, get from memory/D$
• Don’t send to CDB until certain.
EECS 470Lecture 9 Slide 23EECS 470 EECS 470
© 2018 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch
Yet another idea…
• Let’s use a predictor
• Predict if a load is likely to be forwarding from a store• If not, get it from D$ and send it to the CDB
• If so, wait.
• Thoughts on cost/benefit of speculation?• How should that impact our predictor?
“Memory dependence prediction”