EECS 470 Lecture 13 Memory Speculation · Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar,...

Lecture 13 Slide 1EECS 470

Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch

EECS 470

Lecture 13

Memory Speculation

Winter 2018

Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch of Carnegie Mellon University, Purdue University, University of Michigan, Univerity of Pennsylvania, and University of Wisconsin.

Tomasulo-Style Scheduler Implementation

Results

Input/Result Networks

Reservation Stations

Inputs

Network

Control

FlagsOp

• Synchronization managed by scheduler logic

• Communication through input/output networks

• Infrastructure geared towards register communication

Out-of-Order Memory Operations

Scheduling is straightforward in out-of-order… Register inputs only

Register renaming captures all true dependences

Tags tell you exactly when you can execute

… except with loads and stores Register and memory inputs (older stores)

Register renaming does not tell you all dependences There are some in memory

How do loads find older in-flight stores to same address (if any)?

Issue of finding if addresses match is called “memory disambiguation”

The Good: Register Communication

• Directly specified dependencies (contained in inst) Accurate description of communication

no false or missing dependency edges

permits realization of dataflow schedule

Early description of communication know dependencies upon decode allows scheduler logic to be pipelined without impacting speed of

communication

• Small communication name space (32-64 usually) Fast access to communication storage

possible to map entire communication space (no tags) possible to bypass communication storage

• Forwarding

The Bad (and the ugly): Memory Scheduling

• Recall how we handle dependencies in registers We address false dependencies with register renaming.

We address true dependencies with the out-of-order machine CDB broadcast and wakeup

• Why can’t we do the same things with memory? What would rename look like for memory?

Why might it be hard to pull off?

What about wakeup? What’s tricky here?

• Cannot directly use the same techniques Indirectly specified memory dependencies

Large communication space (232-64 bytes!)

• Memory latency is variable• Complicates scheduling

Requirements for a Solution

• Accurate description of memory dependencies No (or few) missing or false dependencies

Permit realization of dataflow schedule

• Early presentation of dependencies Permit pipelining of scheduler logic

• Fast access to communication space Preferably as fast as register communication (zero cycles)

EECS 470Lecture 9 Slide 7EECS 470 EECS 470

The Three Dependency “Flows”

Some trivial ways to handle LoadsAllow only one load or store in OoO core

Stall other operations at dispatch – very slow

No need for LSQ

Load/store only issues when LSQ/ROB head (in-order scheduling) Stall other operations at dispatch

Loads always get value from cache, only 1 outstanding load

More aggressive options for forwarding: Conservative load-to-store forwarding, when addresses known

Optimistic load-to-store forwarding – requires rewind mechanism

Aggressive options for memory: Aggressive fetching from memory if no load-to-store forwarding.

Implementation

Several hardware realizations: Unified LSQ (easier to understand, but nasty hardware)

Separate LQ* and SQ (more complicated, but fairly elegant)

We’ll start with a unified LSQ and move to separate LB and SQ.

*Might end up with a load buffer rather than a queue…

In-order Load/Store Scheduling

• Idea: Schedule all loads and stores in program order This cannot violate true data

dependencies (non-speculative)

• Capabilities/limitations: Overly restrictive – likely to add many

false dependencies

Early presentation of dependencies (no addresses)

Not fast, all communication through memory structures

• Found in in-order issue pipelines

true realized

Dependencies

Consider the LSQ cases

Slot LSQ addresses (top is head)

A Store 0x44

B Load 0x20

C Load 0x44

D Store ????

E Load 0x44

F Store 0x20

G Load 0x20

H Load 0x60

• Which of the loads are we sure we will fulfill via D$/Memory?

• Which of the loads will we fulfill via load-to-store forwarding?

• Which aren’t we sure of?

• Identify what to do with each load.

Unified Load/Store Queue

Operates as a circular FIFO

Allocate on dispatch

De-allocate on retirement

Calc address in register dataflow order

A NxN comparator matrix detects memory address dependence (also considers relative age of entries)

Store ops are held until retirement

Load ops are issued when no dependency exists & all older store addresses known

==========

=========

========

=======

= = ====

======

address

calculation+

translation

Unified Load/Store Queue Questions

When do we search for store-to-load forwarding? As soon as we have the load address

What could happen once we have the load address? There is a store whose data we’ll use

There is no store whose data we’ll use

We aren’t sure which store’s data, if any, we’ll use.

What should we do for each of those three cases?

Split LQ and SQ

D$/TLB + structures to handle in-flight loads/storesPerforms four functions

In-order store retirement

• Writes stores to D$ in order

• Basic, implemented by store queue (SQ)

Store-load forwarding

• Allows loads to read values from older un-retired stores

• Data provided to LQ from SQ.

Memory ordering violation detection

• Checks load speculation (more later)

• Advanced, implemented by load queue (LQ)

Memory ordering violation avoidance

• Advanced, implemented by dependence predictors

Simple Data Memory FU: D$/TLB + SQ

Just like any other FU• 2 register inputs (addr, data in)

• 1 register output (data out)

• 1 non-register input (load pos)?

Store queue (SQ)• In-flight store address/value

• In program order (like ROB)

• Addresses associatively searchable

• Size heuristic: 15-20% of ROB

But what does it do?

valueaddress================

D$/TLB

load position

address data in data out

Store Queue (SQ)

“Out-of-Order” Load Execution

In parallel with D$ access

Send address to SQ• Compare with all store addresses

• CAM: like FA$, or RS tag match

• Select all matching addresses

Age logic selects youngest store that is older than load• Uses load position input

• Three possibilities

• There is a store we can forward from

• Do so!

• There is no store we can forward from

• Get from D$

• We don’t know.

• ?????

valueaddress================

D$/TLB

load position

address data in data out

Store Queue (SQ)

D$/TLB + SQ + LQ

Load queue (LQ)• In-flight load addresses

• In program-order (like ROB,SQ)

• Associatively searchable

• Size heuristic: 20-30% of ROB================

D$/TLB

load queue

address================

store position flush?

Conservative Memory Scheduling

• Schedule loads and stores when all dependencies known satisfied• Conservative - won’t violate true

dependencies (non-speculative)

• Capabilities/limitations:• Accurate only if addresses arrive early

• Late presentation of dependencies (verified with addresses)

• Not fast, all communication through memory and/or complex store forward network

• Better for small windows

true realized

Dependencies

st X st X

Conservative Dataflow Scheduling

true realized

Dependencies

stall cycle

Opportunistic Memory Scheduling

• Observe: on average, < 10% of loads forward from SQ• Even if older store address is unknown, chances are it won’t match

• Let loads execute in presence of older “ambiguous stores”

+ Increases performance

• But what if ambiguous store does match?

• Memory ordering violation: load executed too early • Must detect…

• And fix (e.g., by flushing/refetching instruction starting at load)

Opportunistic Memory Speculation

• Schedule loads and stores when register dependencies satisfied• May violate true data dependencies

(speculative)

• Capabilities/limitations:• Accurate - if little in-flight

communication through memory

• Early presentation of dependencies (no dependencies!)

• Not fast, all communication through memory structures

• Most common with small windows

true realized

Dependencies

Less aggressive speculation

• Once you have a load address, check if you will forward for certain (from somewhere, even if you aren’t sure where)• If not, get from memory/D$

• Don’t send to CDB until certain.

Yet another idea…

• Let’s use a predictor

• Predict if a load is likely to be forwarding from a store• If not, get it from D$ and send it to the CDB

• If so, wait.

• Thoughts on cost/benefit of speculation?• How should that impact our predictor?

“Memory dependence prediction”

EECS 470 Lecture 13 Memory Speculation · Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar,...

Documents

Transcript of EECS 470 Lecture 13 Memory Speculation · Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar,...

Parallel Execution Models for Future Multicore Architectures Guri Sohi University of Wisconsin.

bIoMEdIcAl RFID - University of Wisconsin–Madisonpages.cs.wisc.edu/~sohi/cs252/Fall2014/handouts/spectrum07_rfid_ethics.pdf · States, approved an RFID tag for implantation in humans

The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch.

Rethinking Parallel Execution Guri Sohi (along with Matthew Allen, Srinath Sridharan, Gagan Gupta) University of Wisconsin-Madison.

Imprint - Österreichische Energieagentur · Imprint Authors Österreichisches Ökologie-Institut (Austrian Institute of Ecology): Andrea Wallner (Project Coordination) Antonia Wenisch

The State of Health Insurancepacificprime.cn/assets/landing/sohi-2020/SOHI-2020-report.pdf · What are the regional trends and challenges, and how do they affect individuals, businesses,

EECS 470 Lecture 10 Memory Speculation · EECS 470 Lecture 10 EECS 470 Slide 3 © Wenisch 2016 --Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith,

Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison.

Drug Pricing in Canada Victoria Brown, Anureet Sohi, Lisa Weger SPHA 511.

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

EECS 470 · Shen, Smith, Sohi, Tyson, Wenisch, Vijaykumar Instruction Set Architecture “Instruction set architecture (ISA) is the structure of a computer that a machine language

Lecture 13 Slide 1 EECS 470 © 2015 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch.

Aasheesh Kolli, Thomas F. Wenisch (University of … RAS-Directed Instruction Prefetching Aasheesh Kolli, Thomas F. Wenisch (University of Michigan) and Ali Saidi (ARM) 1. The instruction

GENERATIONAL DIFFERENCES IN WORK PREFERENCESessay.utwente.nl/60041/1/MA_thesis_J_Hoff.pdf · considerable role in the preferences for work (Gilbert, Sohi, & McEachern, 2008). Therefore,

NPP Lithuania - Umweltbundesamt...NPP LITHUANIA Expert Statement to the EIA Report Antonia Wenisch Helmut Hirsch Gabriele Mraz Petra Seibert Klemens Leutgöb Ordered by the Federal

Introduction to Computer Engineeringpages.cs.wisc.edu/~sohi/cs252/Fall2012/lectures/lec07_assembly... · Introduction to Computer Engineering CS/ECE 252, Fall 2012 Prof. Guri Sohi

Naveen Sohi Sohi ABOUT ME I have been working in the communications and marketing ﬁeld for 5 years. I graduated with a Bachelors of Arts in Communications in June of 2014, yet have

Cooperative Caching for Chip Multiprocessors Jichuan Chang †, Enric Herrero ‡, Ramon Canal ‡ and Gurindar S. Sohi * HP Labs † Universitat Politècnica de.

Ali Hashemi Sohi -AVIMA10-Mater thesis-New aero engine concepts1

Introduction to Computer Engineering CS/ECE 252, Spring 2010 Prof. Guri Sohi Computer Sciences Department University of Wisconsin – Madison.