(C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa...

119
(C) 2005 Multifacet Project http://www.cs.wisc.edu/gems ISCA Tutorial June 5 th , 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin Moore Please Ask Questions

Transcript of (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa...

Page 1: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

(C) 2005 Multifacet Project http://www.cs.wisc.edu/gems

ISCA TutorialJune 5th, 2005

Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin Moore

Please Ask Questions

Page 2: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 2 http://www.cs.wisc.edu/gems

What do you want to simulate?

Symmetric Multiprocessor

Glueless Multiprocessor

CPU

Uniprocessor

Multiple-CMP

CMP CMP

CMP CMP

P

Chip Multiprocessor (CMP)

P P P

$ $ $ $

Page 3: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 3 http://www.cs.wisc.edu/gems

Open Source Release of GEMS

• GEMS v1.1 released as GPL software

http://www.cs.wisc.edu/gems• Contributors

Alaa Alameldeen

Brad Beckmann

Ross Dickson

Pacia Harper

Milo Martin

Mike Marty

Carl Mauer

Kevin Moore

Manoj Plakal

Dan Sorin

Min Xu

Luke Yen

• Multifacet Project directed by Mark Hill & David Wood

Page 4: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 4 http://www.cs.wisc.edu/gems

GEMS Requirements

• Virtutech Simics 2.0.x or 2.2.x – Personal academic licenses available– http://www.virtutech.com

• Host Machine– x86 (32 or 64-bit) Linux or Sparc/Solaris host machine– > 1 GB Memory

• Workload Checkpoints YOU Create– License issues w/ releasing checkpoints

Page 5: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 5 http://www.cs.wisc.edu/gems

GEMS From 50,000 Feet

DetailedProcessor

Model

OpalSimics

Microbenchmarks

RandomTester

Dete

rmin

isti

c

Conte

nded lock

s

Tra

ce fl

ie

Page 6: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 6 http://www.cs.wisc.edu/gems

GEMS From 50,000 Feet

DetailedProcessor

Model

OpalSimics

Microbenchmarks

RandomTester

Dete

rmin

isti

c

Conte

nded lock

s

Tra

ce fl

ie

Full-System Functional Simulator

• Boots unmodified Solaris 9• BUT, each instruction 1-cycle

• www.virtutech.com

Page 7: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 7 http://www.cs.wisc.edu/gems

GEMS From 50,000 Feet

DetailedProcessor

Model

OpalSimics

Microbenchmarks

RandomTester

Dete

rmin

isti

c

Conte

nded lock

s

Tra

ce fl

ie

Memory System Model

• Flexible multiprocessor memory hierarchy • Includes domain-specific language

Page 8: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 8 http://www.cs.wisc.edu/gems

GEMS From 50,000 Feet

DetailedProcessor

Model

OpalSimics

Microbenchmarks

RandomTester

Dete

rmin

isti

c

Conte

nded lock

s

Tra

ce fl

ie

OoO Processor Model

• Implements partial SPARC v9 ISA• Modeled after MIPS R10000

Page 9: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 9 http://www.cs.wisc.edu/gems

GEMS From 50,000 Feet

DetailedProcessor

Model

OpalSimics

Microbenchmarks

RandomTester

Dete

rmin

isti

c

Conte

nded lock

s

Tra

ce fl

ie

Other Drivers

• Testing independent of Simics• Microbenchmarks

Page 10: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 10 http://www.cs.wisc.edu/gems

Outline

• Introduction and Motivation

• Demo: Simulating a Multiple-CMP System with GEMS

• Ruby: Memory system model

• BREAK

• Opal: Out-of-order processor model

• Demo: Two gems are better than one

• GEMS Source Code Tour and Extending Ruby

• Building Workloads

Page 11: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 11 http://www.cs.wisc.edu/gems

Full-System Simulation with GEMS

• Steps:– Choosing a Ruby protocol– Building Ruby and Opal– Starting and configuring Simics– Loading and configuring Ruby– Loading and configuring Opal– Running simulation– Getting results

Demo

Page 12: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 12 http://www.cs.wisc.edu/gems

Choosing the Ruby System/Protocol

• Included with GEMS release v1.1

– CMP protocols • MOESI_CMP_token: M-CMP token coherence

• MSI_MOSI_CMP_directory: 2-level Directory

• MOESI_CMP_directory: higher performing 2-level Directory

– SMP protocols• MOSI_SMP_bcast: snooping on ordered interconnect

• MOSI_SMP_directory

• MOSI_SMP_hammer: based on AMD Hammer

• And more

Demo

Page 13: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 13 http://www.cs.wisc.edu/gems

Building Ruby and Opal

• Ruby module

cd $GEMS_ROOT/ruby

– set compile-time defaultsvi config/rubyconfig.defaults

– Build module, choosing protocol and destination dirmake PROTOCOL=MOESI_CMP_token DESTINATION=MOESI_CMP_token

– SLICC runs, generates HTML and additional C++ files– Ruby module built and moved to

$GEMS_ROOT/simics/home/MOESI_CMP_token

• Build Opal

cd $GEMS_ROOT/opalmake module DESTINATION=MOESI_CMP_token

Demo

Page 14: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 14 http://www.cs.wisc.edu/gems

Starting Simics

• Start non-GUI Simics

maya(9)% cd $GEMS_ROOT/simics/home/MOESI_CMP_token/

maya(10)% ./simics

Checking out a license... done: academic license.

Looking for additional Simics modules in ./modules

+----------------+ Copyright 1998-2004 by Virtutech, All Rights Reserved

| Virtutech | Version: simics-2.0.23

| Simics | Compiled: Thu Oct 14 20:27:36 CEST 2004

+----------------+

www.simics.com "Virtutech" and "Simics" are trademarks of Virtutech AB

Type 'copyright' for details on copyright.

Type 'license' for details on warranty, copying, etc.

Type 'readme' for further information about this version.

Type 'help help' for info on the on-line documentation.

simics>

Demo

Page 15: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 15 http://www.cs.wisc.edu/gems

Checkpoint and Configuration

• Checkpoints should be created first– Simics-only process

simics> read-configuration ../../checkpoints-u3/jbb/jbb-16p.check

– SpecJBB checkpoint loaded

• Load python scripts

simics> @sys.path.append("../../../gen-scripts")simics> @import mfacet

• Configure Simicssimics> istc-disableTurning I-STC off and flushing old datasimics> dstc-disableTurning D-STC off and flushing old datasimics> instruction-fetch-mode instruction-fetch-tracesimics> magic-break-enablesimics> cpu-switch-time 1

Demo

Page 16: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 16 http://www.cs.wisc.edu/gems

Load and Configure Ruby

Load module

simics> load-module ruby

Setting # processors is required

simics> ruby0.setparam g_NUM_PROCESSORS 16

Create a M-CMP system (4 chips, 4 procs/chip)

simics> ruby0.setparam g_PROCS_PER_CHIP 4

Override compile-time defaults

simics> ruby0.setparam g_NUM_L2_BANKS 32simics> ruby0.setparam L2_CACHE_ASSOC 4simics> ruby0.setparam L2_CACHE_NUM_SETS_BITS 16simics> ruby0.setparam NETWORK_LINK_LATENCY 50

Initialize

simics> ruby0.init

Demo

Page 17: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 17 http://www.cs.wisc.edu/gems

Optionally Load and Configure Opal

Load module

simics> load-module opal

Initialize default processor

simics> opal0.init

simics> opal0.listparam

Start opal (but do not start simulating)

simics> opal0.sim-start “output.opal"

Demo

Page 18: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 18 http://www.cs.wisc.edu/gems

Running simulation

• Setup transaction-based simulation– “magic breakpoints”– Five JBB transactions

simics> @mfacet.setup_run_for_n_transactions(5,1)

• Start simulating– Ruby only (Simics drives Ruby):

simics> c

– Opal is loaded (Opal steps Simics):

simics> opal0.sim-step 9999999999

Demo

Page 19: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 19 http://www.cs.wisc.edu/gems

Dumping Some Output

• Opal stats

simics> opal0.stats

• Ruby stats

simics> ruby0.dump-stats ruby.stats

• Ruby short stats

simics> ruby0.dump-short-stats

– Ruby_cycles is a good runtime metric

Demo

Page 20: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 20 http://www.cs.wisc.edu/gems

Outline

• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with GEMS

• Ruby: Memory system model– Overview (Drivers & Memory System)– Event-driven simulation– Interconnection network– SLICC: Specifying the logic of the system– Simple example: SMP MI protocol– Limitations

• BREAK

• Opal: Out-of-order processor model• Demo: Two gems are better than one• GEMS Source Code Tour and Extending Ruby• Building Workloads

Page 21: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 21 http://www.cs.wisc.edu/gems

High-Level Infrastructure MapD

rivers

Mem

ory

Syst

em

Inte

rnal

Ruby Teste

rs

Exter

nal

CPU M

odel

s

DetailedProcessor

Model

OpalSimics

Microbenchmarks

RandomTester

Dete

rmin

isti

c

Conte

nded lock

s

Tra

ce fl

ie

Page 22: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 22 http://www.cs.wisc.edu/gems

Ruby Driver: Random Tester

• “Verifying a Multiprocessor Cache Controller Using Random Test Generation” [Wood et al. 90]

• Purpose: Excite cache coherency bugs • Competing actions performed then checked• Utilizes false sharing

– Multiple writers - action– Single read - check

• Randomly inserted delay

RandomTester

Page 23: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 23 http://www.cs.wisc.edu/gems

Ruby Driver: Microbenchmarks

• Deterministic tester– Simple sequence of requests– Sanity checking and performance tuning– DeterministicDriver.C

• GETX, SeriesGETS, Inv

• Contended locks– Compare and swap atomic op.– RequestGenerator.C / SyntheticDriver.C

• Trace file– Issues requests one at a time– Similar to cache warmup mechanism– ‘-z <trace_file.gz>’

Microbenchmarks

Dete

rmin

isti

c

Conte

nded lock

s

Tra

ce fi

le

Page 24: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 24 http://www.cs.wisc.edu/gems

Ruby Driver: In-order Processor Model

• Simics blocking interface (in-order processor)– Single issue, non-pipelined processor– Only one outstanding request per CPU

• SIMICS_RUBY_MULTIPLIER > 1– Estimates a higher performance processor– Multiple simics processor cycles == one ruby cycle

Simics

Page 25: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 25 http://www.cs.wisc.edu/gems

Ruby Driver: In-order Processor Model

• Implements Simics’ mh_memorytracer_possible_cache_miss()• “Callback” Simics with SIM_stall_cycle(proc_ptr, 0)

P0

Simics time queue

P1 P2 P3

stall()

/unstall()

stall()

/unstall()

stall()/unstall()

stall()/unstall()

instructions

Simics in-order processor model

SIMICS

RubyMemory System Model

Page 26: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 26 http://www.cs.wisc.edu/gems

Ruby Driver: Out-of-order Processor Model

• Opal (out-of-order processor)– Super-scalar pipelined processor– Multiple outstanding requests per CPU

• OPAL_RUBY_MULTIPLIER > 1– Faster processor core frequency than memory– Simulation execution optimization

What are they driving?

DetailedProcessor

Model

Opal

Page 27: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 27 http://www.cs.wisc.edu/gems

Ruby Multiprocessor Memory System

• Physical Components– Caches– Memory– System Interconnect

• Determines the timing of memory requests– Driver issues memory request to Ruby– Ruby simulates the requests– Ruby eventually callbacks the driver with the latency

• Ruby’s purpose:

Return memory latency

RubyMemory System Model

Page 28: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 28 http://www.cs.wisc.edu/gems

Outline

• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with GEMS

• Ruby: Memory system model– Overview (Drivers & Memory System)– Event-driven simulation– Interconnection network– SLICC: Specifying the logic of the system– Simple example: SMP MI protocol– Limitations

• BREAK

• Opal: Out-of-order processor model• Demo: Two gems are better than one• GEMS Source Code Tour and Extending Ruby• Building Workloads

Page 29: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 29 http://www.cs.wisc.edu/gems

Discrete Event-driven Simulation

• Discrete event-driven simulation– Events change system state– Series of scheduled events

• Global EventQueue– Heart of Ruby– Priority heap of event/time pairs

• Not a true queue - not in FIFO order

• Self-sorting queue

– Given cycle events occur in arbitrary order– All events must be at least one unit of time

GlobalEventQueue

Event | Time

*Event G 7

*Event B 5

*Event J 3

*Event S 3

*Event A 4

Page 30: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 30 http://www.cs.wisc.edu/gems

Events and Consumers

• Event = Consumer Wakeup– Consumer determines event type– Consumer changes system state

• Typical event– Consumer wakes up to observe its input ports– Consumer acts upon the incoming message(s)

• Change system state• Enqueue outgoing messages

– Consumer pops the incoming message(s)– Consumer schedules outgoing message(s) consumers

Input PortConsumer

Output Port

Output Port

Consumer

Consumer

Page 31: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 31 http://www.cs.wisc.edu/gems

Events and Consumers

• Stalled event– Consumer wakes up to observer its input ports– Consumer encounters a stall– Consumer schedules itself again

• Doesn’t pop incoming queue

Input PortConsumer

Output Port

Output Port

Consumer

Consumer

Page 32: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 32 http://www.cs.wisc.edu/gems

Outline

• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with GEMS

• Ruby: Memory system model– Overview (Drivers & Memory System)– Event-driven simulation– Interconnection network– SLICC: Specifying the logic of the system– Simple example: SMP MI protocol– Limitations

• BREAK

• Opal: Out-of-order processor model• Demo: Two gems are better than one• GEMS Source Code Tour and Extending Ruby• Building Workloads

Page 33: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 33 http://www.cs.wisc.edu/gems

Interconnection Network

• A single flexible infrastructure– Point-to-point links and switches: Consumers– Both intra-chip and inter-chip networks

• Dynamic network creation– Routing tables created at runtime– Utilizes input parameters

• Two ways to generate topologies1. Auto-generated

– Intra-chip network: Single on-chip switch

– Inter-chip network: 4 included (next slide)

2. Customized– TopologyType_FILE_SPECIFIED

– Adjust individual link latency and bandwidth

– Specify one link per line

Link

Switch

Throttle.C

PerfectSwitch.C

Page 34: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 34 http://www.cs.wisc.edu/gems

Auto-generated Inter-chip Network Topologies

TopologyType_TORUS_2D

TopologyType_CROSSBAR

TopologyType_HIERARCHICAL_SWITCH

TopologyType_PT_TO_PT

Page 35: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 35 http://www.cs.wisc.edu/gems

Network Characteristics

• Link latency1. Auto-generated

– ON_CHIP_LINK_LATENCY– NETWORK_LINK_LATENCY

2. Customized– ‘link_latency:’

• Link bandwidth– Bandwidth specified in 1000th of byte1. Auto-generated

– On-chip = 10 x g_endpoint_bandwidth– Off-chip = g_endpoint_bandwidth

2. Customized– Individual link bandwidth = ‘bw_multiplier:’ x g_endpoint_bandwidth

• Buffer size1. Infinite by default2. Customized network supports finite buffering

• Prevent 2D-mesh network deadlock through e-cube restrictive routing• ‘link_weight’

1. Perfect switch bandwidth

Page 36: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 36 http://www.cs.wisc.edu/gems

Outline

• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with GEMS

• Ruby: Memory system model– Overview (Drivers & Memory System)– Event-driven simulation– Interconnection network– SLICC: Specifying the logic of the system– Simple example: SMP MI protocol– Limitations

• BREAK

• Opal: Out-of-order processor model• Demo: Two gems are better than one• GEMS Source Code Tour and Extending Ruby• Building Workloads

Page 37: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 37 http://www.cs.wisc.edu/gems

• Domain-specific language– Designed to specify state machines for cache coherence– Syntactically similar to C/C++/Java– Constrains to hardware-like structures (i.e. no loops)– Generates C++ tightly coupled to Ruby

• Two purposes1. Specify system coherence

– Per-memory-block State Machines– I.e. cache and memory controller logic

2. Glue components together– Caches with transaction buffers– Network ports with controllers

Specification Language for Implementing Cache Coherence (SLICC)

SLICCState

Machine

NetworkIn-ports

NetworkOut-ports

Page 38: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 38 http://www.cs.wisc.edu/gems

System Flexibility via SLICC

• Substantial portion of Ruby code generated– In combination with dynamic network creation– Permits a tremendously flexible simulation infrastructure

• protocols/<protocol_name>.slicc– Indicates the SLICC files needed by the protocol– Specifies the necessary generated objects

• Controller state machines

• Network messages– Snooping protocol: requests and response messages– Directory protocol: requests, forwarded requests, and responses

– Allocates only C++ objects needed by the particular protocol• Ex. Shadow tags for an exclusive two-level cache

• Ex. Persistent Request Table for Token coherence

Page 39: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 39 http://www.cs.wisc.edu/gems

Inside a SLICC State Machine

• Network buffers– Outgoing and incoming ports

• States– Base and transient states

• Events– Internal events that cause state transitions

• Ruby Structures– Caches, transaction buffers… etc.

• Trigger events– Incoming messages trigger internal events

• Actions– Operations performed on structures

• Transitions– Cross-product of possible states and events– Performs atomic sequence of actions

<controller_name>.smnetwork ports

states

events

ruby structures

trigger events

actions

transitions

Page 40: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 40 http://www.cs.wisc.edu/gems

Outline

• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with GEMS

• Ruby: Memory system model– Overview (Drivers & Memory System)– Event-driven simulation– Interconnection network– SLICC: Specifying the logic of the system– Simple example: SMP MI protocol– Limitations

• BREAK

• Opal: Out-of-order processor model• Demo: Two gems are better than one• GEMS Source Code Tour and Extending Ruby• Building Workloads

Page 41: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 41 http://www.cs.wisc.edu/gems

Creating a protocol with SLICC

• MI-example protocol– Simple, SMP directory protocol– Cache and directory/memory controller– Assume ordered interconnect (for simplicity)

Demo

$ $ $

Ruby interconnect

Ruby interconnect

dir dir dir M

I

GETS/G

ETX

Fwd

Page 42: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 42 http://www.cs.wisc.edu/gems

MI Cache Controller – States and Events

// STATES enumeration(State, desc="Cache states") {

// stables states

I, desc="Not Present/Invalid"; M, desc="Modified";

// transient states MI, desc="Modified, issued PUT"; II, desc="Not Present/Invalid, issued PUT"; IS, desc="Issued request for IFETCH/GETX"; IM, desc="Issued request for STORE/ATOMIC"; }

// EVENTS enumeration(Event, desc="Cache events") { // from processor Load, desc="Load request from processor"; Ifetch, desc="Ifetch request from processor"; Store, desc="Store request from processor";

Data, desc="Data from network"; Fwd_GETX, desc="Forward from network";

Replacement, desc="Replace a block"; Writeback_Ack, desc="Ack from the directory for a writeback"; Writeback_Nack, desc="Nack from the directory for a writeback"; }

Demo

Page 43: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 43 http://www.cs.wisc.edu/gems

MI Cache Controller – Network Ports

// NETWORK BUFFERS

MessageBuffer requestFromCache, network="To", virtual_network="0", ordered="true";

MessageBuffer responseFromCache, network="To", virtual_network="1", ordered="true";

MessageBuffer forwardToCache, network="From", virtual_network="2", ordered="true";

MessageBuffer responseToCache, network="From", virtual_network="1", ordered="true";

// NETWORK PORTS

out_port(requestNetwork_out, RequestMsg, requestFromCache);

out_port(responseNetwork_out, ResponseMsg, responseFromCache);

in_port(forwardRequestNetwork_in, RequestMsg, forwardToCache) {

if (forwardRequestNetwork_in.isReady()) {

peek(forwardRequestNetwork_in, RequestMsg) {

if (in_msg.Type == CoherenceRequestType:GETX) {

trigger(Event:Fwd_GETX, in_msg.Address);

}

else if (in_msg.Type == CoherenceRequestType:WB_ACK) {

trigger(Event:Writeback_Ack, in_msg.Address);

}

else {

error("Unexpected message");

}

}

}

}

Demo

Page 44: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 44 http://www.cs.wisc.edu/gems

MI Cache Controller – Structures

// CacheEntry

structure(Entry, desc="...", interface="AbstractCacheEntry") {

State CacheState, desc="cache state";

bool Dirty, desc="Is the data dirty (different than memory)?";

DataBlock DataBlk, desc="data for the block";

}

external_type(CacheMemory) {

bool cacheAvail(Address);

Address cacheProbe(Address);

void allocate(Address);

void deallocate(Address);

Entry lookup(Address);

void changePermission(Address, AccessPermission);

bool isTagPresent(Address);

}

CacheMemory cacheMemory, template_hack="<L1Cache_Entry>", constructor_hack='L1_CACHE_NUM_SETS_BITS, L1_CACHE_ASSOC, MachineType_L1Cache, int_to_string(i)+"_L1"', abstract_chip_ptr="true";

Demo

Page 45: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 45 http://www.cs.wisc.edu/gems

MI Cache Controller – “Mandatory Queue”

// Mandatory Queue in_port(mandatoryQueue_in, CacheMsg, mandatoryQueue, desc="...") {

if (mandatoryQueue_in.isReady()) {

peek(mandatoryQueue_in, CacheMsg) {

if (cacheMemory.isTagPresent(in_msg.Address) == false &&

cacheMemory.cacheAvail(in_msg.Address) == false ) {

// make room for the block

trigger(Event:Replacement, cacheMemory.cacheProbe(in_msg.Address));

}

else {

trigger(mandatory_request_type_to_event(in_msg.Type), in_msg.Address);

}

}

}

}

Demo

Page 46: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 46 http://www.cs.wisc.edu/gems

MI Cache Controller – Transitions

transition(I, Store, IM) {

v_allocateTBE;

i_allocateL1CacheBlock;

a_issueRequest;

m_popMandatoryQueue;

}

transition(IM, Data, M) {

u_writeDataToCache;

s_store_hit;

w_deallocateTBE;

n_popResponseQueue;

}

transition(M, Fwd_GETX, I) {

e_sendData;

o_popForwardedRequestQueue;

}

transition(M, Replacement, MI) { v_allocateTBE;

b_issuePUT;

x_copyDataFromCacheToTBE;

h_deallocateL1CacheBlock;

}

Atomic sequence of actions

Demo

Page 47: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 47 http://www.cs.wisc.edu/gems

MI Cache Controller – Actions

action(a_issueRequest, "a", desc="Issue a request") {

enqueue(requestNetwork_out, RequestMsg, latency="ISSUE_LATENCY") {

out_msg.Address := address;

out_msg.Type := CoherenceRequestType:GETX;

out_msg.Requestor := machineID;

out_msg.Destination.add(map_Address_to_Directory(address));

out_msg.MessageSize := MessageSizeType:Control;

}

}

action(e_sendData, "e", desc="Send data from cache to requestor") {

peek(forwardRequestNetwork_in, RequestMsg) {

enqueue(responseNetwork_out, ResponseMsg, latency="CACHE_RESPONSE_LATENCY") {

out_msg.Address := address;

out_msg.Type := CoherenceResponseType:DATA;

out_msg.Sender := machineID;

out_msg.Destination.add(in_msg.Requestor);

out_msg.DataBlk := cacheMemory[address].DataBlk;

out_msg.MessageSize := MessageSizeType:Response_Data;

}

}

}

Demo

Page 48: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 48 http://www.cs.wisc.edu/gems

SLICC-generated HTML tablesDemo

• http://www.cs.wisc.edu/gems/MI_example_html/

Page 49: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 49 http://www.cs.wisc.edu/gems

Testing MI_exampleDemo

Build Protocol

cd $GEMS_ROOT/ruby

make PROTOCOL=MI_example

Random test– stresses protocol with simultaneous false-sharing requests– 16 processors (-p), 10000 requests (-l)

./amd64_linux/generated/MI_example/bin/tester.exec –p 16 –l 10000

Deterministic test with transition trace– use a trace, requests handled one at a time– input trace (-z), compressed or non-compressed – transition debug (-s) starting at cycle 1

./amd64_linux/generated/MI_example/bin/tester.exec –p 16 –z ruby.trace.gz –s 1

Page 50: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 50 http://www.cs.wisc.edu/gems

Outline

• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with GEMS• Ruby: Memory system model

• BREAK

• Opal: Out-of-order processor model– Overview

– Pipeline

– Example: Load instruction

– Additional Tidbits

• Demo: Two gems are better than one• GEMS Source Code Tour and Extending Ruby• Building Workloads

Page 51: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 51 http://www.cs.wisc.edu/gems

Overview

• What is OPAL?– Out-of-Order SPARC processor simulator

• (modeled after MIPS R10K)

– Uses Timing-First design– Realized as a Simics module – like RUBY– Does NOT use Simics’ MAI interface

• Goal of this section– Starting point for hacking Opal

• Learning approaches– Code review / summarization (using Control Flow Graphs)– Example: a load instruction– Analogies to SimpleScalar…pay attention to the differences

Page 52: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 52 http://www.cs.wisc.edu/gems

Ruby Driver: In-order Processor Model

• Implements Simics’ mh_memorytracer_possible_cache_miss()• “Callback” Simics with SIM_stall_cycle(proc_ptr, 0)

P0

Simics time queue

P1 P2 P3

stall()

/unstall()

stall()

/unstall()

stall()/unstall()

stall()/unstall()

instructions

Simics in-order processor model

SIMICS

RubyMemory System Model

Page 53: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 53 http://www.cs.wisc.edu/gems

Preview: OPAL & Simics

• Use opal’s opal0.sim-step command

P0Phy_mem

fetch

decode

Schedule/execute

retire

check

12

SIMICS

OPAL

8 76 54 3 1

Instruction

Step

RUBY

LOAD

IFETCHHIT

HIT

Page 54: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 54 http://www.cs.wisc.edu/gems

Timing-First Simulation [Mauer Sigmetrics 02]

• Timing Simulator (Opal)– functional execution of user/supervisor operations– speculative, OoO multiprocessor timing simulation– does NOT implement full ISA or any devices

• Functional Simulator (Simics)– full-system multiprocessor simulation– does NOT model detailed micro-architectural timing

KEY: Reload state if Opal state != Simics state

Page 55: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 55 http://www.cs.wisc.edu/gems

Measured Deviations

• Less than 20 deviations per 100,000 instructions (0.02%)

Worst case performance error: 2.4% (assuming deviation latency is pipeline flush)

additional timing slides

Page 56: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 56 http://www.cs.wisc.edu/gems

Opal and UltraSparc

• Functionally simulates 103 of 183 of UltraSparc ISA instructions (99.99% of all dynamic instr in workloads) LIST

• Sample of unimplemented instrs:– ARRAY -FEXPAND -FPADD -RDSOFTINT

– EDGE -FMUL8x16 -FPMERGE -RDSTICK

– SHUTDOWN -SIAM -SIR -WRSOFTINT -WRSTICK

• Does not functionally simulate devices or any I/O instructions– SCSI controllers and disks

– PCI and SBUS interfaces

– interrupt and DMA controllers

– temperature sensors

Correctness type % error

Functional

Performance

0

2.4 (worst case)

Page 57: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 57 http://www.cs.wisc.edu/gems

Simulation Control (system.[C h])

system_t::simulate(int instrs)

Disable all simics procs

returnSimulated enough instrs?

No

Yes

Forall seq->advanceCycle()

ruby->advanceTime()

global_cycle++

Pipeline is modeled here

For MP sims: P0’s instrs counted here

Page 58: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 58 http://www.cs.wisc.edu/gems

Outline

• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with GEMS• Ruby: Memory system model

• BREAK

• Opal: Out-of-order processor model– Overview

– Pipeline

– Example: Load instruction

– Additional Tidbits

• Demo: Two gems are better than one• GEMS Source Code Tour and Extending Ruby• Building Workloads

Page 59: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 59 http://www.cs.wisc.edu/gems

What’s done in a cycle?

• SimpleScalar uses a reverse order, why?

pseq::advanceCycle()

FetchInstructions()

return

DecodeInstructions()

ScheduleInstructions()

RetireInstructions()

Scheduler->execute()

Uses separate queues (finitecycle.h) to record how many instructions are available for each stage.

The order is in fact not important here.

Page 60: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 60 http://www.cs.wisc.edu/gems

Pipeline Model (pseq.[C h])

• Instructions stored/tracked in a RUU-like structure (iwindow.[C h])

• Flexible multi-stage pipeline– Delay modeled with separate queues

(finitecycle.h)

• Models fully-pipelined FUs– Types: CONFIG_ALU_MAPPING– Number: CONFIG_NUM_ALUS

F

F

F

D

D

FU0

FU0

FU0

R

R

FU1

FU1

FETC

H_S

TA

GES

DEC

OD

E_S

TA

GES

RETIR

E_S

TA

GES

Determined byCONFIG_ALU_LATENCY

MAX_FETCH

MAX_DECODE

MAX_RETIRE

MAX_DISPATCHSched

MAX_EXECUTE

Page 61: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 61 http://www.cs.wisc.edu/gems

Instructions ({dynamic,statici,memop,controlop}.[C h] )

Dynamic

Control (controlop.[C h])

Memory (memop.[C h]) ALU (dynamic.[C h])

decoded instr (statici.[C h])

Traps

Registers

Event Times

Seq #Wait List ptr

Predicted Addr

Actual Addr

Virtual/Phys Addr

LSQ index

Taken/Not Taken

Page 62: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 62 http://www.cs.wisc.edu/gems

Outline

• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with GEMS• Ruby: Memory system model

• BREAK

• Opal: Out-of-order processor model– Overview

– Pipeline

– Example: Load instruction

– Additional Tidbits

• Demo: Two gems are better than one• GEMS Source Code Tour and Extending Ruby• Building Workloads

Page 63: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 63 http://www.cs.wisc.edu/gems

Fetch

Is Fetch Ready?

Address Translation I-TLB Miss?

Emit NOP/Stall Fetch

Yes

Read instruction:

pseq::getInstr()

No

Stall fetch

Invoke Ruby to simulate Ifetch

timing

Create Dynamic Instr

(load_inst_t)

Yes

Page 64: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 64 http://www.cs.wisc.edu/gems

Decode

Get load instr from instr window

dynamic_inst_t::decode()

Insert decoded load inst in decode

queue

Get current source operand mappings :

arf::readDecodeMap() (regmap.[C h], arf.[C h])

Rename dest reg : arf::allocateRegister()

(regmap.[C h], arf.[C h])

Page 65: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 65 http://www.cs.wisc.edu/gems

Schedule

Get load instr from instr window

Exceeded scheduling window?

Stop scheduling

Yes

TestSourceReadiness() WAIT_XX_STAGE

Source not ready

Scheduler->schedule() All sources ready?

Wakeup

Yes

NoSource is ready

Page 66: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 66 http://www.cs.wisc.edu/gems

Execute

Read port avail? D-TLB address translate (memory_inst_t::addresstranslate())

TLB Miss?Raise TLB miss exceptionYes

No, reschedule

Invoke Ruby to simulate load timing (rubycache_t::access())

Read value from Simics memory(pseq->readPhysicalMemory())

No

Cache Miss?

CACHE_MISS_STAGE

Yes

pseq->complete()

No

Yes

Page 67: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 67 http://www.cs.wisc.edu/gems

Retire

Get completed LD inst

checkCriticalstate():(PC, NPC,regs)

checkChangedState() (verify load value)

FullSquash() (reload state &

refetch from instr following LD)

FAIL

Step Simics (pseq->advanceSimics())

Retire LD

Traps?takeTrap() (set trap state,squash pipeline)

Yes

No

FAIL

Match

Match

Memory

Consistency

Page 68: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 68 http://www.cs.wisc.edu/gems

Outline

• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with GEMS• Ruby: Memory system model

• BREAK

• Opal: Out-of-order processor model– Overview

– Pipeline

– Example: Load instruction

– Additional Tidbits

• Demo: Two gems are better than one• GEMS Source Code Tour and Extending Ruby• Building Workloads

Page 69: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 69 http://www.cs.wisc.edu/gems

Opal-Ruby Interface

rubycache_t:access()complete()

OpalInterface:isReady()makeRequest()hitCallback()

OPAL RUBY

system_t:rubyCompletedRequest()

pseq_t:completedRequest()

load_inst_t::Execute()

Complete()

LD

Asynchronous

12

3

45

6

78

Page 70: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 70 http://www.cs.wisc.edu/gems

Branch Prediction

pseq_t::createInstruction{…s_instr->nextPC()…

}

dynamic_inst_t::nextPC_call(),nextPC_predicated_branch(),nextPC_predict_branch(),nextPC_indirect()

Branch predictor (fetch/{yags.[C h], …} :

Predict()Update()

Predict()Controlop_t::Execute(){ (check prediction and flush if mispredict)}Retire(){

…Bpred->Update()…

}

Update()

Page 71: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 71 http://www.cs.wisc.edu/gems

Common Config Parameters

Processor Width:MAX_FETCH

_DECODE _DISPATCH _EXECUTE _RETIRE

Pipeline Stages:FETCH_STAGESDECODE_STAGESRETIRE_STAGES

Register File Sizes:CONFIG_IREG_PHYSICAL (int)CONFIG_FPREG_PHYSICAL (fp)CONFIG_CCREG_PHYSICAL (cond code)

ROB Size:IWINDOW_ROB_SIZE

Scheduling Window Size:IWINDOW_WIN_SIZE

Page 72: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 72 http://www.cs.wisc.edu/gems

Opal : Present and Future

• Implements Sparc instructions– Simulating additional Sparc instructions easy task– Porting to x86 substantial code rewrite

• Simulates timing of weaker memory consistency models– Add SC checks in Opal– Add write buffers for weaker models (like TSO)

• No functional simulation of I/O– Plug in disk simulator that interacts with Opal

• Not currently using MAI interface– Possible to replace Opal w/ MAI module that interacts with

Ruby

• Aggressive micro-architectural techniques not modeled– Add support for trace caches, mem. dependence pred., etc

Page 73: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 73 http://www.cs.wisc.edu/gems

Outline

• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with

GEMS• Ruby: Memory system model• BREAK• Opal: Out-of-order processor model

• Demo: Two gems are better than one– Breakdown network stats– Example: Network contention with and without Opal – Simulation runtimes

• GEMS Source Code Tour and Extending Ruby• Building Workloads

Page 74: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 74 http://www.cs.wisc.edu/gems

Breaking Down Ruby Stats Files

• Ruby system config print– Values of all ruby config parameters

• Overall runtime– Target and host machine runtimes, IPC, etc.

• Cache profiling: L1I, L1D, L2…etc.

• Structure occupancy– Demand for cache ports, transaction buffers

• Latency breakdown• Request vs. system state (optional)• Message delay cycles (optional)• Network stats

– Link and switch utilization

• CC event / transition counts

<system_config>.statsRuby config

Overall runtime

Cache profiling

Demo

Structureoccupancy

Latencybreakdown

Request vs.system state

Messagedelay cycles

Network stats

Event /transition

counts

Page 75: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 75 http://www.cs.wisc.edu/gems

Two GEMS are Better than One

• Network behavior with and without Opal• 8 processor CMP• SPLASH benchmark: ocean• 8 byte-wide links between CPUs & L2 cache banks• Two runs using a customized network

1. Ruby only• Allows only one requests per processor• Maximum 8 outstanding requests• Low network utilization• Little network contention

2. Ruby & Opal• Allows multiple outstanding requests• Maximum 128 outstanding requests• Higher network utilization• Noticeable network contention

Demo

Page 76: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 76 http://www.cs.wisc.edu/gems

Two GEMS are Better than One

Ruby Only

Demo

Message Delayed Cycles----------------------

Total_delay_cycles: [binsize: 16 max: 553 count: 22892759

average: 0.534205 | standard deviation: 4.18656 | 22855760 20077 1945 325 309 175 105 3935 7681 518 338 254 397 273 166 130 142 33 41 25 26 29 15 10 10 2 0 2 0 4 4 10 7 6 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]

Network Stats-------------

links_utilized_percent_switch_0_link_3: 4.38966 bw: 8000 base_latency: 1

links_utilized_percent_switch_0_link_4: 4.36838 bw: 8000 base_latency: 1

Ruby_cycles: 41361869

Page 77: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 77 http://www.cs.wisc.edu/gems

Two GEMS are Better than One

Ruby & Opal

Demo

Message Delayed Cycles----------------------

Total_delay_cycles: [binsize: 16 max: 703 count: 22893122

average: 1.35992 (0.534205) | standard deviation: 6.55126 | 22608266 220366 29575 9084 4686 3248 2009 1687 6018 1798 1143 828 625 516 384 272 271 288 398 319 299 228 203 161 92 51 41 26 12 9 30 39 48 43 25 20 3 0 0 1 0 2 4 4 0 0 0 0 0 0 ]

Network Stats-------------

links_utilized_percent_switch_0_link_3: 7.81863 (4.38966) bw: 8000 base_latency: 1

links_utilized_percent_switch_0_link_4: 7.64388 (4.36838) bw: 8000 base_latency: 1

Ruby_cycles: 72550169 (41361869)

Page 78: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 78 http://www.cs.wisc.edu/gems

Simulation Time Comparison

• Comparisons of Runtimes– Progressively add more simulation fidelity

• Simics only

• Simics + Ruby

• Simics + Ruby + Opal

– Accuracy vs. simulation time tradeoff

• Target Machine– 8 UltraSPARC™ iii processor SMP (1 GHz)– 4 GBs of memory

• Host Machine– AMD Opteron™ uniprocessor (2.2 GHz)– 4 GBs of memory

Page 79: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 79 http://www.cs.wisc.edu/gems

Simulation Slowdown

Time Slowdown Slowdown / CPU

Target 20 ms 1 1

Simics 1 minute 3000 x 380 x

Simics + Ruby 15 minutes 45000 x 5600 x

Simics + Ruby + Opal 45 minutes 140000 x 17000 x

2000 JBB Transactions

CAVEAT: These performance numbers may not reflect the optimal configuration of Virtutech Simics. For example, running Simics in “fast mode” (or emulation-only mode) can reduce the slowdown (per CPU) of Simics, compared to real hardware, to less than 10x

Page 80: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 80 http://www.cs.wisc.edu/gems

Outline

• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with

GEMS• Ruby: Memory system model• BREAK• Opal: Out-of-order processor model• Demo: Two gems are better than one

• GEMS Source Code Tour and Extending Ruby– GEMS software structure– Directory Tour– Demo: Extending Ruby and a CMP Protocol

• Building Workloads

Page 81: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

http://www.cs.wisc.edu/gems

GEMS Software Structure

System

DriverChip ProfilerNetworkcommon/Driver.h

Internal Drivers

Simics Interface

Opal Interface

generated/<protocol>/Chip.h

profiler/Profiler.hnetwork/simple

Deterministic Tester

Contended Locks

Random Tester

tester/DeterministicDriver.h

tester/SyntheticDriver.h

Topologytester/Tester.h

interface/OpalInterface.h

simics/SimicsInterface.h

network/simple/Topology.h

MultipleInstantiations

OneInstantiation

Page 82: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

http://www.cs.wisc.edu/gems

Ruby Software Structure

Chip

DirectorySequencer

Caches

Cache Controllers

Cache Line

Directory State

system/DirectoryMemory.hsystem/CacheMemory.h

Directory Controller

SLICC

system/Sequencer.h

Network Ports

buffer/MessageBuffer.h

generated/<protocol>/Chip.h SLICC

generated/<protocol>/L1Cache_Controller.hgenerated/<protocol>/Directory_Controller.hgenerated/<protocol>/L2Cache_Controller.h

Ruby

Ruby

generated/<protocol>/L1Cache_Entry.h

generated/<protocol>/Directory_Entry.h

Page 83: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 83 http://www.cs.wisc.edu/gems

Outline

• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with

GEMS• Ruby: Memory system model• BREAK• Opal: Out-of-order processor model• Demo: Two gems are better than one

• GEMS Source Code Tour and Extending Ruby– GEMS software structure– Directory Tour– Demo: Extending Ruby and a CMP Protocol

• Building Workloads

Page 84: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 84 http://www.cs.wisc.edu/gems

Map of Directories: Top-Level

Top-Level Directory

ruby opal slicc protocols

common gen-scripts scripts

LICENSE KNOWN_ISSUESREADME

microbenchmarks

MemorySystemComponents

ProcessorComponents

GeneratorCode

ProtocolSpecificationFiles

CommonGEMSC++ code

GeneratedSimicsInterfaceScripts

results SimulationOutput

SeparateMicrobenchmarkExecutables

CommonGEMSscripts

Page 85: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 85 http://www.cs.wisc.edu/gems

Map of Directories: ruby

ruby

buffers common config MessageBufferbetween consumers

Ruby config files forModule and tester

Common RubyC++ structs

eventqueue interfaces module network Globaleventqueue

Ruby → Opal &Simics

Simple network codeThe ruby simicsmodule

profiler recorder simics slicc_interface Profiling code cache and trace

recordersAbstract classes interfacewith different protocols

Simics → Ruby

system tester platform generated Physical memory components

Random tester& ubenchmarks

SLICC generated C++files

Object files &executables

html Protocoltables

Example tracefile

Ruby debug flag infoRuby initializer &destroyer

ruby.trace.gz init.h/.C README.debugging

Makefile

Page 86: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 86 http://www.cs.wisc.edu/gems

Map of Directories: ruby/system

ruby/system

memory datastructure

object that identifies aunique chip or machineinstatiation

object that uniquelyidentifies all rubymachines

specific to tokenprotocol a fully associative,

unbounded cachememory template

specific to tokenprotocol

specific to tokenprotocol

manages memoryrequests between the driver and L1 cache controller

used to simulateTSO-like timing

top-level object of theruby memory system,all ruby objects can beaccessed via theg_system_ptr

used to simulateTSO-like timing

transaction bufferentry table used by cache controllers fortransient requests

specific to tokenprotocol

CacheMemory.h DirectoryMemory.h/C MachineID.h NodeID.h

NodePresistentTable.h/C

PerfrectCacheMemory.h PresistentArbiter.h/C PersistentTable.h/C

Sequencer.h/CStoreBuffer.h/C StoreCache.h/C System.h/C

TBETable.h/CTimerTable.h

cache templatedata structure

Page 87: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 87 http://www.cs.wisc.edu/gems

Map of Directories: ruby/slicc_interface

ruby/slicc_interface

ruby abstract class forthe protocol specificchip object

parent class of all messagesmessages communicatedbetween consumers viaMessageBuffers

contains booleans to defineprotocol characteristics to ruby

parent class of allnetwork messages, each protocolimplements uniquenetwork messageobjects to communicatebetween controllers

All address manipulation to determine location and set mapping is here

miscellaneous rubyfunctions used bythe generated controllers

interface between the generated protocol logic and the ruby profiler code

wrapper for the RubySlicc interface files

AbstractCacheEntry.h/C AbstractChip.h/C AbstractProtocol.h/C

Message.h NetworkMessage.h RubySlicc_ComponentMapping.h

RubySlicc_Profiler_interface.h/C RubySlicc_Util.h RubySlicc_includes.h

ruby abstract class forthe protocol specific cache entries

Page 88: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 88 http://www.cs.wisc.edu/gems

Map of Directories: slicc

sliccast doc

parser

Abstract SyntaxTree code

contains the lexerand parser thatconstruct aprotocol’s AST

contains someold but usefuldocumentation

symbols contains SLICCobjects createdduring the firstpass of the AST,majority of codegenerated by these symbols

generator file, html and MIF generatorcode

platform generated generated lexer and parser files

Object files &executables

main functionof the SLICCexecutable

defines typedef, namespaces, etc.

main.h/C slicc_global.h

Makefile

READMESummary ofhow SLICC works

Makefile for theSLICC codegenerator executable

Page 89: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 89 http://www.cs.wisc.edu/gems

Map of Directories: opal

opal

benchmark bypassing common Micro-architecturebenchmarks

Global Opal structsMisc. proc structs

config design fetch module Module and testerconfig files

Helpful informaldesign docs

Code for Opal modulePredictors (branch,Trap,RAS)

python regression sparc systemMisc test and graphing scripts

Golden resultsfor tester

Pipeline modelImplementation-specific defines

tester trace platform generated Opal tester files Files for branch,

memory tracesFiles for parsing configparams

Object files &executables

TODO Todo wish list Describes building

& running OpalOpal handling of mem. consistency

README README.memory_consistency

Makefile

Page 90: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 90 http://www.cs.wisc.edu/gems

Map of Directories: opal/system (1)

opal/system

Register file interface

Used to analyze memdependencies

Opal’s built-in cachestructures

Structs used in validation w/ Simics Type defines for

config paramsPer opcode stats collector class

Branch instr typeclass

TLB implementation for stand-alone sims Code for execution

of dynamic instrsNon-renamed registerfile interface

Top-level classfor all dynamic instrs

CFG class Opal-Simics interface

actor.[C h] arf.[C h] cache.[C h] chain.[C h]

checkresult.hconfig.include controlop.[C h] decode.[C h]

dtlb.[C h]dx.[C h i] dynamic.[C h] flatarf.[C h]

flow.[C h] hfa.C

General micro-arch.structure class

hfa_init.h histogram.[C h]

Opal-Simics interfaceexterns

Histogram statsclass

Page 91: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 91 http://www.cs.wisc.edu/gems

Map of Directories: opal/system (2)

opal/system

Instruction page cache class

Code to execute CFGinstructions

RUU-like struct forstoring/tracking instrs

Stats on locks in system LSQ structure Memory addr stats

classMemory instr class

Simlink to Opal-Ruby interface MSHR structure

(used in Opal cachehierarchy only)

Single waiter object for pipepool

Wait-list object . Used to model MSHR whenrunning w/ Ruby

Top-level procsequencer

Functions used forAPI calls to Simics

ipage.[C h] ipagemap.[C h] iwindow.[C h] ix.[C h]

lockstat.[C h]lsq.[C h] memop.[C h] memstat.[C h]

mf_api.hmshr.[C h] pipepool.[C h] pipestate.[C h]

pseq.[C h] pstate.[C h]

Instruction pageclass

ptrace.[C h] regbox.[C h]

Used for analyzingmemory traces

Contains interfaceptrs to registers.

Page 92: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 92 http://www.cs.wisc.edu/gems

Map of Directories: opal/system (3)

opal/system

Rename map structure

Global event queueHandles all Opal -Ruby memorytransactions

Dummy Simics functions for tester Several includes Decoded instr classStats class for

static insts

Timer class, used tocollect time stats Stats class for

dynamic instsStats class for tracking per-threadstats

Top-level class formanipulating sim

Wait-list object for dynamic insts

regfile.[C h] regmap.[C h] rubycache.[C h] scheduler.[C h]

simdist12.Csparx.C sstat.[C h] statici.[C h]

stopwatch.[C h]sysstat.[C h] system.[C h] threadstat.[C h]

wait.[C h]

Models the registerfile itself

Page 93: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 93 http://www.cs.wisc.edu/gems

Outline

• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with

GEMS• Ruby: Memory system model• BREAK• Opal: Out-of-order processor model• Demo: Two gems are better than one

• GEMS Source Code Tour and Extending Ruby– GEMS software structure– Directory Tour– Demo: Extending Ruby and a CMP Protocol

• Building Workloads

Page 94: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 94 http://www.cs.wisc.edu/gems

Extending Ruby

• Goal: – Add new functionality to Ruby and interface to SLICC

• DemoPrefetcher– Simple, L2->memory next-line prefetcher– Module implemented as C++ object (DemoPrefetcher.C)– New type added to SLICC– Observes L1 GETS requests via function call– Triggers event for prefetch in next cycle

• Object is connected to an in_port

– Not the only way (or the right way) of implementing a prefetcher

Demo

Page 95: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 95 http://www.cs.wisc.edu/gems

Implementing DemoPrefetcher

• Creating an object that can “wakeup” a controller

DemoPrefetcher.h

class DemoPrefetcher {public:

// An object in a SLICC controller will be passed a Chip* DemoPrefetcher(Chip* chip_ptr);

// Allow an in_port to be attached void setConsumer(Consumer* consumer_ptr) { m_consumer_ptr =

consumer_ptr; }

// When wakeup() is called, ensure it should do something bool isReady() const;

// functions to implement simple next-line prefetching const Address& popNextPrefetch(); const Address& peekNextPrefetch() const; void cancelNextPrefetch(); void observeL1Request(const Address& address);

Demo

Page 96: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 96 http://www.cs.wisc.edu/gems

Implementing DemoPrefetcher

DemoPrefetcher.C

void DemoPrefetcher::observeL1Request(const Address& address){ // next-line prefetch address Address prefetch_addr = address; prefetch_addr.makeNextStrideAddress(1);

// add to prefetch queue m_prefetch_queue.push( prefetch_addr );

// when to wakeup-- choose 1 cycles later Time ready_time = g_eventQueue_ptr->getTime() + 1;

// schedule a wakeup() so that the L2 controller can trigger g_eventQueue_ptr->scheduleEventAbsolute(m_consumer_ptr,

ready_time);

}

Demo

Page 97: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 97 http://www.cs.wisc.edu/gems

Interfacing DemoPrefetcher to SLICC

external_type(DemoPrefetcher, inport="yes") { bool isReady(); Address popNextPrefetch(); void cancelNextPrefetch(); Address peekNextPrefetch(); void observeL1Request(Address); } DemoPrefetcher prefetcher;

// wakeup logic in_port(prefetcher_in, Null, prefetcher) { if (prefetcher_in.isReady() ) { if (L2cacheMemory.cacheAvail(prefetcher.peekNextPrefetch()) ||

L2cacheMemory.isTagPresent(prefetcher.peekNextPrefetch())) { if ( getState(prefetcher.peekNextPrefetch()) == State:I ||

getState(prefetcher.peekNextPrefetch()) == State:NP ) { trigger(Event:Prefetch, prefetcher.popNextPrefetch()); } else { // tag is already present in a non-invalid state prefetcher.cancelNextPrefetch(); } } else { trigger(Event:L2_Replacement,

L2cacheMemory.cacheProbe(prefetcher.peekNextPrefetch())); } } }

Demo

Page 98: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 98 http://www.cs.wisc.edu/gems

Implementing DemoPrefetcher

• Nice property of TokenCMP: no tracking of prefetch– A tag is allocated and a request issued to memory– keeps received tokens/data if tag allocated

MOESI_CMP_tokenDEMO-L2cache.sm

transition(NP, Prefetch, I) { vv_allocateL2CacheBlock;

a_issuePrefetch;

}

transition(I, Prefetch) {

a_issuePrefetch;

}

transition({S,O,M,I_L,S_L}, Prefetch) {

// do nothing

}

Demo

Page 99: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 99 http://www.cs.wisc.edu/gems

Outline

• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with

GEMS• Ruby: Memory system model• BREAK• Opal: Out-of-order processor model• Demo: Two gems are better than one• GEMS Source Code Tour and Extending Ruby

• Building Workloads

Page 100: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 100 http://www.cs.wisc.edu/gems

Workloads for Simics/GEMS

• Unfortunately, we cannot release our workloads (legal reasons)

• Steps for Workload Development– Simple Example: Barnes-Hut– What about more complex applications?

• Workload Simulation Methodology– Simulating transactions/requests– Coping with workload variability

Page 101: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 101 http://www.cs.wisc.edu/gems

Workload Setup

• Simple Example: Barnes-Hut (Splash2 suite)– Commands not to be taken literally! (might be different in

different versions)

• Main Steps:– Build OS checkpoint– Copy application source or binary to simulation– Create initial (cold) application checkpoint in Simics– Create warm application checkpoint with Simics/Ruby

Page 102: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 102 http://www.cs.wisc.edu/gems

Build OS Checkpoint

• Use Simics to boot your OS and get a checkpoint (assuming 16 processor serengeti target machine)– cd simics/home/sarek– ./simics –x sarek-16p.simics

• Script loads configuration and boots Solaris

• Scripts should be provided with your Simics distribution assuming you have Solaris license (contact Virtutech Simics Forum)

• Modify scripts to fit your target configuration (e.g., memory, disk, network)

– At the end of your script, take a system snapshot (checkpoint):

simics> write-configuration CHKPT_DIR/sarek-16p.check

simics> quit

– Use this checkpoint to build all your workloads’ 16 processor checkpoints

Page 103: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 103 http://www.cs.wisc.edu/gems

Copy Barnes Source or Binary

• Develop benchmark on real machine (if available)– Use Simics “magic” instructions after initialization

• See Simics reference manual for magic instruction use

– Compile benchmark with such instructions before running in Simics

• Load from your OS checkpoint– ./simics

simics> read-configuration CHKPT_DIR/sarek-16p.check

simics> magic-break-enable

• Copy binary into simulated machine (or copy source and compile) – Console commands:

mount /host

cp –r /host/workloads/splash2/codes/apps/barnes/BARNES .

• See Simics reference manual on the use of the /host filesystem

Page 104: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 104 http://www.cs.wisc.edu/gems

Obtain Initial Barnes Checkpoint

• Warm up application in Simics– Console Commands:

./BARNES < input-warm

• input_warm specifies Barnes parameters

./BARNES < input-warm• Use this second run to warm up cache (see next slide)

./BARNES < input-run > output; magic_call break

• After initial run, write checkpointsimics> write-configuration CHKPT_DIR/barnes-cold-16p.check

simics> quit

• Checkpoint is ready for GEMS run

Page 105: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 105 http://www.cs.wisc.edu/gems

Obtain Warm Barnes Checkpoint

• Load initial checkpoint– setenv CHECKPOINT_AT_END yes– setenv TRANSACTIONS 1– setenv PROCESSORS 16– setenv CHECKPOINT CHKPT_DIR/barnes-cold-16p.check– ./simics -no-win -x GEMS_ROOT/gen-scripts/go.simics

• Script (provided in release) should load ruby and run till the end of the warmup run– Also writes checkpoint at the end

• Edit checkpoint to remove ruby object

– Modify script to suit your needs

Page 106: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 106 http://www.cs.wisc.edu/gems

What About More Complex Applications?

• Setup on real hardware– Tune workload, OS parameters– Scale-down for PC memory limits– Re-tune – For details, [Alameldeen et al., IEEE Computer, Feb’03]

• What if we don’t have access to real hardware?– Install applications and setup in Simics– Checkpoint often– Not optimal for large scale applications!

Page 107: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 107 http://www.cs.wisc.edu/gems

Simulating Transactions/Requests

• Throughput-based applications– Work-based unit to compare configurations– IPC not always meaningful

• Counting Transactions during Simulation– Enable magic breaks in Simics– Benchmark traps to Simics on every magic instruction– Count magic breaks until we reach required number of

transactions– Cope with benchmark variability

Page 108: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 108 http://www.cs.wisc.edu/gems

Why Consider Variability?

OLTP

Page 109: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 109 http://www.cs.wisc.edu/gems

Workload Variability

• How can slower memory lead to faster workload?• Answer: Multithreaded workload takes different paths

– Different lock race outcomes– Different scheduling decisions→ Runs from same initial conditions can be different

This can lead to wrong conclusions for deterministic simulations

• Solution with deterministic simulation– Add pseudo-random delay on memory accesses

(MEMORY_LATENCY)– Simulate base (and enhanced) system multiple times– Use simple or complex statistics [Alameldeen and Wood,

HPCA 2003]

Page 110: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 110 http://www.cs.wisc.edu/gems

The End

• Download and Subscribe to Mailing Lists

http://www.cs.wisc.edu/gems

• We encourage your contributions– Workloads– Additional timing fidelity

Page 111: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 111 http://www.cs.wisc.edu/gems

Additional Opal Slides

Page 112: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 112 http://www.cs.wisc.edu/gems

Sensitivity Analysis

return

Page 113: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 113 http://www.cs.wisc.edu/gems

Sensitivity Results

return

Page 114: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 114 http://www.cs.wisc.edu/gems

Opal and Memory Consistency

• Designed to be aggressive OoO processor• Our use of Simics is sequentially consistent execution• Models the performance of weaker models (such as

TSO) for only SC memory interleavings• Violations of SC in Opal:

– Identical MSHR entry for memory requests with same addr– Executes Ld/St out of program order– No snooping of LSQ for external stores

Return

Page 115: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 115 http://www.cs.wisc.edu/gems

Implemented UltraSparc Instructions (1)addaddcaddccaddcccalignaddralignaddrlandandccandnandnccbabccbcsbebgbgebgublblebleubmaskbnbnebnegbpabpccbpcs

bpebpgbpgebpgubplbplebpleubpnbpnebpnegbposbpposbpvcbpvsbrgezbrgzbrlezbrlzbrnzbrzbshufflebvcbvscallcasacasxacmpdoneretryfabsdfabsqfabss

fadddfaddqfaddsfaligndatafbafbefbgfbgefblfblefblgfbnfbnefbofbpafbpefbpgfbpgefbplfbplefbplgfbpnfbpnefbpofbpufbpuefbpugfbpugefbpulfbpulefbufbuefbug

fbugefbulfbulefcmpdfcmpedfcmpeqfcmpeq16fcmpeq32fcmpesfcmpgt16fcmpgt32fcmple16fcmple32fcmpne16fcmpne32fcmpqfcmpsfdivdfdivqfdivsfdmulqfdtoifdtoqfdtosfdtoxfitodfitoqfitosflushflushwfmovdfmovdafmovdccfmovdcsfmovde

fmovdgfmovdgefmovdgufmovdlfmovdlefmovdleufmovdnfmovdnefmovdnegfmovdposfmovdvcfmovdvsfmovfdafmovfdefmovfdgfmovfdgefmovfdlfmovfdlefmovfdlgfmovfdnfmovfdnefmovfdofmovfdufmovfduefmovfdugfmovfdugefmovfdulfmovfdulefmovfqafmovfqefmovfqgfmovfqgefmovfqlfmovfqle

fmovfqlgfmovfqnfmovfqnefmovfqofmovfqufmovfquefmovfqugfmovfqugefmovfqulfmovfqulefmovfsafmovfsefmovfsgfmovfsgefmovfslfmovfslefmovfslgfmovfsnfmovfsnefmovfsofmovfsufmovfsuefmovfsugfmovfsugefmovfsulfmovfsulefmovqfmovqafmovqccfmovqcsfmovqefmovqgfmovqgefmovqgu

fmovqlfmovqlefmovqleufmovqnfmovqnefmovqnegfmovqposfmovqvcfmovqvsfmovrdgezfmovrdgzfmovrdlezfmovrdlzfmovrdnzfmovrdzfmovrqgezfmovrqgzfmovrqlezfmovrqlzfmovrqnzfmovrqzfmovrsgezfmovrsgzfmovrslezfmovrslzfmovrsnzfmovrszfmovsfmovsafmovsccfmovscsfmovsefmovsgfmovsge

fmovsgufmovslfmovslefmovsleufmovsnfmovsnefmovsnegfmovsposfmovsvcfmovsvsfmuldfmulqfmulsfnegdfnegqfnegsfqtodfqtoifqtosfqtoxfsmuldfsqrtdfsqrtqfsqrtsfsrc1fstodfstoifstoqfstoxfsubdfsubqfsubsfxtodfxtoq

Page 116: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 116 http://www.cs.wisc.edu/gems

Implemented UltraSparc Instructions (2)fxtosfzerofzerosillimpdep1impdep2jmpjmplldblklddlddalddflddfaldfldfaldfsrldqaldqfldqfaldsbldsbaldshldshaldstubldstubaldswldswaldubldubalduhlduhalduwlduwaldx

ldxaldxfsrmembarmovmovamovccmovcsmovemovfamovfemovfgmovfgemovflmovflemovflgmovfnmovfnemovfomovfumovfuemovfugmovfugemovfulmovfulemovgmovgemovgumovlmovlemovleumovnmovnemovnegmovpos

movrgezmovrgzmovrlezmovrlzmovrnzmovrzmovvcmovvsmulsccmulxnopnotororccornornccpopcprefetchprefetchardrdccrdprrestorerestoredretrnsavesavedsdivsdivccsdivxsethisllsllxsmul

smulccsrasraxsrlsrlxstbstbastbarstblkstdstdastdfstdfastfstfastfsrsthsthastqfstqfastwstwastxstxastxfsrsubsubcsubccsubcccswapswapatataddcctaddcctv

tcctcstetgtgetgutltletleutntnetnegtpostraptsubcctsubcctvtvctvsudivudivccudivxumulumulccwrwrccwrprxnorxnorccxorxorcc

return

Page 117: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 117 http://www.cs.wisc.edu/gems

TLB Misses

• ITLB Misses– emit special NOP instruction: STATIC_INSTR_MOP; stall

fetch– does NOT update PC, NPC – fetch resumes whenever any instr (including special NOP)

squashes

• DTLB Misses– Set DTLB miss trap for instruction (setTrapType()) in

Execute()– In retireInstruction(), retrieve trap and call takeTrap() to set

trap state for DTLB handler– refetch from DTLB handler

Page 118: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 118 http://www.cs.wisc.edu/gems

Example: Load instruction

• In dynamic_t::Schedule(), load waits until all operands ready (WAIT_XX_STAGE cases)

• Scheduler gets invoked when all operands ready• Load waits until read port to L1 is available• Load_inst_t::Execute() gets called

– Generates virtual address

– Performs D-TLB address translation

– Inserts entry in LSQ

– Initiates cache access (via Ruby or Opal’s built-in simple cache hierarchy)

– If cache miss -> put on wait list (CACHE_MISS_STAGE) and is woken up by rubycache_t::complete()

• Invokes Simics to read actual memory value in load_inst_t::Complete()

• Retirement check of load value & squash if value deviates from Simics

Page 119: (C) 2005 Multifacet Project ISCA Tutorial June 5 th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin.

Slide 119 http://www.cs.wisc.edu/gems

Modifying Opal-Ruby Interface

• Ruby->Opal interface defined in mf_opal_api object (ruby/interfaces/mf_api.h)

• Opal->Ruby interface defined in mf_ruby_api object• To create new Ruby->Opal callback (ex: hitCallback())

– Define function in ruby/interfaces/OpalInterface.C– Add new function pointer to mf_opal_api object– Create a new function handler in opal/system/system.C and

assign m_opal_api object’s new function pointer to this function handler

• To create new Opal->Ruby callback (ex: makeRequest())

– Define function in ruby/interfaces/OpalInterface.C– Add new function pointer to mf_ruby_api object– Assign function pointer to new function in

OpalInterface::installInterface()