Hasim Joel Emer †‡ Michael Adler †, Artur Klauser †, Angshuman Parashar †, Michael...

Post on 19-Dec-2015

213 views 0 download

Transcript of Hasim Joel Emer †‡ Michael Adler †, Artur Klauser †, Angshuman Parashar †, Michael...

Hasim

Joel Emer†‡

Michael Adler†, Artur Klauser†, Angshuman Parashar†, Michael Pellauer‡,

Murali Vijayaraghavan‡

†VSSADIntel

‡CSAILMIT

2007.05.14 Hasim2

Overview

• Goal– Produce compelling evidence for architecture ideas

• Requirements– Cycle accurate simulation– Representative simulation length– Software development (often)

• Current approach– Mostly software simulation (10 KHz to 1 KHz)

• New approach– Build a performance model in an FPGA

2007.05.14 Hasim3

FPGA-based approaches

• Prototyping– Build a logically isomorphic representation of the design

• Modeling– Build a performance simulation in gates

• Hybrids– Build something that is partially a prototype and partially a model

2007.05.14 Hasim4

Recreate Asim in hardware

• Modularity

• Inter-module communication

• Functional/Timing Partitioning

• Modeling Utilities

2007.05.14 Hasim5

Why modularity?

• Speed of model development

• Shared components between products

• Reuse across generations

• Encourages isomorphism to design

• Improved fidelity

• Facilitates speed/fidelity trade-offs

• Architectural experimentation

• Factorial development and evaluations

• Sharing

2007.05.14 Hasim6

ASIM Module Hierarchy

S

MC N

D R X C WF

B

2007.05.14 Hasim7

ASIM Module Selection

B

B

B

B

S

MC N

D R X C WF

BB

2007.05.14 Hasim8

D R X C WF D R X C WF

S

MC NC M N

Module Selection

S

BB

B

B

B

B

2007.05.14 Hasim9

Module Replacement

B

B

B

B

S

MC N

D R X C WF

B

X

2007.05.14 Hasim10

(H)ASIM Module Hierarchy

2007.05.14 Hasim11

Communication

C

D R X C WF

N N

2007.05.14 Hasim12

Named connections

S DA-out A-in

2007.05.14 Hasim13

Model and FPGA Cycles

Module AModule B

Port

A 1.1 1.2 1.3 2.1 2.2

B 1.1 2.1 2.2 2.3

1 2 3 4 5 6 7 8

A 1.1 1.2 1.3 2.1 2.2

B 1.1 2.1 2.2 2.3

1 2 3 4 5 6 7 8

Port

Port

Port

2007.05.14 Hasim14

Functional/Timing Decomposition

• ISA semantics• Platform semantics

• Micro-architecture

TimingPartition

FunctionalPartition

Fetch(PC)

Instruction

• Simplifies timing model

• Amortize functional model design effort over many models

• Can be pipelined for performance

• Can be FPGA-friendly design

• Can be split across hardware and software

2007.05.14 Hasim15

Execute@execute phases

Fetch instruction

Speculatively execute instruction

Read memory*

Speculatively write memory* (locally visible)

Commit or Abort instruction

Write memory* (globally visible)

* Optional depending on instruction type

2007.05.14 Hasim16

Execution in phases

F D X R C

F D X W C W

F D X C

Assertion: All data dependencies can be represented in these phases

F D X R A

F D X X C W

2007.05.14 Hasim17

HASim: Partitioning Overview

Token Gen

Dec Exe Mem LCom GComFet

Timing Partition

MemoryState

Register State

RegFileFunctionalPartition

2007.05.14 Hasim18

Common Infrastructure

• Modules

• Inter-module communication

• Statistics gathering

• Event logging

• Debug Tracing

• Simulation control

• …

2007.05.14 Hasim19

Bluespec (Asim-style) modulemodule [HAsim_module] mkCache#() (Empty);

Port#(Addr) req_port <- mkSendPort(‘a2cache’); Port#(Bool) resp_port <- mkRecvPort(‘cache2a’);

   TagArray tagarray <- mkTagArray();

rule cycle(True);     Maybe#(Addr) mx = req_port.get();   if (isValid(mx))     resp_port.put(tagarray.lookup(validValue(mx)));

   endruleendmodule

2007.05.14 Hasim20

Bluespec (Asim-style) submodulemodule mkTagArray(TagArray);

RegFile#(Bit#(12),Bit#(4)) tagArray<- mkRegFileFull(...);

method Bool lookup(Bit#(16) a); return (tagArray.sub(getIndex(a)) == getTag(a)); endmethod

function Bit#(4) getTag(Address x); return x[15:12]; endfunction

function Bit#(12) getIndex(Address x); return x[11:0]; endfunction

endmodule

2007.05.14 Hasim21

Support functions - stats

Module

Stat Counter

Module

Stat Counter

Module

Stat Counter

Stat Dumper

module mkCache#(...) (Empty);   ... cache_hits <- mkStat(...); ...    hit=tagarray.lookup(...);    if (hit) cache_hits.increment();

endif

...endmodule

2007.05.14 Hasim22

2Dreams

2007.05.14 Hasim23

Support functions - events

Module

Event Reg

Module

Event Reg

Module

Event Reg

Event Dumper

module mkCache#(...) (Empty);   ... cache_event <- mkEvent(...); ...    hit=tagarray.lookup(...);    cache_event.report(hit);

...endmodule

2007.05.14 Hasim24

Support functions – global controller

Module

Controller

Module

Controller

Module

Controller

GlobalController

module mkCache#(...) (Empty);   ... ctrl <- mkCntrlr(...); ... rule (ctrl.run()) ...

endrule

endmodule

2007.05.14 Hasim26

FPGA-based prototype

Prototyping Catch-22…

2007.05.14 Hasim27

Module Instantiation

U

D R X C WF

MC NC

D R X C WF

M

C

D R X C WF

2007.05.14 Hasim28

Factorial Coding/Experiments

SC

S

MC N

SM

RC

S

MC N

SM

SC

S

MC N

RM

RC

S

MC N

RM

2007.05.14 Hasim29

HAsim: Current status - models

• Simple RISC functional model operating– Simple RISC ISA– Pipelined multi-phase instruction execution– Supports speculative OOO design

• Physical Reg File and ROB• Small physically addressed memory• Fast speculative rewinds

• Instruction-per-cycle (APE) model– Runs simple benchmarks on FPGA

• Five stage pipeline– Supports branch mis-speculation – Runs simple benchmarks (in software simulation)

• X86 functional model architecture under development

2007.05.14 Hasim30

Connections Implement Ports

foo bar bar

foo

baz

baz

PM (Module Tree w. Connections)

PM (Hardware Modules w. Wrappers)

barbar

foofoo

baz baz

Implemented via connections.

2007.05.14 Hasim31

Timing Model Resources (Fast)

OOO, branch prediction, three functional units, 32KB 2-way set associative ICache and DCache, iTLB, dTLB2142 slices (15% of a 2VP30)

• 21 block RAMs (15% of a 2VP30)

Configurable cache model

• 32KB 4-way set associative cache with 16B cache-lines – 165 slices (1% of a 2VP30) – 17 block RAMs (12% of a 2VP30)

• 2MB 4-way set-associative cache with 64B cache-lines– 140 slices (1% of a 2VP30)– 40 block RAMs (29% of a 2VP30)

Current FPGAs (4VFX140)

• 142,128 slices

• 552 block RAMs

• 2 PowerPCs