Constructing Virtual Architectures on Tiled...

56
1 Constructing Virtual Architectures on Tiled Processors David Wentzlaff Anant Agarwal MIT

Transcript of Constructing Virtual Architectures on Tiled...

Page 1: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

1

Constructing Virtual Architectures

on Tiled Processors

David Wentzlaff

Anant Agarwal

MIT

Page 2: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

2

Emulators and JITs

for Multi-Core

David Wentzlaff

Anant Agarwal

MIT

Page 3: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

3

Why Multi-Core?

Page 4: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

4

Why Multi-Core?

Future architectures will be on-chip parallel machines

Moore’s Law provides more parallel silicon resources

Diminishing sequential returns

Growth applications are parallel

Page 5: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

5

Why Multi-Core?

Future architectures will be on-chip parallel machines

Moore’s Law provides more parallel silicon resources

Diminishing sequential returns

Growth applications are parallel

Future architectures will be optimized for parallel applications

Hardware compatibility will be broken

Page 6: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

6

Why Emulators and JITs on Multi-Core?

Future architectures will be on-chip parallel machines

Moore’s Law provides more parallel silicon resources

Diminishing sequential returns

Growth applications are parallel

Future architectures will be optimized for parallel applications

Hardware compatibility will be broken

Page 7: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

7

Why Emulators and JITs on Multi-Core?

Future architectures will be on-chip parallel machines

Moore’s Law provides more parallel silicon resources

Diminishing sequential returns

Growth applications are parallel

Future architectures will be optimized for parallel applications

Hardware compatibility will be broken

Future architectures will need to run legacy applications

Market forces will require future chips to run 1983 “Frogger” for DOS

Software re-verification on new architectures too costly

Page 8: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

8

Are Emulators and JITs for Multi-Core Different?

Page 9: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

9

Are Emulators and JITs for Multi-Core Different?

Yes

Page 10: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

10

Are Emulators and JITs for Multi-Core Different?

Yes

Bountiful parallel resources

Example: Code optimization cost is reduced

“hot spot” analysis may miss sequential performance

Parallelize client application

Page 11: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

11

Are Emulators and JITs for Multi-Core Different?

Yes

Bountiful parallel resources

Example: Code optimization cost is reduced

“hot spot” analysis may miss sequential performance

Parallelize client application

Parameters for Multi-Core are different

Core-to-Core latencies reduced

Page 12: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

12

Road Map

Exploit on-chip parallel resources to accelerate emulation

Page 13: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

13

Road Map

Exploit on-chip parallel resources to accelerate emulation

Page 14: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

14

Road Map

Exploit on-chip parallel resources to accelerate emulation

Focused on performance

Page 15: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

15

Road Map

Exploit on-chip parallel resources to accelerate emulation

Focused on performance

Acceleration Mechanisms

Page 16: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

16

Road Map

Exploit on-chip parallel resources to accelerate emulation

Focused on performance

Acceleration Mechanisms

1. Pipelining Virtual Architectures

Page 17: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

17

Road Map

Exploit on-chip parallel resources to accelerate emulation

Focused on performance

Acceleration Mechanisms

1. Pipelining Virtual Architectures

2. Speculative Parallel Translation

Page 18: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

18

Road Map

Exploit on-chip parallel resources to accelerate emulation

Focused on performance

Acceleration Mechanisms

1. Pipelining Virtual Architectures

2. Speculative Parallel Translation

3. Static & Dynamic Architecture Reconfiguration

Page 19: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

19

Road Map

Exploit on-chip parallel resources to accelerate emulation

Focused on performance

Acceleration Mechanisms

1. Pipelining Virtual Architectures

2. Speculative Parallel Translation

3. Static & Dynamic Architecture Reconfiguration

Proof of concept system

All software parallel translator: x86 on Raw

Page 20: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

20

Data RAM

Disk

Background: Translation

x86

Binary

Runtime -- Execution

x86

Binary

Code Cache Code Cache

Tags

Translator

x86 Parser &

High Level

Translator

High Level

Optimization

Low Level

Code Generation

Low Level

Optimization and

Scheduling

Page 21: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

21

Background: “Old” Parallel Translation

Page 22: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

22

1. Pipelining Virtual Architectures

Page 23: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

23

1. Pipelining Virtual Architectures

Translator

Tile

Code Cache

Tile

Execution

Tile

MMU

Tile

Data Cache

Tile

Page 24: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

24

1. Pipelining Virtual Architectures

Utilize a Tiled Processor as fabric to construct virtual processor

Translator

Tile

Code Cache

Tile

Execution

Tile

MMU

Tile

Data Cache

Tile

Page 25: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

25

1. Pipelining Virtual Architectures

Utilize a Tiled Processor as fabric to construct virtual processor

Coarse grain pipelining to exploit parallelism

Translator

Tile

Code Cache

Tile

Execution

Tile

MMU

Tile

Data Cache

Tile

Page 26: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

26

Sequential TranslationExecution

Time

Translation

Time

Optimization

Time

Tim

e

Page 27: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

27

2. Speculative Parallel TranslationExecution

Time

Translation

Time

Optimization

Time

Tim

e

Page 28: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

28

3. Reconfiguration

Different programs have different characteristics

Processor Architect uses benchmarks to choose “compromise”

processor

Page 29: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

29

3. Reconfiguration

Different programs have different characteristics

Processor Architect uses benchmarks to choose “compromise”

processor

Static Reconfiguration

Choose different virtual machine configuration based off application

Dynamic Reconfiguration

Detect phases/programs dynamic needs and reconfigure at runtime

Cost to reconfiguration

Page 30: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

30

Reconfiguration

Photo

Courtesy

Intel Corp.

Page 31: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

31

Prototype System and Evaluation

Page 32: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

32

Background: Architectures

x86 (Pentium III)

CISC instruction set

Hardware Virtual Memory (VM)

Hardware Memory Protection

Condition Codes used for branching

Hardware instruction cache

1 superscalar processor core

3-way parallelism

Out-of-order processor

Raw

RISC instruction set

No VM

No Memory Protection

No Condition Codes

Software managed instruction memory

16 Processors arranged in 4x4 mesh

4 low latency networks

In-order processors

ISA

Impl

emen

tatio

n

* Photo

Courtesy

Intel Corp.

*

Page 33: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

33

System Design

Manager

L2 Code

Cache

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Banked L1.5 Code Cache

Runtime -- Execution

L1 Code

Cache

L1 Data

Cache

MMU

TLB

System

Functionality

& Loader

L2 Data $

Bank 0

L2 Data $

Bank 1

L2 Data $

Bank 2

L2 Data $

Bank 3

Page 34: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

34

System Design

Manager

L2 Code

Cache

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Banked L1.5 Code Cache

Runtime -- Execution

L1 Code

Cache

L1 Data

Cache

MMU

TLB

System

Functionality

& Loader

L2 Data $

Bank 0

L2 Data $

Bank 1

L2 Data $

Bank 2

L2 Data $

Bank 3

Runtime-ExecL1

I-$

L1

D-$

Page 35: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

35

Methodology

Cycle comparison of Raw vs. Pentium III

All results collected on real hardware

No hardware added

Same binaries (unmodified)

Metric: Slowdown

Raw executing x86 code compared by cycle against Pentium III

Page 36: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

36

Baseline Performance

Page 37: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

37

2. Speculative Parallel Translation

Page 38: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

38

L2 Code Cache Miss Rate

2. Speculative Parallel TranslationC

ode

Cac

he

Page 39: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

39

3. Static Reconfiguration

Manager

L2 Code

Cache

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Banked L1.5 Code CacheRuntime-Exec

L1

I-$

L1

D-$

MMU

TLB

System

Functionality

& Loader

L2 Data $

Bank 0

L2 Data $

Bank 1

L2 Data $

Bank 2

L2 Data $

Bank 3

Page 40: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

40

3. Static Reconfiguration

Manager

L2 Code

Cache

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Banked L1.5 Code CacheRuntime-Exec

L1

I-$

L1

D-$

MMU

TLB

System

Functionality

& Loader

L2 Data $

Bank 0

L2 Data $

Bank 1

L2 Data $

Bank 2

L2 Data $

Bank 3

Page 41: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

41

3. Static Reconfiguration

Manager

L2 Code

Cache

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Banked L1.5 Code CacheRuntime-Exec

L1

I-$

L1

D-$

MMU

TLB

System

Functionality

& Loader

L2 Data $

Bank 0

Translation

Slave

Translation

Slave

Translation

Slave

Page 42: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

42

3. Dynamic Reconfiguration

Manager

L2 Code

Cache

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Banked L1.5 Code CacheRuntime-Exec

L1

I-$

L1

D-$

MMU

TLB

System

Functionality

& Loader

L2 Data $

Bank 0

L2 Data $

Bank 1

L2 Data $

Bank 2

L2 Data $

Bank 3

Translation

Slave

Translation

Slave

Translation

Slave

Page 43: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

43

3. Dynamic Reconfiguration

Manager

L2 Code

Cache

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Banked L1.5 Code CacheRuntime-Exec

L1

I-$

L1

D-$

MMU

TLB

System

Functionality

& Loader

L2 Data $

Bank 0

L2 Data $

Bank 1

L2 Data $

Bank 2

L2 Data $

Bank 3

Translation

Slave

Translation

Slave

Translation

Slave

Page 44: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

44

3. Reconfiguration Zoom

Page 45: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

45

Baseline Performance Analysis

Page 46: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

46

Baseline Performance Analysis

Page 47: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

47

Baseline Performance Analysis

Intrinsic Raw Emulator Pentium III

latency occupancy latency occupancy

L1 Cache Hit 6 4 3 1

L2 Cache Hit 87 87 7 1

L2 Cache Miss 151 87 79 1

Page 48: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

48

Baseline Performance Analysis

Intrinsic Raw Emulator Pentium III

latency occupancy latency occupancy

L1 Cache Hit 6 4 3 1

L2 Cache Hit 87 87 7 1

L2 Cache Miss 151 87 79 1

CPI = (memory_access_rate * (((1 – L1_miss_rate) * L1_hit_occupancy) +

(L1_miss_rate * (((1 – L2_miss_rate * L2_miss_occupancy))))) + ((1 –

memory_access_rate) * non_memory_CPI)

Page 49: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

49

Baseline Performance Analysis

Intrinsic Raw Emulator Pentium III

latency occupancy latency occupancy

L1 Cache Hit 6 4 3 1

L2 Cache Hit 87 87 7 1

L2 Cache Miss 151 87 79 1

CPI = (memory_access_rate * (((1 – L1_miss_rate) * L1_hit_occupancy) +

(L1_miss_rate * (((1 – L2_miss_rate * L2_miss_occupancy))))) + ((1 –

memory_access_rate) * non_memory_CPI)

Memory CPI 3.9 1

Page 50: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

50

Baseline Performance Analysis

Memory System 3.9x slowdown

Lack of ILP 1.3x slowdown

Condition Codes (Flags) 1.1x slowdown

Page 51: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

51

Baseline Performance Analysis

Memory System 3.9x slowdown

Lack of ILP 1.3x slowdown

Condition Codes (Flags) x 1.1x slowdown

5.5x slowdown

Page 52: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

52

Baseline Performance Analysis

Memory System 3.9x slowdown

Lack of ILP 1.3x slowdown

Condition Codes (Flags) x 1.1x slowdown

5.5x slowdown

Code Cache Misses 1 – 20x slowdown

Page 53: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

53

Baseline Performance Analysis

Page 54: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

54

Future Work

Hardware additions to facilitate parallel emulation

x86 Server farm on a chip

Inter-Virtual Processor dynamic load balancing

Differing forms of Dynamic Reconfiguration

Vary number of functional units

Page 55: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

55

Questions ?

Page 56: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal.

56

Extras