Constructing Virtual Architectures on Tiled...

Post on 04-Feb-2020

1 views 0 download

Transcript of Constructing Virtual Architectures on Tiled...

1

Constructing Virtual Architectures

on Tiled Processors

David Wentzlaff

Anant Agarwal

MIT

2

Emulators and JITs

for Multi-Core

David Wentzlaff

Anant Agarwal

MIT

3

Why Multi-Core?

4

Why Multi-Core?

Future architectures will be on-chip parallel machines

Moore’s Law provides more parallel silicon resources

Diminishing sequential returns

Growth applications are parallel

5

Why Multi-Core?

Future architectures will be on-chip parallel machines

Moore’s Law provides more parallel silicon resources

Diminishing sequential returns

Growth applications are parallel

Future architectures will be optimized for parallel applications

Hardware compatibility will be broken

6

Why Emulators and JITs on Multi-Core?

Future architectures will be on-chip parallel machines

Moore’s Law provides more parallel silicon resources

Diminishing sequential returns

Growth applications are parallel

Future architectures will be optimized for parallel applications

Hardware compatibility will be broken

7

Why Emulators and JITs on Multi-Core?

Future architectures will be on-chip parallel machines

Moore’s Law provides more parallel silicon resources

Diminishing sequential returns

Growth applications are parallel

Future architectures will be optimized for parallel applications

Hardware compatibility will be broken

Future architectures will need to run legacy applications

Market forces will require future chips to run 1983 “Frogger” for DOS

Software re-verification on new architectures too costly

8

Are Emulators and JITs for Multi-Core Different?

9

Are Emulators and JITs for Multi-Core Different?

Yes

10

Are Emulators and JITs for Multi-Core Different?

Yes

Bountiful parallel resources

Example: Code optimization cost is reduced

“hot spot” analysis may miss sequential performance

Parallelize client application

11

Are Emulators and JITs for Multi-Core Different?

Yes

Bountiful parallel resources

Example: Code optimization cost is reduced

“hot spot” analysis may miss sequential performance

Parallelize client application

Parameters for Multi-Core are different

Core-to-Core latencies reduced

12

Road Map

Exploit on-chip parallel resources to accelerate emulation

13

Road Map

Exploit on-chip parallel resources to accelerate emulation

14

Road Map

Exploit on-chip parallel resources to accelerate emulation

Focused on performance

15

Road Map

Exploit on-chip parallel resources to accelerate emulation

Focused on performance

Acceleration Mechanisms

16

Road Map

Exploit on-chip parallel resources to accelerate emulation

Focused on performance

Acceleration Mechanisms

1. Pipelining Virtual Architectures

17

Road Map

Exploit on-chip parallel resources to accelerate emulation

Focused on performance

Acceleration Mechanisms

1. Pipelining Virtual Architectures

2. Speculative Parallel Translation

18

Road Map

Exploit on-chip parallel resources to accelerate emulation

Focused on performance

Acceleration Mechanisms

1. Pipelining Virtual Architectures

2. Speculative Parallel Translation

3. Static & Dynamic Architecture Reconfiguration

19

Road Map

Exploit on-chip parallel resources to accelerate emulation

Focused on performance

Acceleration Mechanisms

1. Pipelining Virtual Architectures

2. Speculative Parallel Translation

3. Static & Dynamic Architecture Reconfiguration

Proof of concept system

All software parallel translator: x86 on Raw

20

Data RAM

Disk

Background: Translation

x86

Binary

Runtime -- Execution

x86

Binary

Code Cache Code Cache

Tags

Translator

x86 Parser &

High Level

Translator

High Level

Optimization

Low Level

Code Generation

Low Level

Optimization and

Scheduling

21

Background: “Old” Parallel Translation

22

1. Pipelining Virtual Architectures

23

1. Pipelining Virtual Architectures

Translator

Tile

Code Cache

Tile

Execution

Tile

MMU

Tile

Data Cache

Tile

24

1. Pipelining Virtual Architectures

Utilize a Tiled Processor as fabric to construct virtual processor

Translator

Tile

Code Cache

Tile

Execution

Tile

MMU

Tile

Data Cache

Tile

25

1. Pipelining Virtual Architectures

Utilize a Tiled Processor as fabric to construct virtual processor

Coarse grain pipelining to exploit parallelism

Translator

Tile

Code Cache

Tile

Execution

Tile

MMU

Tile

Data Cache

Tile

26

Sequential TranslationExecution

Time

Translation

Time

Optimization

Time

Tim

e

27

2. Speculative Parallel TranslationExecution

Time

Translation

Time

Optimization

Time

Tim

e

28

3. Reconfiguration

Different programs have different characteristics

Processor Architect uses benchmarks to choose “compromise”

processor

29

3. Reconfiguration

Different programs have different characteristics

Processor Architect uses benchmarks to choose “compromise”

processor

Static Reconfiguration

Choose different virtual machine configuration based off application

Dynamic Reconfiguration

Detect phases/programs dynamic needs and reconfigure at runtime

Cost to reconfiguration

30

Reconfiguration

Photo

Courtesy

Intel Corp.

31

Prototype System and Evaluation

32

Background: Architectures

x86 (Pentium III)

CISC instruction set

Hardware Virtual Memory (VM)

Hardware Memory Protection

Condition Codes used for branching

Hardware instruction cache

1 superscalar processor core

3-way parallelism

Out-of-order processor

Raw

RISC instruction set

No VM

No Memory Protection

No Condition Codes

Software managed instruction memory

16 Processors arranged in 4x4 mesh

4 low latency networks

In-order processors

ISA

Impl

emen

tatio

n

* Photo

Courtesy

Intel Corp.

*

33

System Design

Manager

L2 Code

Cache

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Banked L1.5 Code Cache

Runtime -- Execution

L1 Code

Cache

L1 Data

Cache

MMU

TLB

System

Functionality

& Loader

L2 Data $

Bank 0

L2 Data $

Bank 1

L2 Data $

Bank 2

L2 Data $

Bank 3

34

System Design

Manager

L2 Code

Cache

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Banked L1.5 Code Cache

Runtime -- Execution

L1 Code

Cache

L1 Data

Cache

MMU

TLB

System

Functionality

& Loader

L2 Data $

Bank 0

L2 Data $

Bank 1

L2 Data $

Bank 2

L2 Data $

Bank 3

Runtime-ExecL1

I-$

L1

D-$

35

Methodology

Cycle comparison of Raw vs. Pentium III

All results collected on real hardware

No hardware added

Same binaries (unmodified)

Metric: Slowdown

Raw executing x86 code compared by cycle against Pentium III

36

Baseline Performance

37

2. Speculative Parallel Translation

38

L2 Code Cache Miss Rate

2. Speculative Parallel TranslationC

ode

Cac

he

39

3. Static Reconfiguration

Manager

L2 Code

Cache

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Banked L1.5 Code CacheRuntime-Exec

L1

I-$

L1

D-$

MMU

TLB

System

Functionality

& Loader

L2 Data $

Bank 0

L2 Data $

Bank 1

L2 Data $

Bank 2

L2 Data $

Bank 3

40

3. Static Reconfiguration

Manager

L2 Code

Cache

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Banked L1.5 Code CacheRuntime-Exec

L1

I-$

L1

D-$

MMU

TLB

System

Functionality

& Loader

L2 Data $

Bank 0

L2 Data $

Bank 1

L2 Data $

Bank 2

L2 Data $

Bank 3

41

3. Static Reconfiguration

Manager

L2 Code

Cache

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Banked L1.5 Code CacheRuntime-Exec

L1

I-$

L1

D-$

MMU

TLB

System

Functionality

& Loader

L2 Data $

Bank 0

Translation

Slave

Translation

Slave

Translation

Slave

42

3. Dynamic Reconfiguration

Manager

L2 Code

Cache

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Banked L1.5 Code CacheRuntime-Exec

L1

I-$

L1

D-$

MMU

TLB

System

Functionality

& Loader

L2 Data $

Bank 0

L2 Data $

Bank 1

L2 Data $

Bank 2

L2 Data $

Bank 3

Translation

Slave

Translation

Slave

Translation

Slave

43

3. Dynamic Reconfiguration

Manager

L2 Code

Cache

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Banked L1.5 Code CacheRuntime-Exec

L1

I-$

L1

D-$

MMU

TLB

System

Functionality

& Loader

L2 Data $

Bank 0

L2 Data $

Bank 1

L2 Data $

Bank 2

L2 Data $

Bank 3

Translation

Slave

Translation

Slave

Translation

Slave

44

3. Reconfiguration Zoom

45

Baseline Performance Analysis

46

Baseline Performance Analysis

47

Baseline Performance Analysis

Intrinsic Raw Emulator Pentium III

latency occupancy latency occupancy

L1 Cache Hit 6 4 3 1

L2 Cache Hit 87 87 7 1

L2 Cache Miss 151 87 79 1

48

Baseline Performance Analysis

Intrinsic Raw Emulator Pentium III

latency occupancy latency occupancy

L1 Cache Hit 6 4 3 1

L2 Cache Hit 87 87 7 1

L2 Cache Miss 151 87 79 1

CPI = (memory_access_rate * (((1 – L1_miss_rate) * L1_hit_occupancy) +

(L1_miss_rate * (((1 – L2_miss_rate * L2_miss_occupancy))))) + ((1 –

memory_access_rate) * non_memory_CPI)

49

Baseline Performance Analysis

Intrinsic Raw Emulator Pentium III

latency occupancy latency occupancy

L1 Cache Hit 6 4 3 1

L2 Cache Hit 87 87 7 1

L2 Cache Miss 151 87 79 1

CPI = (memory_access_rate * (((1 – L1_miss_rate) * L1_hit_occupancy) +

(L1_miss_rate * (((1 – L2_miss_rate * L2_miss_occupancy))))) + ((1 –

memory_access_rate) * non_memory_CPI)

Memory CPI 3.9 1

50

Baseline Performance Analysis

Memory System 3.9x slowdown

Lack of ILP 1.3x slowdown

Condition Codes (Flags) 1.1x slowdown

51

Baseline Performance Analysis

Memory System 3.9x slowdown

Lack of ILP 1.3x slowdown

Condition Codes (Flags) x 1.1x slowdown

5.5x slowdown

52

Baseline Performance Analysis

Memory System 3.9x slowdown

Lack of ILP 1.3x slowdown

Condition Codes (Flags) x 1.1x slowdown

5.5x slowdown

Code Cache Misses 1 – 20x slowdown

53

Baseline Performance Analysis

54

Future Work

Hardware additions to facilitate parallel emulation

x86 Server farm on a chip

Inter-Virtual Processor dynamic load balancing

Differing forms of Dynamic Reconfiguration

Vary number of functional units

55

Questions ?

56

Extras