Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in...

44
Sandeep Navada © 2013 A Unified View of Non- monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket K. Choudhary, Salil Wadhavkar, Eric Rotenberg Department of Electrical and Computer Engineering North Carolina State University 1

Transcript of Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in...

Page 1: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Sandeep Navada © 2013

A Unified View of Non-monotonic Core Selection and

Application Steering in Heterogeneous Chip

MultiprocessorsSandeep Navada, Niket K. Choudhary,

Salil Wadhavkar, Eric Rotenberg

Department of Electrical and Computer Engineering

North Carolina State University1

Page 2: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Single-ISA HCMP

• Same ISA• Different microarchitectures

– Superscalar width– Structure sizes– Frequency

• Cores have different performance and power

• New run-time optimization lever

Sandeep Navada © 2013

2

Page 3: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Monotonic HCMP

• Cores can be ranked independent of application• Core 1 faster than Core 2 for any application

Sandeep Navada © 2013

3

A B C D

Core 1Core 2

Applications

Per

form

ance

Page 4: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Monotonic HCMP example

Sandeep Navada © 2013

4

Page 5: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

HCMP literature• Focus

– Monotonic cores– Cores are preordained– Scheduling

• Single thread– Minimize energy for given performance

degradation threshold w.r.t. highest ranked core• Multiple threads

– Maximize throughput/Watt/mm2

Sandeep Navada © 2013

5

Page 6: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Going beyond monotonic HCMP

• Cores can’t be ranked independent of application• Cores designed from ground-up, not pre-existing

Sandeep Navada © 2013

6

A B C D

Core 1Core 2

Applications

Per

form

ance

Page 7: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Non-monotonic HCMP

High-contention scenario

(Optimize throughput)

Kumar, et al., Core Architecture

Optimization for Single-ISA Heterogeneous

Multiprocessors

Low-contention scenario

(Optimize latency)Our work

Sandeep Navada © 2013

7

Page 8: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Optimize latency

Complexity

App AIPCfrequencyperf

Sandeep Navada © 2013

8

Performance = IPC × frequencyComplexity↑ => IPC↑ frequency↓

Complexity

App BIPCfrequencyperf

This tradeoff plays out differently for different apps and is dependent on the ILP characteristics of the app

Page 9: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Non-monotonic HCMP challenges

Core Selection

How to pick the core types

comprising the heterogeneous

design?

Application Steering

How to steer the applications to the

best core?

Sandeep Navada © 2013

9

Page 10: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

10

CORE SELECTION

Sandeep Navada © 2013

Page 11: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Core design space

Sandeep Navada © 2013

11

Parameter Value Range Number

Front end width 2, 3, 4, 5, 6, 7, 8 7

Issue width 2, 3, 4, 5, 6, 7, 8 7

Physical register file size

64, 128, 192, 256, 384, 512 6

Issue queue size 16, 24, 32, 48, 64, 96, 128 7

Load queue/Store queue size

8/8, 16/16, 24/24, 32/32, 40/40, 48/48, 56/56, 64/64

8

L1 I$ size 8, 16, 32, 64, 128KB 5

L1 D$ size 8, 16, 32, 64, 128KB 5

L2$ size 2MB 1

Clock period 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2 ns 8

Page 12: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Core selection

Sandeep Navada © 2013

12

Core design space

Pruningscript

SPEC bench

SimPointtool

Pruned design Space

39 10M phases

FabScalar toolset

IPC, freq,power

Performance of every phase on

every design pointSearch

N=1 HCMP

Search N=2

HCMP

Search N=3

HCMP

Search N=4

HCMP

Optimal 1-core-type

HCMP

Optimal 2-core-type

HCMP

Optimal 3-core-type

HCMP

Optimal 4-core-type

HCMP

N: Number of core types

Page 13: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Sandeep Navada © 2013

13

BIPSCore Types

A B C D E F G H

Phases1 1.5 3.2 1.3 2.2 1.6 1.7 1.3 2.0

2 0.5 2.3 2.5 1.9 3.1 1.8 2.0 1.2

Search for Optimal 4-core-type HCMP

Core 1 Core 2 Core 3 Core 4 Performance

A B C D

E B C D

A F C D

E F C D

E F G H

HMEAN(3.2, 2.5) = 2.81

HMEAN(3.2, 3.1) = 3.15

HMEAN(2.2, 2.5) = 2.34

HMEAN(2.2, 3.1) = 2.57

HMEAN(2.0, 3.1) = 2.43

Page 14: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Kiviat diagram• Visualize core parameters

Sandeep Navada © 2013

14

Frequency

WindowWidth

larger structures

higher frequency

increase superscalar width

14

Page 15: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Optimal 1-core-type HCMP

Sandeep Navada © 2013

15

Frequency

WindowWidth

A

Page 16: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Optimal 1-core-type HCMP

Sandeep Navada © 2013

16

Frequency

WindowWidth

A

“A” core is an average core which strikes a good bal-ance between IPC and frequency.

Page 17: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

17

Optimal 2-core-type HCMP

Sandeep Navada © 2013

Frequency

WindowWidth

ALW

Page 18: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

18

Optimal 2-core-type HCMP

Sandeep Navada © 2013

Frequency

WindowWidth

ALW

“A” core is still selected!

Page 19: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

19

Optimal 2-core-type HCMP

Sandeep Navada © 2013

Frequency

WindowWidth

ALW

“LW” core targets window and width bottlenecksin “A” core.

LARGERWIDER

Page 20: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

20

Optimal 3-core-type HCMP

Sandeep Navada © 2013

Frequency

WindowWidth

ALWN

Page 21: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

21

Optimal 3-core-type HCMP

Sandeep Navada © 2013

Frequency

WindowWidth

ALWN

“A” core is still selected!!

Page 22: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

22

Optimal 3-core-type HCMP

Sandeep Navada © 2013

Frequency

WindowWidth

ALWN

“LW” core is still selected.

Page 23: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

23

Optimal 3-core-type HCMP

Sandeep Navada © 2013

Frequency

WindowWidth

ALWN

“N” core targets frequency bottleneck.

Page 24: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Optimal 4-core-type HCMP

Sandeep Navada © 2013

24

Frequency

WindowWidth

ALWN

Page 25: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Optimal 4-core-type HCMP

Sandeep Navada © 2013

25

Frequency

WindowWidth

ALWN

“A” and “N” are selected, again.

“LW” got split into “L” and “W”,addressing each bottleneck better!

Page 26: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

LW split

Sandeep Navada © 2013

26

Frequency

WindowWidth

ALWLW

Page 27: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Optimal HCMP

Sandeep Navada © 2013

27

The optimal HCMP consists of1. Average core which is the best homogeneous core2. Accelerator cores that relieve distinct bottlenecks in

the average core

Core Type Clock Period ILP-extracting buffers

Widths Caches

A 0.6 32, 128, 128 3, 4 64, 64

N 0.5 32, 64, 64 2, 2 16, 16

L 0.7 48, 128, 384 4, 4 128, 128

W 0.7 32, 128, 128 6, 6 128, 32

Page 28: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

28

APPLICATION STEERING

Sandeep Navada © 2013

Page 29: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Bottleneck-driven steering

• Application is continuously diagnosed for bottlenecks on the current core using perf. counters

• Migrate to different core when bottlenecks change– To an accelerator core that relieves any diagnosed

bottleneck and doesn’t worsen any diagnosed bottleneck– To the average core if no accelerator meets this condition,

or if no bottlenecks

Sandeep Navada © 2013

29

Page 30: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Bottleneck-driven steering

Sandeep Navada © 2013

30

Track performance counters

Diagnose bottlenecks

Steer phase

Page 31: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Track performance counters

Sandeep Navada © 2013

31

Counter Description

Width_ctr Ready instruction not issued due to limited issue width.

Window_ctr Instruction not dispatched due to issue queue or reorder buffer full.

I$_ctr Instruction stalled due to instruction cache miss.

D$_ctr Load instruction stalled due to data cache miss.

Misp_ctr Mispredicted branch.

L2_ctr Instruction stalled due to L2 cache miss.

Cycle_ctr Number of cycles.

Page 32: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Diagnose bottlenecks• Every 10K instructions, evaluate bottlenecks

using performance counters and thresholds

• Performance counters are normalized with respect to the cycle count

• If the normalized performance counter value is above threshold, then the corresponding resource is a bottleneck

Sandeep Navada © 2013

3232

Page 33: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Diagnose bottlenecks

Sandeep Navada © 2013

33

Bottleneck Expression

bool Width Width = (Width_ctr > Width_thresh)

bool Window Window = (Window_ctr > Window_thresh)

bool Frequency Frequency = (Misp_ctr > Misp_thresh) || (L2_ctr > L2_thresh)

bool I$ I$ = (I$_ctr > I$_thresh)

bool D$ D$ = (D$_ctr > D$_thresh)

Thresholds are determined empirically using a training process

Page 34: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Steer phase

Sandeep Navada © 2013

34

Core Bottlenecks relieved

Bottlenecks worsened

Steering logic

W Width Frequency if (Width && !Frequency)W

L Window Frequency else if (Window && !Frequency)L

N Frequency Width, Window

else if (Frequency && !(Width || Window))N

A n/a n/a elseA

Paper shows full steering logic with I$ and D$ bottlenecks included.

Page 35: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

35

RESULTS

Sandeep Navada © 2013

Page 36: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Methodology• Benchmarks: SPEC 2000

– Simulate first 4 billion instructions• Metrics

– Performance: BIPS– Efficiency: BIPS3/Watt

• Migration overhead – Default: 100 cycles– Sensitivity study: 1K, 10K cycles

Sandeep Navada © 2013

36

Page 37: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Steering algorithmsAlgorithm Description

Baseline Run the entire 4B instructions on the average core

Sampling Run on each core type for the sampling interval and then on the best core type for the switching interval

Bottleneck Run current 10K instruction segment based on the bottlenecks of the prior 10K segment

Optimal Run every 10K instruction segment on the best core type of the prior 10K segment

Oracle Run every 10K instruction segment on the best core type

Sandeep Navada © 2013

37

Page 38: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

4-core-type HCMP

Sandeep Navada © 2013

38

• 4-core HCMP outperforms homogeneous CMP by up to 76% and 15%, on average

• Our steering algorithm is able to capture most of this gain

Page 39: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Sampling vs. bottleneck steering

Sandeep Navada © 2013

39

Sampling performs 8.9% better than the average coreBottleneck steering performs 12% better than the average core

Sampling performs 8.9% better than the average coreBottleneck steering performs 12% better than the average core

Page 40: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Occupancy

Sandeep Navada © 2013

40

Occupancy pattern varies dramatically across different applications

Page 41: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Efficiency

Sandeep Navada © 2013

41

Sampling performs 25% better than the average coreBottleneck steering performs 33% better than the average core

Page 42: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

42

SUMMARY

Sandeep Navada © 2013

Page 43: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

43

Summary

• First proposal to architect and orchestrate multiple core types for latency reduction.

• With N core types, the optimal HCMP consists of an average core type coupled with N-1 accelerator core types.

• In the complementary steering algorithm, the application is continuously diagnosed for bottlenecks and is migrated to the core type which relieves the bottlenecks.

Sandeep Navada © 2013

Page 44: Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Future work

• HCMPs open up a whole new direction of microarchitecture research.

• Many microarchitecture optimizations don’t provide universal benefits.

• As each core-type targets a narrow workload space, HCMP provides a great platform to reconsider these optimizations.

Sandeep Navada © 2013

44