sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author:...

28
Asymmetry-aware execution placement on manycore chips Alexey Tumanov Joshua Wise, Onur Mutlu, Greg Ganger CARNEGIE MELLON UNIVERSITY

Transcript of sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author:...

Page 1: sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author: Alexey Tumanov Created Date: 4/16/2013 12:02:35 AM

Asymmetry-aware execution placement on

manycore chips Alexey Tumanov

Joshua Wise, Onur Mutlu, Greg Ganger

CARNEGIE MELLON UNIVERSITY

Page 2: sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author: Alexey Tumanov Created Date: 4/16/2013 12:02:35 AM

Introduction: Core Scaling

Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 2

•  Moore’s Law continues: can still fit more transistors on a chip

?

Page 3: sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author: Alexey Tumanov Created Date: 4/16/2013 12:02:35 AM

Introduction: Memory Controllers

•  Multi-core multi-MC chips gain prevalence •  But, now latency is NOT:

•  uniform memory access latency (UMA) •  hierarchically non-uniform memory latency (NUMA)

Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 3

core MC

Page 4: sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author: Alexey Tumanov Created Date: 4/16/2013 12:02:35 AM

NUMA •  NUMA: symmetrically non-uniform

Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 4

crossbar 2x 2x 2x 2x

2x 2x 2x 2x

x x 2x 2x

x x 2x 2x

bottom left MC access from all cores

Page 5: sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author: Alexey Tumanov Created Date: 4/16/2013 12:02:35 AM

ANUMA •  ANUMA: asymmetrically non-uniform

Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 5

bottom left MC access from all cores

core MC

Page 6: sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author: Alexey Tumanov Created Date: 4/16/2013 12:02:35 AM

NUMA vs. ANUMA •  Latency & bandwidth asymmetries in multi-

MC manycore NoC systems are different •  [numa] sharp division into local/remote that dwarfs

other intra-node asymmetries •  [anuma] gradual continuum of access latencies

Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 6

Page 7: sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author: Alexey Tumanov Created Date: 4/16/2013 12:02:35 AM

Problem Statement •  MC access asymmetries can result in

suboptimal placement w.r.t. MC access patterns

Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 7

Page 8: sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author: Alexey Tumanov Created Date: 4/16/2013 12:02:35 AM

Motivating Experiments •  “Bare metal” latency differential ~ 14%

•  does it matter? •  GCC: 1 thread, 1 MC, 1 core at a time, tile linux

Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 8

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7

1.1e+07

1.12e+07

1.14e+07

1.16e+07

1.18e+07

1.2e+07

1.22e+07

1.24e+07

1.1e+07

1.12e+07

1.14e+07

1.16e+07

1.18e+07

1.2e+07

1.22e+07

1.24e+07

0 2 4 6 8 10 12

GC

C r

untim

e (µ

s)

Manhattan distance to MC

14%

not 0

Page 9: sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author: Alexey Tumanov Created Date: 4/16/2013 12:02:35 AM

Motivation •  Latency differential gets much worse

•  14% figure is NOT an upper bound •  Major factor: contention on the NoC

•  NoC node and link contention slows down packets •  amplifying effect on “bare metal” asymmetries

•  Examples: •  up to 100% latency differential in a simulator

Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 9

Page 10: sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author: Alexey Tumanov Created Date: 4/16/2013 12:02:35 AM

Exploring Solution Space: NUMA •  Designate grid quadrants into NUMA zones (+) a well-explored approach (+) kernel code readily available (-) lacks flexibility (-) contention avoidance

Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 10

Page 11: sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author: Alexey Tumanov Created Date: 4/16/2013 12:02:35 AM

Solution Space: data placement •  OS allocates physical frames in a way that:

•  balances the NoC contention •  optimizes some objective function •  maximizes throughput

(+) proactive (-) no crystal ball: which pages will be hot? (-) once allocated, can’t be easily undone

•  inter-MC page migration worsens NoC contention

Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 11

Page 12: sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author: Alexey Tumanov Created Date: 4/16/2013 12:02:35 AM

Solution Space: thread placement •  Move execution closer to MCs used (-) waits for the problem to occur before acting (-) cache thrashing (+) better-informed decisions are possible

•  MC usage statistics collection

Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 12

Page 13: sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author: Alexey Tumanov Created Date: 4/16/2013 12:02:35 AM

Solution Space: hybrid •  Combine execution and data placement

•  allocate frames, then adjust execution placement •  place execution, then adjust data placement

(+) Corrective action possible (-) Complexity

Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 13

Page 14: sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author: Alexey Tumanov Created Date: 4/16/2013 12:02:35 AM

MC Usage Stats Collection •  Interface:

•  procfs control handle to start/stop collection •  procfs data file for stats data extraction

– pages accessed per MC per sampling cycle •  Implementation

•  instrumented VM to page fault •  each page fault is attributed to corresponding MCi

•  very inefficient, but does work •  adjustable sampling frequency (20Hz)

– trades off overhead vs. accuracy

Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 14

Page 15: sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author: Alexey Tumanov Created Date: 4/16/2013 12:02:35 AM

Placement Algorithm •  Place threads nearest to MCs they use

•  greedily based on memory intensity •  Weighted geometric placement algorithm

•  calculate thread’s desired coordinates as a function of – MC coordinates – thread’s access count vector (L1 normal)

•  Placement collisions resolved •  find second choices in direction of preferred MCs

Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 15

Page 16: sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author: Alexey Tumanov Created Date: 4/16/2013 12:02:35 AM

Bandwidth Performance Results

Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 16

250

300

350

400

450

500

550

(%'" BC)A#?@A+# ()*+"

<1+%'> <=1+%'>'

,-./0

!1!(2'"31456789

!$&$;*;'!"#$%&'

!%:$;*;'

(%'" BC)A#?@A+# ()*+"Base

Page 17: sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author: Alexey Tumanov Created Date: 4/16/2013 12:02:35 AM

Bandwidth Performance Results

Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 17

250

300

350

400

450

500

550

(%'" BC)A#?@A+# ()*+"

<1+%'> <=1+%'>'

,-./0

!1!(2'"31456789

!$&$;*;'!"#$%&'

!%:$;*;'

(%'" BC)A#?@A+# ()*+"Base Overhead

Page 18: sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author: Alexey Tumanov Created Date: 4/16/2013 12:02:35 AM

Bandwidth Performance Results

Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 18

250

300

350

400

450

500

550

(%'" BC)A#?@A+# ()*+"

<1+%'> <=1+%'>'

,-./0

!1!(2'"31456789

!$&$;*;'!"#$%&'

!%:$;*;'

(%'" BC)A#?@A+# ()*+"Base Ovrhd Weighted

Page 19: sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author: Alexey Tumanov Created Date: 4/16/2013 12:02:35 AM

Bandwidth Performance Results

Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 19

250

300

350

400

450

500

550

(%'" BC)A#?@A+# ()*+"

<1+%'> <=1+%'>'

,-./0

!1!(2'"31456789

!$&$;*;'!"#$%&'

!%:$;*;'

(%'" BC)A#?@A+# ()*+"Base Ovrhd Wghtd Brute

Page 20: sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author: Alexey Tumanov Created Date: 4/16/2013 12:02:35 AM

Bandwidth Performance Results

Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 20

250

300

350

400

450

500

550

(%'" BC)A#?@A+# ()*+"

<1+%'> <=1+%'>'

,-./0

!1!(2'"31456789

!$&$;*;'!"#$%&'

!%:$;*;'

(%'" BC)A#?@A+# ()*+"Base Ovrhd Wghtd Brute Base

Page 21: sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author: Alexey Tumanov Created Date: 4/16/2013 12:02:35 AM

Bandwidth Performance Results

Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 21

250

300

350

400

450

500

550

(%'" BC)A#?@A+# ()*+"

<1+%'> <=1+%'>'

,-./0

!1!(2'"31456789

!$&$;*;'!"#$%&'

!%:$;*;'

(%'" BC)A#?@A+# ()*+"Base Ovrhd Wghtd Brute Base Ovrhd Wghtd Brute

Page 22: sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author: Alexey Tumanov Created Date: 4/16/2013 12:02:35 AM

Runtime Performance Results

Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 22

4e+06

5e+06

6e+06

7e+06

8e+06

9e+06

1e+07

1.1e+07

1.2e+07

1.3e+07

N 0 1 2 N 0 1 2 N 0 1 2 N 0 1 2 N 0 1 2

1 threads 2 threads 4 threads 8 threads 16 threads

Mic

rose

cond

s to

com

plet

e be

nch2

MinimumsMedians

Maximums

ben

ch2

runt

ime

(us)

B O WG Br

Page 23: sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author: Alexey Tumanov Created Date: 4/16/2013 12:02:35 AM

Runtime Performance Results

Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 23

4e+06

5e+06

6e+06

7e+06

8e+06

9e+06

1e+07

1.1e+07

1.2e+07

1.3e+07

N 0 1 2 N 0 1 2 N 0 1 2 N 0 1 2 N 0 1 2

1 threads 2 threads 4 threads 8 threads 16 threads

Mic

rose

cond

s to

com

plet

e be

nch2

MinimumsMedians

Maximums

ben

ch2

runt

ime

(us)

B O WG Br B O WG Br

Page 24: sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author: Alexey Tumanov Created Date: 4/16/2013 12:02:35 AM

Runtime Performance Results

Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 24

4e+06

5e+06

6e+06

7e+06

8e+06

9e+06

1e+07

1.1e+07

1.2e+07

1.3e+07

N 0 1 2 N 0 1 2 N 0 1 2 N 0 1 2 N 0 1 2

1 threads 2 threads 4 threads 8 threads 16 threads

Mic

rose

cond

s to

com

plet

e be

nch2

MinimumsMedians

Maximums

ben

ch2

runt

ime

(us)

B O WG Br B O WG Br B O WG Br B O WG Br B O WG Br

Page 25: sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author: Alexey Tumanov Created Date: 4/16/2013 12:02:35 AM

Takeaways •  Promising, but stat collection overhead is high •  Call for per-core MC stats counters

•  in hardware •  Works despite high cost of migration

Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 25

Page 26: sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author: Alexey Tumanov Created Date: 4/16/2013 12:02:35 AM

Future Work •  More experimentation to understand benefits

•  as a function of application properties •  heterogeneous workloads

•  ANUMCA •  exploring how these same ideas apply to cache •  cache access is asymmetrically non-uniform too

•  Optimizing the application-level goal •  SLAs, multi-threaded app. performance goals

Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 26

Page 27: sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author: Alexey Tumanov Created Date: 4/16/2013 12:02:35 AM

Conclusion •  Promising, but stat collection overhead is high •  Call for per-core MC stats counters •  Works despite high cost of migration

•  Questions?

Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 27

Page 28: sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author: Alexey Tumanov Created Date: 4/16/2013 12:02:35 AM

References [1] Alexey Tumanov, Joshua Wise, Onur Mutlu, Greg Ganger.

“Asymmetry-Aware Execution Placement on Manycore Chips. In Proc. of SFMA’13, EuroSys 2013, Czech Republic.

[2] Reetuparna Das, Rachata Ausavarungnirun, Onur Mutlu, Akhilesh Kumar, and Mani Azimi. “Application-to-Core Mapping Policies to Reduce Memory System Interference in Multi-Core Systems”. In Proc. of HPCA’2013.

[3] Tilera Corporation, “The Processor Architecture Overview for the TILEPro Series”, Feb 2013. [pdf]

Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 28