A Template Library to Integrate Thread Scheduling and ... · Key to good performance: data locality...

22
A Template Library to Integrate Thread Scheduling and Locality Management for NUMA Multiprocessors Zoltán Majó Thomas R. Gross Computer Science Department ETH Zurich, Switzerland

Transcript of A Template Library to Integrate Thread Scheduling and ... · Key to good performance: data locality...

Page 1: A Template Library to Integrate Thread Scheduling and ... · Key to good performance: data locality All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO

A Template Library to Integrate

Thread Scheduling and Locality Management

for NUMA Multiprocessors

Zoltán Majó Thomas R. Gross Computer Science Department ETH Zurich, Switzerland

Page 2: A Template Library to Integrate Thread Scheduling and ... · Key to good performance: data locality All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO

Non-uniform memory architecture

Local memory accesses

bandwidth: 10.1 GB/s

latency: 190 cycles

Remote memory accesses

bandwidth: 6.3 GB/s

latency: 310 cycles

2

Processor 1

Core 4 Core 5

Core 6 Core 7

IC MC

DRAM

Processor 0

Core 0 Core 1

Core 2 Core 3

MC IC

DRAM

Key to good performance: data locality

All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO ’09], Molka [PACT ‘09])

Core 0

Page 3: A Template Library to Integrate Thread Scheduling and ... · Key to good performance: data locality All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO

Remote memory references (2 processors)

3

0%

10%

20%

30%

40%

50%

60%

Subset of the PARSEC benchmark suite

Remote memory references / total memory references [%]

Page 4: A Template Library to Integrate Thread Scheduling and ... · Key to good performance: data locality All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO

Outline

Introduction

Ferret

Memory behavior

Optimizing for data locality

Evaluation

Template library

Concluding remarks

4

Page 5: A Template Library to Integrate Thread Scheduling and ... · Key to good performance: data locality All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO

Experimental setup

2-processor 8-core Intel Xeon

Nehalem microarchitecture

Linux + perfmon2

First-touch page placement

Ferret: pipeline parallelism

5

Stage 1 Stage 4 Input Output Stage 3 Stage 2

T1 T1

T1 T1

T1 T1

T1 T1

T2 T2

T2 T2

T2 T2

T2 T2

T3 T3

T3 T3

T3 T3

T3 T3

T4 T4

T4 T4

T4 T4

T4 T4

T-OUT T-IN

Processor 1

DRAM

Processor 0

DRAM

Page 6: A Template Library to Integrate Thread Scheduling and ... · Key to good performance: data locality All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO

Profiling memory accesses

Data address profiling

Based on hardware-performance monitoring

Consider only heap accesses

6

Profile

Page 0 : 1000 accesses by Processor 0

Page 1 : 3000 accesses by Processor 1

Processor 1

DRAM

Processor 0

DRAM

Page 0 Page 1

Page 7: A Template Library to Integrate Thread Scheduling and ... · Key to good performance: data locality All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO

Profiling memory accesses

Data address profiling

Based on hardware-performance monitoring

Consider only heap accesses

7

Profile

Page 0 : 1000 accesses by Processor 0

Page 1 : 3000 accesses by Processor 1

Processor 1

DRAM

Processor 0

DRAM

Page 2 : 4000 accesses by Processor 0 5000 accesses by Processor 1

Page 2

Ferret: 41% of profiled pages inter-processor shared

Page 8: A Template Library to Integrate Thread Scheduling and ... · Key to good performance: data locality All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO

Index

Detailed look

Ferret: similarity search of images

Input: set of images

O utput: list of images similar to input

Index stage: most memory system intensive

Produces 90% of total memory bandwidth

8

Segment Extract Output Rank Input

img/butterfly.jpg img/mountain.jpg img/flower.jpg img/rose.jpg img/flower.jpg img/egg.jpg

Image database

query candidate set

Index

Page 9: A Template Library to Integrate Thread Scheduling and ... · Key to good performance: data locality All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO

Database System

Image ID Features

1 ... 2 ... 3 ... 4 ... 5 ... 6 ... ... ... N-1 ... N ...

1 ...

4 ...

Locality-sensitive hashing

9

Image data

Hash value

1 2 3 4 5 6 ... H-1 H

Data index

Index

6

N ...

6 ...

Page 10: A Template Library to Integrate Thread Scheduling and ... · Key to good performance: data locality All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO

Database System

Image ID Features

1 ... 2 ... 3 ... 4 ... 5 ... 6 ... ... ... N-1 ... N ...

Hash value

1 2 3 4 5 6 ... H-1 H

Data index

1 ...

4 ...

Index

6

N ...

6 ...

Locality-sensitive hashing

10

Index

Image data

4

1 ...

4 ...

N-1 ...

Page 11: A Template Library to Integrate Thread Scheduling and ... · Key to good performance: data locality All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO

Bad data locality

Data accesses and parallelization decoupled

Database interface hides details about internal structure

All threads running the indexing phase access database

Optimizing for data locality

Precise control of data and computation allocation within database

Example: 2-processor system

11

Page 12: A Template Library to Integrate Thread Scheduling and ... · Key to good performance: data locality All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO

Database System

Image ID Features

1 ... 2 ... 3 ... 4 ... ... ... N-1 ... N ...

Optimizing for data locality

12

Image data

Hash value

1 2 3 4 5 6 ... H-1 H

Data Index

Page 13: A Template Library to Integrate Thread Scheduling and ... · Key to good performance: data locality All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO

Database System

Image ID Features

1 ... 2 ... ... ... N/2 ...

Optimizing for data locality

13

Image data: CPU 0

Hash value

1 2 3 4 5 6 ... H-1 H

Data Index

Image ID Features

N/2 + 1 ... ... ... N-1 ... N ...

Image data: CPU 1

Thread pool

T0 T0

T0 T0

CPU 0

T1 T1

T1 T1

CPU 1

Page 14: A Template Library to Integrate Thread Scheduling and ... · Key to good performance: data locality All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO

Database System

Image ID Features

1 ... 2 ... ... ... N/2 ...

Optimizing for data locality

14

Image data: CPU 0

Hash value

1 2 3 4 5 6 ... H-1 H

Data Index

Image ID Features

N/2 + 1 ... ... ... N-1 ... N ...

Image data: CPU 1

Thread pool

T0 T0

T0 T0

CPU 0

T1 T1

T1 T1

CPU 1

Query dispatch

Index

W0 W1

6

N/2 ...

1 ...

N/2 + 1 ...

N ...

6 6

Page 15: A Template Library to Integrate Thread Scheduling and ... · Key to good performance: data locality All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO

Evaluation

Two machines

2-processor 8-core Intel Nehalem (Xeon E5520)

4-processor 32-core Intel Westmere (Xeon E7-4830)

Two image database sizes

small: ~60 K images

large: ~744 K images

3500 image queries in both cases

Compare two configurations

default: first-touch page allocation, affinity scheduling

NUMA-aware memory allocation and computation scheduling

15

Page 16: A Template Library to Integrate Thread Scheduling and ... · Key to good performance: data locality All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO

Data locality

0%

10%

20%

30%

40%

50%

60%

70%

80%

small large small large

default

NUMA-aware

16

Remote memory references / total memory references [%]

8-core machine 32-core machine

Page 17: A Template Library to Integrate Thread Scheduling and ... · Key to good performance: data locality All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO

Performance

-10%

0%

10%

20%

30%

40%

50%

60%

70%

small large small large

17

Performance improvement of NUMA-aware over default [%]

8-core machine 32-core machine

Page 18: A Template Library to Integrate Thread Scheduling and ... · Key to good performance: data locality All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO

Optimizing for data locality – summary

Optimizing data locality in ferret difficult

System developer: internals of shared database must be understood

Database developer: system must be understood

Data structures often key to good NUMA performance

Template library: basis for NUMA-aware data structures

18

Page 19: A Template Library to Integrate Thread Scheduling and ... · Key to good performance: data locality All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO

Template library

Base class

Per-processor data allocation

Locality-aware task dispatch

19

SplittableData<Data,Result>

Data newAtProcessor(p:Processor)

Result dispatch(queryTask:Task)

Page 20: A Template Library to Integrate Thread Scheduling and ... · Key to good performance: data locality All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO

20

abstract class SplittableData<Data,Result> {

ThreadPool threadPool;

Map<Processor,Data> map;

Data newAtProcessor(Processor p) {

// new Data instance at Processor p

// record mapping Processor → Data

}

Result dispatch(Task queryTask) {

List<Result> results;

Data localData;

for (Processor p : processors) {

localData = map.get(p);

threadPool.submit(queryTask, p, localData);

}

for (Processor p : processors)

results.add(threadPool.get(p));

return Result.merge(results);

}

}

Template library

Page 21: A Template Library to Integrate Thread Scheduling and ... · Key to good performance: data locality All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO

Concluding remarks

Some parallel programs NUMA-unfriendly by construction

Good performance: programmer intervention required

Template library to abstract low-level system details

21

Page 22: A Template Library to Integrate Thread Scheduling and ... · Key to good performance: data locality All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO

Thank you for your attention!

22