TiNy Threads on BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

27
Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang R. Gao Department of Electrical & Computer Engineering University of Delaware 2010-04-23 MTAAP’2010 1

description

TiNy Threads on BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS. Handong Ye, Robert Pavel , Aaron Landwehr , Guang R. Gao Department of Electrical & Computer Engineering University of Delaware 2010-04-23. Introduction. - PowerPoint PPT Presentation

Transcript of TiNy Threads on BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

Page 1: TiNy  Threads on  BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

MTAAP’2010 1Atlanta, Georgia

TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OSHandong Ye, Robert Pavel, Aaron Landwehr, Guang R. Gao Department of Electrical & Computer Engineering University of Delaware2010-04-23

Page 2: TiNy  Threads on  BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

MTAAP’2010 2

Introduction• Modern OS based upon a sequential

execution model (the von Neumann model).

• Rapid progress of multi-core/many-core chip technology.– Parallel Computer systems now

implemented on single chips.

Page 3: TiNy  Threads on  BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

MTAAP’2010 3

Introduction• Conventional OS model must adapt

to the underlying changes.• Further exploit the many levels of

parallelism.– Hardware as well as Software

• We introduce a study on how to do this adaptation for the IBM BlueGene/P multi-core system.

Page 4: TiNy  Threads on  BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

MTAAP’2010 4

Outline• Introduction• Contributions• TNT on BlueGene/P• Scheduling TNT across nodes• Synchronization across nodes• TNT Distributed Shared Memory• Results• Conclusions and Future Work

Page 5: TiNy  Threads on  BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

MTAAP’2010 5

Contributions• Isolate traditional OS functions to a

single core of the BG/P multi-core chip.

• Ported the TiNy Thread (TNT) execution model to allow for further utilization of BG/P compute cores.

• Expanded the design framework to a multi-chip system designed for scalability to a large number of chips.

Page 6: TiNy  Threads on  BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

MTAAP’2010 6

Outline• Introduction• Contributions• TNT on BlueGene/P• Scheduling TNT across nodes• Synchronization across nodes• TNT Distributed Shared Memory• Results• Conclusions and Future Work

Page 7: TiNy  Threads on  BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

MTAAP’2010 7

TiNy Threads on BG/P• TiNy Threads

– Lightweight, non-preemptive, threads– API similar to POSIX Threads.– Originally presented in “TiNy Threads: A Thread Virtual

Machine for the Cyclopse-64 Cellular Architecture”– Runs on IBM Cyclops64

• Kernel Modifications– Alterations to the thread scheduler to allow for non-

preemption

Page 8: TiNy  Threads on  BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

MTAAP’2010 8

Outline• Introduction• Contributions• TNT on BlueGene/P• Scheduling TNT across nodes• Synchronization across nodes• TNT Distributed Shared Memory• Results• Conclusions and Future Work

Page 9: TiNy  Threads on  BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

MTAAP’2010 9

Multinode Thread Scheduler

• Thread Scheduler allows TNT to run across multiple nodes.– Requests facilitated through IBM’s

Deep Computing Messaging Framework’s RPCs.

• Multiple Scheduling Algorithms– Workload Un-Aware

• Random• Round-Robin

– Workload Aware

Page 10: TiNy  Threads on  BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

MTAAP’2010 10

Multinode Thread Scheduling

Node A Node B

tnt_create()…

tnt_join()

tid

…tnt_exit()

…tnt_join()

Page 11: TiNy  Threads on  BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

MTAAP’2010 11

Outline• Introduction• Contributions• TNT on BlueGene/P• Scheduling TNT across nodes• Synchronization across nodes• TNT Distributed Shared Memory• Results• Conclusions and Future Work

Page 12: TiNy  Threads on  BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

MTAAP’2010 12

Synchronization• Three forms

– Mutex– Thread Joining– Barrier

• Similar to thread scheduling– Lock requests, Join requests, and

Barrier notifications sent to node responsible for said synchronization

Page 13: TiNy  Threads on  BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

MTAAP’2010 13

Multinode Thread Scheduling

Node A Node B

tid

…tnt_exit()

tnt_join()

A

tnt_exit()

Page 14: TiNy  Threads on  BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

MTAAP’2010 14

Outline• Introduction• Contributions• TNT on BlueGene/P• Scheduling TNT across nodes• Synchronization across nodes• TNT Distributed Shared Memory• Results• Conclusions and Future Work

Page 15: TiNy  Threads on  BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

MTAAP’2010 15

Characteristics of TDSM• Provides One-Sided access to memory

distributed among nodes through IBM’s DCMF.• Allows for virtual address manipulation

– Maps distributed memory to a single virtual address space.

– Allows for array indexing and memory offsets.• Scalable to a variety of applications

– Size of desired global shared memory set at runtime.

• Mutability– Memory allocation algorithm and memory

distribution algorithm can be easily altered and/or replaced.

Page 16: TiNy  Threads on  BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

16

Example of TDSM

0global

15 30 45

tdsm_read(global[15], local, 20*sizeof(int));

global[15] to global[34]

0x00040012

Node 6Node 5 Node 7

Local Buffer0x0004004E

to0x0004009A

Node 6: 0 to 14and

Node 7: 0 to 4

Page 17: TiNy  Threads on  BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

MTAAP’2010 17

Outline• Introduction• Contributions• TNT on BlueGene/P• Scheduling TNT across nodes• Synchronization across nodes• TNT Distributed Shared Memory• Results• Conclusions and Future Work

Page 18: TiNy  Threads on  BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

MTAAP’2010 18

Summary of Results• The performance of the TNT thread

system shows comparable speedup to that of Pthreads running on the same hardware.

• The distributed shared memory operates at 95% of the experimental peak performance of the network, with distance between nodes not being a sensitive factor.

• The cost of thread creation shows a linear relationship as the number of threads increase.

• The cost of waiting at a barrier is constant and independent of the number of threads involved.

Page 19: TiNy  Threads on  BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

MTAAP’2010 19

Single-Node Thread System Performance

• Based upon Radix-2 Cooley-Tukey algorithm with the Kiss FFT library for the underlying DFT.

• Underlying TNT thread model performs comparably to POSIX standard when the number of threads does not exceed the number of available processor cores.

1024 2048 4096 8192 16384 327680

0.5

1

1.5

2

2.5

TNT POSIX

Size of FFT

Abs

olut

e Sp

eed-

Up

for

Tw

o T

hrea

ds

Page 20: TiNy  Threads on  BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

MTAAP’2010 20

Memory System Performance

• Reads and writes of varying sizes between one and two nodes.

• For inter-node communications, data can be transferred at approximately 357 MB/s.

• Kumar et al determined experimental peak performance on BG/P to be 374 MB/s in their ICS’08 paper.

0 10485760

0.0005

0.001

0.0015

0.002

0.0025

0.003

Local Read

Local Write

Remote Read

Remote Write

bytes

late

ncy

(s)

Page 21: TiNy  Threads on  BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

MTAAP’2010 21

Memory System Performance

• Size of Read/Write is a function of the number of nodes across which the data is distributed.

• Latency linearly increases as the amount of data increases, regardless of distance between nodes.

0 256 512 768 10240

0.001

0.002

0.003

0.004

0.005

0.006

Read

Write

Nodes

late

ncy

(s)

Page 22: TiNy  Threads on  BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

MTAAP’2010 22

Multinode Thread Creation Cost

• Approximately 0.2 seconds per thread

• Remained effectively constant

0 100 200 300 400 500 6000

20

40

60

80

100

120

Number of Threads

Seco

nds

Page 23: TiNy  Threads on  BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

MTAAP’2010 23

Synchronization Costs• Performance of

barrier is effectively a constant 0.2 seconds.

110

020

030

040

050

060

070

080

090

010

000.1999

0.19992

0.19994

0.19996

0.19998

0.2

0.20002

0.20004

Number of Threads Included in Barrier

Tim

e (s

)

Page 24: TiNy  Threads on  BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

MTAAP’2010 24

Outline• Introduction• Contributions• TNT on BlueGene/P• Scheduling TNT across nodes• Synchronization across nodes• TNT Distributed Shared Memory• Results• Conclusions and Future Work

Page 25: TiNy  Threads on  BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

MTAAP’2010 25

Conclusions and Future Work

• Proven feasibility of system• Benefits of Execution Model-Driven

approach• Room for Improvement

– Improvements to kernel– More rigorous benchmarks– Improved allocation and scheduling

algorithms

Page 26: TiNy  Threads on  BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

MTAAP’2010 26Atlanta, Georgia

Thank You

Page 27: TiNy  Threads on  BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

MTAAP’2010 27

Bibliography• J. del Cuvillo, W. Zhu, Z. Hu, and G. R. Gao, “Tiny

threads: A thread virtual machine for the cyclops64 cellular architecture,” Parallel and Distributed Processing Symposium, International, vol. 15, p. 265b, 2005.

• S. Kumar, G. Dozsa, G. Almasi et al., “The deep computing messaging framework: generalized scalable message passing on the blue gene/p supercomputer,” in ICS ’08: Proceedings of the 22nd annual international conference on Supercomputing. New York, NY, USA: ACM, 2008, pp. 94–103.

• M. Borgerding, “Kiss FFT.” December 2009, http://sourceforge.net/projects/kissfft/.