TiNy Threads on BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

MTAAP’2010 1Atlanta, Georgia

TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OSHandong Ye, Robert Pavel, Aaron Landwehr, Guang R. Gao Department of Electrical & Computer Engineering University of Delaware2010-04-23

MTAAP’2010 2

Introduction• Modern OS based upon a sequential

execution model (the von Neumann model).

• Rapid progress of multi-core/many-core chip technology.– Parallel Computer systems now

implemented on single chips.

MTAAP’2010 3

Introduction• Conventional OS model must adapt

to the underlying changes.• Further exploit the many levels of

parallelism.– Hardware as well as Software

• We introduce a study on how to do this adaptation for the IBM BlueGene/P multi-core system.

MTAAP’2010 4

Outline• Introduction• Contributions• TNT on BlueGene/P• Scheduling TNT across nodes• Synchronization across nodes• TNT Distributed Shared Memory• Results• Conclusions and Future Work

MTAAP’2010 5

Contributions• Isolate traditional OS functions to a

single core of the BG/P multi-core chip.

• Ported the TiNy Thread (TNT) execution model to allow for further utilization of BG/P compute cores.

• Expanded the design framework to a multi-chip system designed for scalability to a large number of chips.

MTAAP’2010 6

MTAAP’2010 7

TiNy Threads on BG/P• TiNy Threads

– Lightweight, non-preemptive, threads– API similar to POSIX Threads.– Originally presented in “TiNy Threads: A Thread Virtual

Machine for the Cyclopse-64 Cellular Architecture”– Runs on IBM Cyclops64

• Kernel Modifications– Alterations to the thread scheduler to allow for non-

preemption

MTAAP’2010 8

MTAAP’2010 9

Multinode Thread Scheduler

• Thread Scheduler allows TNT to run across multiple nodes.– Requests facilitated through IBM’s

Deep Computing Messaging Framework’s RPCs.

• Multiple Scheduling Algorithms– Workload Un-Aware

• Random• Round-Robin

– Workload Aware

MTAAP’2010 10

Multinode Thread Scheduling

Node A Node B

tnt_create()…

tnt_join()

…tnt_exit()

…tnt_join()

MTAAP’2010 11

MTAAP’2010 12

Synchronization• Three forms

– Mutex– Thread Joining– Barrier

• Similar to thread scheduling– Lock requests, Join requests, and

Barrier notifications sent to node responsible for said synchronization

MTAAP’2010 13

Multinode Thread Scheduling

Node A Node B

…tnt_exit()

tnt_join()

tnt_exit()

MTAAP’2010 14

MTAAP’2010 15

Characteristics of TDSM• Provides One-Sided access to memory

distributed among nodes through IBM’s DCMF.• Allows for virtual address manipulation

– Maps distributed memory to a single virtual address space.

– Allows for array indexing and memory offsets.• Scalable to a variety of applications

– Size of desired global shared memory set at runtime.

• Mutability– Memory allocation algorithm and memory

distribution algorithm can be easily altered and/or replaced.

Example of TDSM

0global

15 30 45

tdsm_read(global[15], local, 20*sizeof(int));

global[15] to global[34]

0x00040012

Node 6Node 5 Node 7

Local Buffer0x0004004E

to0x0004009A

Node 6: 0 to 14and

Node 7: 0 to 4

MTAAP’2010 17

MTAAP’2010 18

Summary of Results• The performance of the TNT thread

system shows comparable speedup to that of Pthreads running on the same hardware.

• The distributed shared memory operates at 95% of the experimental peak performance of the network, with distance between nodes not being a sensitive factor.

• The cost of thread creation shows a linear relationship as the number of threads increase.

• The cost of waiting at a barrier is constant and independent of the number of threads involved.

MTAAP’2010 19

Single-Node Thread System Performance

• Based upon Radix-2 Cooley-Tukey algorithm with the Kiss FFT library for the underlying DFT.

• Underlying TNT thread model performs comparably to POSIX standard when the number of threads does not exceed the number of available processor cores.

1024 2048 4096 8192 16384 327680

TNT POSIX

Size of FFT

MTAAP’2010 20

Memory System Performance

• Reads and writes of varying sizes between one and two nodes.

• For inter-node communications, data can be transferred at approximately 357 MB/s.

• Kumar et al determined experimental peak performance on BG/P to be 374 MB/s in their ICS’08 paper.

0 10485760

0.0005

0.0015

0.0025

Local Read

Local Write

Remote Read

Remote Write

MTAAP’2010 21

Memory System Performance

• Size of Read/Write is a function of the number of nodes across which the data is distributed.

• Latency linearly increases as the amount of data increases, regardless of distance between nodes.

0 256 512 768 10240

MTAAP’2010 22

Multinode Thread Creation Cost

• Approximately 0.2 seconds per thread

• Remained effectively constant

0 100 200 300 400 500 6000

Number of Threads

MTAAP’2010 23

Synchronization Costs• Performance of

barrier is effectively a constant 0.2 seconds.

000.1999

0.19992

0.19994

0.19996

0.19998

0.20002

0.20004

Number of Threads Included in Barrier

MTAAP’2010 24

MTAAP’2010 25

Conclusions and Future Work

• Proven feasibility of system• Benefits of Execution Model-Driven

approach• Room for Improvement

– Improvements to kernel– More rigorous benchmarks– Improved allocation and scheduling

algorithms

MTAAP’2010 26Atlanta, Georgia

Thank You

MTAAP’2010 27

Bibliography• J. del Cuvillo, W. Zhu, Z. Hu, and G. R. Gao, “Tiny

threads: A thread virtual machine for the cyclops64 cellular architecture,” Parallel and Distributed Processing Symposium, International, vol. 15, p. 265b, 2005.

• S. Kumar, G. Dozsa, G. Almasi et al., “The deep computing messaging framework: generalized scalable message passing on the blue gene/p supercomputer,” in ICS ’08: Proceedings of the 22nd annual international conference on Supercomputing. New York, NY, USA: ACM, 2008, pp. 94–103.

• M. Borgerding, “Kiss FFT.” December 2009, http://sourceforge.net/projects/kissfft/.

TiNy Threads on BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

Documents

Transcript of TiNy Threads on BlueGene /P: Exploring Many-Core Parallelisms Beyond The Traditional OS

25 Top/s Multibillion-Atom Molecular Dynamics Simulations ...sc05.supercomputing.org/schedule/pdf/pap122.pdf2 The BlueGene/L architecture BlueGene/L consists of 65,536 nodes, each

An Overview of the BlueGene/L Supercomputer

BlueGene/Q: Architecture, CoDesign; Path to Exascalechrisc/COURSES/PARALLEL/SPRING... · 2020-01-05 · © 2012 IBM Corporation BlueGene/Q: Architecture, CoDesign; Path to Exascale

THREADS CHAIR’S MESSAGE RESEARCH THREADS

Parallelisms in sentences

Raptor – Software and Applications on BlueGene/L* · 10/2003 JAG BG/L 1 BlueGene/L: Applications, Architecture and Software Workshop October 14-15, 2003 Reno, NV Raptor – Software

Architecture of Parallel Computers CSC / ECE 506 BlueGene ...Architecture of Parallel Computers CSC / ECE 506 BlueGene Architecture Lecture 24 7/31/2006 Dr Steve Hunter. ... • sPPM

Threads. Threads Overview Multithreading Models Threading Issues P threads Windows XP Threads Linux Threads Java Threads.

Fourier Transforms for the BlueGene/L Communication Network · 2017-11-01 · BlueGene/L Communication Network Heike Jagode MSc in High Performance Computing The University of Edinburgh

UNIX Threads - Florida Atlantic Universitysam/course/netp/lec_note/thread.pdf · 2 UNIX Threads Motivation for Threads Thread Resources Thread Implementations Unix Threads POSIX Threads

Parallelisms in Ezikeọba Igbo Praise Song: …linguisticsafrikana.com/pdf/Parallelisms in Ezike_ba Igbo for...C. U. Agbedo - Parallelisms in Ezikeọba Igbo Praise Song: Evidence

Raptor – Software and Applications on BlueGene/L*

BlueGene/L - asc.llnl.gov · BlueGene/L Single Node Performance Issues Siddhartha Chatterjee (sc@us.ibm.com) Kenneth Dockser John A. Gunnels Manish Gupta Fred G. Gustavson Mark Mendell

BlueGene/P Supercomputer

Filtering Failure Logs for a BlueGene/L Prototypeeceweb1.rutgers.edu/~yyzhang/research/papers/dsn05.pdf8192-processor BlueGene/L DD1 prototype, housed at IBM Rochester, which currently

BlueGene/L Facts

Compiling and Executing Applications on BlueGene

An Overview of the BlueGene/L Supercomputer · 2018-01-04 · BlueGene/L is a scalable system in which the maximum number of compute nodes assigned to a single parallel job is 216

The QCD transition temperature from simulations on BlueGene L supercomputer at LLNL

Parallelisms of PG(3,5) with automorphisms of order 13