HPC Parallel Programming: Overview and Sequential...

HPC Parallel Programming:Overview and Sequential Programming Optimization

Parallelization and Optimization GroupTATA Consultancy Services, SahyadriPark Pune, India

April 29, 2013

HPC Parallel Computing Course Overview

1. HPC Cluster Overview.

Last week

2. Job Submission Cluster. Today: April 29, 2013

3. Parallel Programming:

3.1 Sequential Programming Optimization.Today April 29, 20133.2 Multicore Programming Optimization.Today April 30, 20133.3 Multinode Programming Optimization.Today May 2, 20133.4 Tools.April 30 and May 2, 20133.5 Hands on training exercises.Afternoon3.6 Q&A.

1. HPC Cluster Overview. Last week

2. Job Submission Cluster.

Today: April 29, 2013

3.1 Sequential Programming Optimization.

Today April 29, 20133.2 Multicore Programming Optimization.Today April 30, 20133.3 Multinode Programming Optimization.Today May 2, 20133.4 Tools.April 30 and May 2, 20133.5 Hands on training exercises.Afternoon3.6 Q&A.

3.1 Sequential Programming Optimization.Today April 29, 2013

3.2 Multicore Programming Optimization.Today April 30, 20133.3 Multinode Programming Optimization.Today May 2, 20133.4 Tools.April 30 and May 2, 20133.5 Hands on training exercises.Afternoon3.6 Q&A.

3.1 Sequential Programming Optimization.Today April 29, 20133.2 Multicore Programming Optimization.

Today April 30, 20133.3 Multinode Programming Optimization.Today May 2, 20133.4 Tools.April 30 and May 2, 20133.5 Hands on training exercises.Afternoon3.6 Q&A.

3.1 Sequential Programming Optimization.Today April 29, 20133.2 Multicore Programming Optimization.Today April 30, 2013

3.3 Multinode Programming Optimization.Today May 2, 20133.4 Tools.April 30 and May 2, 20133.5 Hands on training exercises.Afternoon3.6 Q&A.

3.1 Sequential Programming Optimization.Today April 29, 20133.2 Multicore Programming Optimization.Today April 30, 20133.3 Multinode Programming Optimization.

Today May 2, 20133.4 Tools.April 30 and May 2, 20133.5 Hands on training exercises.Afternoon3.6 Q&A.

3.1 Sequential Programming Optimization.Today April 29, 20133.2 Multicore Programming Optimization.Today April 30, 20133.3 Multinode Programming Optimization.Today May 2, 2013

3.4 Tools.April 30 and May 2, 20133.5 Hands on training exercises.Afternoon3.6 Q&A.

3.1 Sequential Programming Optimization.Today April 29, 20133.2 Multicore Programming Optimization.Today April 30, 20133.3 Multinode Programming Optimization.Today May 2, 20133.4 Tools.

April 30 and May 2, 20133.5 Hands on training exercises.Afternoon3.6 Q&A.

3.1 Sequential Programming Optimization.Today April 29, 20133.2 Multicore Programming Optimization.Today April 30, 20133.3 Multinode Programming Optimization.Today May 2, 20133.4 Tools.April 30 and May 2, 2013

3.5 Hands on training exercises.Afternoon3.6 Q&A.

3.1 Sequential Programming Optimization.Today April 29, 20133.2 Multicore Programming Optimization.Today April 30, 20133.3 Multinode Programming Optimization.Today May 2, 20133.4 Tools.April 30 and May 2, 20133.5 Hands on training exercises.Afternoon

3.6 Q&A.

Acknowledgements

The Parallelization and Optimization group of the TCS HPC group havecreated and delivered this HPC training. The specific people who havecontributed are:

1. OpenMP presentation and Cache/OpenMP assignments: AnubhavJain, Pthreads presentation: Ravi Teja.

2. Tools presentation and Demo: Rihab, Himanshu, Ravi Teja and AmitKalele.

3. MPI presentation: Amit Kalele and Shreyas.

4. Cache assignments: Mastan Shaik.

5. Computer and Cluster Architecture and Sequential Optimization usingcache.Multicore Synchronization, Multinode Infiniband introductionand general coordination and overall review: Dhananjay Brahme.

HPC Computing Cluster:

Figure: High Performance Multicore Multinode Cluster:

Source: Sanket Sinha, HPC Data Operations Presentation, TCS, Pune

Memory Access:

Figure: CPU to Memory connectionNUMA Source: www.intel.com

Figure: CPU to Memory connection viaFrontSide Bus. Source: Wikipedia

Memory Access:

Figure: CPU to Memory connectionNUMA Source: www.intel.com

Figure: CPU to Memory connection viaFrontSide Bus. Source: Wikipedia

CPU Memory Architecture

Figure: CPU cores, caches and Memory

CPU Memory Bandwidth: Sandy Bridge ES 2670

CPU Specs CommentNo of Sockets 2Technology 32 nmNo. of Cores 8Clock Rate 2.6 GhzNo. of Floating Point 8 8*3*8 = 192operations per clock 2.6 * 192= 499.2

per core 499.2 * 8 = 3993.6 GB/sQPI speed 8GT/sPCI Express 3 40 lane

Mem Specs CommentMemory Type DDR3-800/

1066/1333/1600 1333 * 8 bytes

No. of Channels 4 allows forparallel readsby the cpu

Memory CPU 64 bitsbus widthMax MemoryBandwidth 51.2GB/s 1333 * 8 * 4

= 42.656 GB/sMax MemorySize 750 GB

There is 100X gap between the CPU and Memory Bandwidth.

CPU Specs CommentNo of Sockets 2Technology 32 nmNo. of Cores 8Clock Rate 2.6 GhzNo. of Floating Point 8 8*3*8 = 192operations per clock 2.6 * 192= 499.2per core 499.2 * 8 = 3993.6 GB/s

QPI speed 8GT/sPCI Express 3 40 lane

1066/1333/1600 1333 * 8 bytes

CPU Specs CommentNo of Sockets 2Technology 32 nmNo. of Cores 8Clock Rate 2.6 GhzNo. of Floating Point 8 8*3*8 = 192operations per clock 2.6 * 192= 499.2per core 499.2 * 8 = 3993.6 GB/sQPI speed 8GT/sPCI Express 3 40 lane

1066/1333/1600 1333 * 8 bytes

CPU Specs CommentNo of Sockets 2Technology 32 nmNo. of Cores 8Clock Rate 2.6 GhzNo. of Floating Point 8 8*3*8 = 192operations per clock 2.6 * 192= 499.2per core 499.2 * 8 = 3993.6 GB/sQPI speed 8GT/sPCI Express 3 40 lane

1066/1333/1600 1333 * 8 bytes

Solution: On Chip Memory

Table: Memory Hierarchy

Cache1 Cache2 Memory SpeedSize 32K 4Mb 2Gb Decoding:Slower:O(log(Size))

Area - - larger Slower:O(Size1/2)Speed 3 cycles 14 cycles 114 cycles -Technology Static Ram Static Ram Dynamic Ram Cheaper CMOSLocation On-chip On-chip Of-chip Slower:Larger Capacitance and Resistance

Solution: On Chip Memory

Table: Memory Hierarchy

Cache1 Cache2 Memory SpeedSize 32K 4Mb 2Gb Decoding:Slower:O(log(Size))

Area - - larger Slower:O(Size1/2)Speed 3 cycles 14 cycles 114 cycles -Technology Static Ram Static Ram Dynamic Ram Cheaper CMOSLocation On-chip On-chip Of-chip Slower:Larger Capacitance and Resistance

Cache Line

Figure: Cache Line is 4 (several) bytes

Cache Details

Topic PolicyCache LineStructure Valid,Address Bits

Write Policy Write Backor Write Thru

Cache Line Least recently usedreplacement

Direct Mapped Cache

Principle ImplicationResolve Store higher addressMapping with dataResolve CompareMapping the higher addressLocality Lower bits map directly

higher bits cause overlapOverlap? Problem

Set Associative Cache

Figure: With cache size doubled, overlap isreduced by 2

Figure: With cache size doubled, datafrom any 2 out of 4 regions is stored

Set Associative (Contd):

Problem ProblemDirect Mapped Choice Restricted

to 1 out of 2 memory regions.

Set Associative Allow 2∗2C2for each of the m sets in the cache

Programming

Programming methodology to use cache efficiently

1. Principle: Use a cache line in as many computations as possible. Thisreduces Cache misses.

2. Method:

2.1 loop blocking.2.2 nested loop: interchange loops.

3. Application:

3.1 Array access: Access array consecutively: Consider an array of 1Mdoubles. Initialize each element to 1.5 and compute the sum by addingup each consecutive element. How long did it take? Compute the sumby adding up each 11th element till you have added all the elements.How long did it take?

3.2 Matrix Transpose: block transpose.3.3 MatrixXMatrix:interchange loops, block on loop.

More optimization

1. Reduce computation:

2. Application:

2.1 Remove loop invariant outsize.2.2 Loop unrolling.

3. Replace expensive operation by cheaper operation:

4. Application:

4.1 Multiplication by power of 2 by shift

Assignments

1. Write a program to transpose matrix of 8192 X 8192 doubles in thenormal way. Now implement a version that is optimized for cache.Assume a cache line has 64 bytes, i.e., 8 doubles.

2. Write a program to multiply two matrices of 2048 X 2048 doubles inthe normal way. Improve the efficiency by reordering inner two loops.Compute BT and use this matrix to compute A X B. How long did ittake? Use blocking and compute A X B. How long did it take?

Thank You

HPC Parallel Programming: Overview and Sequential...

Documents

Transcript of HPC Parallel Programming: Overview and Sequential...

Parallel rendering Technologies for hPc clusTers€¦ · parallel rendering with HPC workstation clusters and describes how open source utilities such as Chromium and Distributed

Hpc Overview

Massively Parallel Phase Field Simulations using HPC ...

HPC molecular simulations using LAMMPShpcadvisorycouncil.com/events/2011/Stanford... · HPC molecular simulations using LAMMPS ... (Large-scale Atomic/Molecular Massively Parallel

Parallel Performance Optimization ASD Shared Memory HPC ...

Federated High Performance Computing with OPENCCQ · Self-Service Elastic HPC Scheduler CCQ EFS S3 Auto-Scaling Compute OrangeFS HPC Parallel Storage DDB HPC Job Login WebDAV iRODS

Parallel Programming Concepts - Lehigh Universityalp514/hpc2017/parprog.pdf · Parallel Programming Concepts 2017 HPC Workshop: Parallel Programming Alexander B. Pacheco LTS Research

Tutorial on Parallel Debugging Victor Eijkhout TACC HPC ...

Parallel Software usage on UK National HPC Facilities 2009 ...archer.ac.uk/documentation/white-papers/app-usage/UKParallel... · exascale HPC facilities is the ability of parallel

Hot Topics in Parallel Computing - Vuducvuduc.org/teaching/cse8803-hpc-fa10/01--intro--posted.pdf · Hot Topics in Parallel Computing Prof. Richard (Rich) Vuduc CSE 8803-HPC [01]

Parallel Programming Models - Florida State Universityengelen/courses/HPC/Models.pdf2/7/17 HPC Overview n Basic concepts n Programming models n Multiprogramming n Shared address space

Parallel Programming & Intel HPC Round Table visitjanjust/presentations/PDP-groepsoverleg... · Parallel Programming & Intel HPC Round Table visit Jan Just Keijser 23 April 2014.

Parallel gripping Overview parallel gripper RPP

HPC - HIGH PERFORMANCE COMPUTING (SUPERCOMPUTING…os.inf.tu-dresden.de/Studium/DOS/SS2014/03-Parallel-MPP.pdf · HPC - HIGH PERFORMANCE COMPUTING (SUPERCOMPUTING) ! ... 10 micro,

Parallel MATLAB on HPC - USC · Parallel MATLAB on HPC ... • Using graphical user interface (GUI) • Using command line interface (CLI) • Using CLI with batch parallel job execu;on

Parallel Programming in OpenMP - Louisiana State University · Information Technology Services 6th Annual LONI HPC Parallel Programming Workshop, 2017 p. 6/69 Parallel execution •

CCalvin- HPC for reactor physics -v2 · • Explicit message passing between processors (MPI) ... HPC For reactor physics: ... • « Advanced plutonium assembly parallel calculations

3D Parallel FEM (III) Parallel Visualization using ppOpen ...nkl.cc.u-tokyo.ac.jp/14w/CW-3D/3Dp-3.pdf · ppOpen-HPC • ppOpen-HPC is an open source infrastructure for development

WINDOWS HPC SERVER 2008 R2 GOALS AND OVERVIEW · • Reporting • Performance Tuning Parallel & Cluster Dev Tools Emerging Technologies Part re • Private clouds • Public clouds

Parallel Performance Measurement of Heterogeneous Parallel ...icl.cs.utk.edu/news_pub/submissions/icpp_2011.pdf · performance computing (HPC) landscape, as other HPC com-ponents,