Parallel & Scalable Machine Learningmorrisriedel.de/wp-content/uploads/2018/01/2018-01-15... ·...

Introduction to Machine Learning Algorithms

Dr. – Ing. Morris RiedelAdjunct Associated ProfessorSchool of Engineering and Natural Sciences, University of IcelandResearch Group Leader, Juelich Supercomputing Centre, Germany

PRACE and Parallel Computing Basics

January 15th, 2018JSC, Germany

Parallel & Scalable Machine Learning

LECTURE 2

Review of Lecture 1

Lecture 2 – PRACE and Parallel Computing Basics

Machine Learning Prerequisites1. Some pattern exists2. No exact mathematical formula3. Data exists

Linearly seperable dataset Iris Perceptron Model as hypothesis

(simplest linear learning model,linearity in learned weights wi)

Understood PerceptronLearning Algorithm (PLA)

Learning Approaches Supervised Learning Unsupervised Learning Reinforcement Learning

(decision boundary)wi

(perceptron model)

2 / 40

Outline

Lecture 2 – PRACE and Parallel Computing Basics 3 / 40

Outline of the Course

1. Introduction to Machine Learning Fundamentals

2. PRACE and Parallel Computing Basics

3. Unsupervised Clustering and Applications

4. Unsupervised Clustering Challenges & Solutions

5. Supervised Classification and Learning Theory Basics

6. Classification Applications, Challenges, and Solutions

7. Support Vector Machines and Kernel Methods

8. Practicals with SVMs

9. Validation and Regularization Techniques

10. Practicals with Validation and Regularization

11. Parallelization Benefits

12. Cross-Validation PracticalsLecture 2 – PRACE and Parallel Computing Basics

Day One – beginner

Day Two – moderate

Day Three – expert

4 / 40

Outline

PRACE Understanding High Performance Computing Persistent pan-European Supercomputing Infrastructure PRACE in Numbers & Resource Provisioning Scientific Domains & PRACE Services PRACE Advanced Training Centers & Portal

Parallel Computing Basics Supercomputers & Parallel Computing Modular Supercomputing & DEEP-EST Project Scheduling Principles & Job Scripts HPC Module Environment JURECA & JUROPA3 HPC System


PRACE



HTC

networkinterconnectionless important!

Understanding High Performance Computing

High Performance Computing (HPC) is based on computing resources that enable the efficient use of parallel computing techniques through specific support with dedicated hardware such as high performance cpu/core interconnections.

High Throughput Computing (HTC) is based on commonly available computing resources such as commodity PCs and small clusters that enable the execution of ‘farming jobs’ without providing a high performance interconnection between the cpu/cores.

HPC

networkinterconnectionimportant

focus in the next slides

7 / 40

Partnership for Advanced Computing in Europe (PRACE)

Basic Facts HPC-driven infrastructure An international not-for-profit

association under Belgien law (with its seat in Brussels)

Has 25 members and 2 observers

Governed by the PRACECouncil in which eachmember has a seat

Daily managementof the association isdelegated to the Boardof Directors


[1] PRACE8 / 40

PRACE as Persistent pan-European HPC Infrastructure


Mission: enabling world-class science through large scale simulations

Offering: HPC resources on leading edge capability systems

Resource award: through a single and fair pan-European peer review process for open research

HPC

[1] PRACE9 / 40

PRACE – History


[1] PRACE

The Partnership for Advanced Computing in Europe (PRACE) is a persistent pan-European supercomputing infrastructure that offers HPC resources through a single and fair pan-European peer review process for open research

10 / 40

PRACE European vs. Regional Systems


[1] PRACE11 / 40

PRACE TIER-0 Systems


MareNostrum: IBMBSC, Barcelona, Spain

JUQUEEN: IBM BlueGene/Q GAUSS/FZJJülich, Germany

CURIE: Bull Bullx GENCI/CEABruyères-le-Châtel, France

SuperMUC: IBM GAUSS/LRZ Garching, Germany

Hazel Hen: Cray GAUSS/HLRS,Stuttgart, Germany

MARCONI: LenovoCINECABologna, Italy

Piz Daint: Cray XC 30CSCSLugano, Switzerland

#3 on Top500June17

#21 on Top500

#40 on Top500

#17 on Top500

#14 on Top500

#85 on Top500

#13 on Top500

[1] PRACE [5] TOP 500 supercomputing sites12 / 40

[Video] JUQUEEN – Supercomputer Upgrade Example


PRACE Achievements

More than 500 scientific projects enabled Over 14 thousand million core

hours awarded since 2010 Of which 63% are trans-national R&D access to industrial users

with >50 companies supported More 60 Pflop/s of peak

performance on 7 world-class systems

25 members, including 5 hosting members (France, Germany, Italy, Spain and Switzerland)


[1] PRACE14 / 40

PRACE – Scientific Domains


[1] PRACE

Biochemistry, Bioinformatics and

Life Sciences [PROZENTSATZ]

Chemical Sciences and

Materials [PROZENTSATZ]

Universe Sciences

[PROZENTSATZ]Mathematical and Computer Sciences

[PROZENTSATZ]

Earth System Sciences

[PROZENTSATZ]

Engineering[PROZENTSATZ]

Fundamental Constituents of

matter [PROZENTSATZ]

Research Domain Pie Chart up to and including Call 14, % of total core hours awarded

15 / 40

PRACE – Current Services at a Glance


PRACE PartnersTier-0 systems (open R&D)- Project Access

1-3 years- Preparatory Access

Type A, B, C, D

Tier-1 systems (open R&D)- DECI Programme

Application Enabling & Support- Preparatory access Type C- Preparatory access Type D

- Tier-1 for Tier-0- SHAPE- HLST support

Training- Training Portal- PATC, PTC- Seasonal Schools & on

demand - International HPC Summer

School- MOOC- Code Vault- Best Practice Guides- White Papers

For academia and industry Communication, Dissemination, Outreach- Website- PR- Scientific Communication- Summer of HPC

Operation & Coordination of thecommon PRACE Operational Services- Service Catalogue- PRACE MD-VPN network- Security

HPC Commissioning & Prototyping- Technology Watch, PCP activity- Infrastructure WS- Best Practices - UEABS

[1] PRACE16 / 40

PRACE Advanced Training Centers (PATC) & Portal

Selected Facts More than 10 000 people trained by 6 PRACE Advanced Training

Centers (PATC) and other events Training portal consists of valuable material in all fields related to HPC

& supercomputing Easy search

function tofind materialsof past events

Material of thistraining will bealso availableafter the event


[2] PRACE Training Portal

17 / 40

[Video] PRACE – Introduction to Supercomputing


[3] YouTube, ’PRACE – Introduction to Supercomputing’

18 / 40

Parallel Computing Basics


Parallel Computing

All modern supercomputers depend heavily on parallelism

Often known as ‘parallel processing’ of some problem space Tackle problems in parallel to enable the ‘best performance’ possible

‘The measure of speed’ in High Performance Computing matters Common measure for parallel computers established by TOP500 list Based on benchmark for ranking the best 500 computers worldwide


We speak of parallel computing whenever a number of ‘compute elements’ (e.g. cores) solve a problem in a cooperative way

[4] Introduction to High Performance Computing for Scientists and Engineers

[5] TOP 500 supercomputing sites

network interconnection important20 / 40

TOP 500 List (November 2017)


powerchallenge

EU #1

[5] TOP 500 supercomputing sites

Impact of the TOP500 listis compromised through its own success: especially the 50 top HPC system are often designed primarily to reach a great LINPACK performance benchmark today

The LINPACK performance benchmark not fully reflects the broad range of applications in HPC today

21 / 40

Towards new HPC Architectures – DEEP-EST EU Project


BN

Many CoreCPU

CN

General Purpose

CPU

MEM

GCE

FPGA MEM

NAM

FPGA

General Purpose

CPU

NVRAMNVRAMNVRAMNVRAMNVRAMNVRAMNVRAMNVRAM

DN

MEM

FPGA

General Purpose

CPUMEM

NVRAMNVRAM

Possible Application Workload

[6] DEEP-EST EU Project

22 / 40

Modular Supercomputing @ JSC


IBM Power 6 JUMP, 9 TFlop/s

IBM Blue Gene/PJUGENE, 1 PFlop/s

HPC-FF100 TFlop/s

JUROPA200 TFlop/s

IBM Blue Gene/QJUQUEEN5.9 PFlop/s

IBM Blue Gene/LJUBL, 45 TFlop/s

IBM Power 4+JUMP, 9 TFlop/s

JURECA (2015)2.2 PFlop/s

JURECA Booster (2017)5 PFlop/s

JUSTstorage

23 / 40

JURECA CLUSTER & BOOSTER Module @ JSC


Tutorial will use the JURECA CLUSTER module on Monday & Wednesday (Tuesday maintenance)24 / 40

Tutorial Machine: JSC JURECA System – CLUSTER Module

Characteristics Login nodes with 256 GB

memory per node 45,216 CPU cores 1.8 (CPU) + 0.44 (GPU)

Petaflop/s peak performance Two Intel Xeon E5-2680 v3 Haswell

CPUs per node: 2 x 12 cores, 2.5 GhZ 75 compute nodes equipped with two

NVIDIA K80 GPUs (2 x 4992 CUDA cores)

Architecture & Network Based on T-Platforms V-class server architecture Mellanox EDR InfiniBand high-speed

network with non-blocking fat tree topology 100 GiB per second storage connection to JUST


[7] JURECA HPC System

HPC

Use our ssh keys to get an access and use reservation

Put the private key into your ./ssh directory (UNIX)

Use the private key with your putty tool (Windows)

25 / 40

Exercise – Login to JURECA with TrainXYZ Accounts


train004 – train050 are training accounts that are used in the tutorial for JURECA using SSH Keys Every participant needs to pick one trainXYZ account from the list Download keys: https://fz-juelich.sciebo.de/index.php/s/IR9u8LPQjNXIV37/authenticate Password: JSC_ml_2018

26 / 40

JURECA System – SSH Login

Use your account train004 - train050 Windows: use putty UNIX: ssh [email protected] Example


Remember to use your own trainXYZ account in order to login to the JURECA system27 / 40

Using SSH Clients for Windows

Example: using the Putty SSH client(other SSH tools exist, e.g. could be MoabXTerm, etc.)


[8] PUTTY tool

Configure Keys under SSH, change username to trainXYZ, and use hostname for JURECA/JUROPA328 / 40

Tutorial Machine: JSC JUROPA3 System

Characteristics (JSC partition) 36 compute nodes Intel Xeon E5-2650

(Sandy Bridge-EP) eight-core processors 2.0 GHz

64 GB memory 704 CPU cores total Additional 4 compute nodes

with 2 NVIDIA Tesla K20X GPU(2688 CUDA cores)

Additional 4 compute nodeswith 2 Intel Xeon Phi 5110P (60 cores)

Architecture & Network Infiniband FDR HCA


[9] JUROPA3 HPC System

HPC

Use our ssh keys to get an access and use reservation

Put the private key into your ./ssh directory (UNIX)

Use the private key with your putty tool (Windows)

29 / 40

JUROPA3 System – SSH Login

Use your account train004 - train050 Windows: use putty UNIX: ssh [email protected] Example


Remember to use your own trainXYZ account in order to login to the JUROPA3 system30 / 40

Scheduling Principles – SLURM Scheduler in Tutorial

HPC Systems are typically not used in an interactive fashion Program application starts ‘processes‘ on processors (‘do a job for a user‘) Users of HPC systems send ‘job scripts‘ to schedulers to start programs Scheduling enables the sharing of the HPC system with other users Closely related to Operating Systems with a wide varity of algorithms

E.g. First Come First Serve (FCFS) Queues processes in the order that they arrive in the ready queue.

E.g. Backfilling Enables to maximize cluster utilization and throughput Scheduler searches to find jobs that can fill gaps in the schedule Smaller jobs farther back in the queue run ahead of a job waiting at the

front of the queue (but this job should not be delayed by backfilling!)


Scheduling is the method by which user processes are given access to processor time (shared)

31 / 40

Example: Supercomputer BlueGene/Q

Lecture 2 – PRACE and Parallel Computing Basics[10] LLView Tool

32 / 40

JURECA/JUROPA3 Example – Tutorial Reservations

On Jureca:{{{ReservationName=ml-hpc-1 StartTime=2018-01-15T08:45:00 EndTime=2018-01-15T17:15:00 Duration=08:30:00Nodes=jrc[0057,0066,0072,0085-0090,0092-0095,0098-0103,0105,0107-0116] NodeCnt=30 CoreCnt=720 Features=thin PartitionName=batch Flags=

TRES=cpu=1440Users=mriedel,s.luehrs,train001,train002,train003,train004,train005,train006,train007,train008,train009,train010,train011,train012,train013,train014,train015,train016,train017,train018,train019,train020,train021,train022,train023,train024,train025,train026,train027,train028,train029,train030,train031,train032,train033,train034,train035,train036,train037,train038,train039,train040,train041,train042,train043,train044,train045,train046,train047,train048,train049,train050 Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=ml-hpc-3 StartTime=2018-01-17T08:45:00 EndTime=2018-01-17T17:15:00 Duration=08:30:00

Nodes=jrc[0057-0090,0092-0107] NodeCnt=50 CoreCnt=1200 Features=thin PartitionName=batch Flags=OVERLAP

TRES=cpu=2400Users=mriedel,s.luehrs,train001,train002,train003,train004,train005,train006,train007,train008,train009,train010,train011,train012,train013,train014,train015,train016,train017,train018,train019,train020,train021,train022,train023,train024,train025,train026,train027,train028,train029,train030,train031,train032,train033,train034,train035,train036,train037,train038,train039,train040,train041,train042,train043,train044,train045,train046,train047,train048,train049,train050 Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a}}}

On Juropa3:{{{ReservationName=ml-hpc-2 StartTime=2018-01-16T08:45:00 EndTime=2018-01-16T17:15:00 Duration=08:30:00

Nodes=j3c[006-020,025,035,053-055] NodeCnt=20 CoreCnt=320 Features=(null) PartitionName=batch Flags=

TRES=cpu=640Users=train001,train002,train003,train004,train005,train006,train007,train008,train009,train010,train011,train012,train013,train014,train015,train016,train017,train018,train019,train020,train021,train022,train023,train024,train025,train026,train027,train028,train029,train030,train031,train032,train033,train034,train035,train036,train037,train038,train039,train040,train041,train042,train043,train044,train045,train046,train047,train048,train049,train050 Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a}}}


Jobscript JURECA/JUROPA3 Example & Reservation


JURECA HPC System – Reservation Monday (2018-01-15) ml-hpc-1 JUROPA3 HPC System – Reservation Tuesday (2018-01-16) ml-hpc-2 JURECA HPC System – Reservation Wednesday (2018-01-17) ml-hpc-3

Every day the reservation string is changed on the HPC systems (below)

Change the number of nodes and tasks to use more or less CPUs for jobs

Use the command qsub <jobsript> in order to submit parallel jobs to the supercomputerand remember your <job id> returned

Use the command qstat in order to check the status of your parallel job

34 / 40


HPC Environment – Modules

Module environment tool Avoids to manually setup environment information for every application Simplifies shell initialization and lets users easily modify their environment

Module avail Lists all available modules on the HPC system (e.g. compilers, MPI, etc.)

Module spider Find modules in the installed set of modules and more information

Module load Loads particular modules into the current work environment, E.g.:

35 / 40


[Video] Parallel I/O with I/O Nodes

[11] Simplifying HPC Architectures, YouTube Video

36 / 40

Lecture Bibliography


Lecture Bibliography (1)

[1] Partnership for Advanced Computing in Europe (PRACE), Online: http://www.prace-ri.eu/

[2] PRACE Training Portal, Online: http://www.training.prace-ri.eu/material/index.html

[3] PRACE – Introduction to Supercomputing, Online: https://www.youtube.com/watch?v=D94FJx9vxFA

[4] Introduction to High Performance Computing for Scientists and Engineers, Georg Hager & Gerhard Wellein, Chapman & Hall/CRC Computational Science, ISBN 143981192X, English, ~330 pages, 2010, Online:http://www.amazon.de/Introduction-Performance-Computing-Scientists-Computational/dp/143981192X

[5] TOP500 Supercomputing Sites, Online: http://www.top500.org/

[6] DEEP-EST EU Project, Online: http://www.deep-projects.eu/

[7] JURECA HPC System @ JSC,Online: http://www.fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/JURECA/JURECA_node.html

[8] Putty Tool, Online: http://www.putty.org/

[9] JUROPA3 HPC System @ JSC, Online: http://www.fz-juelich.de/ias/jsc/EN/Research/HPCTechnology/ClusterComputing/JUROPA-3/JUROPA-3_node.html

[10] LLView Tool, Online: http://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/LLview/_node.html


Lecture Bibliography (2)

[11] YouTube Video, ‘Big Ideas: Simplifying High Performance Computing Architectures’, Online: https://www.youtube.com/watch?v=ISS_OGVamBk


40 / 40Lecture 2 – PRACE and Parallel Computing Basics

Slides Available at http://www.morrisriedel.de/talks

Parallel & Scalable Machine Learningmorrisriedel.de/wp-content/uploads/2018/01/2018-01-15... ·...

Documents

Transcript of Parallel & Scalable Machine Learningmorrisriedel.de/wp-content/uploads/2018/01/2018-01-15... ·...