Parallel & Scalable Machine Learningmorrisriedel.de/wp-content/uploads/2018/01/2018-01-15... ·...
Transcript of Parallel & Scalable Machine Learningmorrisriedel.de/wp-content/uploads/2018/01/2018-01-15... ·...
Introduction to Machine Learning Algorithms
Dr. – Ing. Morris RiedelAdjunct Associated ProfessorSchool of Engineering and Natural Sciences, University of IcelandResearch Group Leader, Juelich Supercomputing Centre, Germany
PRACE and Parallel Computing Basics
January 15th, 2018JSC, Germany
Parallel & Scalable Machine Learning
LECTURE 2
Review of Lecture 1
Lecture 2 – PRACE and Parallel Computing Basics
Machine Learning Prerequisites1. Some pattern exists2. No exact mathematical formula3. Data exists
Linearly seperable dataset Iris Perceptron Model as hypothesis
(simplest linear learning model,linearity in learned weights wi)
Understood PerceptronLearning Algorithm (PLA)
Learning Approaches Supervised Learning Unsupervised Learning Reinforcement Learning
(decision boundary)wi
(perceptron model)
2 / 40
Outline
Lecture 2 – PRACE and Parallel Computing Basics 3 / 40
Outline of the Course
1. Introduction to Machine Learning Fundamentals
2. PRACE and Parallel Computing Basics
3. Unsupervised Clustering and Applications
4. Unsupervised Clustering Challenges & Solutions
5. Supervised Classification and Learning Theory Basics
6. Classification Applications, Challenges, and Solutions
7. Support Vector Machines and Kernel Methods
8. Practicals with SVMs
9. Validation and Regularization Techniques
10. Practicals with Validation and Regularization
11. Parallelization Benefits
12. Cross-Validation PracticalsLecture 2 – PRACE and Parallel Computing Basics
Day One – beginner
Day Two – moderate
Day Three – expert
4 / 40
Outline
PRACE Understanding High Performance Computing Persistent pan-European Supercomputing Infrastructure PRACE in Numbers & Resource Provisioning Scientific Domains & PRACE Services PRACE Advanced Training Centers & Portal
Parallel Computing Basics Supercomputers & Parallel Computing Modular Supercomputing & DEEP-EST Project Scheduling Principles & Job Scripts HPC Module Environment JURECA & JUROPA3 HPC System
Lecture 2 – PRACE and Parallel Computing Basics 5 / 40
PRACE
Lecture 2 – PRACE and Parallel Computing Basics 6 / 40
Lecture 2 – PRACE and Parallel Computing Basics
HTC
networkinterconnectionless important!
Understanding High Performance Computing
High Performance Computing (HPC) is based on computing resources that enable the efficient use of parallel computing techniques through specific support with dedicated hardware such as high performance cpu/core interconnections.
High Throughput Computing (HTC) is based on commonly available computing resources such as commodity PCs and small clusters that enable the execution of ‘farming jobs’ without providing a high performance interconnection between the cpu/cores.
HPC
networkinterconnectionimportant
focus in the next slides
7 / 40
Partnership for Advanced Computing in Europe (PRACE)
Basic Facts HPC-driven infrastructure An international not-for-profit
association under Belgien law (with its seat in Brussels)
Has 25 members and 2 observers
Governed by the PRACECouncil in which eachmember has a seat
Daily managementof the association isdelegated to the Boardof Directors
Lecture 2 – PRACE and Parallel Computing Basics
[1] PRACE8 / 40
PRACE as Persistent pan-European HPC Infrastructure
Lecture 2 – PRACE and Parallel Computing Basics
Mission: enabling world-class science through large scale simulations
Offering: HPC resources on leading edge capability systems
Resource award: through a single and fair pan-European peer review process for open research
HPC
[1] PRACE9 / 40
PRACE – History
Lecture 2 – PRACE and Parallel Computing Basics
[1] PRACE
The Partnership for Advanced Computing in Europe (PRACE) is a persistent pan-European supercomputing infrastructure that offers HPC resources through a single and fair pan-European peer review process for open research
10 / 40
PRACE European vs. Regional Systems
Lecture 2 – PRACE and Parallel Computing Basics
[1] PRACE11 / 40
PRACE TIER-0 Systems
Lecture 2 – PRACE and Parallel Computing Basics
MareNostrum: IBMBSC, Barcelona, Spain
JUQUEEN: IBM BlueGene/Q GAUSS/FZJJülich, Germany
CURIE: Bull Bullx GENCI/CEABruyères-le-Châtel, France
SuperMUC: IBM GAUSS/LRZ Garching, Germany
Hazel Hen: Cray GAUSS/HLRS,Stuttgart, Germany
MARCONI: LenovoCINECABologna, Italy
Piz Daint: Cray XC 30CSCSLugano, Switzerland
#3 on Top500June17
#21 on Top500
#40 on Top500
#17 on Top500
#14 on Top500
#85 on Top500
#13 on Top500
[1] PRACE [5] TOP 500 supercomputing sites12 / 40
[Video] JUQUEEN – Supercomputer Upgrade Example
Lecture 2 – PRACE and Parallel Computing Basics 13 / 40
PRACE Achievements
More than 500 scientific projects enabled Over 14 thousand million core
hours awarded since 2010 Of which 63% are trans-national R&D access to industrial users
with >50 companies supported More 60 Pflop/s of peak
performance on 7 world-class systems
25 members, including 5 hosting members (France, Germany, Italy, Spain and Switzerland)
Lecture 2 – PRACE and Parallel Computing Basics
[1] PRACE14 / 40
PRACE – Scientific Domains
Lecture 2 – PRACE and Parallel Computing Basics
[1] PRACE
Biochemistry, Bioinformatics and
Life Sciences [PROZENTSATZ]
Chemical Sciences and
Materials [PROZENTSATZ]
Universe Sciences
[PROZENTSATZ]Mathematical and Computer Sciences
[PROZENTSATZ]
Earth System Sciences
[PROZENTSATZ]
Engineering[PROZENTSATZ]
Fundamental Constituents of
matter [PROZENTSATZ]
Research Domain Pie Chart up to and including Call 14, % of total core hours awarded
15 / 40
PRACE – Current Services at a Glance
Lecture 2 – PRACE and Parallel Computing Basics
PRACE PartnersTier-0 systems (open R&D)- Project Access
1-3 years- Preparatory Access
Type A, B, C, D
Tier-1 systems (open R&D)- DECI Programme
Application Enabling & Support- Preparatory access Type C- Preparatory access Type D
- Tier-1 for Tier-0- SHAPE- HLST support
Training- Training Portal- PATC, PTC- Seasonal Schools & on
demand - International HPC Summer
School- MOOC- Code Vault- Best Practice Guides- White Papers
For academia and industry Communication, Dissemination, Outreach- Website- PR- Scientific Communication- Summer of HPC
Operation & Coordination of thecommon PRACE Operational Services- Service Catalogue- PRACE MD-VPN network- Security
HPC Commissioning & Prototyping- Technology Watch, PCP activity- Infrastructure WS- Best Practices - UEABS
[1] PRACE16 / 40
PRACE Advanced Training Centers (PATC) & Portal
Selected Facts More than 10 000 people trained by 6 PRACE Advanced Training
Centers (PATC) and other events Training portal consists of valuable material in all fields related to HPC
& supercomputing Easy search
function tofind materialsof past events
Material of thistraining will bealso availableafter the event
Lecture 2 – PRACE and Parallel Computing Basics
[2] PRACE Training Portal
17 / 40
[Video] PRACE – Introduction to Supercomputing
Lecture 2 – PRACE and Parallel Computing Basics
[3] YouTube, ’PRACE – Introduction to Supercomputing’
18 / 40
Parallel Computing Basics
Lecture 2 – PRACE and Parallel Computing Basics 19 / 40
Parallel Computing
All modern supercomputers depend heavily on parallelism
Often known as ‘parallel processing’ of some problem space Tackle problems in parallel to enable the ‘best performance’ possible
‘The measure of speed’ in High Performance Computing matters Common measure for parallel computers established by TOP500 list Based on benchmark for ranking the best 500 computers worldwide
Lecture 2 – PRACE and Parallel Computing Basics
We speak of parallel computing whenever a number of ‘compute elements’ (e.g. cores) solve a problem in a cooperative way
[4] Introduction to High Performance Computing for Scientists and Engineers
[5] TOP 500 supercomputing sites
network interconnection important20 / 40
TOP 500 List (November 2017)
Lecture 2 – PRACE and Parallel Computing Basics
powerchallenge
EU #1
[5] TOP 500 supercomputing sites
Impact of the TOP500 listis compromised through its own success: especially the 50 top HPC system are often designed primarily to reach a great LINPACK performance benchmark today
The LINPACK performance benchmark not fully reflects the broad range of applications in HPC today
21 / 40
Towards new HPC Architectures – DEEP-EST EU Project
Lecture 2 – PRACE and Parallel Computing Basics
BN
Many CoreCPU
CN
General Purpose
CPU
MEM
GCE
FPGA MEM
NAM
FPGA
General Purpose
CPU
NVRAMNVRAMNVRAMNVRAMNVRAMNVRAMNVRAMNVRAM
DN
MEM
FPGA
General Purpose
CPUMEM
NVRAMNVRAM
Possible Application Workload
[6] DEEP-EST EU Project
22 / 40
Modular Supercomputing @ JSC
Lecture 2 – PRACE and Parallel Computing Basics
IBM Power 6 JUMP, 9 TFlop/s
IBM Blue Gene/PJUGENE, 1 PFlop/s
HPC-FF100 TFlop/s
JUROPA200 TFlop/s
IBM Blue Gene/QJUQUEEN5.9 PFlop/s
IBM Blue Gene/LJUBL, 45 TFlop/s
IBM Power 4+JUMP, 9 TFlop/s
JURECA (2015)2.2 PFlop/s
JURECA Booster (2017)5 PFlop/s
JUSTstorage
23 / 40
JURECA CLUSTER & BOOSTER Module @ JSC
Lecture 2 – PRACE and Parallel Computing Basics
Tutorial will use the JURECA CLUSTER module on Monday & Wednesday (Tuesday maintenance)24 / 40
Tutorial Machine: JSC JURECA System – CLUSTER Module
Characteristics Login nodes with 256 GB
memory per node 45,216 CPU cores 1.8 (CPU) + 0.44 (GPU)
Petaflop/s peak performance Two Intel Xeon E5-2680 v3 Haswell
CPUs per node: 2 x 12 cores, 2.5 GhZ 75 compute nodes equipped with two
NVIDIA K80 GPUs (2 x 4992 CUDA cores)
Architecture & Network Based on T-Platforms V-class server architecture Mellanox EDR InfiniBand high-speed
network with non-blocking fat tree topology 100 GiB per second storage connection to JUST
Lecture 2 – PRACE and Parallel Computing Basics
[7] JURECA HPC System
HPC
Use our ssh keys to get an access and use reservation
Put the private key into your ./ssh directory (UNIX)
Use the private key with your putty tool (Windows)
25 / 40
Exercise – Login to JURECA with TrainXYZ Accounts
Lecture 2 – PRACE and Parallel Computing Basics
train004 – train050 are training accounts that are used in the tutorial for JURECA using SSH Keys Every participant needs to pick one trainXYZ account from the list Download keys: https://fz-juelich.sciebo.de/index.php/s/IR9u8LPQjNXIV37/authenticate Password: JSC_ml_2018
26 / 40
JURECA System – SSH Login
Use your account train004 - train050 Windows: use putty UNIX: ssh [email protected] Example
Lecture 2 – PRACE and Parallel Computing Basics
Remember to use your own trainXYZ account in order to login to the JURECA system27 / 40
Using SSH Clients for Windows
Example: using the Putty SSH client(other SSH tools exist, e.g. could be MoabXTerm, etc.)
Lecture 2 – PRACE and Parallel Computing Basics
[8] PUTTY tool
Configure Keys under SSH, change username to trainXYZ, and use hostname for JURECA/JUROPA328 / 40
Tutorial Machine: JSC JUROPA3 System
Characteristics (JSC partition) 36 compute nodes Intel Xeon E5-2650
(Sandy Bridge-EP) eight-core processors 2.0 GHz
64 GB memory 704 CPU cores total Additional 4 compute nodes
with 2 NVIDIA Tesla K20X GPU(2688 CUDA cores)
Additional 4 compute nodeswith 2 Intel Xeon Phi 5110P (60 cores)
Architecture & Network Infiniband FDR HCA
Lecture 2 – PRACE and Parallel Computing Basics
[9] JUROPA3 HPC System
HPC
Use our ssh keys to get an access and use reservation
Put the private key into your ./ssh directory (UNIX)
Use the private key with your putty tool (Windows)
29 / 40
JUROPA3 System – SSH Login
Use your account train004 - train050 Windows: use putty UNIX: ssh [email protected] Example
Lecture 2 – PRACE and Parallel Computing Basics
Remember to use your own trainXYZ account in order to login to the JUROPA3 system30 / 40
Scheduling Principles – SLURM Scheduler in Tutorial
HPC Systems are typically not used in an interactive fashion Program application starts ‘processes‘ on processors (‘do a job for a user‘) Users of HPC systems send ‘job scripts‘ to schedulers to start programs Scheduling enables the sharing of the HPC system with other users Closely related to Operating Systems with a wide varity of algorithms
E.g. First Come First Serve (FCFS) Queues processes in the order that they arrive in the ready queue.
E.g. Backfilling Enables to maximize cluster utilization and throughput Scheduler searches to find jobs that can fill gaps in the schedule Smaller jobs farther back in the queue run ahead of a job waiting at the
front of the queue (but this job should not be delayed by backfilling!)
Lecture 2 – PRACE and Parallel Computing Basics
Scheduling is the method by which user processes are given access to processor time (shared)
31 / 40
Example: Supercomputer BlueGene/Q
Lecture 2 – PRACE and Parallel Computing Basics[10] LLView Tool
32 / 40
JURECA/JUROPA3 Example – Tutorial Reservations
On Jureca:{{{ReservationName=ml-hpc-1 StartTime=2018-01-15T08:45:00 EndTime=2018-01-15T17:15:00 Duration=08:30:00Nodes=jrc[0057,0066,0072,0085-0090,0092-0095,0098-0103,0105,0107-0116] NodeCnt=30 CoreCnt=720 Features=thin PartitionName=batch Flags=
TRES=cpu=1440Users=mriedel,s.luehrs,train001,train002,train003,train004,train005,train006,train007,train008,train009,train010,train011,train012,train013,train014,train015,train016,train017,train018,train019,train020,train021,train022,train023,train024,train025,train026,train027,train028,train029,train030,train031,train032,train033,train034,train035,train036,train037,train038,train039,train040,train041,train042,train043,train044,train045,train046,train047,train048,train049,train050 Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
ReservationName=ml-hpc-3 StartTime=2018-01-17T08:45:00 EndTime=2018-01-17T17:15:00 Duration=08:30:00
Nodes=jrc[0057-0090,0092-0107] NodeCnt=50 CoreCnt=1200 Features=thin PartitionName=batch Flags=OVERLAP
TRES=cpu=2400Users=mriedel,s.luehrs,train001,train002,train003,train004,train005,train006,train007,train008,train009,train010,train011,train012,train013,train014,train015,train016,train017,train018,train019,train020,train021,train022,train023,train024,train025,train026,train027,train028,train029,train030,train031,train032,train033,train034,train035,train036,train037,train038,train039,train040,train041,train042,train043,train044,train045,train046,train047,train048,train049,train050 Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a}}}
On Juropa3:{{{ReservationName=ml-hpc-2 StartTime=2018-01-16T08:45:00 EndTime=2018-01-16T17:15:00 Duration=08:30:00
Nodes=j3c[006-020,025,035,053-055] NodeCnt=20 CoreCnt=320 Features=(null) PartitionName=batch Flags=
TRES=cpu=640Users=train001,train002,train003,train004,train005,train006,train007,train008,train009,train010,train011,train012,train013,train014,train015,train016,train017,train018,train019,train020,train021,train022,train023,train024,train025,train026,train027,train028,train029,train030,train031,train032,train033,train034,train035,train036,train037,train038,train039,train040,train041,train042,train043,train044,train045,train046,train047,train048,train049,train050 Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a}}}
Lecture 2 – PRACE and Parallel Computing Basics 33 / 40
Jobscript JURECA/JUROPA3 Example & Reservation
Lecture 2 – PRACE and Parallel Computing Basics
JURECA HPC System – Reservation Monday (2018-01-15) ml-hpc-1 JUROPA3 HPC System – Reservation Tuesday (2018-01-16) ml-hpc-2 JURECA HPC System – Reservation Wednesday (2018-01-17) ml-hpc-3
Every day the reservation string is changed on the HPC systems (below)
Change the number of nodes and tasks to use more or less CPUs for jobs
Use the command qsub <jobsript> in order to submit parallel jobs to the supercomputerand remember your <job id> returned
Use the command qstat in order to check the status of your parallel job
34 / 40
Lecture 2 – PRACE and Parallel Computing Basics
HPC Environment – Modules
Module environment tool Avoids to manually setup environment information for every application Simplifies shell initialization and lets users easily modify their environment
Module avail Lists all available modules on the HPC system (e.g. compilers, MPI, etc.)
Module spider Find modules in the installed set of modules and more information
Module load Loads particular modules into the current work environment, E.g.:
35 / 40
Lecture 2 – PRACE and Parallel Computing Basics
[Video] Parallel I/O with I/O Nodes
[11] Simplifying HPC Architectures, YouTube Video
36 / 40
Lecture Bibliography
Lecture 2 – PRACE and Parallel Computing Basics 37 / 40
Lecture Bibliography (1)
[1] Partnership for Advanced Computing in Europe (PRACE), Online: http://www.prace-ri.eu/
[2] PRACE Training Portal, Online: http://www.training.prace-ri.eu/material/index.html
[3] PRACE – Introduction to Supercomputing, Online: https://www.youtube.com/watch?v=D94FJx9vxFA
[4] Introduction to High Performance Computing for Scientists and Engineers, Georg Hager & Gerhard Wellein, Chapman & Hall/CRC Computational Science, ISBN 143981192X, English, ~330 pages, 2010, Online:http://www.amazon.de/Introduction-Performance-Computing-Scientists-Computational/dp/143981192X
[5] TOP500 Supercomputing Sites, Online: http://www.top500.org/
[6] DEEP-EST EU Project, Online: http://www.deep-projects.eu/
[7] JURECA HPC System @ JSC,Online: http://www.fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/JURECA/JURECA_node.html
[8] Putty Tool, Online: http://www.putty.org/
[9] JUROPA3 HPC System @ JSC, Online: http://www.fz-juelich.de/ias/jsc/EN/Research/HPCTechnology/ClusterComputing/JUROPA-3/JUROPA-3_node.html
[10] LLView Tool, Online: http://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/LLview/_node.html
Lecture 2 – PRACE and Parallel Computing Basics 38 / 40
Lecture Bibliography (2)
[11] YouTube Video, ‘Big Ideas: Simplifying High Performance Computing Architectures’, Online: https://www.youtube.com/watch?v=ISS_OGVamBk
Lecture 2 – PRACE and Parallel Computing Basics 39 / 40
40 / 40Lecture 2 – PRACE and Parallel Computing Basics
Slides Available at http://www.morrisriedel.de/talks