Download - New Era of High Performance Computing (convergence of AI ... · Petascale Cray XC Systems UNDISCLOSED SYSTEMS Top 10 Top 50 Top 100 Cray Systems 4 18 29 Vendor Rank #1 #1 #1 Top 500

New Era of High Performance Computing (convergence of AI, Big Data, HPC)

Rajesh [email protected]

mailto:[email protected]

Petascale

Cray XC

Systems

UNDISCLOSEDSYSTEMS

Top 10 Top 50 Top 100

Cray Systems 4 18 29

Vendor Rank #1 #1 #1

Top 500 Supercomputers in the World

June 2018

/wikipedia/de/7/76/HLRN-Logo.svg

/wikipedia/de/7/76/HLRN-Logo.svg

http://www.nersc.gov/

http://www.nersc.gov/

http://www.afrl.hpc.mil/

http://www.afrl.hpc.mil/

Cray’s Supercomputing Leadership

Copyright 2018 Cray Inc. - Confidential and Proprietary3

Top 500 Supercomputers in the World

Nov 2017

Top 10 Top 50 Top 100

Cray Systems 4 18 29

Vendor Rank #1 #1 #1

Copyright 2018 Cray Inc.4

The Convergence of Big Data, AI and HPC

Modeling The World

Data-IntensiveProcessing

Hybrid workflows with a mix of simulation and

analytics

Data Models

Analysis of large datasets for knowledge discovery, insight, and

prediction.

Math Models

Simulation and modelling of the natural world via

mathematical equations

Workloads are

becoming

more

heterogeneous

The Convergence of Big Data, AI and HPC

6

Systems/Container Management

Analytics/Machine Learning Ecosystem

Deep Learning Toolkits

Big Data

Today: Running software built for the cloud on HPC hardware

Benefit: Convergence of productivity and performance

https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0CAcQjRxqFQoTCObR85jFi8kCFQlwPgod6fYHZA&url=https://uk.wikipedia.org/wiki/Apache_Spark&psig=AFQjCNFEtbW0aqXpHbnpSOEoDSrdF5POPw&ust=1447440232698881

https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0CAcQjRxqFQoTCObR85jFi8kCFQlwPgod6fYHZA&url=https://uk.wikipedia.org/wiki/Apache_Spark&psig=AFQjCNFEtbW0aqXpHbnpSOEoDSrdF5POPw&ust=1447440232698881

Machine Learning Coming to Your Science Domain

Clustering Daya Bay

Events

Classifying LHC

Events

Oxford Nanopore

SequencingDetecting Extreme Weather

FWI Subsurface

Modelling

Modeling Galaxy Shapes

Turbine CFD Modeling

New and Larger Deep Learning Models Required

Protein-Ligand Binding

Deep Learning Use Cases

Consumer Retail Energy Financial Health Industrial Autonomous Driving

•Search•Face/Object Detection

Detection• Image Segmentation

Segmentation•Speech

Understanding•NLP•Text to Speech

•Person and Object DetectionDetection

• Image Segmentation

•Scene Analytics•Support•Marketing•Supply Chain•Security

•Oil and Gas Exploration

•Smart Grid•Operational

Improvement•Conservation

•Algorithmic TradingTrading

•Fraud Detection•Personal Finance•Risk Mitigation•Security

•Enhanced Diagnostics

•3D Medical Imaging

•Drug Discovery•Sensory Aids

•FactoryAutomation

•Predictive Maintenance

•Precision Agriculture

•Field Automation

•Pedestrian,Vehicle, and Object Detection Detection and Classification

•Ego Motion•Sensor Fusion•Environment

Modeling

Topologies:

•ResNet•SSD•LSTM•Attention•SparseNN•FCN

Topologies:

•ResNet•SSD•FCN•RNN

Topologies:

•Deep Reinforcement

Topologies:

•Deep Reinforcement

Topologies:

•ResNet•SSD•FCN

Topologies:

•ResNet•SSD•Deep

ReinforcementLearning

Topologies:

•Deep ReinforcementLearning

•LSTM•SSD

8/31/2018

NERSC – Deep Learning in Science

Opportunities to apply DL widely in support of classic HPC simulation and modelling

Molecular Engineering of Solar-powered WindowsJacqueline M. Cole, University of Cambridge, Argonne National Lab

1 2 3 4Extract Compound Data from Scientific Publications

Enrich Data with ML and Quantum Chemical Calculations

Filter Data Set to small number of candidates (ML)

Validate final candidates (sim)

Relationship Between AI, ML, & DL

● “AI” is a very broad term, with no clear boundaries

● ”AI” and “deep learning” are not synonymous

● Machine learning is just a part of AI

● Deep learning is a specialization of machine learning

● Cray focuses on Deep Learning

Neural Network Workflow

NN workflows are similar in many ways to typical data science workflows

Ingest/clean & transform can be major undertakings, as usual

Training results in a model that can then be used for inference, which produces answers in

production

Cray’s biggest contribution is to be made in the computationally intensive training phase!

Deep Learning in Production

DataAcquisition

DataPreparation

ModelTraining

ModelTesting

• Cleansing• Shaping• Enrichment

Data Annotation (Ground Truth)

TestSet

ValidationSet

Train Model

Evaluate Performance and optimize model

Cross-Validation

TrainingSet

ModelDeployment

A/B testing in production

Iterative

Training Inferencing

Model management

Example of an end-to-end workflow

Deep Learning: Behind the Scenes

DataAcquisition

DataPreparation

ModelTraining

ModelTesting

ModelDeployment

A/B testing in production

Training Inferencing

Ideal training algorithm: For every training sample:

run sample forward through the model

compute the error vs. the training data

back-propagate error through the NN to update the weights (gradient descent)

Typically broken up into “mini-batches”

Exposes more intra-node parallelism; arguably reduces “noise”

After all data is processed, adjust “learning rate” and repeat until desired accuracy achieved

DNN model with weights on all connections.Largest models now hundreds of layers, and millions (to billions) of nodes

HPC Thinking: Message-size, MPI-collective, Global all-reduce modifications

Source: Peter Mendygral and Jef Dawson, Cray PE and Performance

90%+ scalability efficiency that can reduce training time from days to hours

Differentiating Results: TensorFlow

Cray Machine Learning / AI Environment

Cray Distributed Training Framework Delivers up to 5x Performance* over other Distributed Training Approaches

* Actual performance depends onsystem, batch size and model

Deep Learning Toolkits

Analytics/Machine Learning Ecosystem

NCCL

Not Just More Data But Also DIFFERENT I/O Patterns

18

Large,streaming

I/O(HDDs shine)

Small,random

I/O(SSDs shine)

Modelling &

Simulation

Advanced

AnalyticsArtificial

Intelligence

L300FL300 L300N

ClusterStor Converged Building BlocksEmbedded HA Lustre Object Storage Servers

Form Factor

HDD/SDD

IOPS 4K rand. Wr

Throughput

GB/s*

Cost/usable GB 1 1.15 ~ 30

5U84 12Gb/s SAS

82/2 82/2 or 80/4 (with NXD SW) 0/24

4,000 40,000 500,000

10 rd/10 wr 10 rd/10 wr 10 rd/20 wr

*Conservatively derated19

Best used for..Large,

Streaming I/OSmall,

random I/OMixed

I/O

5U84 12Gb/s SAS 2U24 12Gb/s SAS

Base rack

Copyright 2017 Cray Inc.

● No single point of failure

● 2 X GiGE Switches (1U each)

● 2 X IB Switches (1U each)

● SMU / System Management unit

(2U)

● MMU / Meta data management unit

(2U)

● 6 X SSU (5U each)

20

Expansion rack

● No single point of failure

● 2 X GiGE Switches (1U each)

● 2 X IB Switches (1U each)

● 7 X SSU (5U each)

Copyright 2017 Cray Inc.

Rack# drives:

(HDDs/SSDs)

8TB HDDTBs: (U/R)

IOR perfGB/s*

PowerkW

SSU #6 574/ 14 3304 / 4592 63 14.9

SSU #5 492 / 12 2832 / 3936 54 12.6

SSU #4 410 / 10 2360 / 3280 45 10.9

SSU #3 328 / 8 1888 / 2624 36 9.2

SSU #2 246 / 6 1416 / 1968 27 7.4

SSU #1 164 / 4 944 / 1312 18 5.7

SSU #0 82 / 2 472 / 656 9 4.0

SSU Specs – 7.2 K RPM NL-SAS drive (Expansion rack)

22Copyright 2017 Cray Inc.

CRAY INC - COPYRIGHT 201823

Runtime VariabilityReal time and historical views of metrics

to understand what is impacting

applications

Problem IsolationUnified view of system activity enables

problem isolation in complex

environments

Trend AnalysisEnable data-driven decisions and

visualization of trends to optimize

systems

AlertingSpend more time on high priority tasks

and be notified when anomalies occur

Cray’s Solution:

Multiple ClusterStor System View

CRAY INC - COPYRIGHT 2018

ActivityQuickly see what

jobs are running

on the system

and jobs that

might present

issues

UtilizationPerformance and

capacity

information for

multiple systems

24

Performance Visualization and Comparison

VisualizePerformance

graphs over the

life of the job for

write, read, and

metadata

operations

CompareCompare this job

to the rest of the

system at a

glance

25

Cray’s next machine- Convergence of Cluster & Supercomputer

Cray Inc. Proprietary – Not For Public Disclosure 26

● Liquid cooled & Air cooled system

● Single interconnect for either system

● Single system management software for either system

● Ability to carve out a portion of system for dedicated projects with few clicks!

● Ability to optimize the same platform for a variety of applications (cluster focused, large memory focused, large mpi jobs)

• Highest power CPUs supported via direct liquid cooling

• Hardware & Software scalable to Exascale Systems

Performance

• Warm water cooling (W3 and W4 temps supported)

• Efficient power conversion

• Upgradeable for multiple generations

TCO

Cray Next Gen LC Supercomputer

Leadership Supercomputing

Highest number of systems in the HPC

Top 100*

Drive maximum computing

performance

while focusing on programmability

Close the gap between observed

and achievable performance

Maximize cycles to the application

Address issues of scale and

complexity of HPC systems

Cray developer tools profile applications with

over 99,000 ranks

Cray MPI runs with > 2 million ranks

*Nov 2017

Cray Software

Admin Interface

Cray Linux Environment

Lustre, Cray DVS, DataWarp

Systems Management

High Speed Interconnect

Cray Programming Environment

3rd party WLMs, containers, tools

A complete, fully integrated, extensible environment for HPC

Converged System ManagementSupport for broad operating and management ecosystems

Node Bootstrap

Orchestration

Utility

Storage

Monitoring

Configuration

Management

Network

Management

…

Infrastructure Services

Storage Events Networking Compute

CLE

Software EcosystemsHardware

Inventory

Administrative

Control

Management Services

External Interfaces

Cray

REST APIs

Tools

(continued)

Programming

Models

Programming

Languages

Fortran

Tools Programming

Environments

Cray’s Extensive Programming Environment

Chapel

Python

R

Optimized

Libraries

LAPACK

Cray Developed

Licensed ISV SWCray added value to 3rd party

3rd party packaging

C

C++

Analytics / AI **

AI Toolboxes

Cray MPI

SHMEM

Environment setup

Debuggers

Modules

TotalView

DDT

gdb4hpc

Abnormal

Termination

Processing (ATP)

ScaLAPACK

BLAS

Iterative

Refinement

Toolkit

FFTW

NetCDF

HDF5

Performance

Analysis

Porting

CrayPAT

Cray Apprentice2

Reveal

CCDB

STAT

Distributed Memory

Debugging Support

I/O Libraries

Scientific Libraries

DL Frameworks

Chapel AI

Cray Compiling

Environment

PrgEnv-cray

GNU

PrgEnv-gnu

ProgEnv-

3rd Party compilers

(Intel, Allinea,

LLVM, etc)

Languages

Cray Distributed

Training

Framework

Valgrind4hpc

Cray Urika

AI - Analytics

OpenMP

OpenACC

PGAS & Global

View

Shared Memory /

GPU

UPC

Fortran coarrays

Coarray C++

Chapel

Next Gen System-Software Themes

• Building on current management and Linux scalability enhancements

• MPI scalability across full systemsScaling to exascale

• Separate management and operating environments

• Concurrent maintenance

• Health and resiliency support

Toward zero downtime

• Customer choice of operating environment

• Broad container support

• Workload management and orchestrationRun any workflow

• Clean APIs between software components

• Customizable with easy integrationModularity