Bare-Metal Style HPC Clusters On Google Cloud Platform ... · Bare-Metal Style HPC Clusters On...

Bare-Metal Style HPC ClustersOn

Google Cloud PlatformDr. Joseph Schoonover

CEO, Cloud-HPC Systems ArchitectFluid Numerics, LLC

Boulder, CO

joe@fluidnumerics.com

This talk will show how we can replicate this type of environment on the cloud

Compute Network Storage

HPC Components

Deployment Manager

Infrastructure as code→ Python or Jinja templates and YAML dictionaries

https://cloud.google.com/deployment-manager/

Resources:- type: compute.v1.instance name: quickstart-deployment-vm Properties: zone: us-central1-f machineType: <MACHINE TYPE> Disks: - deviceName: boot type: PERSISTENT boot: true autoDelete: true initializeParams: sourceImage: <OS IMAGE> networkInterfaces: - network: <NETWORK>

SchedMD : Slurm-GCP

https://github.com/schedmd/slurm-gcp

SchedMD : Slurm-GCPHow to use : Customize YAML file

#slurm-cluster.yaml

imports:- path: slurm.jinjaresources:- name: slurm-cluster type: slurm.jinja properties: cluster_name : g1 static_node_count : 2 max_node_count : 10 zone : us-central1-b region : us-central1 cidr : 10.10.0.0/16 controller_machine_type : n1-standard-2 compute_machine_type : n1-standard-2 login_machine_type : n1-standard-2

SchedMD : Slurm-GCPHow to use : Deploy

gcloud deployment-manager deployment create slurm-cluster --project=<project id> --config=slurm-cluster.yaml

SchedMD : Slurm-GCPHow to use : Login

ssh <user>@<ip-address>

Customizations

● Stackdriver monitoring integrations

● Multiple, user-locked, micro-login nodes

● Additional NFS storage and Lustre Storage

● Multiple partitions for multiple application/developer support

● Federation - burst expand on prem resources

● Python/Jupyter notebook integrations

Stackdriver

Built in system monitoring and alerting platform

This data can be used to resize compute nodes to maximize resource utilization and reduce costs

Custom metrics can be integrated (e.g. GPU utilization)

Slurm Cluster

github.com/schedmd/slurm-gcp

Customizations

+multi-partition +multi-zone

+environment modules

+github.com/spack/spack

Compute Resources

Slurm + Stackdriver + BigQuery + DataStudio

Partitions

highmem-64 ( 64 CPU + 416 GB RAM ) + 8 Nvidia® Tesla® V100 GPUs

highmem-32 ( 32 CPU + 208 GB RAM ) + 4 Nvidia® Tesla® P100 GPUs

highmem-16 ( 16 CPU + 104 GB RAM ) + 4 Nvidia® Tesla® P4 GPUs

highmem-16 ( 16 CPU + 104 GB RAM ) + 4 Nvidia® Tesla® K80 GPUs

highmem-8 ( 8 CPU + 52 GB RAM ) + 1 Nvidia® Tesla® K80 GPU

highmem-8 ( 8 CPU + 52 GB RAM ) + 1 Nvidia® Tesla® P100 GPU

standard-32 ( 32 CPU + 120 GB RAM )

GPU Platform Comparisons

Use Case

Jeff CandyGeneral Atomics

CGYRO ● Highmem-32 + 4xV100 GPU

● Highmem-32 + 4xP100 GPU

● Highmem-32 + 4xV100 GPU

● Highmem-32 + 4xV100 GPU + GPU-Direct MPI

GPU Platform ComparisonsUse Case

Jeff CandyGeneral Atomics

Cori, Summit, and GCP Comparisons

● Highmem-32 + 4xV100 GPU + GPU-Direct MPI

Use Case

● Standard-32

Cloud Cost AnalysisUse Case

● Standard-32

● Highmem-64 + 8xV100 GPU$1.06

$16.54

$0.65 $0.75 $1.35 $2.07

● Standard-32

22,386.3

4,070.5 3,460.71,570.6 809.9

● Standard-32

$0.73 $0.72 $0.59 $0.47

HPC DevOps

A Cultural Shift

Currently, organizations maintain one cluster that supports many teams

Teams of scientists/developers are separated from the infrastructure and admin teams

Developer teams or divisions can experiment and develop on their own cluster.

This would require infrastructure and admin individuals to interact more closely with scientists and developers

Impromptu Tutorial this afternoon

At 3:45 , we’ll do a tutorial where you can spin up your own elastic slurm cluster on GCP.

https://codelabs.developers.google.com/codelabs/hpc-slurm-on-gcp/

Further Questions

Contact : joe@fluidnumerics.com

https://fluidnumerics.com

Bare-Metal Style HPC Clusters On Google Cloud Platform ... · Bare-Metal Style HPC Clusters On...

Documents

Transcript of Bare-Metal Style HPC Clusters On Google Cloud Platform ... · Bare-Metal Style HPC Clusters On...

HPC Cloud Computing - de Laat · Virtual Private HPC Cluster With the newly developed High Performance Computing (HPC) Cloud environment researchers get access to their own Virtual

HPC on Oracle Cloud Infrastructure

HPC Cloud: Hype or Reality?

SURFsara HPC Cloud Workshop - UvA · SURFsara HPC Cloud Workshop UvA HPC and Big Data Course June 2014 Anatoli Danezi, Markus van Dijk cloud-support@surfsara.nl → Tutorial 2014-06-11

Cloud hpc-bigdata-challenges

HPC Meets Cloud: Opportunities and Challenges in … · Designing High-Performance MPI and Big Data ... – NSF Chameleon Cloud HPC Cloud ... full bare metal reconfiguration

ISC Cloud 2013 - Cloud Architectures for HPC – Industry Case Studies

TOWARDS EFFICIENT HPC IN THE CLOUDcharm.cs.illinois.edu/newPapers/13-31/paper.pdf · HPC-cloud: What (applications), why (benefits), who (users) How: QDR Bridge HPC-cloud Gap (1)

Cloud Computing for HPC · HPC Cloud - Differentiating the products Cloud Connect On Demand • LSF, HPC, Symphony plugin • Schedules infrastructure in multi-tenant environments

"Cloud Computing for HPC"

MT09 Using Dell’s HPC Cloud Solutions to maximize HPC utilization while reducing cost

Qubole on Oracle Bare Metal Cloud Service · Qubole on Oracle Bare Metal Cloud Service ... , Optimized for the Cloud Native Integration with Oracle Bare Metal ... delivering the quintessential

IMPACT HPC Cloud Day

HPC-Cloud / Cloud-HPC - Approaches to HPC with OpenStack

HPC in the cloud comes of age - Red Oak HPC Seminar

Scalable, bare-metal HPC Cloud for Al and machine learning · • Scalable service that can flex with your HPC and GPU workloads • Non virtualised - compute and storage together

CUDA Cloud: Enabling HPC Workloads in OpenStack | …on-demand.gputechconf.com/.../S3214-Enabling-HPC-Workloads-Open… · Title: CUDA Cloud: Enabling HPC Workloads in OpenStack |

HPC Cloud Security Stakeholders Platform (HCSSP) · 2019-09-03 · HPC yyy WG HPC xxx WG HPC Cloud Security WG CCM for AWS HPC Cloud CCM for Google HPC Cloud CCM for Microsoft HPC

VIRTUALIZING HIGH PERFORMANCE COMPUTING (HPC) … · 5 Virtualizing High Performance Computing High Performance Computing (HPC) workloads have been traditionally run on bare-metal,

HPC Cloud Computing with OpenNebula