Self-Service Supercomputing

49
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. London Summit July 2016 HPC Clusters as code in the [almost]* Infinite cloud Brendan Bouffler AWS Global Scientific Computing @boofla 2016-07-07 Wil Mayers Alces Flight Ltd (UK) @alcesflight

Transcript of Self-Service Supercomputing

Page 1: Self-Service Supercomputing

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

London Summit July 2016

HPC Clusters as code in the [almost]* Infinite cloud

Brendan BoufflerAWS Global Scientific Computing

@boofla

2016-07-07

Wil MayersAlces Flight Ltd (UK)

@alcesflight

Page 2: Self-Service Supercomputing

Scientific Computing

Science is one of the greatest areas ofcomputation and can benefit from ademocratization in cost and globalaccessibility that the cloud brings.

It’s also where we think Amazon canmake a huge, really disruptive, impacton the world by participating - which is, atthe most basic level, what we are aboutas a company.

Page 3: Self-Service Supercomputing

Disrupting science, wherever it’s happening.

Page 4: Self-Service Supercomputing

Existing1. Oregon2. California3. Virginia4. Dublin5. Frankfurt6. Singapore7. Sydney8. Seoul9. Tokyo10. Sao Paulo11. Beijng12. US GovCloud

1. Ohio2. India3. UK4. Canada5. China+1

AWS Region Availability Zone

regions are sovereign your data never leaves

Page 5: Self-Service Supercomputing

Public Data Sets

workloads to the data data to the workloads

Page 6: Self-Service Supercomputing

Meeeeelions of uncorrelated workloads

core

s

time

Collectiveaction

Wheneveryonecomestogetherinthecloudtosharetheresource,andonlypaysforwhattheyuse,theefficiencyishuge.

Page 7: Self-Service Supercomputing

Spot Market

core

s

time

Spot Market

Our ultimate space filler.

Spot Instances allow you to name your own price for spare AWS EC2 computing capacity.

Great for workloads that aren’t time sensitive, and especially popular in research (hint: it’s really cheap).

Page 8: Self-Service Supercomputing

Spot Market BehaviorSpot Bid Advisor

The Spot Bid Advisor analyzes Spot price history to help you determine a bid price that suits your needs.

You should weigh your application’s tolerance for interruption and your cost saving goals when selecting a Spot instance and bid price.

The lower your frequency of being outbid, the longer your Spot instances are likely to run without interruption.

https://aws.amazon.com/ec2/spot/bid-advisor/

Bid Price & Savings

Your bid price affects your ranking when it comes to acquiring resources in the SPOT market, and is the maximum price you will pay.

But frequently you’ll pay a lot less.

Page 9: Self-Service Supercomputing

Agility is…Paying Only for IT You Use

Peak: 58K cores

Valley: 12K cores

Page 10: Self-Service Supercomputing

Breakthrough discoveries in the Cloud

The CHILES project astronomers have detected radio emissions from hydrogen in a galaxy more than 5 billion light years away, shattering the previous record by almost twice. This has important implications for our understanding of how galaxies have evolved over time.

The team at ICRAR in Western Australia estimates that the amount of compute capacity required to shift and crunch this data would have made this work infeasible.

By using AWS, they were able to quickly and cheaply build their new pipelines, and then scale them as massive amounts of data arrived from their instruments.

Page 11: Self-Service Supercomputing

Science is about experimentation

Page 12: Self-Service Supercomputing

AWS Building blocks

TECHNICAL & BUSINESS SUPPORT

Account Management

Support

Professional Services

Solutions Architects

Training & Certification

Security & Pricing Reports

Partner Ecosystem

AWSMARKETPLACE

Backup

Big Data& HPC

Business Apps

Databases

Development

IndustrySolutions

Security

MANAGEMENTTOOLS

Queuing

Notifications

Search

Orchestration

Email

ENTERPRISEAPPS

VirtualDesktops

StorageGateway

Sharing &Collaboration

Email &Calendaring

Directories

HYBRID CLOUDMANAGEMENT

Backups

Deployment

DirectConnect

IdentityFederation

IntegratedManagement

SECURITY &MANAGEMENT

Virtual PrivateNetworks

Identity &Access

EncryptionKeys

Configuration Monitoring Dedicated

INFRASTRUCTURESERVICES

Regions AvailabilityZones Compute

StorageObjects, Blocks, Files

DatabasesSQL, NoSQL, Caching

CDNNetworking

PLATFORMSERVICES

App

Mobile & WebFront-end

Functions

Identity

Data Store

Real-time

Development

Containers

SourceCode

BuildTools

Deployment

DevOps

Mobile

Sync

Identity

PushNotifications

MobileAnalytics

MobileBackend

Analytics

DataWarehousing

Hadoop

Streaming

DataPipelines

MachineLearning

Page 13: Self-Service Supercomputing

EC2There’s a couple dozen EC2 compute instance types alone, each of which is optimized for different things.

One size does not fit all.

Page 14: Self-Service Supercomputing

C4Intel Xeon E5-2666 v3, custom built for AWS.

Intel Haswell, 16 FLOPS/tick

2.9 GHz, turbo to 3.5 GHz

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/c4-instances.html

Feature Specification

Processor Number E5-2666 v3

Intel® Smart Cache 25 MiB

Instruction Set 64-bit

Instruction Set Extensions AVX 2.0

Lithography 22 nm

Processor Base Frequency 2.9 GHz

Max All Core Turbo Frequency 3.2 GHz

Max Turbo Frequency 3.5 GHz (available on c4.2xLarge)

Intel® Turbo Boost Technology 2.0

Intel® vPro Technology Yes

Intel® Hyper-Threading Technology Yes

Intel® Virtualization Technology (VT-x) Yes

Intel® Virtualization Technology for Directed

I/O (VT-d)

Yes

Intel® VT-x with Extended Page Tables (EPT) Yes

Intel® 64 Yes

Page 15: Self-Service Supercomputing

cfnCluster - provision an HPC cluster in minutes

#cfnclusterhttps://github.com/awslabs/cfncluster

cfncluster is a sample code framework that deploys and maintains clusters on AWS. It is reasonably agnostic to what the cluster is for and can easily be extended to support different frameworks. The CLI is stateless, everything is done using CloudFormation or resources within AWS.

10 minutes

http://boofla.io/u/cfnCluster – (Boof’s HOWTO slides)

Page 16: Self-Service Supercomputing
Page 17: Self-Service Supercomputing

§ 750+ popular scientific applications

AWS Marketplace

iimmediately

Introducing Alces Flight - self-scaling HPC clusters instantly ready to compute, billed by the hour and using the AWS Spot market by default to achieve supercomputing for ~1c per core per hour.

Self-service HPC … 2016

http://boofla.io/u/alcesFlight

Page 18: Self-Service Supercomputing

Requirements for Launching your HPC cluster

• An Amazon Web Services (AWS) account• An SSH key-pair in your AWS region• An SSH client• Optionally – a VNC client• A workload to process

Page 19: Self-Service Supercomputing

Wil Mayers, Alces

Page 20: Self-Service Supercomputing

Searching AWS Marketplace

Page 21: Self-Service Supercomputing

Selecting Alces Flight from Marketplace

Page 22: Self-Service Supercomputing

Launching a new cluster

Page 23: Self-Service Supercomputing

CloudFormation cluster launch

Page 24: Self-Service Supercomputing

Access IP address

Page 25: Self-Service Supercomputing

Logging in to your Flight Cluster

Page 26: Self-Service Supercomputing

Cluster Architecture VPC

• Virtual Private Cluster (VPC)• One login node• EBS volume for data/apps• Compute node scaling group

• 2 to 1,152 cores• Deployed in placement group• Static or auto-scaling• On-demand or Spot instances

Page 27: Self-Service Supercomputing

Linux cluster facilities

• CentOS Linux cluster• Full root access to all nodes• Genders utility • PDSH utility• YUM install any software

Page 28: Self-Service Supercomputing

Graphical Desktop sessions

• Create a session• Share connection

details• Join to the session via

VNC• Other collaborators

can join

Page 29: Self-Service Supercomputing

Using Graphical Applications

Page 30: Self-Service Supercomputing

Installing Scientific Applications

• Simple command-line tool to install applications

Page 31: Self-Service Supercomputing

Installing by Scientific Discipline

• Choose a depot of applications to install

Page 32: Self-Service Supercomputing

Alces Gridware Application library

• Over 850 application, library and MPI versions• Pre-optimized and stored in S3• Option to compile and optimize on-demand

• Includes modules environment management• Gridware project keeps pace with latest versions• Support for commercial and licensed applications• http://tiny.cc/gridware

Page 33: Self-Service Supercomputing

Using Storage Services

• Cluster includes large storage volume for data and apps

• Tools to manage data held in object storage

• Store your data in AWS S3 quickly and easily

S3

Page 34: Self-Service Supercomputing

Cluster job scheduler

• Choice of HPC cluster job schedulers

• Automate job processing on your HPC cluster

• Queue jobs for processing when nodes are available

• Auto-scaling compute nodes within user-defined limits

• Automatically rerun any jobs stopped when spot price exceeded

Page 35: Self-Service Supercomputing

Workload to process #1

Landsat cloud coverage survey

Page 36: Self-Service Supercomputing

Landsat Satellite mapping data

• Continuous record of Earth’s surface

• Data from the 1970s to present day

• Public data set available to everyone

• Stored on object storage, including AWS S3

Page 37: Self-Service Supercomputing

Workload

• Survey of cloud cover around Northern Tropic• Task-array job running 360 degrees around the Earth• Measures average cloud cover in each image• Generates a deck of sample images• Uploads deck to S3 object storage• Uses 360 x compute cores

? S3

Page 38: Self-Service Supercomputing

Workflow

1. Launch your cluster2. Enable object storage3. Install application4. Fetch job-script5. Submit job

Page 39: Self-Service Supercomputing

Approximate costs

• 360 jobs each taking ~5 mins• Total CPU time = 30 core hours

• Cost of 36 core hours in AWS spot market* = $0.44• Cost of one T2 login node for 1 hour* = $0.12• Cost of 100GB EBS volume for apps* = <$0.01• Alces Flight software cost = $0.00

• Total cost per daily run = $0.60 / 45p• Cost for one year of research = $219 / £168

* based on C4.8xlarge spot rate in EU-West region; T2.large on-demand instance; EBS st1 volume; excludes S3 storage costs and sales tax where applicable

Page 40: Self-Service Supercomputing

Workload to process #2

Computational Fluid Design with OpenFoam

Page 41: Self-Service Supercomputing

OpenFoam CFD

• Computational Fluid Design workload• Simulates liquid and air-flow for engineering projects• Open-source software available to all• Commercial support available from CFD Direct Ltd.• Run as a parallel job across multiple compute nodes

Page 42: Self-Service Supercomputing

Workload

• Generate a mesh representing the problem• Decomposition of the problem into sections• Processing of the sections• Visualization of the solution

Page 43: Self-Service Supercomputing

Workflow

1. Launch your cluster2. Enable object storage3. Install application4. Fetch job-script5. Submit job6. Start desktop7. Visualize

Page 44: Self-Service Supercomputing

Visualization with ParaView

Page 45: Self-Service Supercomputing

Approximate costs (full solve)

• 1 job using 128 cores taking 4 hours• Total CPU time = 1024 core hours

• Cost of 1024 core hours in AWS spot market* = $7.04• Cost of one T2 login node for 4 hours* = $0.45• Cost of 100GB EBS volume for apps* = $0.02• Alces Flight software cost = $0.00

• Total cost per simulation = $7.51 / £5.75

* based on C4.8xlarge spot rate in EU-West region; T2.large on-demand instance; EBS st1 volume; excludes sales tax where applicable

Page 46: Self-Service Supercomputing

Filesystems in the marketplace, too

BeeGFS is a scalable parallel cluster filesystem developed with a strong focus on performance and designed easy installation and management developed by the Fraunhofer Institute.

Intel Lustre® Cloud Edition is a scalable, parallel file system purpose-built for HPC and with a long history in the field supporting a range of workloads.

There’s more to come - the AWS Marketplace is growing all the time and new offerings are added frequently. Watch this space.

There are cluster filesystem options, too– for when you need extreme I/O scaling.

Page 47: Self-Service Supercomputing

How to start?

1. AWS Account

3. A problem to solve

Page 48: Self-Service Supercomputing

Please remember to rate this session under My Agenda on

awssummit.london

Page 49: Self-Service Supercomputing