Self-Service Supercomputing

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

London Summit July 2016

HPC Clusters as code in the [almost]* Infinite cloud

Brendan BoufflerAWS Global Scientific Computing

@boofla

2016-07-07

Wil MayersAlces Flight Ltd (UK)

@alcesflight

Scientific Computing

Science is one of the greatest areas ofcomputation and can benefit from ademocratization in cost and globalaccessibility that the cloud brings.

It’s also where we think Amazon canmake a huge, really disruptive, impacton the world by participating - which is, atthe most basic level, what we are aboutas a company.

Disrupting science, wherever it’s happening.

Existing1. Oregon2. California3. Virginia4. Dublin5. Frankfurt6. Singapore7. Sydney8. Seoul9. Tokyo10. Sao Paulo11. Beijng12. US GovCloud

1. Ohio2. India3. UK4. Canada5. China+1

AWS Region Availability Zone

regions are sovereign your data never leaves

Public Data Sets

workloads to the data data to the workloads

Meeeeelions of uncorrelated workloads

core

s

time

Collectiveaction

Wheneveryonecomestogetherinthecloudtosharetheresource,andonlypaysforwhattheyuse,theefficiencyishuge.

Spot Market

core

s

time

Spot Market

Our ultimate space filler.

Spot Instances allow you to name your own price for spare AWS EC2 computing capacity.

Great for workloads that aren’t time sensitive, and especially popular in research (hint: it’s really cheap).

Spot Market BehaviorSpot Bid Advisor

The Spot Bid Advisor analyzes Spot price history to help you determine a bid price that suits your needs.

You should weigh your application’s tolerance for interruption and your cost saving goals when selecting a Spot instance and bid price.

The lower your frequency of being outbid, the longer your Spot instances are likely to run without interruption.

https://aws.amazon.com/ec2/spot/bid-advisor/

Bid Price & Savings

Your bid price affects your ranking when it comes to acquiring resources in the SPOT market, and is the maximum price you will pay.

But frequently you’ll pay a lot less.

Agility is…Paying Only for IT You Use

Peak: 58K cores

Valley: 12K cores

Breakthrough discoveries in the Cloud

The CHILES project astronomers have detected radio emissions from hydrogen in a galaxy more than 5 billion light years away, shattering the previous record by almost twice. This has important implications for our understanding of how galaxies have evolved over time.

The team at ICRAR in Western Australia estimates that the amount of compute capacity required to shift and crunch this data would have made this work infeasible.

By using AWS, they were able to quickly and cheaply build their new pipelines, and then scale them as massive amounts of data arrived from their instruments.

Science is about experimentation

AWS Building blocks

TECHNICAL & BUSINESS SUPPORT

Account Management

Support

Professional Services

Solutions Architects

Training & Certification

Security & Pricing Reports

Partner Ecosystem

AWSMARKETPLACE

Backup

Big Data& HPC

Business Apps

Databases

Development

IndustrySolutions

Security

MANAGEMENTTOOLS

Queuing

Notifications

Search

Orchestration

Email

ENTERPRISEAPPS

VirtualDesktops

StorageGateway

Sharing &Collaboration

Email &Calendaring

Directories

HYBRID CLOUDMANAGEMENT

Backups

Deployment

DirectConnect

IdentityFederation

IntegratedManagement

SECURITY &MANAGEMENT

Virtual PrivateNetworks

Identity &Access

EncryptionKeys

Configuration Monitoring Dedicated

INFRASTRUCTURESERVICES

Regions AvailabilityZones Compute

StorageObjects, Blocks, Files

DatabasesSQL, NoSQL, Caching

CDNNetworking

PLATFORMSERVICES

App

Mobile & WebFront-end

Functions

Identity

Data Store

Real-time

Development

Containers

SourceCode

BuildTools

Deployment

DevOps

Mobile

Sync

Identity

PushNotifications

MobileAnalytics

MobileBackend

Analytics

DataWarehousing

Hadoop

Streaming

DataPipelines

MachineLearning

EC2There’s a couple dozen EC2 compute instance types alone, each of which is optimized for different things.

One size does not fit all.

C4Intel Xeon E5-2666 v3, custom built for AWS.

Intel Haswell, 16 FLOPS/tick

2.9 GHz, turbo to 3.5 GHz

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/c4-instances.html

Feature Specification

Processor Number E5-2666 v3

Intel® Smart Cache 25 MiB

Instruction Set 64-bit

Instruction Set Extensions AVX 2.0

Lithography 22 nm

Processor Base Frequency 2.9 GHz

Max All Core Turbo Frequency 3.2 GHz

Max Turbo Frequency 3.5 GHz (available on c4.2xLarge)

Intel® Turbo Boost Technology 2.0

Intel® vPro Technology Yes

Intel® Hyper-Threading Technology Yes

Intel® Virtualization Technology (VT-x) Yes

Intel® Virtualization Technology for Directed

I/O (VT-d)

Yes

Intel® VT-x with Extended Page Tables (EPT) Yes

Intel® 64 Yes

cfnCluster - provision an HPC cluster in minutes

#cfnclusterhttps://github.com/awslabs/cfncluster

cfncluster is a sample code framework that deploys and maintains clusters on AWS. It is reasonably agnostic to what the cluster is for and can easily be extended to support different frameworks. The CLI is stateless, everything is done using CloudFormation or resources within AWS.

10 minutes

http://boofla.io/u/cfnCluster – (Boof’s HOWTO slides)

§ 750+ popular scientific applications

AWS Marketplace

iimmediately

Introducing Alces Flight - self-scaling HPC clusters instantly ready to compute, billed by the hour and using the AWS Spot market by default to achieve supercomputing for ~1c per core per hour.

Self-service HPC … 2016

http://boofla.io/u/alcesFlight

Requirements for Launching your HPC cluster

• An Amazon Web Services (AWS) account• An SSH key-pair in your AWS region• An SSH client• Optionally – a VNC client• A workload to process

Wil Mayers, Alces

Searching AWS Marketplace

Selecting Alces Flight from Marketplace

Launching a new cluster

CloudFormation cluster launch

Access IP address

Logging in to your Flight Cluster

Cluster Architecture VPC

• Virtual Private Cluster (VPC)• One login node• EBS volume for data/apps• Compute node scaling group

• 2 to 1,152 cores• Deployed in placement group• Static or auto-scaling• On-demand or Spot instances

Linux cluster facilities

• CentOS Linux cluster• Full root access to all nodes• Genders utility • PDSH utility• YUM install any software

Graphical Desktop sessions

• Create a session• Share connection

details• Join to the session via

VNC• Other collaborators

can join

Using Graphical Applications

Installing Scientific Applications

• Simple command-line tool to install applications

Installing by Scientific Discipline

• Choose a depot of applications to install

Alces Gridware Application library

• Over 850 application, library and MPI versions• Pre-optimized and stored in S3• Option to compile and optimize on-demand

• Includes modules environment management• Gridware project keeps pace with latest versions• Support for commercial and licensed applications• http://tiny.cc/gridware

Using Storage Services

• Cluster includes large storage volume for data and apps

• Tools to manage data held in object storage

• Store your data in AWS S3 quickly and easily

S3

Cluster job scheduler

• Choice of HPC cluster job schedulers

• Automate job processing on your HPC cluster

• Queue jobs for processing when nodes are available

• Auto-scaling compute nodes within user-defined limits

• Automatically rerun any jobs stopped when spot price exceeded

Workload to process #1

Landsat cloud coverage survey

Landsat Satellite mapping data

• Continuous record of Earth’s surface

• Data from the 1970s to present day

• Public data set available to everyone

• Stored on object storage, including AWS S3

Workload

• Survey of cloud cover around Northern Tropic• Task-array job running 360 degrees around the Earth• Measures average cloud cover in each image• Generates a deck of sample images• Uploads deck to S3 object storage• Uses 360 x compute cores

? S3

Workflow

1. Launch your cluster2. Enable object storage3. Install application4. Fetch job-script5. Submit job

Approximate costs

• 360 jobs each taking ~5 mins• Total CPU time = 30 core hours

• Cost of 36 core hours in AWS spot market* = $0.44• Cost of one T2 login node for 1 hour* = $0.12• Cost of 100GB EBS volume for apps* = <$0.01• Alces Flight software cost = $0.00

• Total cost per daily run = $0.60 / 45p• Cost for one year of research = $219 / £168

* based on C4.8xlarge spot rate in EU-West region; T2.large on-demand instance; EBS st1 volume; excludes S3 storage costs and sales tax where applicable

Workload to process #2

Computational Fluid Design with OpenFoam

OpenFoam CFD

• Computational Fluid Design workload• Simulates liquid and air-flow for engineering projects• Open-source software available to all• Commercial support available from CFD Direct Ltd.• Run as a parallel job across multiple compute nodes

Workload

• Generate a mesh representing the problem• Decomposition of the problem into sections• Processing of the sections• Visualization of the solution

Workflow

1. Launch your cluster2. Enable object storage3. Install application4. Fetch job-script5. Submit job6. Start desktop7. Visualize

Visualization with ParaView

Approximate costs (full solve)

• 1 job using 128 cores taking 4 hours• Total CPU time = 1024 core hours

• Cost of 1024 core hours in AWS spot market* = $7.04• Cost of one T2 login node for 4 hours* = $0.45• Cost of 100GB EBS volume for apps* = $0.02• Alces Flight software cost = $0.00

• Total cost per simulation = $7.51 / £5.75

* based on C4.8xlarge spot rate in EU-West region; T2.large on-demand instance; EBS st1 volume; excludes sales tax where applicable

Filesystems in the marketplace, too

BeeGFS is a scalable parallel cluster filesystem developed with a strong focus on performance and designed easy installation and management developed by the Fraunhofer Institute.

Intel Lustre® Cloud Edition is a scalable, parallel file system purpose-built for HPC and with a long history in the field supporting a range of workloads.

There’s more to come - the AWS Marketplace is growing all the time and new offerings are added frequently. Watch this space.

There are cluster filesystem options, too– for when you need extreme I/O scaling.

How to start?

1. AWS Account

3. A problem to solve

Please remember to rate this session under My Agenda on

awssummit.london

Self-Service Supercomputing

Technology

Transcript of Self-Service Supercomputing