Gateways to Discovery: Cyberinfrastructure for the …...SAN DIEGO SUPERCOMPUTER CENTER at the...
Transcript of Gateways to Discovery: Cyberinfrastructure for the …...SAN DIEGO SUPERCOMPUTER CENTER at the...
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Gateways to Discovery:
Cyberinfrastructure for the Long Tail of Science
XSEDE’14 (16 July 2014)
R. L. Moore, C. Baru, D. Baxter, G. Fox (Indiana U), A Majumdar, P Papadopoulos, W Pfeiffer, R. S. Sinkovits, S. Strande (NCAR), M.
Tatineni, R. P. Wagner, N. Wilkins-Diehr, M. L. Norman
UCSD/SDSC (except as noted)
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
HPC for the 99%
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Comet is in response to NSF’s solicitation (13-528) to
• “… expand the use of high end resources to a
much larger and more diverse community
• … support the entire spectrum of NSF
communities
• ... promote a more comprehensive and
balanced portfolio
• … include research communities that are not
users of traditional HPC systems.“
The long tail of science needs HPC
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Jobs and SUs at various scales across NSF resources
0
500
1000
1500
2000
2500
3000
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K
Mil
lio
ns
of
XD
SU
s C
ha
rge
d
Fra
ctio
n o
f A
ll J
ob
s C
ha
rge
d in
20
12
Job Size (Cores)
Percentage of Jobs (Left Axis)
SUs Charged (Right Axis)
One node
• 99% of jobs run on
NSF’s HPC
resources in 2012
used <2048 cores
• And consumed
~50% of the total
core-hours across
NSF resources
Job Size (Cores)
Cu
mu
lati
ve U
sag
e
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Comet Will Serve the 99%
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Comet: System Characteristics • Available January 2015
• Total flops ~1.8-2.0 PF
• Dell primary integrator
• Intel next-gen processors, former
codename Haswell, with AVX2
• Aeon storage vendor
• Mellanox FDR InfiniBand
• Standard compute nodes
• Dual Haswell processors
• 128 GB DDR4 DRAM (64
GB/socket!)
• 320 GB SSD (local scratch)
• GPU nodes
• Four NVIDIA GPUs/node
• Large-memory nodes (Mar 2015)
• 1.5 TB DRAM
• Four Haswell processors/node
• Hybrid fat-tree topology
• FDR (56 Gbps) InfiniBand
• Rack-level (72 nodes) full bisection
bandwidth
• 4:1 oversubscription cross-rack
• Performance Storage
• 7 PB, 200 GB/s
• Scratch & Persistent Storage
• Durable Storage (reliability)
• 6 PB, 100 GB/s
• Gateway hosting nodes and VM
image repository
• 100 Gbps external connectivity to
Internet2 & ESNet
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Comet Architecture
Juniper 100 Gbps
Arista 40GbE
(2x)
Data Mover (4x)
R&E Network Access Data Movers
Internet 2
7x 36-port FDR in each rack wired as full fat-tree. 4:1 over subscription between racks.
72 HSWL 320 GB
IB Core (2x)
N GPU
4 Large-Memory
Bridge (4x)
Performance Storage 7 PB, 200 GB/s
Durable Storage 6 PB, 100 GB/s
Arista 40GbE
(2x)
N racks
FDR 36p
FDR 36p
64 128
18
72 HSWL 320 GB
72 HSWL
72
72
18 Mid-tier
Additional Support Components (not shown for clarity) NFS Servers, Virtual Image Repository, Gateway/Portal Hosting Nodes, Login Nodes, Ethernet Management Network, Rocks Management Nodes
Node-Local Storage 18
72 FDR FDR
FDR
40GbE
40GbE
10GbE
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
SSDs – building on Gordon success
Based on our experiences with Gordon, a number of
applications will benefits from continued access to flash
• Applications that generate large numbers of temp files
• Computational finance – analysis of multiple markets (NASDAQ, etc.)
• Text analytics – word correlations in Google Ngram data
• Computational chemistry codes that write one- and two-
electron integral files to scratch
• Structural mechanics codes (e.g. Abaqus), which
generate stiffness matrices that don’t fit into memory
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Large memory nodes
While most user applications will run well on the standard
compute nodes, a few domains will benefit from the large
memory (1.5 TB nodes)
• De novo genome assembly: ALLPATHS-LG,
SOAPdenovo, Velvet
• Finite-element calculations: Abaqus
• Visualization of large data sets
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
GPU nodes
Comet’s GPU nodes will serve a number of domains
• Molecular dynamics applications have been one of the
biggest GPU success stories. Packages include Amber,
CHARMM, Gromacs and NAMD
• Applications that depend heavily on linear algebra
• Image and signal processing
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Key Comet Strategies
• Target modest-scale users and new users/communities:
goal of 10,000 users/year!
• Support capacity computing, with a system optimized for
small/modest-scale jobs and quicker resource response
using allocation/scheduling policies
• Build upon and expand efforts with Science Gateways,
encouraging gateway usage and hosting via software
and operating policies
• Provide a virtualized environment to support
development of customized software stacks, virtual
environments, and project control of workspaces
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Comet will serve a large number of users, including new communities/disciplines
• Allocations/scheduling policies to optimize for high throughput of
many modest-scale jobs (leveraging Trestles experience)
• Optimized for rack-level jobs but cross-rack jobs feasible
• Optimized for throughput (ala Trestles)
• Per-project allocations caps to ensure large numbers of users
• Rapid access for start-ups with one-day account generation
• Limits on job sizes, with possibility of exceptions
• Gateway-friendly environment: Science gateways reach large
communities w/ easy user access
• e.g. CIPRES gateway alone currently accounts for ~25% of all users of NSF
resources, with 3,000 new users/year and ~5,000 users/year
• Virtualization provides low barriers to entry (see later charts)
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Changing the face of XSEDE HPC users • System design and policies
• Allocations, scheduling and security policies which favor gateways
• Support gateway middleware and gateway hosting machines
• Customized environments with high-performance virtualization
• Flexible allocations for bursty usage patterns
• Shared node runs for small jobs, user-settable reservations
• Third party apps
• Leverage and augment investments elsewhere
• FutureGrid experience, image packaging, training, on-ramp
• XSEDE (ECSS NIP & Gateways, TEOS, Campus Champions)
• Build off established successes supporting new communities
• Example-based documentation in Comet focus areas
• Unique HPC University contributions to enable community growth
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Virtualization Environment
• Leveraging expertise of Indiana U/ FutureGrid team
• VM jobs scheduled just like batch jobs (not conventional
cloud environment with immediate elastic access)
• VMs will be easy on-ramp for new users/communities,
including low porting time
• Flexible software environments for new communities and
apps
• VM repository/library
• Virtual HPC cluster (multi-node) with near-native IB
latency and minimal overhead (SRIOV)
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Single Root I/O Virtualization in HPC
• Problem: complex workflows demand
increasing flexibility from HPC platforms
• Pro: Virtualization flexibility
• Con: Virtualization IO performance loss
(e.g., excessive DMA interrupts)
• Solution: SR-IOV and Mellanox ConnectX-3
InfiniBand HCAs
• One physical function (PF) multiple
virtual functions (VF), each with own DMA
streams, memory space, interrupts
• Allows DMA to bypass hypervisor to VMs
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
High-Performance Virtualization on Comet
• Mellanox FDR InfiniBand HCAs with SR-IOV
• Rocks and OpenStack Nova to manage VMs
• Flexibility to support complex science gateways and
web-based workflow engines
• Custom compute appliances and virtual clusters developed with
FutureGrid and their existing expertise
• Backed by virtualized Lustre running over virtualized InfiniBand
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Benchmark comparisons of SR-IOV Cluster v AWS (early 2013): Hardware/Software Configuration
Native, SR-IOV Amazon EC2
Platform • Rocks 6.1 (EL6)
• Virtualization via kvm • Amazon Linux 2013.03 (EL6)
• cc2.8xlarge Instances
CPUs • 2x Xeon E5-2660 (2.2GHz)
• 16 cores per node
• 2x Xeon E5-2670 (2.6GHz)
• 16 cores per node
RAM • 64 GB DDR3 DRAM • 60.5 DDR3 DRAM
Interconnect • QDR4X InfiniBand
• Mellanox ConnectX-3 (MT27500) • Intel VT-d, SR-IOV enabled in
firmware, kernel, drivers
• mlx4_core 1.1
• Mellanox OFED 2.0
• HCA firmware 2.11.1192
• 10 GbE
• common placement group
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
50x less latency than Amazon EC2
18
• SR-IOV
• < 30% overhead for
Messages < 128 bytes
• < 10% overhead for
eager send/recv
• Overhead 0% for
bandwidth-limited
regime
• Amazon EC2
• > 5000% worse latency
• Time dependent (noisy) OSU Microbenchmarks (3.9, osu_latency)
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
10x more bandwidth than Amazon EC2
19
• SR-IOV
• < 2% bandwidth loss
over entire range
• > 95% peak bandwidth
• Amazon EC2
• < 35% peak bandwidth
• 900% to 2500% worse
bandwidth than
virtualized InfiniBand
OSU Microbenchmarks (3.9, osu_bw)
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Weather Modeling – 15% Overhead
• 96-core (6-node)
calculation
• Nearest-neighbor
communication
• Scalable algorithms
• SR-IOV incurs modest
(15%) performance hit
• ...but still still 20%
faster*** than Amazon WRF 3.4.1 – 3hr forecast
*** 20% faster despite SR-IOV cluster having 20% slower CPUs
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Quantum ESPRESSO: 5x Faster than EC2
• 48-core (3 node)
calculation
• CG matrix inversion
(irregular comm.)
• 3D FFT matrix
transposes (All-to-all
communication)
• 28% slower w/ SR-IOV
• SR-IOV still > 500%
faster*** than EC2 Quantum Espresso 5.0.2 – DEISA AUSURF112 benchmark
*** 20% faster despite SR-IOV cluster having 20% slower CPUs
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
SR-IOV is a huge step forward in high-
performance virtualization
• Shows substantial improvement in latency over Amazon
EC2, and it provides nearly zero bandwidth overhead
• Benchmark application performance confirms significant
improvement over EC2
• SR-IOV lowers performance barrier to virtualizing the
interconnect and makes fully virtualized HPC clusters
viable
• Comet will deliver virtualized HPC to new/non-traditional
communities that need flexibility without major loss of
performance
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
BACKUP
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
NSF 13-528: Competitive proposals should address:
• “Complement existing XD capabilities with new types of computational
resources attuned to less traditional computational science communities;
• Incorporate innovative and reliable services within the HPC environment
to deal with complex and dynamic workflows that contribute significantly
to the advancement of science and are difficult to achieve within XD;
• Facilitate transition from local to national environments via the use of
virtual machines;
• Introduce highly useable and cost efficient cloud computing capabilities
into XD to meet national scale requirements for new modes of
computationally intensive scientific research;
• Expand the range of data intensive and/or computationally-challenging
science and engineering applications that can be tackled with current XD
resources;
• Provide reliable approaches to scientific communities needing a high-
throughput capability.”
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
VCs on Comet: Operational Details - one VM per physical node -
Physical node
(XSEDE stack)
Virtual machine
(User stack)
HN
Virtual cluster
head node
HN HN
HN
HN
VC0 VC1
VC2
VC3
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
VCs on Comet: Operational Details - Head Node remains active after VC shutdown -
Physical node
(XSEDE stack)
Virtual machine
(User stack)
HN
Virtual cluster
head node
HN HN
HN
HN
VC0 VC1
VC2
VC3
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
VCs on Comet: Spinup/shutdown - Each VC has its own ZFS file system for storing VMIs –
- latency hiding tricks on startup -
Physical node
(XSEDE stack)
Virtual machine
(User stack)
HN
Virtual cluster
head node
HN HN
HN
HN
VC0 VC1
VC2
VC3
ZFS pool
Virtual
machine disk
image