Emerging Technology Trends in Data-Intensive SupercomputingSAN DIEGO SUPERCOMPUTER CENTER at the...

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Emerging Technology Trends in Data-Intensive Supercomputing

Dr. Richard Moore

Deputy Director San Diego Supercomputer Center University of California, San Diego

USA

HPC Advisory Council Meeting October 28, 2012 Zhangjiajie, China



Data Crisis: Information Big Bang

PCAST Digital Data

NSF Experts Study

Wired, Nature

Storage Networking Industry Association (SNIA) 100 Year Archive

Requirements Survey Report

“there is a pending crisis in archiving… we have to create long-term methods

for preserving information, for making it available for analysis in

the future.” 80% respondents: >50 yrs;

68% > 100 yrs

Industry

“Data-Enabled Science”



Gordon – An Innovative Data-Intensive Supercomputer

• Designed to accelerate access to massive amounts of data in areas of genomics, earth science, engineering, medicine, and others.

• Emphasizes memory and I/O over FLOPS. • Appro-integrated 1,024 node Sandy Bridge

cluster. • 300 TB of high performance Intel flash. • Large memory supernodes via vSMP

Foundation from ScaleMP. • 3D torus interconnect from Mellanox. • Built from commodity hardware • In production since February 2012. • Funded by the NSF and available through the

Extreme Science and Engineering Discovery Environment program (XSEDE) allocations.



Shared memory Programming

(vSMP)

The Memory Hierarchy of a Typical Supercomputer

Shared memory Programming (single node)

Message passing programming

Latency Gap Disk I/O BIG DATA

Disk I/O



vSMP Aggregation Software

Gordon 32-way Supernode

Dual Sandy Bridge

Compute Node Dual SB

CN Dual SB

CN Dual SB

CN Dual SB

CN Dual SB

CN Dual SB

CN Dual SB

CN

Dual SB CN

Dual SB CN

Dual SB CN

Dual SB CN

Dual SB CN

Dual SB CN

Dual SB CN

Dual SB CN

Dual SB CN

Dual SB CN

Dual SB CN

Dual SB CN

Dual SB CN

Dual SB CN

Dual SB CN

Dual SB CN

Dual SB CN

Dual SB CN

Dual SB CN

Dual SB CN

Dual SB CN

Dual SB CN

Dual SB CN

Dual SB CN

I/O NODE

4.8 TB flash SSD

Dual Westmere

I/O Processors

I/O NODE

4.8 TB flash SSD

Dual Westmere

I/O Processors



Gordon Specifications INTEL SANDY BRIDGE COMPUTE NODE

Sockets & Cores 2 & 16 Clock speed 2.6 GHz

DRAM capacity and speed 64 GB, 1,333 MHz INTEL710 eMLC FLASH I/O NODE

NAND flash SSD drives 16 SSD capacity per drive &

per node 16 * 300 GB = 4.8 TB

SMP SUPER-NODE (VIA VSMP) Compute nodes / I/O

Nodes 32 / 2 Addressable DRAM 2 TB

Addressable memory including flash 11.6 TB

GORDON (AGGREGATE) Compute Nodes 1,024 Compute cores 16,384

Peak performance 341 TF

DRAM/SSD memory 64 TB DRAM 300 TB SSD

INFINIBAND INTERCONNECT Architecture Dual-Rail, 3D torus

Link Bandwidth QDR Vendor Mellanox

LUSTRE-BASED DISK I/O SUBSYSTEM (SHARED) Total storage:

current/planned 4 PB/6 PB (raw)

Total bandwidth 100 GB/s



Exporting & Preserving Flash Performance

• Several layers of overhead reduce performance (SATA, Linux, network)

• I/O models need to be driven by the applications

• No one has really done this before

• iSCSIoRDMA (iSER) was the best protocol

• XFS performs well • Continue to explore

alternatives based on user needs



Gordon 3D Torus Interconnect Fabric 4x4x4 3D Torus Topology

IO

CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN

36 Port Fabric Switch

36 Port Fabric Switch

18 x 4X IB Network Connections

18 x 4X IB Network Connections

IO

CN

Dual-Rail Network increased Bandwidth & Redundancy

Single Connection to each Network 16 Compute Nodes, 2 IO Nodes

4X4X4 Mesh Ends are folded on all three

Dimensions to form a 3DTorus



Trestles System Description Table 2.1. Trestles System Specification

System Component Configuration AMD MAGNY-COURS COMPUTE NODE

Sockets 4 Cores 32 Clock speed 2.4 GHz Flop speed 307 Gflop/s Memory capacity 64 GB Memory bandwidth 171 GB/s STREAM Triad bandwidth 100 GB/s Flash memory (SSD) 160GB

FULL SYSTEM Total compute nodes 324 Total compute cores 10,368 Peak performance 100 Tflop/s Total memory 20.7 TB Total memory bandwidth 55.4 TB/s Total flash memory 52 TB

QDR INFINIBAND INTERCONNECT Topology Fat tree Link bandwidth 8 GB/s (bidirectional) Peak bisection bandwidth 5.2 TB/s (bidirectional) MPI latency 1.3 µs

DISK I/O SUBSYSTEM File systems NFS, Lustre Storage capacity (usable) 150 TB: Dec 2010

2PB : June 2011 4PB: July 2012

I/O bandwidth 50 GB/s



The Majority of TeraGrid/XD Projects Have Modest-Scale Resource Needs

• “80/20” @ ~512 cores (FY09) • ~80% of projects never run a

job larger than this … • And those projects use <20%

of resources • Only ~1% of projects run jobs as

large as 16K cores and those consume >30% of TG resources

• Many projects/users only need modest-scale jobs/resources

• And a modest-size resource can provide the resources for a large number of these projects/users

Exceedance distributions of projects and usage as a function of the largest job (core count) run by a project over a full year (FY2009)



SDSC is Deploying a New Repertoire of Storage Systems

SDSC Cloud • Storage of Digital Data for Ubiquitous Access and High-Durability • Access: Multi-platform web interface, S3-type interfaces, backup

SW

Data Oasis (PFS) • High-Performance Parallel File System for HPC Systems;

Partitioned for Scratch and Medium-Term Parking Space • Access: Lustre on HPC Systems (Gordon, Trestles, Triton)

Project Storage • Purpose: Typical Project / Home Directory / User File Server

Storage Needs • Access: NFS/CIFS, iSCI



Data Oasis Heterogeneous Architecture

64 OSS (Object Storage Servers)

Provide 100GB/s Performance and >4PB Raw

Capacity

JBOD 90TB

JBODs (Just a Bunch Of Disks)

Provide Capacity Scale-out to an Additional 5.8PB

Arista 7508 10G

Arista 7508 10G

Redundant Switches for Reliability and

Performance

3 Distinct Network Architectures

OSS 72TB

JBOD 90TB

OSS 72TB

JBOD 90TB

OSS 72TB

JBOD 90TB

64 Lustre LNET Routers 100 GB/s

Mellanox 5020 Bridge 12 GB/s

MDS: Gordon scratch

MDS: Trestles scratch

Myrinet 10G Switch 25 GB/s

MDS: Triton scratch

GORDON IB cluster

TRITON Myrinet cluster

TRESTLES IB cluster

Metadata Servers

MDS: Gordon & Trestles project

OSS 72TB



SDSC Cloud: A Paradigm Shift for Long-Term Storage: Focus on Access, Sharing & Collaboration

• Launched September 2011 • Largest, highest-performance

known academic cloud • 5.5 Petabytes (raw), 8 GB/sec • Automatic dual-copy and

verification • Capacity and performance scale

linearly to 100’s of petabytes • Open source platform based on

NASA and RackSpace software • http://cloud.sdsc.edu

15

http://cloud.sdsc.edu/



Applications of SDSC Cloud Shared/published/curated data collections

HPC simulation data storage and sharing

Web/portal applications and site hosting

Application integration using supported APIs

Serving images/videos

Backup services



Data Curation – pilot projects • Project mid-way thru a two-year pilot phase

• How do lab personnel work with librarians to curate their data? • How much work is required to curate data and what are options? • What is a sustainable business model that RCI should invest in?

• Five representative programs selected as pilots (http://rci.ucsd.edu/pilots) • The Brain Observatory (Annese) • Open Topography (Baru) • Levantine Archaeology Laboratory (Levy) • SIO Geological Collections (Norris) • Laboratory for Computational Astrophysics (Wagner)

• Using existing tools whenever possible • Storage at SDSC, campus high-speed networking, Digital Asset Management System

(DAMS) at UCSD Libraries, Chronopolis digital preservation network

• Also, develop Data Management Plan tools and provide training • Anticipate production curation services in mid-2013

http://rci.ucsd.edu/pilots



Center for Large-scale Data Systems Research (CLDS)

• New Industry-Academia partnership led by SDSC

• Research focus: technical and management challenges of large-scale data systems in business, government and academia

• “Big Data 2015” and “How Much Information? Phase 2” research projects

• How IT systems are used and valued in industry verticals

• POC: Chaitan Baru ([email protected])



Predictive Analytics Center of Excellence: Bringing together academia, government, and industry

Predictive Analytics Center of

Excellence

Inform, Educate and Train

Develop Standards

& Methodolo

gy

Scalable High Per-formance

Data Mining

Foster Research

and Collaborati

on

Data Mining

Repository of Very

Large Data Sets

Provide Predictive Analytics Services

Bridge the Industry

and Academia

Gap

PACE: Natasha Balac

[email protected]



Applications



Computational Style Code Answering the question: Why Gordon?

V M F

C T L

V: Uses vSMP aggregation software C: Computationally intensive, leverages Sandy Bridge M: Uses larger Memory/core on Gordon (4GB/core) T: Threaded F: Uses Flash L: Lustre I/O intensive



Breadth First Search Comparison using SSD and HDD

V M F

C T L Source: Sandeep Gupta, San Diego Supercomputer Center. Used by permission. 2011

Graphs are mathematical and computational representations of relationships of objects in a network. Such networks occur in many natural and man-made scenarios, including communication, biological, and social contexts. Understanding the structure of these graphs is important for uncovering important relationships among the members.

• Implementation of Breadth-first search (BFS) graph algorithm developed by Munagala and Ranade

• 134 million nodes • Flash drives reduced

I/O time by factor of 6.5x • Problem converted from I/O

bound to compute bound



Daphnia Genome Assembly using Velvet and vSMP

V M F

C T L Source: Wayne Pfeiffer, San Diego Supercomputer Center. Used by permission.

Daphnia (a.k.a. water flea), is a model species used for understanding mechanisms of inheritance and evolution, and as a surrogate species for studying human health in responses to environmental changes.

De novo assembly of short DNA reads using the de Bruijn graph algorithm. Code parallelized using OpenMP directives.

Benchmark problem: Daphnia genome assembly from 44-bp and 75-bp reads using 35-mer

Photo: Dr. Jan Michels, Christian-Albrechts-University, Kiel



Foxglove Calculation using Gaussian 09 with vSMP - MP2 Energy Gradient Calculation

V M F

C T L Source: Jerry Greenberg, San Diego Supercomputer Center. January, 2012.

The Foxglove plant (Digitalis) is studied for its medicinal uses. Digoxin, an extract of the Foxglove, is used to treat a variety of conditions including diseases of the heart. There is some recent research that suggests it may also be a beneficial cancer treatment.

Time to solution: 43,000s

Processor footprint - 4 nodes 64 threads

Memory footprint – 10 nodes 700 GB

1 Compute node = (16 cores/node) 64 GB/node)



Axial compression of caudal rat vertebra - Very large memory simulation using Abaqus and vSMP

Source: Matthew Goff, Chris Hernandez. Cornell University. Used by permission. 2012

The goal of the simulations is to analyze how small variances in boundary conditions effect high strain regions in the model. The research goal is to understand the response of trabecular bone to mechanical stimuli. This has relevance for paleontologists to infer habitual locomotion of ancient people and animals, and in treatment strategies for populations with fragile bones such as the elderly.

• 5 million quadratic, 8 noded elements

• Model created with custom Matlab application that converts 253 micro CT images into voxel-based finite element models

vSMP and flash provide a large memory capability and speed-up by allowing very large Abaqus FEM models to be run in-core, or mixed in-core/flash mode.



Massive Data Analysis of Large-eddy Simulation of Deep Convection in Atmosphere (Clouds) using vSMP

Simulation Details • GigaLES Model Run Dataset (partial) • 40 time-steps (24 hour simulation) • 256 vertical layers • 204.8 x 204.8 kilometers • 100 m horizontal resolution

R Analysis • 160 GB data set (40 netCDF files @ 4 GB each) • 340 GB memory footprint • ~ 3 ½ hours for data input and analysis

The Center for Multi-scale Modeling of Atmospheric Processes (CMMAP) is an NSF Science and Technology Center focused on improving the representation of cloud processes in climate models.

V F

C T L

M • System for Atmospheric Modeling: M. Kharoutdinov,

SUNY Stonybrook • Visualization: J. Helly, A. Chourasia • Analysis: J. Helly, S. Strande



Cosmology simulation: Matter power spectrum measurement using vSMP

Source: Rick Wagner, Michael L. Norman. SDSC.

Goal is to measure the effect of the light from the first stars on the evolution of the universe. To quantitatively compare the matter distribution of each simulation, we use radially binned 3D power spectra.

• 2 simulations • 32003 uniform 3D grids • 15k+ files each

Individual simulations

Difference

Power spectra

• Existing OpenMP code • ~256GB memory used • ~5 ½ hours per field • 0 development effort



Impact of high-frequency trading on financial markets

V M F

C T L Source: Mao Ye, Dept. of Finance, U. Illinois. Used by permission. 6/1/2012

To determine the impact of high-frequency trading activity on financial markets, it is necessary to construct nanosecond resolution limit order books – records of all unexecuted orders to buy/sell stock at a specified price. Analysis provides evidence of quote stuffing: a manipulative practice that involves submitting a large number of orders with immediate cancellation to generate congestion

Time to construct limit order books now under 15 minutes for threaded application using 16 cores on single Gordon compute node



Protein Data Bank Query Comparisons: With DB2 Database on 2 Gordon I/O Nodes: with HDD’s or with SSD’s

V M F

C T L Source: Vishwinath Nandigam, San Diego Supercomputer Center. 2011

The Protein Data Bank (PDB): Is the single worldwide repository of information about the 3D structures of large biological molecules. These are the molecules of life that are found in all organisms. Understanding the shape of a molecule helps to understand how it works.

• For single queries, HDD and SSD perform about the same.

• For concurrent queries, SSD’s achieve large speedup.

• Q5B is > 10x, and performance varies by type of query



Classification of sensor time series data

V M F

C T L Source: Ramon Huerta, UCSD Bio Circuits Institute Used by permission 6/1/2012

Chemical sensors (e-noses) will be placed in the homes of elderly participants in an effort to continuously and non-intrusively monitor their living environments. Time series classification algorithms will then be applied to the sensor data to detect anomalous behavior that may suggest a change in health status.

After optimizing code, linking Intel’s MKL and porting to Gordon, runtime reduced from 15.5 hours to 8 minutes



Summary • Data-intensive supercomputing requires new

approaches, not just more storage capacity • Gordon targeted to new classes of data-

intensive applications using SSD and vSMP • Data Oasis is a robust, heterogeneous high-

bandwidth file system • SDSC Cloud facilitates data access, sharing,

search and discovery • Bridging library & researchers w/ curation pilots • After the hardware, it takes people & expertise to

bring the impact to science applications!



Thank you!

谢谢

Richard Moore [email protected]

Emerging Technology Trends in Data-Intensive SupercomputingSAN DIEGO SUPERCOMPUTER CENTER at the...

Documents

Transcript of Emerging Technology Trends in Data-Intensive SupercomputingSAN DIEGO SUPERCOMPUTER CENTER at the...