A New NSF TeraGrid Resource for Data-Intensive Science · PDF fileSSD 11s (145x) 100s (24x)...

Post on 15-Feb-2018

215 views 0 download

Transcript of A New NSF TeraGrid Resource for Data-Intensive Science · PDF fileSSD 11s (145x) 100s (24x)...

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Michael L. Norman

Principal Investigator

Director, SDSC

Allan Snavely

Co-Principal Investigator

Project Scientist

A New NSF TeraGrid Resource for Data-Intensive Science

Slide 1

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Coping with the data deluge

• Advances in computing technology have resulted in a Moore’s Law for Data • Amount of digital data from

instruments doubles every 18 months (DNA sequencers, CCD cameras, telescopes, MRIs, etc.)

• Density of storage media keeping pace with Moore’s law, but not I/O rates • Time to process exponentially

growing amounts of data is growing exponentially

• Latency for random access limited by disk read head speed

Slide 2

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

What is Gordon?

• A “data-intensive” supercomputer based on SSD flash memory and virtual shared memory SW

• Emphasizes MEM and IOPS over FLOPS

• A system designed to accelerate access to massive data bases being generated in all fields of science, engineering, medicine, and social science

• Random IO to SSD 10-100x faster than HDD

• In production in 1Q2012

• We have a working prototype called Dash available for testing/evaluation now (LSST, PTF)

Slide 3

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

The Memory Hierarchy of a Typical HPC Cluster

Shared memory

programming

Message passing

programming

Latency Gap

Disk I/O

Slide 5

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

The Memory Hierarchy of Gordon

Shared memory

programming

Disk I/O

Slide 6

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Gordon’s 3 Key Innovations

• Fill the latency gap with large amounts of flash SSD

• 256 TB

• >35 million IOPS

• Aggregate CPU, DRAM, and SSD resources into 32 shared memory supernodes for ease of use

• 8 TFLOPS

• 2 TB DRAM

• 8 TB SSD

• High performance parallel file system

• 4 PB

• >100 GB/s sustained

Slide 7

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Results from Dash*:

a working prototype of Gordon

*available as a TeraGrid resource

Slide 8

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Palomar Transient Factory (PTF) collab. with Peter Nugent

• Nightly wide-field surveys using Palomar Schmidt telescope

• Image data sent to LBL for archive/analysis

• 100 new transients every minute

• Large, random queries across multiple databases for IDs

Slide 9

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

PTF-DB Transient Search

Forward

Q1

Backward

Q1

DASH-IO-

SSD

11s

(145x)

100s

(24x)

Existing DB 1600s 2400s

Random Queries requesting very small chunks of data about the

candidate observations

Slide 10

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

MOPS Application Runs Faster Under vSMP as on Hardware SMP and ccNUMA

Moving Object Pipeline System

(MOPS) used in Asteroid Tracking.

Part of the Large Synoptic Survey

Telescope (LSST) Project

•Algorithm is serial (no MPI)

•135 GB required for the test

case

•Dash Node is a dual socket, 8

core Nehalem node with 48 GB

of memory.

•Triton PDAF is an 8 socket, 32

core, Shanghai node with

256GB memory

•Ember is a SGI Altix UV

system with 384 Nehalem

cores, 2 TB RAM in a SSI

Collab. with Jonathan Myers, LSST Corp.

Slide 11

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Gordon Training Events http://www.sdsc.edu/us/training/

• Getting Ready for Gordon: Using vSMP

• May 10-11, 2011 (next week)

• Will be recorded for web download

• Getting Ready for Gordon: Summer Institute

• August 6-17, 2011

• Contact Susan Rathbun (susan@sdsc.edu)

Slide 12

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

How to get time

• NOW: Request a start-up allocation on Dash at https://www.teragrid.org/web/user-support/startup

• After Sept 2011: Request a start-up or large allocation on Gordon at https://www.teragrid.org/web/user-support/allocations

• For more information, see http://www.sdsc.edu/us/resources/dash/

• Or email me at MLNORMAN@UCSD.EDU

Slide 13

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

RESERVE SLIDES

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Gordon Architecture: “Supernode”

• 32 Appro Extreme-X compute nodes • Dual processor Intel

Sandy Bridge • 240 GFLOPS • 64 GB

• 2 Appro Extreme-X IO nodes • Intel SSD drives

• 4 TB ea. • 560,000 IOPS

• ScaleMP vSMP virtual shared memory • 2 TB RAM aggregate • 8 TB SSD aggregate

240 GF

Comp.

Node

64 GB

RAM

240 GF

Comp.

Node

64 GB

RAM

4 TB

SSD

I/O Node

vSMP memory

virtualization

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Gordon Architecture: Full Machine

• 32 supernodes = 1024 compute nodes

• Dual rail QDR Infiniband network

• 3D torus (4x4x4)

• 4 PB rotating disk parallel file system

• >100 GB/s

SN SN

SN SN

SN SN

SN SN

SN SN

SN SN

SN SN

SN SN

SN SN

SN SN

SN SN

SN SN

SN SN

SN SN

SN SN

SN SN

D D D D D D

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Gordon Aggregate Capabilities

Speed >200 TFLOPS

Mem (RAM) 64 TB

Mem (SSD) 256 TB

Mem (RAM+SSD) 320 TB

Ratio (MEM/SPEED) 1.31 BYTES/FLOP

IO rate to SSDs 35 Million IOPS

Network bandwidth 16 GB/s bi-directional

Network latency 1 msec.

Disk storage 4 PB

Disk IO Bandwidth >100 GB/sec