A New NSF TeraGrid Resource for Data-Intensive Science · PDF fileSSD 11s (145x) 100s (24x)...

16
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Michael L. Norman Principal Investigator Director, SDSC Allan Snavely Co-Principal Investigator Project Scientist A New NSF TeraGrid Resource for Data-Intensive Science Slide 1

Transcript of A New NSF TeraGrid Resource for Data-Intensive Science · PDF fileSSD 11s (145x) 100s (24x)...

Page 1: A New NSF TeraGrid Resource for Data-Intensive Science · PDF fileSSD 11s (145x) 100s (24x) Existing DB 1600s 2400s Random Queries requesting very small chunks of data about the candidate

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Michael L. Norman

Principal Investigator

Director, SDSC

Allan Snavely

Co-Principal Investigator

Project Scientist

A New NSF TeraGrid Resource for Data-Intensive Science

Slide 1

Page 2: A New NSF TeraGrid Resource for Data-Intensive Science · PDF fileSSD 11s (145x) 100s (24x) Existing DB 1600s 2400s Random Queries requesting very small chunks of data about the candidate

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Coping with the data deluge

• Advances in computing technology have resulted in a Moore’s Law for Data • Amount of digital data from

instruments doubles every 18 months (DNA sequencers, CCD cameras, telescopes, MRIs, etc.)

• Density of storage media keeping pace with Moore’s law, but not I/O rates • Time to process exponentially

growing amounts of data is growing exponentially

• Latency for random access limited by disk read head speed

Slide 2

Page 3: A New NSF TeraGrid Resource for Data-Intensive Science · PDF fileSSD 11s (145x) 100s (24x) Existing DB 1600s 2400s Random Queries requesting very small chunks of data about the candidate

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

What is Gordon?

• A “data-intensive” supercomputer based on SSD flash memory and virtual shared memory SW

• Emphasizes MEM and IOPS over FLOPS

• A system designed to accelerate access to massive data bases being generated in all fields of science, engineering, medicine, and social science

• Random IO to SSD 10-100x faster than HDD

• In production in 1Q2012

• We have a working prototype called Dash available for testing/evaluation now (LSST, PTF)

Slide 3

Page 4: A New NSF TeraGrid Resource for Data-Intensive Science · PDF fileSSD 11s (145x) 100s (24x) Existing DB 1600s 2400s Random Queries requesting very small chunks of data about the candidate

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

The Memory Hierarchy of a Typical HPC Cluster

Shared memory

programming

Message passing

programming

Latency Gap

Disk I/O

Slide 5

Page 5: A New NSF TeraGrid Resource for Data-Intensive Science · PDF fileSSD 11s (145x) 100s (24x) Existing DB 1600s 2400s Random Queries requesting very small chunks of data about the candidate

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

The Memory Hierarchy of Gordon

Shared memory

programming

Disk I/O

Slide 6

Page 6: A New NSF TeraGrid Resource for Data-Intensive Science · PDF fileSSD 11s (145x) 100s (24x) Existing DB 1600s 2400s Random Queries requesting very small chunks of data about the candidate

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Gordon’s 3 Key Innovations

• Fill the latency gap with large amounts of flash SSD

• 256 TB

• >35 million IOPS

• Aggregate CPU, DRAM, and SSD resources into 32 shared memory supernodes for ease of use

• 8 TFLOPS

• 2 TB DRAM

• 8 TB SSD

• High performance parallel file system

• 4 PB

• >100 GB/s sustained

Slide 7

Page 7: A New NSF TeraGrid Resource for Data-Intensive Science · PDF fileSSD 11s (145x) 100s (24x) Existing DB 1600s 2400s Random Queries requesting very small chunks of data about the candidate

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Results from Dash*:

a working prototype of Gordon

*available as a TeraGrid resource

Slide 8

Page 8: A New NSF TeraGrid Resource for Data-Intensive Science · PDF fileSSD 11s (145x) 100s (24x) Existing DB 1600s 2400s Random Queries requesting very small chunks of data about the candidate

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Palomar Transient Factory (PTF) collab. with Peter Nugent

• Nightly wide-field surveys using Palomar Schmidt telescope

• Image data sent to LBL for archive/analysis

• 100 new transients every minute

• Large, random queries across multiple databases for IDs

Slide 9

Page 9: A New NSF TeraGrid Resource for Data-Intensive Science · PDF fileSSD 11s (145x) 100s (24x) Existing DB 1600s 2400s Random Queries requesting very small chunks of data about the candidate

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

PTF-DB Transient Search

Forward

Q1

Backward

Q1

DASH-IO-

SSD

11s

(145x)

100s

(24x)

Existing DB 1600s 2400s

Random Queries requesting very small chunks of data about the

candidate observations

Slide 10

Page 10: A New NSF TeraGrid Resource for Data-Intensive Science · PDF fileSSD 11s (145x) 100s (24x) Existing DB 1600s 2400s Random Queries requesting very small chunks of data about the candidate

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

MOPS Application Runs Faster Under vSMP as on Hardware SMP and ccNUMA

Moving Object Pipeline System

(MOPS) used in Asteroid Tracking.

Part of the Large Synoptic Survey

Telescope (LSST) Project

•Algorithm is serial (no MPI)

•135 GB required for the test

case

•Dash Node is a dual socket, 8

core Nehalem node with 48 GB

of memory.

•Triton PDAF is an 8 socket, 32

core, Shanghai node with

256GB memory

•Ember is a SGI Altix UV

system with 384 Nehalem

cores, 2 TB RAM in a SSI

Collab. with Jonathan Myers, LSST Corp.

Slide 11

Page 11: A New NSF TeraGrid Resource for Data-Intensive Science · PDF fileSSD 11s (145x) 100s (24x) Existing DB 1600s 2400s Random Queries requesting very small chunks of data about the candidate

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Gordon Training Events http://www.sdsc.edu/us/training/

• Getting Ready for Gordon: Using vSMP

• May 10-11, 2011 (next week)

• Will be recorded for web download

• Getting Ready for Gordon: Summer Institute

• August 6-17, 2011

• Contact Susan Rathbun ([email protected])

Slide 12

Page 12: A New NSF TeraGrid Resource for Data-Intensive Science · PDF fileSSD 11s (145x) 100s (24x) Existing DB 1600s 2400s Random Queries requesting very small chunks of data about the candidate

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

How to get time

• NOW: Request a start-up allocation on Dash at https://www.teragrid.org/web/user-support/startup

• After Sept 2011: Request a start-up or large allocation on Gordon at https://www.teragrid.org/web/user-support/allocations

• For more information, see http://www.sdsc.edu/us/resources/dash/

• Or email me at [email protected]

Slide 13

Page 13: A New NSF TeraGrid Resource for Data-Intensive Science · PDF fileSSD 11s (145x) 100s (24x) Existing DB 1600s 2400s Random Queries requesting very small chunks of data about the candidate

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

RESERVE SLIDES

Page 14: A New NSF TeraGrid Resource for Data-Intensive Science · PDF fileSSD 11s (145x) 100s (24x) Existing DB 1600s 2400s Random Queries requesting very small chunks of data about the candidate

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Gordon Architecture: “Supernode”

• 32 Appro Extreme-X compute nodes • Dual processor Intel

Sandy Bridge • 240 GFLOPS • 64 GB

• 2 Appro Extreme-X IO nodes • Intel SSD drives

• 4 TB ea. • 560,000 IOPS

• ScaleMP vSMP virtual shared memory • 2 TB RAM aggregate • 8 TB SSD aggregate

240 GF

Comp.

Node

64 GB

RAM

240 GF

Comp.

Node

64 GB

RAM

4 TB

SSD

I/O Node

vSMP memory

virtualization

Page 15: A New NSF TeraGrid Resource for Data-Intensive Science · PDF fileSSD 11s (145x) 100s (24x) Existing DB 1600s 2400s Random Queries requesting very small chunks of data about the candidate

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Gordon Architecture: Full Machine

• 32 supernodes = 1024 compute nodes

• Dual rail QDR Infiniband network

• 3D torus (4x4x4)

• 4 PB rotating disk parallel file system

• >100 GB/s

SN SN

SN SN

SN SN

SN SN

SN SN

SN SN

SN SN

SN SN

SN SN

SN SN

SN SN

SN SN

SN SN

SN SN

SN SN

SN SN

D D D D D D

Page 16: A New NSF TeraGrid Resource for Data-Intensive Science · PDF fileSSD 11s (145x) 100s (24x) Existing DB 1600s 2400s Random Queries requesting very small chunks of data about the candidate

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Gordon Aggregate Capabilities

Speed >200 TFLOPS

Mem (RAM) 64 TB

Mem (SSD) 256 TB

Mem (RAM+SSD) 320 TB

Ratio (MEM/SPEED) 1.31 BYTES/FLOP

IO rate to SSDs 35 Million IOPS

Network bandwidth 16 GB/s bi-directional

Network latency 1 msec.

Disk storage 4 PB

Disk IO Bandwidth >100 GB/sec