On Data Intensive Computing and Exascale

1

On Data Intensive Computing and Exascale

Alok Choudhary John G. Searle Professor

Dept. of Electrical Engineering and Computer Science and Professor, Kellogg School of Management

Northwestern University [email protected]

2

Science and Society Transformed by Data

Crowd Sourcing

Devices/Sensor

Simulations

DATA DRIVEN

Discovering Knowledge from Massive Data – Data Driven?

Data Intensive (DI) ● Depends on the

perspective –  Processor, memory,

application, storage?

● An application can be data intensive without (necessarily) being I/O intensive

Data Driven (DD) ●  Operations are driven (and

defined) by data –  Massive transactions –  BIG analytics

u Top-down query (well-defined operations) u Bottom up discovery (unpredictable time-to-result)

–  BIG data processing –  Predictive computing

●  Usage model further differentiates these

–  Single App, users –  Large number, sharing, historical/

temporal

“Data intensive” vs “Data Driven”

Very few large-scale applications of practical importance are NOT Data Intensive

It is a Process: Transactional to Temporal

010101 101010 111000 000111

Raw Data

Knowledge

Target Data

Transformed Data

Patterns

Understanding

Integration

Process Illustration: Predictive Insights: Extreme Events

1. Pre-process ancillary climate model outputs

Steps for Discovery of Multivariate Non-linear Interactions

Steps for Predictive Modeling of Hurricanes

SST impact on hurricane

frequency & intensity

IPCC AR4 Models: CMIP3 datasets Monthly mean sea surface temperature Monthly mean atmospheric temperature Daily horizontal wind at 250/850 hPa

2. Construct multivariate nonlinear climate network

3. Detect & track communities

3. Find non-linear relationships

4. Validate w/ hindcasts

5. Determine non-stationary non-i.i.d. climates states &

Build hurricane models

(A)

CMIP3 à CMIP5

●  Coupled Model Inter comparison Project

●  Spatial resolution: 1 – 0.25 degrees ●  Temporal resolution: 6 hours – 3 hours ● Models: 24 - 37

●  Simulation experiments: 10s - 100s –  Control runs & hindcast

–  Decadal & centennial-scale forecasts

●  Covers 1000s of simulation years ●  100+ variables

●  10s of TBs to 10s of PBs

Summary of CMIP5 model experiments, grouped into three tiers

A “DATA DRIVEN DISCOVERY” WORTH A THOUSAND SIMULATIONS?

A different way of thinking?

Discovering Materials : Simulations à Analytics

!"#$%&'()"#*"+*,-*.&/01()"#*02%232$/*

! "#$%&%'%(#)(*#+,#-$.%(/&'0(1$#/$()#2+34#$(5$5267(89:;(! :+,&2&*(,52&#.&*('3<=5(&$)#2+34#$(3..5.(85>6>(5=5*'2#($5634?&'7@(+3%%@(3'#+&*(23.&&@(A(?3=5$*5(%@(,@(.@()(5=5*'2#$%;(

4&/01()5/*6"0/71#8*

! "#$%'2-*'(.3'3(+&$&$6(+#.5=%('#(,25.&*'()#2+34#$(5$5267(-%&$6(*05+&*3=()#2+-=3(3$.(.52&?3<=5(5+,&2&*3=(&$)#2+34#$(

6"0/7*-527'2)"#*

! B5%'(+#.5=(#$(-$%55$(.3'3(! CDE)#=.(*2#%%(?3=&.34#$(8.3'3(.&?&.5.(&$'#(CD(%56+5$'%@(+#.5=(<-&='(#$(F(%56+5$'%(3$.('5%'5.(#$(25+3&$&$6(C(%56+5$'G(,2#*5%%(25,53'5.(CD(4+5%(/&'0(.&H525$'('5%'(%56+5$';(

92&8/*$(27/*,-*.&/01()"#*

! I-$(*#+<&$3'#2&3=(=&%'(#)(*#+,#-$.%('02#-60('05(9:(+#.5=(

:(&//#1#8*

! B052+#.7$3+&*(%'3<&=&'7(3$.(05-2&%4*%(

;27102)"#*

! J'2-*'-25(,25.&*4#$(! K-3$'-+(+5*03$&*3=(+#.5=&$6(

!"#$%&'(")%'*+*%,(+"-+(.)&')/+0"#1"2&3,+

4%,(+"-+1).3%05"&,+

67")(*%,(.3+7%879

1"(.&5'*+0'&3%3'(.,+

:;+#"3.*+

6('$*.+3%,0"<.).3+,()20(2).,+

='>+

=$>+

The Data Driven Discovery Ecosystem

Transactional: Data

Generation

Historical : Data

Processing/Organization

Relational: Discovery/Predictive Modeling

Refine Model/ New

Experiments/ Control

Feedback

Historical data

Learning Models

Trigger/questions Predict

(A)

Data reduction

Data Management

Data Query

COMPUTE CENTRIC TO DISCOVERY CENTRIC: HOW MANY FLOPS IS A LOOKUP WORTH?

Potential System Architecture (IESP) Systems 2011

K computer 2019 Difference

Today & 2019

System peak 10.5 Pflop/s 1 Eflop/s O(100)

Power 12.7 MW ~20 MW

System memory 1.6 PB 32 - 64 PB O(10)

Node performance 128 GF 1,2 or 15TF O(10) – O(100)

Node memory BW 64 GB/s 2 - 4TB/s O(100)

Node concurrency 8 O(1k) or 10k O(100) – O(1000)

Total Node Interconnect BW 20 GB/s 200-400GB/s O(10)

System size (nodes) 88,124 O(100,000) or O(1M) O(10) – O(100)

Total concurrency 705,024 O(billion) O(1,000)

MTTI days O(1 day) - O(10)

Potential System Architecture with a cap of $200M and 20MW Systems 2011

K computer 2019 Difference

Today & 2019

System peak 10.5 Pflop/s 1 Eflop/s O(100)

Power 12.7 MW ~20 MW

System memory 1.6 PB 32 - 64 PB O(10)

Node performance 128 GF 1,2 or 15TF O(10) – O(100)

Node memory BW 64 GB/s 2 - 4TB/s O(100)

Node concurrency 8 O(1k) or 10k O(100) – O(1000)

Total Node Interconnect BW 20 GB/s 200-400GB/s O(10)

System size (nodes) 88,124 O(100,000) or O(1M) O(10) – O(100)

Total concurrency 705,024 O(billion) O(1,000)

MTTI days O(1 day) - O(10)

Balanced Approach to Architecture

FLOPS

Memory

Storage

Storage

Memory

FLOPS/OPS

I/O software stack in HPC – Narrow Universe

15

A A

A A A

A

A

A A

A

A

A

A

A

A

A

A

A

A

A

A

A A A A A A A

A A A A A A A

applica*on/compute nodes A

D D D D D D D

D I/O delegate/discovery nodes

S S S S S S S S S I/O servers

compute nodes

15

Compute node

Compute node

Compute node

Compute node

network

I/O Server

I/O Server

I/O Server

Applications

Client-side File System

Parallel netCDF, HDF5, ...

MPI-IO

Supercomputers (Current): Illustration of Simulation Dataset Sizes (dated!)

Application On-Line Data Off-Line Data FLASH: Buoyancy-Driven Turbulent Nuclear Burning 75TB 300TB

Reactor Core Hydrodynamics 2TB 5TB

Computational Nuclear Structure 4TB 40TB

Computational Protein Structure 1TB 2TB

Performance Evaluation and Analysis 1TB 1TB Kinetics and Thermodynamics of Metal and

Complex Hydride Nanoparticles 5TB 100TB

Climate Science 10TB 345TB

Parkinson's Disease 2.5TB 50TB

Plasma Microturbulence 2TB 10TB

Lattice QCD 1TB 44TB

Thermal Striping in Sodium Cooled Reactors 4TB 8TB

Gating Mechanisms of Membrane Proteins 10TB 10TB

© Alok Choudhary

Balanced Approach to Architecture: I/O + Analytics

FLOPS

Memory

Storage

Storage

Memory

FLOPS/OPS

Parallel netCDF

Parallel File System

Scaling I/O and Analytics: Accelerating time to Discovery

I/O was previously a major bottleneck: The only reason execution at this scale became possible was due to I/O scaling.

●  Global Cloud Resolving Model (GCRM) –  Simulate circulation associated with large convective clouds

–  Developed by David Randell (Colorado State U) & Karen Schuchardt (PNNL)

●  Geodesic grid model ●  1.4 PB data per simulation

–  4 km resolution, 3 hourly, 1 simulated year –  1.5 TB per checkpoint

Analytics and I/O : Accelerating time to Discovery

●  Improved I/O throughput

–  Using PnetCDF optimizations, massive scalability

–  For 3.5 km grid resolution, grid size is 41.9M cells with 256 vertical layers

–  Data analysis read and simulation checkpoint

Jan 25, 2012 Slide 20

ARCHITECTURE, ALGORITHMS, BENCHMARKING, CO-DESIGN

Large-scale Analytics: Data Analysis Kernels

Performance typically dominated by a few kernels (Illustrative).

Application

Top 3 Kernels Σ

(%) Kernel 1 (%) Kernel 2 (%) Kernel 3 (%)

K-means Distance (68) Center (21) minDist (10) 99

Fuzzy K-means Center (58) Distance (39) fuzzySum (1) 98

BIRCH Distance (54) Variance (22) Redist (10) 86

HOP Density (39) Search (30) Gather (23) 92

Naïve Bayesian probCal (49) Variance (38) dataRead (10) 97

ScalParC Classify (37) giniCalc (36) Compare (24) 97

Apriori Subset (58) dataRead (14) Increment (8) 80

Eclat Intersect (39) addClass (23) invertC (10) 72

SVMlight quotMatrix (57) quadGrad (38) quotUpdate (2) 97

Data mining Programs: Do they have different characteristics?

0 1 2 3 4 5

8 9

11

aprio

ri ba

yesi

an

birc

h ec

lat

hop

sc

alpa

rc

kMea

ns

fuzz

y rs

earc

h se

mph

y sn

p ge

nene

t sv

m-rf

e

NU-MineBench

6 7

10

Clus

ter N

umbe

r

gcc

bzip

2 gz

ip

mcf

tw

olf

vort

ex

vpr

pars

er

apsi

ar

t eq

uake

lu

cas

mes

a m

grid

sw

im

wup

wis

e ra

wca

udio

ep

ic

enco

de

cjpe

g m

peg2

pe

gwit gs

to

ast

Q17

Q

3 Q

4 Q

6

SPEC FP MediaBench TPC-H SPEC INT

NU-Minebench versus Others

●  number of data references per instruc*on is significantly higher. ●  L2 miss rates are considerably high due to the inherent streaming nature

of data retrieval. ●  The high ALU opera*ons per instruc*on is high, which indicates the

extensive amount of computa*ons performed in data mining applica*ons.

Parameter†

Benchmark of Applications

SPECINT SPECFP MediaBench TPC-H MineBench

Data References 0.81 0.55 0.56 0.48 1.10

Bus Accesses 0.030 0.034 0.002 0.010 0.037

Instruction Decodes 1.17 1.02 1.28 1.08 0.78

Resource Related Stalls 0.66 1.04 0.14 0.69 0.43

CPI 1.43 1.66 1.16 1.36 1.54

ALU Instructions 0.25 0.29 0.27 0.30 0.31

L1 Misses 0.023 0.008 0.010 0.029 0.016

L2 Misses 0.003 0.003 0.0004 0.002 0.006

Branches 0.13 0.03 0.16 0.11 0.14

Branch Mispredictions 0.009 0.0008 0.016 0.0006 0.006 † The numbers shown here for the parameters are values per instruction

Scalable Data Analytics Kernels

●  Parallel hierarchical clustering

–  Speedup of 18,000 on 16k processors

–  I/O significant at large scale

Power-aware Data Analytics: Approximation is a TOP Option in analytics (Co-design)

0

0.2

0.4

0.6

0.8

1

1.2

0.00001

0.0001

0.001

0.01

0.1

1

10

100

4 Bits 8 Bits 12 Bits 16 Bits 20 Bits 32 Bits

Relativ

e Energy Con

sumed

(w.r.t. No Co

mpression

)

Average Clusterin

g Error (%

) in Log Scale

Bits used in representing input data

K-‐Means Clustering : Error Vs Energy

Average Error(%)Data Set AData Set B

Power-aware analytics ●  Reduced bit fixed-point

representations

●  Pearson correlation –  2.5-3.5 times faster

–  50-70% less energy

●  K-means

–  ~44% less energy with an error of only 0.03% using 12-bit representation

Energy Consumption Correlations

Speedup Correlation

In-memory Kmeans Clustering

● Data set fits into the device memory -> small amount of transfer over PCI bus ● Data set A : 5.7 Million records, Data set B : 8

Million records, attributes = 60, clusters = 32 ● Energy savings – 44% for 12-bit quantization

26

Data Set A Data Set B

00.0020.0040.0060.0080.010.0120.0140.0160.018

4 Bits

8 Bits

12 Bits

16 Bits

20 Bits

32 Bits

4 Bits

8 Bits

12 Bits

16 Bits

20 Bits

32 Bits

Energy Con

sumed

(kJ)

Energy consumed in data transfer

Transfer Energy

Data Set A Data Set B

0

5

10

15

20

25

30

4 Bits

8 Bits

12 Bits

16 Bits

20 Bits

32 Bits

4 Bits

8 Bits

12 Bits

16 Bits

20 Bits

32 Bits

Energy Con

sumed

(kJ)

Energy Consumption of in-‐order GPU K-‐means

Quantization EnergyKernel EnergyGPU Active Energy

Accelerators: Principal Component Analysis/ Data Reduction

Co-Design: Analytics Algorithms at Scale with power efficiency and using new memory hierarchies

Large-scale data-driven science for complex, multivariate, spatio-temporal, non-linear, and dynamic systems:

High Performance Computing Efficient analytics on future generation exascale HPC platforms with complex memory hierarchies

kernels, features, dependencies

Relationship Mining Discovery of complex

dependence structures such as non-linear relationships

Predictive Modeling Model typical and extreme behavior from multivariate

spatio-temporal data

Complex Networks Study collective behavior of

interacting subsystems

relationships community structure- function-dynamics

The traditional model of developing algorithms on small data and then parallelizing them WILL NOT WORK in most cases because the characteristics of the algorithm and solution itself will depend on the data: You don’t want to “remove” the needle from the haystack, you want to find it!

Data Analytics – Broad Impact

Illustrative Applications Feature, data reduction, or analytics task

Data analysis kernels

Chemistry, Climate, Combustion, Cosmology, Fusion, Materials science, Plasma

Clustering k-means, fuzzy k-means, BIRCH, MAFIA, DBSCAN, HOP, SNN, Dynamic Time Warping, Random Walk

Biology, Climate, Combustion, Cosmology, Plasma, Renewable energy

Statistics Extrema, mean, quantiles, standard deviation, copulas, value-based extraction, sampling

Biology, Climate, Fusion, Plasma Feature selection Data slicing, LVF, SFG, SBG, ABB, RELIEF

Chemistry, Materials science, Plasma, Climate

Data transformations Fourier transform, wavelet transform, PCA/SVD/EOF analysis, multidimensional scaling, differentiation, integration

Combustion, Earth science Topology Morse-Smale complexes, Reeb graphs, level set decomposition

Earth science Geometry Fractal dimension, curvature, torsion

Biology, Climate, Cosmology, Fusion Classification ScalParC, decision trees, Naïve Bayes, SVMlight, RIPPER

Chemistry, Climate, Combustion, Cosmology, Fusion, Plasma

Data compression PPM, LZW, JPEG, wavelet compression, PCA, Fixed-point representation

Climate Anomaly detection Entropy, LOF, GBAD

Climate, Earth science Similarity / distance Cosine similarity, correlation (TAPER), mutual information, Student's t-test, Eulerian distance, Mahalanobis distance, Jaccard coefficient, Tanimoto coefficient, shortest paths

Cosmology Halos and sub-halos SUBFIND, AHF

Network Effect and Precise Interest Targeting

4/12/12 Voxsup Confidential

Thank You!

Alok Choudhary, Northwestern University

[email protected]

On Data Intensive Computing and Exascale

Documents

Transcript of On Data Intensive Computing and Exascale