On Data Intensive Computing and Exascale
Transcript of On Data Intensive Computing and Exascale
1
On Data Intensive Computing and Exascale
Alok Choudhary John G. Searle Professor
Dept. of Electrical Engineering and Computer Science and Professor, Kellogg School of Management
Northwestern University [email protected]
Slide 2 2
Science and Society Transformed by Data
Slide 3
Crowd Sourcing
Devices/Sensor
Simulations
DATA DRIVEN
Discovering Knowledge from Massive Data – Data Driven?
Slide 4
Data Intensive (DI) ● Depends on the
perspective – Processor, memory,
application, storage?
● An application can be data intensive without (necessarily) being I/O intensive
Data Driven (DD) ● Operations are driven (and
defined) by data – Massive transactions – BIG analytics
u Top-down query (well-defined operations) u Bottom up discovery (unpredictable time-to-result)
– BIG data processing – Predictive computing
● Usage model further differentiates these
– Single App, users – Large number, sharing, historical/
temporal
“Data intensive” vs “Data Driven”
Very few large-scale applications of practical importance are NOT Data Intensive
Slide 5
It is a Process: Transactional to Temporal
010101 101010 111000 000111
Raw Data
Knowledge
Target Data
Transformed Data
Patterns
Understanding
Integration
Slide 6
Process Illustration: Predictive Insights: Extreme Events
1. Pre-process ancillary climate model outputs
Steps for Discovery of Multivariate Non-linear Interactions
Steps for Predictive Modeling of Hurricanes
SST impact on hurricane
frequency & intensity
IPCC AR4 Models: CMIP3 datasets Monthly mean sea surface temperature Monthly mean atmospheric temperature Daily horizontal wind at 250/850 hPa
2. Construct multivariate nonlinear climate network
3. Detect & track communities
3. Find non-linear relationships
4. Validate w/ hindcasts
5. Determine non-stationary non-i.i.d. climates states &
Build hurricane models
(A)
Slide 7
CMIP3 à CMIP5
● Coupled Model Inter comparison Project
● Spatial resolution: 1 – 0.25 degrees ● Temporal resolution: 6 hours – 3 hours ● Models: 24 - 37
● Simulation experiments: 10s - 100s – Control runs & hindcast
– Decadal & centennial-scale forecasts
● Covers 1000s of simulation years ● 100+ variables
● 10s of TBs to 10s of PBs
Summary of CMIP5 model experiments, grouped into three tiers
Slide 8
A “DATA DRIVEN DISCOVERY” WORTH A THOUSAND SIMULATIONS?
A different way of thinking?
Slide 9
Discovering Materials : Simulations à Analytics
!"#$%&'()"#*"+*,-*.&/01()"#*02%232$/*
! "#$%&%'%(#)(*#+,#-$.%(/&'0(1$#/$()#2+34#$(5$5267(89:;(! :+,&2&*(,52&#.&*('3<=5(&$)#2+34#$(3..5.(85>6>(5=5*'2#($5634?&'7@(+3%%@(3'#+&*(23.&&@(A(?3=5$*5(%@(,@(.@()(5=5*'2#$%;(
4&/01()5/*6"0/71#8*
! "#$%'2-*'(.3'3(+&$&$6(+#.5=%('#(,25.&*'()#2+34#$(5$5267(-%&$6(*05+&*3=()#2+-=3(3$.(.52&?3<=5(5+,&2&*3=(&$)#2+34#$(
6"0/7*-527'2)"#*
! B5%'(+#.5=(#$(-$%55$(.3'3(! CDE)#=.(*2#%%(?3=&.34#$(8.3'3(.&?&.5.(&$'#(CD(%56+5$'%@(+#.5=(<-&='(#$(F(%56+5$'%(3$.('5%'5.(#$(25+3&$&$6(C(%56+5$'G(,2#*5%%(25,53'5.(CD(4+5%(/&'0(.&H525$'('5%'(%56+5$';(
92&8/*$(27/*,-*.&/01()"#*
! I-$(*#+<&$3'#2&3=(=&%'(#)(*#+,#-$.%('02#-60('05(9:(+#.5=(
:(&//#1#8*
! B052+#.7$3+&*(%'3<&=&'7(3$.(05-2&%4*%(
;27102)"#*
! J'2-*'-25(,25.&*4#$(! K-3$'-+(+5*03$&*3=(+#.5=&$6(
!"#$%&'(")%'*+*%,(+"-+(.)&')/+0"#1"2&3,+
4%,(+"-+1).3%05"&,+
67")(*%,(.3+7%879
1"(.&5'*+0'&3%3'(.,+
:;+#"3.*+
6('$*.+3%,0"<.).3+,()20(2).,+
='>+
=$>+
Slide 10
The Data Driven Discovery Ecosystem
Transactional: Data
Generation
Historical : Data
Processing/Organization
Relational: Discovery/Predictive Modeling
Refine Model/ New
Experiments/ Control
Feedback
Historical data
Learning Models
Trigger/questions Predict
(A)
Data reduction
Data Management
Data Query
Slide 11
COMPUTE CENTRIC TO DISCOVERY CENTRIC: HOW MANY FLOPS IS A LOOKUP WORTH?
Potential System Architecture (IESP) Systems 2011
K computer 2019 Difference
Today & 2019
System peak 10.5 Pflop/s 1 Eflop/s O(100)
Power 12.7 MW ~20 MW
System memory 1.6 PB 32 - 64 PB O(10)
Node performance 128 GF 1,2 or 15TF O(10) – O(100)
Node memory BW 64 GB/s 2 - 4TB/s O(100)
Node concurrency 8 O(1k) or 10k O(100) – O(1000)
Total Node Interconnect BW 20 GB/s 200-400GB/s O(10)
System size (nodes) 88,124 O(100,000) or O(1M) O(10) – O(100)
Total concurrency 705,024 O(billion) O(1,000)
MTTI days O(1 day) - O(10)
Potential System Architecture with a cap of $200M and 20MW Systems 2011
K computer 2019 Difference
Today & 2019
System peak 10.5 Pflop/s 1 Eflop/s O(100)
Power 12.7 MW ~20 MW
System memory 1.6 PB 32 - 64 PB O(10)
Node performance 128 GF 1,2 or 15TF O(10) – O(100)
Node memory BW 64 GB/s 2 - 4TB/s O(100)
Node concurrency 8 O(1k) or 10k O(100) – O(1000)
Total Node Interconnect BW 20 GB/s 200-400GB/s O(10)
System size (nodes) 88,124 O(100,000) or O(1M) O(10) – O(100)
Total concurrency 705,024 O(billion) O(1,000)
MTTI days O(1 day) - O(10)
Slide 14
Balanced Approach to Architecture
FLOPS
Memory
Storage
Storage
Memory
FLOPS/OPS
Slide 15
I/O software stack in HPC – Narrow Universe
15
A A
A A A
A
A
A A
A
A
A
A
A
A
A
A
A
A
A
A
A A A A A A A
A A A A A A A
applica*on/compute nodes A
D D D D D D D
D I/O delegate/discovery nodes
S S S S S S S S S I/O servers
compute nodes
15
Compute node
Compute node
Compute node
Compute node
network
I/O Server
I/O Server
I/O Server
Applications
Client-side File System
Parallel netCDF, HDF5, ...
MPI-IO
Slide 16
Supercomputers (Current): Illustration of Simulation Dataset Sizes (dated!)
Application On-Line Data Off-Line Data FLASH: Buoyancy-Driven Turbulent Nuclear Burning 75TB 300TB
Reactor Core Hydrodynamics 2TB 5TB
Computational Nuclear Structure 4TB 40TB
Computational Protein Structure 1TB 2TB
Performance Evaluation and Analysis 1TB 1TB Kinetics and Thermodynamics of Metal and
Complex Hydride Nanoparticles 5TB 100TB
Climate Science 10TB 345TB
Parkinson's Disease 2.5TB 50TB
Plasma Microturbulence 2TB 10TB
Lattice QCD 1TB 44TB
Thermal Striping in Sodium Cooled Reactors 4TB 8TB
Gating Mechanisms of Membrane Proteins 10TB 10TB
© Alok Choudhary
Slide 17
Balanced Approach to Architecture: I/O + Analytics
FLOPS
Memory
Storage
Storage
Memory
FLOPS/OPS
Parallel netCDF
Parallel File System
Slide 18
Scaling I/O and Analytics: Accelerating time to Discovery
I/O was previously a major bottleneck: The only reason execution at this scale became possible was due to I/O scaling.
● Global Cloud Resolving Model (GCRM) – Simulate circulation associated with large convective clouds
– Developed by David Randell (Colorado State U) & Karen Schuchardt (PNNL)
● Geodesic grid model ● 1.4 PB data per simulation
– 4 km resolution, 3 hourly, 1 simulated year – 1.5 TB per checkpoint
Slide 19
Analytics and I/O : Accelerating time to Discovery
● Improved I/O throughput
– Using PnetCDF optimizations, massive scalability
– For 3.5 km grid resolution, grid size is 41.9M cells with 256 vertical layers
– Data analysis read and simulation checkpoint
Jan 25, 2012 Slide 20
ARCHITECTURE, ALGORITHMS, BENCHMARKING, CO-DESIGN
Slide 21
Large-scale Analytics: Data Analysis Kernels
Performance typically dominated by a few kernels (Illustrative).
Application
Top 3 Kernels Σ
(%) Kernel 1 (%) Kernel 2 (%) Kernel 3 (%)
K-means Distance (68) Center (21) minDist (10) 99
Fuzzy K-means Center (58) Distance (39) fuzzySum (1) 98
BIRCH Distance (54) Variance (22) Redist (10) 86
HOP Density (39) Search (30) Gather (23) 92
Naïve Bayesian probCal (49) Variance (38) dataRead (10) 97
ScalParC Classify (37) giniCalc (36) Compare (24) 97
Apriori Subset (58) dataRead (14) Increment (8) 80
Eclat Intersect (39) addClass (23) invertC (10) 72
SVMlight quotMatrix (57) quadGrad (38) quotUpdate (2) 97
Slide 22
Data mining Programs: Do they have different characteristics?
0 1 2 3 4 5
8 9
11
aprio
ri ba
yesi
an
birc
h ec
lat
hop
sc
alpa
rc
kMea
ns
fuzz
y rs
earc
h se
mph
y sn
p ge
nene
t sv
m-rf
e
NU-MineBench
6 7
10
Clus
ter N
umbe
r
gcc
bzip
2 gz
ip
mcf
tw
olf
vort
ex
vpr
pars
er
apsi
ar
t eq
uake
lu
cas
mes
a m
grid
sw
im
wup
wis
e ra
wca
udio
ep
ic
enco
de
cjpe
g m
peg2
pe
gwit gs
to
ast
Q17
Q
3 Q
4 Q
6
SPEC FP MediaBench TPC-H SPEC INT
Slide 23
NU-Minebench versus Others
● number of data references per instruc*on is significantly higher. ● L2 miss rates are considerably high due to the inherent streaming nature
of data retrieval. ● The high ALU opera*ons per instruc*on is high, which indicates the
extensive amount of computa*ons performed in data mining applica*ons.
Parameter†
Benchmark of Applications
SPECINT SPECFP MediaBench TPC-H MineBench
Data References 0.81 0.55 0.56 0.48 1.10
Bus Accesses 0.030 0.034 0.002 0.010 0.037
Instruction Decodes 1.17 1.02 1.28 1.08 0.78
Resource Related Stalls 0.66 1.04 0.14 0.69 0.43
CPI 1.43 1.66 1.16 1.36 1.54
ALU Instructions 0.25 0.29 0.27 0.30 0.31
L1 Misses 0.023 0.008 0.010 0.029 0.016
L2 Misses 0.003 0.003 0.0004 0.002 0.006
Branches 0.13 0.03 0.16 0.11 0.14
Branch Mispredictions 0.009 0.0008 0.016 0.0006 0.006 † The numbers shown here for the parameters are values per instruction
Slide 24
Scalable Data Analytics Kernels
● Parallel hierarchical clustering
– Speedup of 18,000 on 16k processors
– I/O significant at large scale
Slide 25
Power-aware Data Analytics: Approximation is a TOP Option in analytics (Co-design)
0
0.2
0.4
0.6
0.8
1
1.2
0.00001
0.0001
0.001
0.01
0.1
1
10
100
4 Bits 8 Bits 12 Bits 16 Bits 20 Bits 32 Bits
Relativ
e Energy Con
sumed
(w.r.t. No Co
mpression
)
Average Clusterin
g Error (%
) in Log Scale
Bits used in representing input data
K-‐Means Clustering : Error Vs Energy
Average Error(%)Data Set AData Set B
Power-aware analytics ● Reduced bit fixed-point
representations
● Pearson correlation – 2.5-3.5 times faster
– 50-70% less energy
● K-means
– ~44% less energy with an error of only 0.03% using 12-bit representation
Energy Consumption Correlations
Speedup Correlation
Slide 26
In-memory Kmeans Clustering
● Data set fits into the device memory -> small amount of transfer over PCI bus ● Data set A : 5.7 Million records, Data set B : 8
Million records, attributes = 60, clusters = 32 ● Energy savings – 44% for 12-bit quantization
26
Data Set A Data Set B
00.0020.0040.0060.0080.010.0120.0140.0160.018
4 Bits
8 Bits
12 Bits
16 Bits
20 Bits
32 Bits
4 Bits
8 Bits
12 Bits
16 Bits
20 Bits
32 Bits
Energy Con
sumed
(kJ)
Energy consumed in data transfer
Transfer Energy
Data Set A Data Set B
0
5
10
15
20
25
30
4 Bits
8 Bits
12 Bits
16 Bits
20 Bits
32 Bits
4 Bits
8 Bits
12 Bits
16 Bits
20 Bits
32 Bits
Energy Con
sumed
(kJ)
Energy Consumption of in-‐order GPU K-‐means
Quantization EnergyKernel EnergyGPU Active Energy
Slide 27
Accelerators: Principal Component Analysis/ Data Reduction
Slide 28
Co-Design: Analytics Algorithms at Scale with power efficiency and using new memory hierarchies
Large-scale data-driven science for complex, multivariate, spatio-temporal, non-linear, and dynamic systems:
High Performance Computing Efficient analytics on future generation exascale HPC platforms with complex memory hierarchies
kernels, features, dependencies
Relationship Mining Discovery of complex
dependence structures such as non-linear relationships
Predictive Modeling Model typical and extreme behavior from multivariate
spatio-temporal data
Complex Networks Study collective behavior of
interacting subsystems
relationships community structure- function-dynamics
The traditional model of developing algorithms on small data and then parallelizing them WILL NOT WORK in most cases because the characteristics of the algorithm and solution itself will depend on the data: You don’t want to “remove” the needle from the haystack, you want to find it!
Slide 29
Data Analytics – Broad Impact
Illustrative Applications Feature, data reduction, or analytics task
Data analysis kernels
Chemistry, Climate, Combustion, Cosmology, Fusion, Materials science, Plasma
Clustering k-means, fuzzy k-means, BIRCH, MAFIA, DBSCAN, HOP, SNN, Dynamic Time Warping, Random Walk
Biology, Climate, Combustion, Cosmology, Plasma, Renewable energy
Statistics Extrema, mean, quantiles, standard deviation, copulas, value-based extraction, sampling
Biology, Climate, Fusion, Plasma Feature selection Data slicing, LVF, SFG, SBG, ABB, RELIEF
Chemistry, Materials science, Plasma, Climate
Data transformations Fourier transform, wavelet transform, PCA/SVD/EOF analysis, multidimensional scaling, differentiation, integration
Combustion, Earth science Topology Morse-Smale complexes, Reeb graphs, level set decomposition
Earth science Geometry Fractal dimension, curvature, torsion
Biology, Climate, Cosmology, Fusion Classification ScalParC, decision trees, Naïve Bayes, SVMlight, RIPPER
Chemistry, Climate, Combustion, Cosmology, Fusion, Plasma
Data compression PPM, LZW, JPEG, wavelet compression, PCA, Fixed-point representation
Climate Anomaly detection Entropy, LOF, GBAD
Climate, Earth science Similarity / distance Cosine similarity, correlation (TAPER), mutual information, Student's t-test, Eulerian distance, Mahalanobis distance, Jaccard coefficient, Tanimoto coefficient, shortest paths
Cosmology Halos and sub-halos SUBFIND, AHF
Slide 30
Network Effect and Precise Interest Targeting
4/12/12 Voxsup Confidential
Slide 31
Network Effect and Precise Interest Targeting
4/12/12 Voxsup Confidential
Slide 32
Network Effect and Precise Interest Targeting
4/12/12 Voxsup Confidential
Slide 33
Network Effect and Precise Interest Targeting
4/12/12 Voxsup Confidential