The Other HPC: High Productivity Computing in Polystore Environments
-
Upload
university-of-washington -
Category
Data & Analytics
-
view
97 -
download
1
Transcript of The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High
Productivity Computing in
Polystore Environments
Bill Howe, Ph.D.Associate Professor, Information School
Adjunct Associate Professor, Computer Science & Engineering
Associate Director, eScience Institute
Director, Urbanalytics Group
8/7/2017 Bill Howe, UW 1
Time
Am
ou
nt
of
data
in
th
e w
orl
d
Time
Pro
cessin
g p
ow
er
What is the rate-limiting step in data understanding?
Processing power:
Moore’s LawAmount of data in
the world
Pro
cess
ing
po
wer
Time
What is the rate-limiting step in data understanding?
Processing power:
Moore’s Law
Human cognitive capacity
Idea adapted from “Less is More” by Bill Buxton (2001)
Amount of data in
the world
slide src: Cecilia Aragon, UW HCDE
Productivity
How long I have to wait for results
monthsweeksdayshoursminutessecondsmilliseconds
HPC
Systems
Databases
feasibility
threshold
interactivity
threshold
Claim: Only these two performance
thresholds are generally important;
other performance requirements
are application-specific
8/7/2017 Bill Howe, UW 5
priority is machine efficiency
HPC DB/ Dataflow
priority is developer efficiency
data manipulation considered
pre-processing
batch
analysis considered post-
processing
batch and interactive
Observations
• Every interesting application has both a data
manipulation component and an analytics
component
• Different people like to express things different
ways
• Different systems offer better performance at
different things
• …but in between people and systems, there is
no real difference in expressiveness between
linear and relational algebra
• So we want full “anything anywhere” rewrites 8/7/2017 Bill Howe, UW 6
Observations
• Every interesting application has both a data
manipulation component and an analytics
component
• Different people like to express things different
ways
• Different systems offer better performance at
different things
• …but in between people and systems, there is
no real difference in expressiveness between
linear and relational algebra
• So we want full “anything anywhere” rewrites 8/7/2017 Bill Howe, UW 7
Matrix Multiply
select A.i, B.k, sum(A.val*B.val)
from A, B
where A.j = B.j
group by A.i, B.k
Matrix multiply in RA
Sparse means: |non-zero elements| < |rows|~1.2
Naïve sparse algorithm: |non-zero elements|*|rows| Best-known dense algorithm: |rows|2.38
Matrix multiply
sparsity exponent r where m=nr
Complexity
exponent
n2.38
mn
m0.7n1.2+n2
slide adapted from ZwickR. Yuster and U. Zwick, Fast Sparse Matrix Multiplication
n = number of rows
m = number of non-zerosComplexity of matrix multiply
naïve sparse algorithm
best known sparse algorithm
best known dense algorithmlots of room
here
Single-Server Experiment
Top-shelf SQL (Hyper)
vs. Top-shelf Dense Library (MKL BLAS)
vs. Top-shelf Sparse Library (MKL SpBLAS)
Who wins? By how much?
BLAS vs. SpBLAS vs. SQL (10k)
off the shelf
database
(HyPer)
15X
Single Node Sparse Matrix Multiply:BLAS vs. SpBLAS vs. HyperDB (N=20k)
1. Dense is not competitive2. 50X-100X gap between DB and library
Single Node Sparse Matrix Multiply:BLAS vs. SpBLAS vs. SQL (N=50k)
1. Dense is not competitive2. 10X-50X gap between DB and library
Single Node Sparse Matrix Multiply:Relative Speedup of SpBLAS vs. HyperDB
100X on small data
“Only” 5X on big data
Single Node Sparse Matrix Multiply:SpBLAS vs. SQL (Real Data)
About 5X on Real Data
Distributed Experiment
MyriaX
vs. CombBLAS
Who wins? By how much?
CombBLAS vs. MyriaX (N=50k) on star
8X to 45X
CombBLAS vs. MyriaX (Real Data)
• CombBLAS 10X faster on one dataset
• MyriaX 1.5X faster on another!
A x B x C
select AB.i, C.m, sum(AB.val*C.val)
from
(select A.i, B.k, sum(A.val*B.val)
from A, B
where A.j = B.j
group by A.i, B.k
) AB,
C
where AB.k = C.k
group by AB.i, C.m
select A.i, C.m, sum(A.val*B.val*C.val)
from A, B, C
where A.j = B.j
and B.k = C.k
group by A.i, C.m
group . join . join
group . join . group . join
Observations
• Every interesting application has both a data manipulation component and an analytics component
• Different people like to express things different ways
• Different systems offer better performance at different things
• …but in between people and systems, there is no real difference in expressiveness between linear and relational algebra
• So we want full “anything anywhere” rewrites
8/7/2017 Bill Howe, UW 20
8/7/2017 Bill Howe, UW 21
8/7/2017 Bill Howe, UW 22
Linear Algebra
Relational Algebra
8/7/2017 Bill Howe, UW 24
8/7/2017 Bill Howe, UW 25
Example: Combine measurements from sensors, compute means & covariances
26
Preprocessing
(easier to
express in RA)
Analysis
(easier to
express in LA)
Dylan
Hutchison
Example: Sensor Difference Mean & Covariance
27 https://arrayofthings.github.io/
t c v
466 temp 55.2
466 hum 40.1
492 temp 56.3
492 hum 35.0
528 temp 56.5
Filter, bin onto common time
buckets
Filter, bin onto common time
buckets
Subtract
Compute Mean
Compute Covariance
Preprocessing
(easier to
express in RA)
Analysis
(easier to
express in LA)
Array of Things
Sensor Data
Collected in CSV files
Dylan
Hutchison
Bin query: easy in RA, harder in LA
28
t c v
466 temp 55.2
466 hum 40.1
492 temp 56.3
492 hum 35.0
528 temp 56.5
𝑡𝑒𝑚𝑝 ℎ𝑢𝑚466492528
55.2 40.1𝟓𝟔.𝟑 35.0𝟓𝟔.𝟓
t’ c v
460 temp 55.2
460 hum 40.1
520 temp 56.4
520 hum 35.0
𝑡𝑒𝑚𝑝 ℎ𝑢𝑚460520
55.2 40.1𝟓𝟔.𝟒 35.0
bin 𝑡 = 𝑡 − 𝑡 % 60 + 60 𝑡 % 6060 + .5
LAMultiply:
using avg on added elements
466 492 528460520
11 1
RASELECT bin(t) AS t', c, avg(v) AS vGROUP BY t', c
* =
Dylan
Hutchison
Covariance query: easy in LA, harder in RA
29
𝑋 is an 𝑛 ⨉𝑑 matrix
𝑀 = 1
𝑛1𝑇𝑋 is a 1 ⨉𝑑 matrix
𝐶 = 1
𝑛𝑋𝑇𝑋 −𝑀𝑇𝑀 is a 𝑑 ⨉𝑑 matrix
LA
N = size(X, 1);
M = mean(X, 1);
C = X'*X / N – M'*M;
Carlos Ordonez.Building Statistical Models and Scoring with UDFs. SIGMOD 2007.
𝑥11 𝑥12𝑥21 𝑥22𝑥31 𝑥32
d attributes
n points
RA
(Generated SQL statements for each entry)
T = SELECT FROM X sum(1.0) AS N,
sum(X1) AS M1, sum(X2) AS M2, …, sum(Xd) AS Md,
sum(X1*X1) AS Q11, sum(X1*X2) AS Q12, …,
sum(Xd-1*Xd) AS Q(d-1)d, sum(Xd*Xd) AS Qdd
C = SELECT FROM T
(1 AS i, 1 AS j, Q11/N – M1*M1 AS v) UNION
(1 AS i, 2 AS j, Q12/N – M1*M2 AS v) UNION
…
Dylan
Hutchison
LARA: COMPREHENSIVE UNIFIED
LINEAR AND RELATIONAL ALGEBRA
30
R A G K
Myria Algebra
Spark Myria CombBLAS GEMS
Parallel Algebra
Logical Algebra
Myria Middleware
CombBLAS API
Spark API
MyriaAPI
GEMS API
rewrite
rules
Array Algebra
MyriaL
Services: visualization, logging, discovery, history, browsing
Orchestration and Execution of the Polystore Plan
Graph Algebra
Accumulo
KeyVal Algebra
Accumulo API
Serial C
Serial Algebra
C
Spark Myria CombBLAS GEMS
Parallel Algebra
Logical Algebra
Myria Middleware
CombBLAS API
Spark API
MyriaAPI
GEMS API
rewrite
rules
Array Algebra
MyriaL
Services: visualization, logging, discovery, history, browsing
Orchestration and Execution of the Polystore Plan
Graph Algebra
Accumulo
KeyVal Algebra
AccumuloAPI
Serial C
Serial Algebra
C
LARA Algebra
LARA API
LARA Physical Plans
LaraDB (Accumulo)
34
k1 k2
[0]v1
[‘’]v2
a 37 7 ‘dan’
a 20 0 ‘’
b 25 0 ‘dylan'
b 20 2 ‘bill’
⋈⊗
⋈
extf⨁
Join Union Extension
Objects:
Associative Tables
Operators:
Join and Union adapted from:
M. Spight and V. Tropashko.
First steps in relational lattice. 2006.
Ext is a restricted form
of monadic bind
Total functions from keys
to values with finite support
Default Values
ValuesKeys
Attributes
“horizontal
concat”
“vertical
concat”
“flatmap”
UDFs: ⊗, ⨁, fThink “Semiring”
⊗⋈⊕
Support
Join: Horizontal Concat
35
a c[0]x
[0]z
a1 c1 11 1
a1 c2 12 2
a2 c1 13 3
a3 c3 14 4
c b[0']
z[0']
y
c1 b1 5 15
c2 b1 6 16
c2 b2 7 17
c4 b1 8 18
⋈⊗ = a c b[0 ⊗ 0']
z
a1 c1 b1 1 ⊗ 5
a1 c2 b1 3 ⊗ 5
a1 c2 b2 2 ⊗ 6
(a3 c3 b1 4 ⊗ 0= 0 ⊗ 0')
Requires:
vA ⊗ 0' = 0 ⊗ vB = 0 ⊗ 0'
Union: Vertical Concat
36
= c[0]x
[0 ⨁ 0 = 0]z
[0]y
c1 11 ⨁ 13 1 ⨁ 3 ⨁ 5 15
c2 12 2 ⨁ 6 ⨁ 7 16 ⨁ 17
c3 14 14 0
c4 0 8 18
⋈
⨁a c
[0]x
[0]z
a1 c1 11 1
a1 c2 12 2
a2 c1 13 3
a3 c3 14 4
c b[0]z
[0]y
c1 b1 5 15
c2 b1 6 16
c2 b2 7 17
c4 b1 8 18
Requires:
v ⨁ 0 = 0 ⨁ v = v
Ext: Flatmap
37
a c[0]x
[0]z
a1 c1 11 1
a1 c2 12 2
a2 c1 13 3
=
a c k'[0 – 0 = 0]
v'
a1 c1 a1c1 11 – 5
a1 c1 c1a1 5 – 11
a1 c2 a1c2 12 – 2
a1 c2 c2a1 2 – 12
a2 c1 a2c1 13 – 3
a2 c1 c1a2 3 – 13
extf
k' v'
ac x – z
ca z – x
f(a, c, x, z) =
Requires:
Summary: Union, Join, Ext
38
Key Types Value Types Support
Union ( 𝑨 ⊕ 𝑩 ) = 𝐾𝐴 ∩ 𝐾𝐵 = 𝑉𝐴 ∪ 𝑉𝐵 ⊆ 𝑆𝐴 ∪ 𝑆𝐵
Join (𝑨 ⋈⊗ 𝑩 ) = 𝐾𝐴 ∪ 𝐾𝐵 = 𝑉𝐴 ∩ 𝑉𝐵 ⊆ 𝑆𝐴 ∩ 𝑆𝐵
Ext ( ext f A ) extended by f set by f ⊆ 𝑆𝐴 × 𝑆𝑓
⋈
For Support, ‘⊆’ becomes ‘=’ if
⊕ is zero-sum-free or ⊗ has zero-product-property
Duality
> If ⨁ or ⊗ is associative, commutative, or idempotent,
then so is Union or Join
> (Push Aggregation into Join) If ⊗ distributes over ⨁,
> (Distribute Join over Union)
If , then
LARA Properties
39
= sum(AB ⊗ CT)
RADISH: COMPILING QUERIES TO HPC ARCHITECTURES
Query compilation for distributed processing
pipeline
as
parallel
code
parallel compiler
machine
code
[Myers ’14]
pipeline
fragment
code
pipeline
fragment
code
sequential
compiler
machine
code
[Crotty ’14, Li ’14, Seo ’14, Murray ‘11]
sequential
compiler
RADISH
ICS 16
Brandon
Myers
8/7/2017 Bill Howe, UW 43/57
1% selection microbenchmark, 20GB
Avoid long code paths
ICS 16
Brandon
Myers
8/7/2017 Bill Howe, UW 44/57
Q2 SP2Bench, 100M triples, multiple self-joins
Communication optimization
ICS 16
Brandon
Myers
Graph Patterns
45
• SP2Bench, 100 million triples
• Queries compiled to a PGAS C++ language layer, then
compiled again by a low-level PGAS compiler
• One of Myria’s supported back ends
• Comparison with Shark/Spark, which itself has been shown to
be 100X faster than Hadoop-based systems
• …plus PageRank, Naïve Bayes, and more
RADISH
ICS 16
Brandon
Myers
8/7/2017 Bill Howe, UW 46
ICS 15
RADISH
ICS 16
Brandon
Myers
Recap
• Productivity is the new performance
• …but this doesn’t mean give up on orders of
magnitude performance difference by doing
everything on one system
• Everything interesting is LA + RA
• There is no difference except syntax and
systems
• We want to comprehensively optimize
across them, generate code anywhere
Other Productivity Work
• Workload Analytics for SQL Data Lakes
– Shrainik Jain
• AI for Scientific Data Curation
– Maxim Grechkin, Hoing Poon (MSR)
• Visualization Recommendation
– Kanit “Ham” Wongsuphasawat, Dom Moritz, Jeff
Heer
• Information Extraction from Scientific Figures
– Poshen Lee, Sean Yang
• Scalable Approximate Community Detection
– Seung-Hee Bae (Western Michigan)
The SQLShare Corpus:
A multi-year log of hand-written SQL queries
Queries 24275
Views 4535
Tables 3891
Users 591
SIGMOD 2016
Shrainik Jain
https://uwescience.github.io/sqlshare
Workload Analytics for Data Lakes
lifetime = days between first and last access of table
SIGMOD 2016
Shrainik Jain
http://uwescience.github.io/sqlshare/
Data “Grazing”: Short dataset lifetimes
Key idea: Embed queries as vectors
• Learn query embeddings; use them for
all workload analytics tasks:
– Query recommendation
– Workload summarization / index selection
– User behavior modeling
– Predicting heavy hitters
– Forensics
• Get rid of specialized feature
engineering
Doc2Vec on SQL
Can we recover known
patterns in the workload?
TPC-H queries,
generated with
different
parameters
Can we recover known
patterns in the workload?
TPC-H queries,
generated with
different
parameters
Doc2Vec on Templatized Query Plans
Workload Summarization
and Index Selection
DEEP CURATION FOR
SCIENTIFIC DATA LAKES
Microarray experiments
8/7/2017 Bill Howe, UW 58
Microarray samples submitted to the Gene Expression Omnibus
Curation is fast becoming the
bottleneck to data sharing
Maxim
Gretchkin
Hoifung
Poon
Maxim
Gretchkin
Hoifung
Poon
No growth in number of
datasets used per paper!
Maxim
Gretchkin
Hoifung
Poon
Majority of samples are
one-time-use only!
color = labels supplied
as metadata
clusters = 1st two PCA
dimensions on the
gene expression data
itself
Can we use curate algorithmically?Maxim
Gretchkin
Hoifung
Poon
The expression data and the text labels appear to disagree
Maxim
Gretchkin
Hoifung
Poon
Better Tissue
Type Labels
Domain knowledge
(Ontology)
Expression data
Free-text Metadata
2 Deep Networkstext
expr
SVM
NIPS 18 (review)
Deep Curation Maxim
Gretchkin
Hoifung
PoonDistant supervision and co-learning between text-
based classified and expression-based classifier: Both
models improve by training on each others’ results.
Free-text classifierExpression classifier
NIPS 18 (review)
Deep Curation:
Our stuff wins, with ZERO training dataMaxim
Gretchkin
Hoifung
Poon
state of the art
our reimplementation
of the state of the art
our dueling
pianos NN
amount of training data used
NIPS 18 (review)
Viziometrics: Analysis of Visualization
in the Scientific Literature
Proportion of
non-quantitative
figures in paper
Paper impact, grouped into 5% percentiles
Poshen Lee
Voyager
8/7/2017 Bill Howe, UW 66
Kanit “Ham” Wongsuphasawat Dominik Moritz
InfoVis 15
Seung-Hee
BaeScalable Graph Clustering
Version 1
Parallelize Best-known
Serial Algorithm
ICDM 2013
Version 2
Free 30% improvement
for any algorithm
TKDD 2014 SC 2015
Version 3
Distributed approx.
algorithm, 1.5B edges
RESPONSIBLE DATA SCIENCE
8/7/2017 Bill Howe, UW 68
69
Propublica, May 2016
70
The Special Committee on Criminal Justice Reform's
hearing of reducing the pre-trial jail population.
Technical.ly, September 2016
Philadelphia is grappling with the prospect of a racist computer algorithm
Any background signal in the
data of institutional racism is
amplified by the algorithm
operationalized by the algorithm
legitimized by the algorithm
“Should I be afraid of risk assessment tools?”
“No, you gotta tell me a lot more about yourself.
At what age were you first arrested?
What is the date of your most recent crime?”
“And what’s the culture of policing in the
neighborhood in which I grew up in?”
8/7/2017 Bill Howe, UW 71
Amazon Prime Now Delivery Area: Atlanta Bloomberg, 2016
8/7/2017 Bill Howe, UW 72
Amazon Prime Now Delivery Area: Boston Bloomberg, 2016
8/7/2017 Bill Howe, UW 73
Amazon Prime Now Delivery Area: Chicago Bloomberg, 2016
First decade of Data Science research and practice:
What can we do with massive, noisy, heterogeneous datasets?
Next decade of Data Science research and practice:
What should we do with massive, noisy, heterogeneous datasets?
The way I think about this…..(1)
The way I think about this…. (2)
Decisions are based on two sources of information:
1. Past examplese.g., “prior arrests tend to increase likelihood of future arrests”
2. Societal constraintse.g., “we must avoid racial discrimination”
8/7/2017 Data, Responsibly / SciTech NW 75
We’ve become very good at automating the use of past examples
We’ve only just started to think about incorporating societal constraints
The way I think about this… (3)
How do we apply societal constraints to algorithmic
decision-making?
Option 1: Rely on human oversight
Ex: EU General Data Protection Regulation requires that a
human be involved in legally binding algorithmic decision-making
Ex: Wisconsin Supreme Court says a human must review
algorithmic decisions made by recidivism models
Issues with scalability, prejudice
Option 2: Build systems to help enforce these constraints
This is the approach we are exploring
8/7/2017 Data, Responsibly / SciTech NW 76
The way I think about this…(4)
On transparency vs. accountability:
• For human decision-making, sometimes explanations are
required, improving transparency
– Supreme court decisions
– Employee reprimands/termination
• But when transparency is difficult, accountability takes over
– medical emergencies, business decisions
• As we shift decisions to algorithms, we lose both
transparency AND accountability
• “The buck stops where?”
8/7/2017 Data, Responsibly / SciTech NW 77
FairnessAccountability TransparencyPrivacyReproducibility
Fides: A platform for responsible data science
joint with Stoyanovich [US], Abiteboul [FR], Miklau [US], Sahuguet [US], Weikum [DE]
Data Curation
novel features to support:
So what do we do about it?