The Other HPC: High Productivity Computing in Polystore Environments

The Other HPC: High

Productivity Computing in

Polystore Environments

Bill Howe, Ph.D.Associate Professor, Information School

Adjunct Associate Professor, Computer Science & Engineering

Associate Director, eScience Institute

Director, Urbanalytics Group

8/7/2017 Bill Howe, UW 1

Time

Am

ou

nt

of

data

in

th

e w

orl

d

Time

Pro

cessin

g p

ow

er

What is the rate-limiting step in data understanding?

Processing power:

Moore’s LawAmount of data in

the world

Pro

cess

ing

po

wer

Time

What is the rate-limiting step in data understanding?

Processing power:

Moore’s Law

Human cognitive capacity

Idea adapted from “Less is More” by Bill Buxton (2001)

Amount of data in

the world

slide src: Cecilia Aragon, UW HCDE

Productivity

How long I have to wait for results

monthsweeksdayshoursminutessecondsmilliseconds

HPC

Systems

Databases

feasibility

threshold

interactivity

threshold

Claim: Only these two performance

thresholds are generally important;

other performance requirements

are application-specific


priority is machine efficiency

HPC DB/ Dataflow

priority is developer efficiency

data manipulation considered

pre-processing

batch

analysis considered post-

processing

batch and interactive

Observations

• Every interesting application has both a data

manipulation component and an analytics

component

• Different people like to express things different

ways

• Different systems offer better performance at

different things

• …but in between people and systems, there is

no real difference in expressiveness between

linear and relational algebra

• So we want full “anything anywhere” rewrites 8/7/2017 Bill Howe, UW 6

Observations

• Every interesting application has both a data

manipulation component and an analytics

component

• Different people like to express things different

ways

• Different systems offer better performance at

different things

• …but in between people and systems, there is

no real difference in expressiveness between

linear and relational algebra

• So we want full “anything anywhere” rewrites 8/7/2017 Bill Howe, UW 7

sparsity exponent r where m=nr

Complexity

exponent

n2.38

mn

m0.7n1.2+n2

slide adapted from ZwickR. Yuster and U. Zwick, Fast Sparse Matrix Multiplication

n = number of rows

m = number of non-zerosComplexity of matrix multiply

naïve sparse algorithm

best known sparse algorithm

best known dense algorithmlots of room

here

Single-Server Experiment

Top-shelf SQL (Hyper)

vs. Top-shelf Dense Library (MKL BLAS)

vs. Top-shelf Sparse Library (MKL SpBLAS)

Who wins? By how much?

BLAS vs. SpBLAS vs. SQL (10k)

off the shelf

database

(HyPer)

15X

Single Node Sparse Matrix Multiply:BLAS vs. SpBLAS vs. HyperDB (N=20k)

1. Dense is not competitive2. 50X-100X gap between DB and library

Single Node Sparse Matrix Multiply:BLAS vs. SpBLAS vs. SQL (N=50k)

1. Dense is not competitive2. 10X-50X gap between DB and library

Single Node Sparse Matrix Multiply:Relative Speedup of SpBLAS vs. HyperDB

100X on small data

“Only” 5X on big data

Single Node Sparse Matrix Multiply:SpBLAS vs. SQL (Real Data)

About 5X on Real Data

Distributed Experiment

MyriaX

vs. CombBLAS

Who wins? By how much?

CombBLAS vs. MyriaX (N=50k) on star

8X to 45X

CombBLAS vs. MyriaX (Real Data)

• CombBLAS 10X faster on one dataset

• MyriaX 1.5X faster on another!

A x B x C

select AB.i, C.m, sum(AB.val*C.val)

from

(select A.i, B.k, sum(A.val*B.val)

from A, B

where A.j = B.j

group by A.i, B.k

) AB,

C

where AB.k = C.k

group by AB.i, C.m

select A.i, C.m, sum(A.val*B.val*C.val)

from A, B, C

where A.j = B.j

and B.k = C.k

group by A.i, C.m

group . join . join

group . join . group . join

Observations

• Every interesting application has both a data manipulation component and an analytics component

• Different people like to express things different ways

• Different systems offer better performance at different things

• …but in between people and systems, there is no real difference in expressiveness between linear and relational algebra

• So we want full “anything anywhere” rewrites



Linear Algebra

Relational Algebra

Example: Combine measurements from sensors, compute means & covariances

26

Preprocessing

(easier to

express in RA)

Analysis

(easier to

express in LA)

Dylan

Hutchison

Example: Sensor Difference Mean & Covariance

27 https://arrayofthings.github.io/

t c v

466 temp 55.2

466 hum 40.1

492 temp 56.3

492 hum 35.0

528 temp 56.5

Filter, bin onto common time

buckets

Filter, bin onto common time

buckets

Subtract

Compute Mean

Compute Covariance

Preprocessing

(easier to

express in RA)

Analysis

(easier to

express in LA)

Array of Things

Sensor Data

Collected in CSV files

Dylan

Hutchison

https://arrayofthings.github.io/

Bin query: easy in RA, harder in LA

28

t c v

466 temp 55.2

466 hum 40.1

492 temp 56.3

492 hum 35.0

528 temp 56.5

𝑡𝑒𝑚𝑝 ℎ𝑢𝑚466492528

55.2 40.1𝟓𝟔.𝟑 35.0𝟓𝟔.𝟓

t’ c v

460 temp 55.2

460 hum 40.1

520 temp 56.4

520 hum 35.0

𝑡𝑒𝑚𝑝 ℎ𝑢𝑚460520

55.2 40.1𝟓𝟔.𝟒 35.0

bin 𝑡 = 𝑡 − 𝑡 % 60 + 60 𝑡 % 6060 + .5

LAMultiply:

using avg on added elements

466 492 528460520

11 1

RASELECT bin(t) AS t', c, avg(v) AS vGROUP BY t', c

* =

Dylan

Hutchison

Covariance query: easy in LA, harder in RA

29

𝑋 is an 𝑛 ⨉𝑑 matrix

𝑀 = 1

𝑛1𝑇𝑋 is a 1 ⨉𝑑 matrix

𝐶 = 1

𝑛𝑋𝑇𝑋 −𝑀𝑇𝑀 is a 𝑑 ⨉𝑑 matrix

LA

N = size(X, 1);

M = mean(X, 1);

C = X'*X / N – M'*M;

Carlos Ordonez.Building Statistical Models and Scoring with UDFs. SIGMOD 2007.

𝑥11 𝑥12𝑥21 𝑥22𝑥31 𝑥32

d attributes

n points

RA

(Generated SQL statements for each entry)

T = SELECT FROM X sum(1.0) AS N,

sum(X1) AS M1, sum(X2) AS M2, …, sum(Xd) AS Md,

sum(X1*X1) AS Q11, sum(X1*X2) AS Q12, …,

sum(Xd-1*Xd) AS Q(d-1)d, sum(Xd*Xd) AS Qdd

C = SELECT FROM T

(1 AS i, 1 AS j, Q11/N – M1*M1 AS v) UNION

(1 AS i, 2 AS j, Q12/N – M1*M2 AS v) UNION

…

Dylan

Hutchison

LARA: COMPREHENSIVE UNIFIED

LINEAR AND RELATIONAL ALGEBRA

30

R A G K

Myria Algebra

Spark Myria CombBLAS GEMS

Parallel Algebra

Logical Algebra

Myria Middleware

CombBLAS API

Spark API

MyriaAPI

GEMS API

rewrite

rules

Array Algebra

MyriaL

Services: visualization, logging, discovery, history, browsing

Orchestration and Execution of the Polystore Plan

Graph Algebra

Accumulo

KeyVal Algebra

Accumulo API

Serial C

Serial Algebra

C

Spark Myria CombBLAS GEMS

Parallel Algebra

Logical Algebra

Myria Middleware

CombBLAS API

Spark API

MyriaAPI

GEMS API

rewrite

rules

Array Algebra

MyriaL

Services: visualization, logging, discovery, history, browsing

Orchestration and Execution of the Polystore Plan

Graph Algebra

Accumulo

KeyVal Algebra

AccumuloAPI

Serial C

Serial Algebra

C

LARA Algebra

LARA API

LARA Physical Plans

LaraDB (Accumulo)

34

k1 k2

[0]v1

[‘’]v2

a 37 7 ‘dan’

a 20 0 ‘’

b 25 0 ‘dylan'

b 20 2 ‘bill’

⋈⊗

⋈

extf⨁

Join Union Extension

Objects:

Associative Tables

Operators:

Join and Union adapted from:

M. Spight and V. Tropashko.

First steps in relational lattice. 2006.

Ext is a restricted form

of monadic bind

Total functions from keys

to values with finite support

Default Values

ValuesKeys

Attributes

“horizontal

concat”

“vertical

concat”

“flatmap”

UDFs: ⊗, ⨁, fThink “Semiring”

⊗⋈⊕

Support

Join: Horizontal Concat

35

a c[0]x

[0]z

a1 c1 11 1

a1 c2 12 2

a2 c1 13 3

a3 c3 14 4

c b[0']

z[0']

y

c1 b1 5 15

c2 b1 6 16

c2 b2 7 17

c4 b1 8 18

⋈⊗ = a c b[0 ⊗ 0']

z

a1 c1 b1 1 ⊗ 5

a1 c2 b1 3 ⊗ 5

a1 c2 b2 2 ⊗ 6

(a3 c3 b1 4 ⊗ 0= 0 ⊗ 0')

Requires:

vA ⊗ 0' = 0 ⊗ vB = 0 ⊗ 0'

Union: Vertical Concat

36

= c[0]x

[0 ⨁ 0 = 0]z

[0]y

c1 11 ⨁ 13 1 ⨁ 3 ⨁ 5 15

c2 12 2 ⨁ 6 ⨁ 7 16 ⨁ 17

c3 14 14 0

c4 0 8 18

⋈

⨁a c

[0]x

[0]z

a1 c1 11 1

a1 c2 12 2

a2 c1 13 3

a3 c3 14 4

c b[0]z

[0]y

c1 b1 5 15

c2 b1 6 16

c2 b2 7 17

c4 b1 8 18

Requires:

v ⨁ 0 = 0 ⨁ v = v

Ext: Flatmap

37

a c[0]x

[0]z

a1 c1 11 1

a1 c2 12 2

a2 c1 13 3

=

a c k'[0 – 0 = 0]

v'

a1 c1 a1c1 11 – 5

a1 c1 c1a1 5 – 11

a1 c2 a1c2 12 – 2

a1 c2 c2a1 2 – 12

a2 c1 a2c1 13 – 3

a2 c1 c1a2 3 – 13

extf

k' v'

ac x – z

ca z – x

f(a, c, x, z) =

Requires:

Summary: Union, Join, Ext

38

Key Types Value Types Support

Union ( 𝑨 ⊕ 𝑩 ) = 𝐾𝐴 ∩ 𝐾𝐵 = 𝑉𝐴 ∪ 𝑉𝐵 ⊆ 𝑆𝐴 ∪ 𝑆𝐵

Join (𝑨 ⋈⊗ 𝑩 ) = 𝐾𝐴 ∪ 𝐾𝐵 = 𝑉𝐴 ∩ 𝑉𝐵 ⊆ 𝑆𝐴 ∩ 𝑆𝐵

Ext ( ext f A ) extended by f set by f ⊆ 𝑆𝐴 × 𝑆𝑓

⋈

For Support, ‘⊆’ becomes ‘=’ if

⊕ is zero-sum-free or ⊗ has zero-product-property

Duality

> If ⨁ or ⊗ is associative, commutative, or idempotent,

then so is Union or Join

> (Push Aggregation into Join) If ⊗ distributes over ⨁,

> (Distribute Join over Union)

If , then

LARA Properties

39

= sum(AB ⊗ CT)

RADISH: COMPILING QUERIES TO HPC ARCHITECTURES

Query compilation for distributed processing

pipeline

as

parallel

code

parallel compiler

machine

code

[Myers ’14]

pipeline

fragment

code

pipeline

fragment

code

sequential

compiler

machine

code

[Crotty ’14, Li ’14, Seo ’14, Murray ‘11]

sequential

compiler

RADISH

ICS 16

Brandon

Myers

8/7/2017 Bill Howe, UW 43/57

1% selection microbenchmark, 20GB

Avoid long code paths

ICS 16

Brandon

Myers

8/7/2017 Bill Howe, UW 44/57

Q2 SP2Bench, 100M triples, multiple self-joins

Communication optimization

ICS 16

Brandon

Myers

Graph Patterns

45

• SP2Bench, 100 million triples

• Queries compiled to a PGAS C++ language layer, then

compiled again by a low-level PGAS compiler

• One of Myria’s supported back ends

• Comparison with Shark/Spark, which itself has been shown to

be 100X faster than Hadoop-based systems

• …plus PageRank, Naïve Bayes, and more

RADISH

ICS 16

Brandon

Myers


ICS 15

RADISH

ICS 16

Brandon

Myers

Recap

• Productivity is the new performance

• …but this doesn’t mean give up on orders of

magnitude performance difference by doing

everything on one system

• Everything interesting is LA + RA

• There is no difference except syntax and

systems

• We want to comprehensively optimize

across them, generate code anywhere

Other Productivity Work

• Workload Analytics for SQL Data Lakes

– Shrainik Jain

• AI for Scientific Data Curation

– Maxim Grechkin, Hoing Poon (MSR)

• Visualization Recommendation

– Kanit “Ham” Wongsuphasawat, Dom Moritz, Jeff

Heer

• Information Extraction from Scientific Figures

– Poshen Lee, Sean Yang

• Scalable Approximate Community Detection

– Seung-Hee Bae (Western Michigan)

The SQLShare Corpus:

A multi-year log of hand-written SQL queries

Queries 24275

Views 4535

Tables 3891

Users 591

SIGMOD 2016

Shrainik Jain

https://uwescience.github.io/sqlshare

Workload Analytics for Data Lakes

https://uwescience.github.io/sqlshare/data_release.html

lifetime = days between first and last access of table

SIGMOD 2016

Shrainik Jain

http://uwescience.github.io/sqlshare/

Data “Grazing”: Short dataset lifetimes

Key idea: Embed queries as vectors

• Learn query embeddings; use them for

all workload analytics tasks:

– Query recommendation

– Workload summarization / index selection

– User behavior modeling

– Predicting heavy hitters

– Forensics

• Get rid of specialized feature

engineering

Doc2Vec on SQL

Can we recover known

patterns in the workload?

TPC-H queries,

generated with

different

parameters

Can we recover known

patterns in the workload?

TPC-H queries,

generated with

different

parameters

Doc2Vec on Templatized Query Plans

Workload Summarization

and Index Selection

DEEP CURATION FOR

SCIENTIFIC DATA LAKES

Microarray experiments


Microarray samples submitted to the Gene Expression Omnibus

Curation is fast becoming the

bottleneck to data sharing

Maxim

Gretchkin

Hoifung

Poon

Maxim

Gretchkin

Hoifung

Poon

No growth in number of

datasets used per paper!

Maxim

Gretchkin

Hoifung

Poon

Majority of samples are

one-time-use only!

color = labels supplied

as metadata

clusters = 1st two PCA

dimensions on the

gene expression data

itself

Can we use curate algorithmically?Maxim

Gretchkin

Hoifung

Poon

The expression data and the text labels appear to disagree

Maxim

Gretchkin

Hoifung

Poon

Better Tissue

Type Labels

Domain knowledge

(Ontology)

Expression data

Free-text Metadata

2 Deep Networkstext

expr

SVM

NIPS 18 (review)

Deep Curation Maxim

Gretchkin

Hoifung

PoonDistant supervision and co-learning between text-

based classified and expression-based classifier: Both

models improve by training on each others’ results.

Free-text classifierExpression classifier

NIPS 18 (review)

Deep Curation:

Our stuff wins, with ZERO training dataMaxim

Gretchkin

Hoifung

Poon

state of the art

our reimplementation

of the state of the art

our dueling

pianos NN

amount of training data used

NIPS 18 (review)

Viziometrics: Analysis of Visualization

in the Scientific Literature

Proportion of

non-quantitative

figures in paper

Paper impact, grouped into 5% percentiles

Poshen Lee

Voyager


Kanit “Ham” Wongsuphasawat Dominik Moritz

InfoVis 15

Seung-Hee

BaeScalable Graph Clustering

Version 1

Parallelize Best-known

Serial Algorithm

ICDM 2013

Version 2

Free 30% improvement

for any algorithm

TKDD 2014 SC 2015

Version 3

Distributed approx.

algorithm, 1.5B edges

RESPONSIBLE DATA SCIENCE


69

Propublica, May 2016

70

The Special Committee on Criminal Justice Reform's

hearing of reducing the pre-trial jail population.

Technical.ly, September 2016

Philadelphia is grappling with the prospect of a racist computer algorithm

Any background signal in the

data of institutional racism is

amplified by the algorithm

operationalized by the algorithm

legitimized by the algorithm

“Should I be afraid of risk assessment tools?”

“No, you gotta tell me a lot more about yourself.

At what age were you first arrested?

What is the date of your most recent crime?”

“And what’s the culture of policing in the

neighborhood in which I grew up in?”


Amazon Prime Now Delivery Area: Atlanta Bloomberg, 2016


Amazon Prime Now Delivery Area: Boston Bloomberg, 2016


Amazon Prime Now Delivery Area: Chicago Bloomberg, 2016

First decade of Data Science research and practice:

What can we do with massive, noisy, heterogeneous datasets?

Next decade of Data Science research and practice:

What should we do with massive, noisy, heterogeneous datasets?

The way I think about this…..(1)

The way I think about this…. (2)

Decisions are based on two sources of information:

1. Past examplese.g., “prior arrests tend to increase likelihood of future arrests”

2. Societal constraintse.g., “we must avoid racial discrimination”

8/7/2017 Data, Responsibly / SciTech NW 75

We’ve become very good at automating the use of past examples

We’ve only just started to think about incorporating societal constraints

The way I think about this… (3)

How do we apply societal constraints to algorithmic

decision-making?

Option 1: Rely on human oversight

Ex: EU General Data Protection Regulation requires that a

human be involved in legally binding algorithmic decision-making

Ex: Wisconsin Supreme Court says a human must review

algorithmic decisions made by recidivism models

Issues with scalability, prejudice

Option 2: Build systems to help enforce these constraints

This is the approach we are exploring


The way I think about this…(4)

On transparency vs. accountability:

• For human decision-making, sometimes explanations are

required, improving transparency

– Supreme court decisions

– Employee reprimands/termination

• But when transparency is difficult, accountability takes over

– medical emergencies, business decisions

• As we shift decisions to algorithms, we lose both

transparency AND accountability

• “The buck stops where?”


FairnessAccountability TransparencyPrivacyReproducibility

Fides: A platform for responsible data science

joint with Stoyanovich [US], Abiteboul [FR], Miklau [US], Sahuguet [US], Weikum [DE]

Data Curation

novel features to support:

So what do we do about it?

The Other HPC: High Productivity Computing in Polystore Environments

Data & Analytics

Transcript of The Other HPC: High Productivity Computing in Polystore Environments