Approximation is Bliss - Yongjoo Park · 2020-06-07 · Resurgence of AQP research 0 20 40 60 80...

Approximate Computing in Database Systems

Yongjoo Park @Michigan

Approximation is Bliss

Approximate query processing isbecoming more valuable

What is approximate query processing (AQP)?

Faster and Cheaper

I/O Computation Exact Answer

100011100010110110111010010

Less Computation

100011100010110110111010010

Approximate AnswerLess I/O

MR Online,COSMOS

AQUA,Ripple join

Approx countestimator

1985 1990 1995 2000 2005 2010 2015

Doublesampling

Online agg

Dynamicsample

Opt stratifiedDBO

STRAT SMS

Bootstrap for AQP

SciBORQ

BlinkDB

QuickR, Wander join

35 years of research

AQP research has a long history

Resurgence of AQP research

1998 2003 2008 2013 2018

Acharya. Aqua Hellerstein. OLA Jermaine. DBO Cormode. Synopses

1998 2003 2008 2013 2018

Hadoop

Cloudera Hortonworks

Redshift big datasystems/ vendors

Today’s big data ecosystems

FLOW STORE FORMAT

COMPUTE VIZ

Today’s big data ecosystems

FLOW STORE FORMAT

COMPUTE VIZ

Can process a large volume of data

Slow (esp. for ad-hoc queries)

Costly

Big data analytics is slow

One of the biggest customer science company in UK

One of the largestretail corporations

Collects 70GB+ data/day

Ad-hoc queries with customer demographic filters

Basic statistics + ML

A location intelligence company

Billions of GPS points

Real-time responses required for its web-interface

Using commercial clusters (from MapR, Amazon, ...)10-20 minute query latencies

Big data analytics is costly

$450K/monthpay per query

The cost increases with more data & queries

Case 80 GB/day, one-year data retention, 1000 queries/day

$48K/monthpay per node HIGH

Big data: too much cost for its value?

https://www.jesse-anderson.com/2019/06/i-come-not-to-bury-cloudera-but-to-praise-it/”“We generate woefully low amounts of value relative to the amount spent.

Jesse Anderson, Director of Big Data Institute

65% of Cloudera is gonein 4 months

Approximation is bliss100x faster or cheaper by sacrificing 0.1% accuracy err = f( 1

n − 1N ) ≤ f( 1

Faster: less I/O, less computation

Cheaper: same latency with less resource

AQP can produce indistinguishable results

2 billion pointsTook 71 mins

1 million pointsTook 3 secs

EXACT APPROXIMATE

[Park et al. ICDE’16]

Our contributions

1. 35 years of research, little industry adoption

Our effort: Universal AQP

2. Limited to simple aggregation Our effort: AQP for ML

[Park et al. SIGMOD’18]

[Park et al. SIGMOD’19]

sum, avg, count, count-distinct

<Universal AQP>...

Typical AQP systems

[Aqua ’99, JoinSynopses ’99, BlinkDB ’13, WanderJoin ’16, Quickr ’16, and more]

avg (sensor_temp)

user/app

AQP System

± 0.182.3

82.28 ± 0.01

82.2873

Barriers in adopting AQP

• Newer SQL-on-Hadoop: busy catching up on standard features

Vendor resistance: AQP requires significant changes to DBMS internals

• Traditional DBMS: stable codebases, reluctance to major changes

[BlinkDB ’13, G-OLA ’15, Join Synopses ’99, WanderJoin ‚16, ABM ’14, ...]

User resistance:• users don’t typically abandon their existing systems• vendor lock-in makes data migration almost impossible

One example of user resistance

Since our organization is huge, it would be difficult to ask infrastructure team to apply patch on spark for supporting BlinkDB.

Giridhar Addepalli, Walmarthttps://github.com/sameeragarwal/blinkdb/issues/14

“”

Our Approach: Universal AQP

SQLEngine

OriginalQuery

Approx. Ans toOriginal Query(+ error bound)

RewrittenQuery

Exact Ans toRewritten

The rewritten query runs faster becauseit uses a sample table instead of the original table

Universal AQP: criteria

Consistency

Accuracy guarantee

Efficiency

VerdictDBhttps://github.com/mozafari/verdictdb

We focus on append-only data

New error estimation logic

Comparable to built-in AQP

VerdictDB: offers large speedups

On Redshift On Spark SQL On Impala

Datasets:• 500GB TPC-H benchmark• 200GB Instacart dataset

Workloads:• TPC-H, microbenchmark

Data Systems:• Redshift, Spark SQL, Impala• 10+1 r4.xlarge cluster

24.0×

12.0×

18.6×

Including all overhead

Used columnar formats for all systems

2% relative errors

VerdictDB: comparable to built-in AQP

Datasets:• 200GB Instacart dataset

Workloads:• microbenchmark

Data Systems:• Spark SQL• 10+1 r4.xlarge cluster

17.3 sec

0.70 sec 0.97 sec

OriginalSpark

Built-inAQP System

Built-in AQP System:SnappyData(a commercial version of BlinkDB) Ours

Remember:requires a wholesale replacementof your database systems

...</Universal AQP>

<AQP for ML>...

Machine learning can be time-consuming

More data, slower training

We train multiple models• new training data• feature engineering

[Anderson, CIDR’13] 50K 100K 500K 1M 10M 37M

79 s 95 s 166 s 263 s

1735 s

5727 s

Number of examples

Criteo datasetlogistic regression (L-BFGS)

Sampling may accelerate training

Training: iterative gradient computationθt+1 = θt - α · grad(θt)

Sampling: gradient computation becomes fastergrad(θt) = (1/N) ∑i=1..N f(xi | θt)

Benefits if (savings from sampling) > (increase in # of iterations)

1. Ad-hoc approach: no accuracy guarantee2. Biased sampling: not efficient for feature engineering

(until convergence)

BlinkML: uniform sampling w/ accuracy guarantee

training set(size N)

sample (size n)

sampleautomatically

full(x)

approx(x)

ML trainer consistentpredictions

Consistency is expressed asEx[full(x) ≠ approx(x)] < ε with high probability

BlinkML supports commonly-used models

Supports convex MLE models:• linear regression• logistic regression• probabilistic PCA• generalized linear models

https://www.kaggle.com/surveys/2017/

Accuracy guarantee exploits the property of MLE models:grad(θopt) = (1/N) ∑i=1..N f(xi | θopt) = 0

BlinkML introduces computational optimization

BlinkML offers large speedups

90% 95% 96% 97% 98% 99%

Logistic regression46M examples

Requested Accuracy

73×25×

Datasets:• Size: 2.86 GB on disk•# of features: 998K

Systems:•Optimization: Scipy•5+1 m5.2xlarge

66× 60× 50×

...</AQP for ML>

Summary

1. AQP: becoming more valuable

2. VerdictDB: enables AQP on any platforms

3. BlinkML: trains MLE models with bounded errors

Thank you!

Approximation is Bliss - Yongjoo Park · 2020-06-07 · Resurgence of AQP research 0 20 40 60 80...

Documents

Transcript of Approximation is Bliss - Yongjoo Park · 2020-06-07 · Resurgence of AQP research 0 20 40 60 80...

AQP dh_128539

Summarizing and mining inverse distributions on data streams via dynamic inverse sampling Graham Cormode cormode@bell-labs.com S. Muthukrishnan muthu@cs.rutgers.edu.

Data Management Guide - Federal Aviation Administration · AQP Advanced Qualification Program Data Management Guide 2 This document supports the Advanced Qualification Program (AQP)

Tutorial Aqp

A generalized algorithm for determining pair-wise ...ncss-tech.github.io/AQP/presentations/AQP-num_soil_classification.pdf · A generalized algorithm for determining pair-wise dissimilarity

Aqp Etanol Company

AQP Critical Tasks November2011

adult hearing AQP implementation pack

Eaton Aeroquip AQP Hose Technical Manualpub/@eaton/@hyd/docu… · EATON AQP Hose E-HOAQ-CC001-E May 2016 3 Eaton Aeroquip’s AQP® family of hose is sci-entifically superior to

AQP Dust Bunnies Environmental Protection & Natural Resources AQP Dust Bunnies Salt River Pima Maricopa Indian Community Air Quality Program Outreach Dust.

ADVANCED QUALIFICATION PROGRAM (AQP) LATAM … AQP OACI LIMA 104... · Lecciones aprendidas en AQP punto de vista del explotador 7.- Conclusiones / Mejoras en la seguridad operaciónal

CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk Concurrency Control.

Oficio Art Control aqp

David P. Cormode · 2019-02-23 · David Cormode – Page 2 MChem (Undergraduate and Master’s), Department of Chemistry, 1998 – 2002 St. Edmund Hall, University of Oxford Graduated

Kdd2014 Cormode Duffield Sampling Data

CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk Data Security and Privacy.

Aqp brochure light_ing

Eaton Limited Warranty WalterscheidAeroquip* Hose 2661 AQP – SAE100R4 2807, 2808 Hose with Teflon® Smooth Bore FC234 AQP – SAE100R5 - USCG/MMT, NMMA/BIA FC300 AQP – SAE100R5

Data Stream Algorithms Lower Bounds Graham Cormode graham@research.att.com.

2011 UseR AQP Talk