Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner,...

46
Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael Jordan, Samuel Madden, Ion Stoica

Transcript of Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner,...

Page 1: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

Approximate Queries on Very Large Data

UC Berkeley

Sameer AgarwalJoint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael Jordan, Samuel Madden, Ion Stoica

Page 2: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

Our GoalSupport interactive SQL-like aggregate queries over massive sets of data

Page 3: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

Our GoalSupport interactive SQL-like aggregate queries over massive sets of data

blinkdb> SELECT AVG(jobtime)

FROM very_big_log AVG, COUNT, SUM, STDEV, PERCENTILE

etc.

Page 4: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

Support interactive SQL-like aggregate queries over massive sets of data

blinkdb> SELECT AVG(jobtime)

FROM very_big_log

WHERE src = ‘hadoop’

FILTERS, GROUP BY clauses

Our Goal

Page 5: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

Support interactive SQL-like aggregate queries over massive sets of data

blinkdb> SELECT AVG(jobtime)

FROM very_big_log

WHERE src = ‘hadoop’ LEFT OUTER JOIN logs2

ON very_big_log.id = logs.id

JOINS, Nested Queries etc.

Our Goal

Page 6: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

Support interactive SQL-like aggregate queries over massive sets of data

blinkdb> SELECT my_function(jobtime)

FROM very_big_log

WHERE src = ‘hadoop’ LEFT OUTER JOIN logs2

ON very_big_log.id = logs.id

ML Primitives,User Defined Functions

Our Goal

Page 7: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

Hard Disks

½ - 1 Hour 1 - 5 Minutes 1 second

?Memory

100 TB on 1000 machines

Query Execution on Samples

Page 8: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

ID

City Salary

1 NYC 50,000

2 NYC 62,492

3 Berkeley 78,212

4 NYC 120,242

5 NYC 98,341

6 Berkeley 75,453

7 NYC 60,000

8 NYC 72,492

9 Berkeley 88,212

10

Berkeley 92,242

11

NYC 70,000

12

Berkeley 102,492

Query Execution on SamplesWhat is the average Salary of all the people in the table?

$80,848

Page 9: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

ID

City Salary

1 NYC 50,000

2 NYC 62,492

3 Berkeley 78,212

4 NYC 120,242

5 NYC 98,341

6 Berkeley 75,453

7 NYC 60,000

8 NYC 72,492

9 Berkeley 88,212

10

Berkeley 92,242

11

NYC 70,000

12

Berkeley 102,492

Query Execution on SamplesWhat is the average Salary of all the people in the table?

ID City Salary

Sampling Rate

2 NYC 62,492 1/4

6 Berkeley

75,453 1/4

8 NYC 72,492 1/4

UniformSample

$70,145$80,848

Page 10: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

ID

City Salary

1 NYC 50,000

2 NYC 62,492

3 Berkeley 78,212

4 NYC 120,242

5 NYC 98,341

6 Berkeley 75,453

7 NYC 60,000

8 NYC 72,492

9 Berkeley 88,212

10

Berkeley 92,242

11

NYC 70,000

12

Berkeley 102,492

Query Execution on SamplesWhat is the average Salary of all the people in the table?

ID City Salary

Sampling Rate

2 NYC 62,492 1/4

6 Berkeley

75,453 1/4

8 NYC 72,492 1/4

UniformSample

$70,145 +/- 10,815

$80,848

Page 11: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

ID

City Salary

1 NYC 50,000

2 NYC 62,492

3 Berkeley 78,212

4 NYC 120,242

5 NYC 98,341

6 Berkeley 75,453

7 NYC 60,000

8 NYC 72,492

9 Berkeley 88,212

10

Berkeley 92,242

11

NYC 70,000

12

Berkeley 102,492

Query Execution on SamplesWhat is the average Salary of all the people in the table?ID City Salar

ySampling Rate

2 NYC 62,492 1/2

3 Berkeley

78,212 1/2

5 NYC 60,000 1/2

6 Berkeley

75,453 1/2

8 NYC 72,492 1/2

12 Berkeley

102,492

1/2

UniformSample

$75,190 +/- 5,895

$80,848$70,145 +/- 10,815

Page 12: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

Speed/Accuracy Trade-off

Execution Time

Erro

r

30 mins

Time to Execute on

Entire Dataset

InteractiveQueries

5 sec

Page 13: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

Execution Time

Erro

r

30 mins

Time to Execute on

Entire Dataset

InteractiveQueries

5 sec

Speed/Accuracy Trade-off

Pre-ExistingNoise

Page 14: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

What is BlinkDB?A data analysis (warehouse) system that …

- builds on Shark and Spark

- returns fast, approximate answers with error bars by executing queries on small samples of data

- is compatible with Apache Hive (storage, serdes, UDFs, types, metadata) and supports Hive’s SQL-like query structure with minor modifications

Page 15: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

Sampling Vs. No Sampling

0100200300400500600700800900

1000

1 10-1 10-2 10-3 10-4 10-5

Fraction of full data

Que

ry R

espo

nse

Tim

e (S

econ

ds)

103

1020

18 13 10 8

10x as response timeis dominated by I/O

Page 16: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

Sampling Vs. No Sampling

0100200300400500600700800900

1000

1 10-1 10-2 10-3 10-4 10-5

Fraction of full data

Que

ry R

espo

nse

Tim

e (S

econ

ds)

103

1020

18 13 10 8

(0.02%)(0.07%) (1.1%) (3.4%) (11%)

Error Bars

Page 17: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

Hive Architecture

Hadoop Storage (e.g., HDFS, HBase)

Metastore

MapReduce

SQL Parser

Query Optimize

r

Physical Plan

SerDes, UDFs

Execution

Driver

Command-line Shell Thrift/JDBC

Page 18: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

Shark Architecture

Hadoop Storage (e.g., HDFS, HBase)

Metastore

Spark

SQL Parser

Query Optimize

r

Physical Plan

SerDes, UDFs

Execution

Driver

Command-line Shell Thrift/JDBC

Page 19: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

BlinkDB Architecture

Hadoop Storage (e.g., HDFS, HBase)

Metastore

Spark

SQL Parser

Query Optimize

r

Physical Plan

SerDes, UDFs

Execution

Driver

Command-line Shell Thrift/JDBC

Page 20: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

BlinkDB alpha-0.1.0

1. Released and available at http://blinkdb.org

2. Allows you to create random and stratified samples on native tables and materialized views

3. Adds approximate aggregate functions with statistical closed forms to HiveQL : approx_avg(), approx_sum(), approx_count() etc.

Page 21: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

Example: Preparing the Datablinkdb>

Page 22: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

blinkdb> create external table logs (dt string, event string, bytes int) row format delimited fields terminated by ' ' location ’/tmp/logs’;

Referencing an external table logs in BlinkDB

Example: Preparing the Data

Page 23: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

blinkdb> create external table logs (dt string, event string, bytes int) row format delimited fields terminated by ' ' location ’/tmp/logs';

blinkdb> create table logs_sample as select * from logs samplewith 0.01;

Create a 1% random sample logs_sample from logs

Example: Preparing the Data

Page 24: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

blinkdb> create external table logs (dt string, event string, bytes int) row format delimited fields terminated by ' ' location ’/tmp/logs';

blinkdb> create table logs_sample as select * from logs samplewith 0.01;

blinkdb> create table logs_sample_cached as select * from logs_sample;

Supports all Shark primitives for caching samples in memory

Example: Preparing the Data

Page 25: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

blinkdb> set blinkdb.sample.size=32810

blinkdb> set blinkdb.dataset.size=3198910

Giving BlinkDB information about the size of sample you wish to operate on and the size of the original dataset

Example: Analyzing the Data

Page 26: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

blinkdb> set blinkdb.sample.size=32810

blinkdb> set blinkdb.dataset.size=3198910

blinkdb> select approx_count(1) from logs_sample_cached where event = “foo”;

Example: Analyzing the Data

Prefixing approx_ to an aggregate operator tells BlinkDB to return an approximate answer

Page 27: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

blinkdb> set blinkdb.sample.size=32810

blinkdb> set blinkdb.dataset.size=3198910

blinkdb> select approx_count(1) from logs_sample_cached where event = “foo”;

12810132 +/- 3423 (99% Confidence)

Example: Analyzing the Data

Returns an approximate answer with an error bar and confidence interval

Page 28: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

blinkdb> create table logs_sample as select * from [any subquery] samplewith 0.01;

Example: There’s more!

The sample operator can be anywhere in the query graph

Page 29: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

blinkdb> create table logs_sample as select * from [any subquery] samplewith 0.01;

blinkdb> select approx_count(1) from logs_sample_cached where event = “foo” GROUP BY dt ORDER BY dt;

Example: There’s more!

Retains remaining Hive Query Structure

Page 30: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

blinkdb> create table logs_sample as select * from [any subquery] samplewith 0.01;

blinkdb> select approx_count(1) from logs_sample_cached where event = “foo” GROUP BY dt ORDER BY dt;

12810132 +/- 3423 (99% Confidence)

Example: There’s more!

Note: The output is a String

Page 31: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

Feature Roadmap1. Integrating BlinkDB with Shark as an

experimental feature (coming soon!)

2. Automatic Sample Management

3. More Hive Aggregates, UDAF Support

4. Runtime Correctness Tests

Page 32: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’WITHIN 1 SECONDS 234.23 ± 15.32

Automatic Sample Management

Goal: The API should abstract the details of creating, deleting and managing samples from the user

Page 33: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’WITHIN 2 SECONDS 234.23 ± 15.32

Automatic Sample Management

239.46 ± 4.96

Goal: The API should abstract the details of creating, deleting and managing samples from the user

Page 34: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ERROR 0.1 CONFIDENCE 95.0%

Automatic Sample Management

Goal: The API should abstract the details of creating, deleting and managing samples from the user

Page 35: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

TABLE

Sam

plin

g M

odul

e

Original Data

Offline-sampling: Creates an optimal set of samples on native tables and materialized views based on query history and workload characteristics

Automatic Sample Management

Page 36: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

TABLE

Sam

plin

g M

odul

e

In-MemorySamples

On-DiskSamples

Original Data

Sample Placement: Samples striped over 100s or 1,000s of machines both on disks and in-memory.

Automatic Sample Management

Page 37: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

SELECT foo (*)FROM TABLE

WITHIN 2

Query Plan

HiveQL/SQLQuery

Sample Selection

TABLE

Sam

plin

g M

odul

e

In-MemorySamples

On-DiskSamples

Original Data

Automatic Sample Management

Page 38: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

SELECT foo (*)FROM TABLE

WITHIN 2

Query Plan

HiveQL/SQLQuery

Sample Selection

TABLE

Sam

plin

g M

odul

e

In-MemorySamples

On-DiskSamples

Original Data

Online sample selection to pick best sample(s) based on query latency and accuracy requirements

Automatic Sample Management

Page 39: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

TABLE

Sam

plin

g M

odul

e

In-MemorySamples

On-DiskSamples

Original Data

Shark

SELECT foo (*)FROM TABLE

WITHIN 2

New Query Plan

HiveQL/SQLQuery

Sample Selection

Error Bars & Confidence Intervals

Result182.23 ± 5.56

(95% confidence)

Parallel query execution on multiple samples striped across multiple machines.

Automatic Sample Management

Page 40: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

1. Using Bootstrap to estimate error

More Aggregates/ UDAFs Support

Sample

A

Page 41: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

1. Using Bootstrap to estimate error

More Aggregates/ UDAFs Support

Sample

A A1 A2 An

Bootstrap Operator

Page 42: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

1. Using Bootstrap to estimate error

More Aggregates/ UDAFs Support

Sample

A A1 A2 An

Placement of the Bootstrap Operator in the query graph is critical to performance

Page 43: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

1. Using Bootstrap to estimate error

More Aggregates/ UDAFs Support

Sample

A A1 A2 An

However, the bootstrap can fail

Page 44: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

1. Given a query,how do you know if it can be approximated at runtime?- Depends on the query, data distribution, and

sample size

2. Need for runtime diagnosis tests- Check whether error improves as sample size

increases- 30,000 extremely small query tasks

Runtime Correctness Tests

Page 45: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

1. BlinkDB alpha-0.1.0 released and available at http://blinkdb.org

2. Takes just 5-10 minutes to run it locally or to spin an EC2 cluster

3. Hands-on Exercises today at the AMPCamp

4. Designed to be a drop-in tool like Shark

Getting Started

Page 46: Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

1. Approximate queries is an important means to achieve interactivity in processing large datasets

2. BlinkDB..- builds on Shark and Spark- approximate answers with error bars by executing queries on

small samples of data- supports existing Hive Query with minor modifications

3. For more information, please check out our EuroSys 2013 (http://bit.ly/blinkdb-1) and KDD 2014 (http://bit.ly/blinkdb-2) papers

Summary

Thanks!