Making Every Bit Count in Wide Area Analytics Ariel Rabkin Joint work with: Matvey Arye, Siddhartha...

28
Making Every Bit Count in Wide Area Analytics Ariel Rabkin Joint work with: Matvey Arye, Siddhartha Sen, Michael J. Freedman, and Vivek Pai 1

Transcript of Making Every Bit Count in Wide Area Analytics Ariel Rabkin Joint work with: Matvey Arye, Siddhartha...

1

Making Every Bit Count in Wide Area Analytics

Ariel Rabkin

Joint work with: Matvey Arye, Siddhartha Sen, Michael J. Freedman, and Vivek Pai

2

Global Systems Have Global Data

3

The Rise of Big Distributed Data

• CDNs:– Akamai has ~20 million requests per

second– CloudFlare has about 300 MB/s of logs,

volume doubles every 4 months

• Sensor data (e.g., power grid, highways)

• Smart camera networks

4

Trends

Time

Am

ou

nt

per

dolla

r Dat

a Vo

lum

esW

ide-are

a Bandwidth

5

Analyzing Low-rate Events is Easy

Server Crashed!

Alert me when server crashes!

6

High-rate Events can be Costly

Every minute, compute request counts by URL

RequestsRequestsRequestsReques

ts

RequestsRequestsRequestsReques

ts

7

Backhaul has Bad Dynamics

Example: backhaul count of events every 5 minutesChoice of summaries is made upfront statically

• Buyer’s remorse: Chose to collect unnecessary and expensive data

• Analyst’s remorse: Summaries insufficient for analysis. No way to retroactively get more data

8

Local Storage!

Every minute, compute request counts by URL

RequestsRequestsRequestsReques

ts

RequestsRequestsRequestsReques

ts

LocalAggregatio

n and Storage

LocalAggregatio

n and Storage

9

Challenge: Bandwidth ScarcityI want the request count for every URL every

second

I can’t do that, Ari. That costs 100 MB/sec. You only have 12 MB/sec. Want to impose a rank cutoff, value

cutoff, or change frequency?

I can do that for 900 KB/sec.

Can I get the top 1000 URLs every second?

Great, do it!

10

? ? ? ? ? ? ?

Challenge: Varying Scarcity

Time

Bandw

idth

Needed

Available

Can do

First aggregate over longer time periods, up to 30 seconds. Then

only keep the top URLs.

12

Data Processing Requirements

• Aggregatable

• Merge-able

Data DataMerged

Representation

+ =• Reducible

Data Data

StoredData

+=

Update

13

Raw byte stringse.g. MapReduce

Database tables

High-level API

Merge + Aggregate

Predictable performance

ArbitraryJoins

X X √ X

√ X X √

14

The Data Cube Model

Counts by URL 12:00

12:01

12:02

www.mysite.com

3 5 …

www.yoursite.com

5 4 …

www.hersite.com

8 12 …Roll-up of mysite.com by time from 12:00 to 12:01:

8Roll-up of sites at time

12:00: 16

Cube: A multidimensional array, with one or more aggregates, indexed by a set of dimensions

Aggregation function used for:• Updates• Roll-ups• Merging cubes• Degrading

cubes

15

Data Cube

Raw byte stringse.g. MapReduce

Database tables

High-level API

Merge + Aggregate

Predictable performance

ArbitraryJoins

X X √ X

√ X X √

√ √ √ X

16

DataflowOperator

s

LocalCube

DataflowOperator

s

Netw

ork

bott

len

eck

DataflowOperator

s

Local Cube

DataflowOperator

s

DataflowOperator

sMerged Cube

Dataflow

Operators

A Vision for Wide-Area Analytics

Dataflow adapted to bandwidth

17

Adaptivity

DataflowOperator

s

Local Cube

DataflowOperator

s

Netw

ork

bott

len

eck

18

Feedback control

Netw

ork

bott

len

eck

Adaptivity

DataflowOperator

s

Local Cube

DataflowOperator

s

Summarized

Cube

• Key ingredients:– Cube summarization as

mechanism– User-defined policies– Feedback control

19

Backup Slides

20

Conclusions

• The hard problems in wide-area analysis:– Reasoning about bandwidth/data quality

tradeoffs– Optimizing data quality under changing

conditions.– Jointly optimizing bandwidth and other

resources

• We are building a system. –We call it JetStream. Stay tuned….

23

Bandwidth Costs do not Decline Smoothly

[TeleGeography's Global Bandwidth Research Service]

24 [TeleGeography's Global Bandwidth Research Service]

20% 20%

Frankfurt-

London

2012 Bandwidth Price Shifts

25

Diurnal Load Makes Overprovisioning Expensive

• Leased lines waste capacity during off-peak

• Public internet gets congested during peak

29

Can iteratively pose different queries

RequestsRequestsRequestsRequests

Benefit: Iteration

RequestsRequestsRequestsRequests

LocalAggregatio

n and Storage

LocalAggregatio

n and Storage

A revised query

30

Can adapt data volume collected to available bw

RequestsRequestsRequestsRequests

Benefit: adaptation

RequestsRequestsRequestsRequests

LocalAggregatio

n and Storage

LocalAggregatio

n and Storage

Limited Bandwidth

31

Can adapt data volume collected to available bw

RequestsRequestsRequestsRequests

Benefit: adaptation

RequestsRequestsRequestsRequests

LocalAggregatio

n and Storage

LocalAggregatio

n and Storage

Ample Bandwidth

32

A dataflow model for wide-area analytics

Operator

Cube

Defines data transformation on tuples. Can do input or output.

Structured storage of data

33

Processing SourceCube

Netw

ork

bott

len

eck

Processed Data

Processing SourceCube

Generated data Ingested Into Local cubes

34

Processed Data

Processing