Continuous Analytics: Stream Query Processing in …franklin/Talks/LBL...Continuous Analytics:...

27
Continuous Analytics: Stream Query Processing in Practice LBNL December 2, 2009 Michael J. Franklin Professor, CS Dept., UC Berkeley Founder and CTO, Truviso

Transcript of Continuous Analytics: Stream Query Processing in …franklin/Talks/LBL...Continuous Analytics:...

Page 1: Continuous Analytics: Stream Query Processing in …franklin/Talks/LBL...Continuous Analytics: Stream Query Processing in Practice LBNL December 2, 2009 Michael J. Franklin Professor,

Continuous Analytics: Stream Query Processing in Practice

LBNL December 2, 2009

Michael J. Franklin Professor, CS Dept., UC Berkeley Founder and CTO, Truviso

Page 2: Continuous Analytics: Stream Query Processing in …franklin/Talks/LBL...Continuous Analytics: Stream Query Processing in Practice LBNL December 2, 2009 Michael J. Franklin Professor,

Stream Query Processing: The Basic Idea

Static Results

Batch Load

Data

Queries Results

“Store First, Query Later”

Traditional

Data Warehouse

Process “On-the-Fly”

Stream Query Processing

Continuous, Visibility,

Alerts

Live Data

Results

Data Stream

Processor

Page 3: Continuous Analytics: Stream Query Processing in …franklin/Talks/LBL...Continuous Analytics: Stream Query Processing in Practice LBNL December 2, 2009 Michael J. Franklin Professor,

A “Hot” Research Topic

Early: Active DBs, ECA rules, Triggers, Data Broadcast Field took off in 2002 •  ~10 SIGMOD/VLDB/ICDE “stream” papers thru 2001 •  275+ since then (holding steady at 40-50/yr) Lots of Research Systems-Building Projects •  AT&T – Gigascope •  Berkeley – TelegraphCQ->HIFI •  Brandeis, Brown, MIT - Aurora->Borealis •  Purdue – Nile •  Stanford – STREAM •  Wisconsin, Portland State – NiagaraCQ

Lots of Technology Transfer:

Page 4: Continuous Analytics: Stream Query Processing in …franklin/Talks/LBL...Continuous Analytics: Stream Query Processing in Practice LBNL December 2, 2009 Michael J. Franklin Professor,

Research Vision Meets Reality

Existing Technology

Existing Application

Existing Technology

New Application

New Technology

New Application

New Technology

Existing Application

Page 5: Continuous Analytics: Stream Query Processing in …franklin/Talks/LBL...Continuous Analytics: Stream Query Processing in Practice LBNL December 2, 2009 Michael J. Franklin Professor,

But There is a Fundamental Problem

5

http://www.b-eye-network.com/view/7188

Agg

rega

te D

ata

Volu

me

Hardware capacity

For data analytics, computers get slower every year.

Data volume (industry-wide)

Data volume (net-centric businesses)

Page 6: Continuous Analytics: Stream Query Processing in …franklin/Talks/LBL...Continuous Analytics: Stream Query Processing in Practice LBNL December 2, 2009 Michael J. Franklin Professor,

It’s Getting Harder to Get Timely Information Va

lue

of D

ata

to D

ecis

ion-

Mak

ing

Time

Pre

vent

ive/

P

redi

ctiv

e

Actionable

Reactive

Historical

Real- Time

Seconds Minutes Hours Days

Traditional Business

Intelligence

Information Half-Life In Decision-Making

6 Months

Time-critical Decisions

Page 7: Continuous Analytics: Stream Query Processing in …franklin/Talks/LBL...Continuous Analytics: Stream Query Processing in Practice LBNL December 2, 2009 Michael J. Franklin Professor,

Transform

Extract

Index, Build Cubes,

Update Queries

Display

Time Slice Analysis

Batch Source Data

Load

DELAY

DELAY

DELAY

DELAY

Batch Data Window

Batch Processing: A Poor Fit for the Real-Time Web

7

The basic approach is the same for a

legacy data warehouse cluster or a

brand new Hadoop cluster.

Page 8: Continuous Analytics: Stream Query Processing in …franklin/Talks/LBL...Continuous Analytics: Stream Query Processing in Practice LBNL December 2, 2009 Michael J. Franklin Professor,

Making the Database Elephant Dance

Precomputation •  Materialized Views •  Fancy Indexes •  Query Result Caching I/O Reduction •  Main memory DB •  Column databases Shared Processing •  Multi-Query Optimization •  Shared Scans Latency Reduction •  Mini-batch, Trickle-feed

And of course: Throwing hardware at it! (i.e., MPP)

Page 9: Continuous Analytics: Stream Query Processing in …franklin/Talks/LBL...Continuous Analytics: Stream Query Processing in Practice LBNL December 2, 2009 Michael J. Franklin Professor,

So what does Stream Query Processing have to do with the (batch-oriented) Data Overload

problem?

Page 10: Continuous Analytics: Stream Query Processing in …franklin/Talks/LBL...Continuous Analytics: Stream Query Processing in Practice LBNL December 2, 2009 Michael J. Franklin Professor,

10

Applications & Databases

(CDC)

“Real-Time” •  Continuous Monitoring •  Automated Actions/Alerts •  Real-Time Dashboards

Message Bus

Network/Firewall

Internet/Clickstream

Business Process/Events

Mobile Sensors

Transactions

Video Voice

On Demand/Interactive •  Instant ad-hoc Reports •  Visualization/Drill Down •  “Tivo” Replay

Decision Optimization

•  Periodic Reporting •  Data Mining/OLAP •  Simulation/Back-Testing

The Key Insight: Decouple Processing and Delivery

For Efficiency: Process Streams in Real-Time.

But let Business Processes drive the use of the results.

Stream-Relational Processing

All data starts as streams

Page 11: Continuous Analytics: Stream Query Processing in …franklin/Talks/LBL...Continuous Analytics: Stream Query Processing in Practice LBNL December 2, 2009 Michael J. Franklin Professor,

Net-centric Analytics Workloads

•  80-90% of the workload is known ahead of time (i.e., not ad hoc) •  Reports, KPIs, key metrics

•  Applications are “additive” •  Append-mostly •  Time or order-oriented •  Computation applied to new data

•  Lots of Aggregation, often by time •  SQL is appreciated

Page 12: Continuous Analytics: Stream Query Processing in …franklin/Talks/LBL...Continuous Analytics: Stream Query Processing in Practice LBNL December 2, 2009 Michael J. Franklin Professor,

Data Records

Update Queries

Source Data

Data Warehouse

Update Display

Continuous Analysis

Store

Pre-Aggregation

Continuous Analytics

12

Page 13: Continuous Analytics: Stream Query Processing in …franklin/Talks/LBL...Continuous Analytics: Stream Query Processing in Practice LBNL December 2, 2009 Michael J. Franklin Professor,

Integration Framework

Application Data  Web Apps  SOA/Web Services  ERP/CRM/SCM  Legacy Applications

Staged Data  Databases  Documents/XML  Files/Logs

“Live” Data  Network Traffic  Message Buses  Log Entries  QoS Data

Data Sources

Actions  Alerts/Events  Algorithms  Data Services  Transactional Systems

Immediate Insight  TruView Web

Dashboards  On Demand Reports  Transactional

Drill Down

Output Streams  Pre-Aggregates  Message Buses  External Data Storage

Destinations

Persistent Data Store

Full SQL Interface

Algorithms + App Logic PostgreSQL

Raw Data

TruLink Source C

onnectors Tr

uAct

ion

Con

nect

ors

Context Requires Stream/Relational

13

Can also directly link TruCQ instances

Page 14: Continuous Analytics: Stream Query Processing in …franklin/Talks/LBL...Continuous Analytics: Stream Query Processing in Practice LBNL December 2, 2009 Michael J. Franklin Professor,

SQL Execution On Streaming Data

14

Goal: support the full SQL language and ecosystem •  A stream is an unbounded sequence of records •  A table is a set of records •  Window operators convert streams to tables •  SQL queries apply to tables

Window Operators

•  Each “window” produces a set of records (a table) •  The TruCQ engine applies standard SQL to the results of

window operators •  Results are continuously appended to the output stream

Page 15: Continuous Analytics: Stream Query Processing in …franklin/Talks/LBL...Continuous Analytics: Stream Query Processing in Practice LBNL December 2, 2009 Michael J. Franklin Professor,

Example: Window Operators

15

Data Stream

Data Stream

Data Stream

Chunking Non-overlapping windows (e.g. 5 second / record windows) <SLICES ‘5 seconds’>

Sliding Overlapping windows (e.g. 5 sec. windows move every 2 sec.) <VISIBLE ‘5 seconds’ ADVANCE ‘2 seconds’>

Landmark Growing windows (e.g., extend window every 2 seconds) <LANDMARK RESET ‘1 day’ ADVANCE ‘2 seconds’>

Page 16: Continuous Analytics: Stream Query Processing in …franklin/Talks/LBL...Continuous Analytics: Stream Query Processing in Practice LBNL December 2, 2009 Michael J. Franklin Professor,

Example: Partitioned Window User Session Tracking

16

Data Stream

2 5 6 7 8 9 10 11 12 13 14 15 3 4 1

User 1 Session

User 2 Session

User 3 Session

1

2

3

4

5

6

7 8

9

10

11

12

13

14

15

Partitioned Value-based “virtual” streams <PER UserID VISIBLE “2 rows” ADVANCE “2 rows”>

Page 17: Continuous Analytics: Stream Query Processing in …franklin/Talks/LBL...Continuous Analytics: Stream Query Processing in Practice LBNL December 2, 2009 Michael J. Franklin Professor,

Creating and Querying Streams

CREATE STREAM url_stream ( url varchar(1024), atime timestamp CQTIME USER, client_ip varchar(50), );

SELECT url, count(*) as url_count FROM url_stream <VISIBLE '5 minutes' ADVANCE '1 minute'> GROUP by url ORDER by url_count desc LIMIT 10;

Page 18: Continuous Analytics: Stream Query Processing in …franklin/Talks/LBL...Continuous Analytics: Stream Query Processing in Practice LBNL December 2, 2009 Michael J. Franklin Professor,

Streams, Derived Streams and Persistence

CREATE STREAM urls_now AS SELECT url, count(*) as scnt, cq_close(*)

FROM url_stream <VISIBLE '5 minutes' ADVANCE '1 minute'>

GROUP by url;

CREATE TABLE urls_archive (url varchar(1024), scnt integer, stime timestamp);

CREATE CHANNEL urls_channel FROM urls_now INTO urls_archive APPEND;

Stream/Table integration pushes Stream Processing beyond “real-time” applications

Derived stream (Can also define Views over streams)

Page 19: Continuous Analytics: Stream Query Processing in …franklin/Talks/LBL...Continuous Analytics: Stream Query Processing in Practice LBNL December 2, 2009 Michael J. Franklin Professor,

Historical Comparions w Stream/Table Joins

SELECT c.scnt, h.scnt, c.stime

FROM (select sum(cnt) as scnt,

cq_close(*) as stime

from urls_now <slices 1 windows>) c,

urls_archive h

WHERE c.stime –‘1 week’::interval = h.stime

Added Benefit – you get to re-use nice features like subqueries, sophisticated date/time functions, fancy query operators, etc.

Page 20: Continuous Analytics: Stream Query Processing in …franklin/Talks/LBL...Continuous Analytics: Stream Query Processing in Practice LBNL December 2, 2009 Michael J. Franklin Professor,

Shared CQ Query Processing

• Start with a regular optimized plan • New queries “folded” into the current shared plan

[Krishnamurthy et al SIGMOD 06] • Adaptive query plan – Tuple Router

(i.e., “CQ-eddy” [Madden et al SIGMOD 02]) • Eliminates redundant work

SELECT T.symbol, AVG(T.price*T.volume) FROM Trades T [VISIBLE ‘5 sec’ ADVANCE ‘3 sec’], SANDP500 S WHERE T.symbol = S.symbol AND T.volume > 5000 GROUP BY T.symbol

SELECT … FROM … WHERE …. GROUP BY …

SELECT … FROM … WHERE …. GROUP BY …

SELECT … FROM … WHERE …. GROUP BY …

Page 21: Continuous Analytics: Stream Query Processing in …franklin/Talks/LBL...Continuous Analytics: Stream Query Processing in Practice LBNL December 2, 2009 Michael J. Franklin Professor,

Unique User Tracking

21

1 2 3 4 5 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

5 Minute Set

1 2 3 4 5 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Dimension Set (18-24)

1 2 3 4 5 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Dimension Set (males)

1 2 3 4 5 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Query: How many unique users clicked this ad in the past 5 minutes who were males,18-24

Page 22: Continuous Analytics: Stream Query Processing in …franklin/Talks/LBL...Continuous Analytics: Stream Query Processing in Practice LBNL December 2, 2009 Michael J. Franklin Professor,

22

Meeting Data Volume and Latency Demands

Continuous Analytics software on a single commodity server

Major ad network’s 40+ server data warehouse system

Page 23: Continuous Analytics: Stream Query Processing in …franklin/Talks/LBL...Continuous Analytics: Stream Query Processing in Practice LBNL December 2, 2009 Michael J. Franklin Professor,

The Latency/Performance Tradeoff: Gone 8-node cluster 4-node cluster Continuous 16-node cluster

With Batch: reducing batch size reduces throughput. With Continuous Analytics this tradeoff goes away.

High Through

put

Low Latency

Page 24: Continuous Analytics: Stream Query Processing in …franklin/Talks/LBL...Continuous Analytics: Stream Query Processing in Practice LBNL December 2, 2009 Michael J. Franklin Professor,

Revisiting the Dancing Lessons Recall the tools used to make store-first

systems perform in this environment: 1.  I/O Reduction 2.  Shared Processing 3.  Latency Reduction 4.  Precomputation 5.  Throwing Hardware at it

Stream Query Processing inherently does #1,2 and 3

An integrated Stream-Relational system enables #4

#5 still possible but much less necessary

Page 25: Continuous Analytics: Stream Query Processing in …franklin/Talks/LBL...Continuous Analytics: Stream Query Processing in Practice LBNL December 2, 2009 Michael J. Franklin Professor,

The Future of Analytics

•  “Barbarians at the Gate” •  Procedural cloud-based

approaches gaining interest

•  Proven scalability for massive data sets

•  But, we’ve seen this movie before! • What is the minimal extension to SQL to

support Continuous Analytics? • Can we make persistence an orthogonal

property of both queries and data?

Page 26: Continuous Analytics: Stream Query Processing in …franklin/Talks/LBL...Continuous Analytics: Stream Query Processing in Practice LBNL December 2, 2009 Michael J. Franklin Professor,

Complex Event Processng (CEP)

•  Related Technology •  Distributed Simulation Background •  Some overlap with Stream Query

Proc. •  Oracle (BEA), TIBCO, and others

have products •  Focused on real-time reaction •  Proprietary Languages (some

SQL-ish) •  Currently suffering from the dual

innovation dilemma discussed earlier.

Page 27: Continuous Analytics: Stream Query Processing in …franklin/Talks/LBL...Continuous Analytics: Stream Query Processing in Practice LBNL December 2, 2009 Michael J. Franklin Professor,

Conclusions

•  Stream Query Processing enables “Continuous Analytics” •  Addresses a 30+ yr old legacy problem in

Database Systems. •  Crucial for coping with the ongoing Data Tsunami.

•  Key Insight: Decouple Processing from Delivery •  Process in real-time because it is orders of

magnitude more efficient.

•  Deliver results as dictated by users and business processes.

•  Leveraging existing standards, tools, and business processes is key to deploying new technology.