Continuous Queries over Data Streams

93
1 Continuous Queries Continuous Queries over over Data Streams Data Streams Vitaly Kroivets, Lyan Marina Vitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and Internet Presentation for The Seminar on Database and Internet The Hebrew University of Jerusalem, Fall 2002 The Hebrew University of Jerusalem, Fall 2002

description

Continuous Queries over Data Streams. Vitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and Internet The Hebrew University of Jerusalem, Fall 2002. Contents of the lecture. Introduction Proposed Architecture of Data Stream Management System Research problems - PowerPoint PPT Presentation

Transcript of Continuous Queries over Data Streams

Page 1: Continuous Queries  over  Data Streams

1

Continuous Queries Continuous Queries over over

Data StreamsData Streams

Vitaly Kroivets, Lyan MarinaVitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and InternetPresentation for The Seminar on Database and InternetThe Hebrew University of Jerusalem, Fall 2002The Hebrew University of Jerusalem, Fall 2002

Page 2: Continuous Queries  over  Data Streams

2

Contents of the lectureContents of the lecture

IntroductionIntroduction

Proposed Architecture of Data Proposed Architecture of Data Stream Management SystemStream Management System

Research problemsResearch problems

Query OptimizationQuery Optimization

BibliographyBibliography

Page 3: Continuous Queries  over  Data Streams

3

Data Streams vs. Data Data Streams vs. Data SetsSets

Data Sets:Data Sets: Data Streams:Data Streams:

Updates Updates infrequentinfrequent

Data changed Data changed constantly constantly (sometimes (sometimes additions only)additions only)Old data Old data

required many required many timestimes

Mostly only freshest Mostly only freshest data useddata used

Example: Example: employees employees personal data personal data tabletable

Examples: financial Examples: financial tickers, data feeds tickers, data feeds from sensors, from sensors, network monitoring, network monitoring, etcetc

Page 4: Continuous Queries  over  Data Streams

4

Using Traditional Using Traditional DatabaseDatabase

User/ApplicationUser/ApplicationUser/ApplicationUser/Application

LoaderLoaderLoaderLoader

QueryQuery ResultResult

ResultResult……

QueryQuery……

Page 5: Continuous Queries  over  Data Streams

5

Data Streams ParadigmData Streams ParadigmUser/ApplicationUser/ApplicationUser/ApplicationUser/Application

Register QueryRegister Query

Stream QueryStream QueryProcessorProcessor

ResultResult

Page 6: Continuous Queries  over  Data Streams

6

Data Streams ParadigmData Streams ParadigmUser/ApplicationUser/ApplicationUser/ApplicationUser/Application

Register QueryRegister Query

Stream QueryStream QueryProcessorProcessor

ResultResult

Scratch SpaceScratch Space(Memory and/or Disk)(Memory and/or Disk)

DataStream

ManagementSystem

(DSMS)

Page 7: Continuous Queries  over  Data Streams

7

What Is A Continuous What Is A Continuous Query ?Query ?

Query which is Query which is issued once issued once and logically and logically run run continuously.continuously.

Page 8: Continuous Queries  over  Data Streams

8

What is Continuous What is Continuous Query ?Query ?

Query which is issued once and run continuously.Query which is issued once and run continuously.

Example: detect abnormalities in network traffic behavior in real-time and their cause -- like link congestion due to hardware failure.

Page 9: Continuous Queries  over  Data Streams

9

What is Continuous What is Continuous Query ?Query ?

Query which is issued once and run continuously.Query which is issued once and run continuously.

More examples:

Continues queries used to support load balancing, online automatic trading at Stock Exchange

Page 10: Continuous Queries  over  Data Streams

10

Special ChallengesSpecial Challenges

Timely online answers Timely online answers even for rapid data even for rapid data streamsstreams

Ability of fast access to Ability of fast access to large portions of data large portions of data

Processing of multiple Processing of multiple streams simultaneously streams simultaneously

Page 11: Continuous Queries  over  Data Streams

11

Making Things ConcreteMaking Things Concrete

Outgoing (call_ID, caller, time, event)Incoming (call_ID, callee, time, event)

event = start or end

CentralOffice

CentralOffice

DSMS

BOB ALICE

Page 12: Continuous Queries  over  Data Streams

12

Making Things ConcreteMaking Things Concrete

Database = two streams of mobile call Database = two streams of mobile call recordsrecords Outgoing(connectionID, caller, start, end)Outgoing(connectionID, caller, start, end) Incoming(connectionID, callee, start, end)Incoming(connectionID, callee, start, end)

Query language = SQLQuery language = SQL

FROM clauses can refer to streams and/or FROM clauses can refer to streams and/or relationsrelations

Page 13: Continuous Queries  over  Data Streams

13

Query 1 (self-join)Query 1 (self-join)

Find allFind all outgoing callsoutgoing calls longer thanlonger than 2 minutes2 minutes

SELECT O1.call_ID, O1.callerSELECT O1.call_ID, O1.callerFROM Outgoing O1, Outgoing O2FROM Outgoing O1, Outgoing O2WHERE (O2.time – O1.time > 2 WHERE (O2.time – O1.time > 2 AND O1.call_ID = O2.call_ID AND O1.call_ID = O2.call_ID AND O1.event = startAND O1.event = start AND O2.event = end)AND O2.event = end)

Result requiresResult requires unbounded storageunbounded storage Can provideCan provide result as data streamresult as data stream Can output after 2 min,Can output after 2 min, without seeingwithout seeing end end

Page 14: Continuous Queries  over  Data Streams

14

Query 2 (join)Query 2 (join)

Pair upPair up callerscallers and and calleescallees

SELECT O.caller, I.calleeSELECT O.caller, I.calleeFROM Outgoing O, Incoming IFROM Outgoing O, Incoming IWHERE O.call_ID = I.call_IDWHERE O.call_ID = I.call_ID

Can still provideCan still provide result as data streamresult as data stream RequiresRequires unbounded temporary storage …unbounded temporary storage … … … unless streams areunless streams are near-synchronizednear-synchronized

Page 15: Continuous Queries  over  Data Streams

15

Query 3 (group-by Query 3 (group-by aggregation)aggregation)

Total connection timeTotal connection time for each callerfor each caller

SELECT O1.caller, sum(O2.time – O1.time)SELECT O1.caller, sum(O2.time – O1.time)FROM Outgoing O1, Outgoing O2FROM Outgoing O1, Outgoing O2WHERE (O1.call_ID = O2.call_IDWHERE (O1.call_ID = O2.call_ID AND O1.event = start AND O1.event = start AND O2.event = end)AND O2.event = end)GROUP BY O1.callerGROUP BY O1.caller

Cannot provide result in (append-only) Cannot provide result in (append-only) stream. stream.

Alternatives:Alternatives:• Output stream with updatesOutput stream with updates• Provide current value on demandProvide current value on demand• Keep answer in memoryKeep answer in memory

Page 16: Continuous Queries  over  Data Streams

16

ConclusionsConclusions

Conventional DBMS technology is Conventional DBMS technology is inadequateinadequate

We need reconsider all aspects of data We need reconsider all aspects of data management and processing in presence management and processing in presence of data streamsof data streams

Page 17: Continuous Queries  over  Data Streams

17

DBMS versus DSMSDBMS versus DSMS

• Persistent relationsPersistent relations • Transient streams (and Transient streams (and persistent relations)persistent relations)

Page 18: Continuous Queries  over  Data Streams

18

DBMS versus DSMSDBMS versus DSMS• Persistent relations Persistent relations

• Transient streams (and Transient streams (and persistent relations)persistent relations)

• One-time queriesOne-time queries • Continuous queriesContinuous queries

Page 19: Continuous Queries  over  Data Streams

19

DBMS versus DSMSDBMS versus DSMS• Persistent relations Persistent relations

• Transient streams (and Transient streams (and persistent relations)persistent relations)

• One-time queriesOne-time queries • Continuous queriesContinuous queries

• Random accessRandom access • Sequential accessSequential access

Page 20: Continuous Queries  over  Data Streams

20

DBMS versus DSMSDBMS versus DSMS• Persistent relations Persistent relations

• Transient streams (and Transient streams (and persistent relations)persistent relations)

• One-time queriesOne-time queries • Continuous queriesContinuous queries

• Random accessRandom access • Sequential accessSequential access

• Access plan Access plan determined by query determined by query processor and processor and physical DB designphysical DB design

• Unpredictable data Unpredictable data arrival and arrival and characteristicscharacteristics

Page 21: Continuous Queries  over  Data Streams

21

DBMS versus DSMSDBMS versus DSMS• Persistent relations Persistent relations

• Transient streams (and Transient streams (and persistent relations)persistent relations)

• One-time queriesOne-time queries • Continuous queriesContinuous queries

• Random accessRandom access • Sequential accessSequential access

• Access plan Access plan determined by query determined by query processor and processor and physical DB designphysical DB design

• Unpredictable data Unpredictable data arrival and arrival and characteristicscharacteristics

• ““Unbounded” disk storeUnbounded” disk store • Bounded main memoryBounded main memory

Page 22: Continuous Queries  over  Data Streams

22

RelatedRelated workworkTapestryTapestry system system

CContent-based filtering oontent-based filtering off email messages. email messages. RRestricted subset of SQLestricted subset of SQL append-only query append-only query resultsresultsCronicle data modelCronicle data model

AAppend-only ordered sequences of tuplesppend-only ordered sequences of tuples restricted view-definition languagerestricted view-definition language doesnt store doesnt store any croniclesany croniclesAlert systemAlert system

EEvent-condition Action triggers in conventional vent-condition Action triggers in conventional SQL DBSQL DB Continuous Queries over append-only Continuous Queries over append-only "active tables"."active tables".

Page 23: Continuous Queries  over  Data Streams

23

RelatedRelated workworkMaterialized ViewsMaterialized Views

Materialized Views are queries which need to be Materialized Views are queries which need to be reevaluated whenever database changesreevaluated whenever database changes..

Materialized Views vsMaterialized Views vs. . Continuous QueriesContinuous Queries::

Continuous QueriesContinuous Queries May stream rather then store resultMay stream rather then store result May deal with append only relations May deal with append only relations May provide approximate answersMay provide approximate answers Processing strategy may adapt characteristics Processing strategy may adapt characteristics

of data streamof data stream

Page 24: Continuous Queries  over  Data Streams

24

Architecture for Architecture for continuous queriescontinuous queries

Single stream of tuples D, single continuous Query QSingle stream of tuples D, single continuous Query Qand Answer to the query Aand Answer to the query AQ is issued once and operates continuouslyQ is issued once and operates continuously

<A,B><A,B><A,B> Q

Data Stream

Continuous Query

A?Answer

Page 25: Continuous Queries  over  Data Streams

25

Architecture for Architecture for continuous queriescontinuous queries

We consider data streams that adhere to the relation We consider data streams that adhere to the relation model (i. e. streams of tuples), although many of model (i. e. streams of tuples), although many of the ideas and techniques are independent of the the ideas and techniques are independent of the data model being considereddata model being considered

<A,B><A,B><A,B> Q

Data Stream

Continuous Query

A?Answer

Page 26: Continuous Queries  over  Data Streams

26

Architecture for continuous Architecture for continuous queriesqueries

Scenario 1Scenario 1 ( (simplestsimplest):):

Data stream D is append only Data stream D is append only - - no updates or no updates or deletions. How to handle Q?deletions. How to handle Q?

11) ) Always store current answer A to Q Always store current answer A to Q ..

D is of unbounded size D is of unbounded size ==> A may be too> A may be too..

22) ) Not to store A, but make new tuples in A Not to store A, but make new tuples in A available as another continuous streamavailable as another continuous stream..

No need for unbounded storage for A, but No need for unbounded storage for A, but may may need unbounded storage to determine new need unbounded storage to determine new

tuples in Atuples in A..

Page 27: Continuous Queries  over  Data Streams

27

Architecture for continuous Architecture for continuous queriesqueries

Scenario 2Scenario 2 Input stream is appendInput stream is append--only, but may cause only, but may cause

updates and deletions in answer Aupdates and deletions in answer A.. => May need to update/delete tuples in output => May need to update/delete tuples in output

data streamdata stream Scenario3Scenario3 ( (most generalmost general)) Input stream D includes updates and deletionsInput stream D includes updates and deletions.. => Much data of stream should be stored to => Much data of stream should be stored to

determine answer.determine answer.

Page 28: Continuous Queries  over  Data Streams

28

Architecture for continuous Architecture for continuous queriesqueries

How to solve?How to solve?

1) Restrict expressiveness of Q.1) Restrict expressiveness of Q.

2) Impose constrains on data stream to2) Impose constrains on data stream to

guarantee that answer to Q is boundedguarantee that answer to Q is bounded

and amount of data needed to compute Q .and amount of data needed to compute Q .

3) Provide approximate answer.3) Provide approximate answer.

Page 29: Continuous Queries  over  Data Streams

29

Arcitecture for processing Arcitecture for processing continuous queriescontinuous queries

Stream QueryStream QueryProcessorProcessor

Stream QueryStream QueryProcessorProcessor

Stream 1

Stream 2

Stream N

.

.

.

Throw

Scratch

Store

Stream

Page 30: Continuous Queries  over  Data Streams

30

Architecture for Architecture for continuous queriescontinuous queries

STREAMSTREAM is data stream containing tuples is data stream containing tuples appended to A. It is appendappended to A. It is append--only stream only stream ((shouldnt include updatesshouldnt include updates//deletionsdeletions))

STREAMSTREAM and and STORESTORE define current answer A define current answer A..

Page 31: Continuous Queries  over  Data Streams

31

Architecture for continuous Architecture for continuous queriesqueries

When query Q is notified of new When query Q is notified of new

tuple t in a relevant data stream, tuple t in a relevant data stream,

it can perform number of actions,it can perform number of actions,

which are not mutually exclusivewhich are not mutually exclusive

1) t causes new tuples in A1) t causes new tuples in A

if tuple a will remain in A foreverif tuple a will remain in A forever: :

send a to send a to STREAMSTREAM

2) if a should be in A, but may be2) if a should be in A, but may be removed at some removed at some moment: add a to moment: add a to STORESTORE

Stream QueryStream QueryProcessorProcessor

Stream QueryStream QueryProcessorProcessor

Throw Scratch Store Stream

Stream

Page 32: Continuous Queries  over  Data Streams

32

Architecture for continuous Architecture for continuous queriesqueries

When query Q is notified of new tuple t in a relevant When query Q is notified of new tuple t in a relevant

data stream, it can perform number of actions,data stream, it can perform number of actions,

which are not mutually exclusivewhich are not mutually exclusive

3) t may cause update or deletion3) t may cause update or deletion

of answer tuples in Store. Answer of answer tuples in Store. Answer

tuples may be moved from tuples may be moved from

STORE STORE to to STREAMSTREAM

4) May need to save t or derived 4) May need to save t or derived

data to ensure in future can compute data to ensure in future can compute

query result send t to query result send t to SCRATCHSCRATCH

Stream QueryStream QueryProcessorProcessor

Stream QueryStream QueryProcessorProcessor

Throw Scratch Store Stream

Stream

Page 33: Continuous Queries  over  Data Streams

33

Architecture for continuous Architecture for continuous queriesqueries

When query Q is notified of new tuple t in a relevant When query Q is notified of new tuple t in a relevant

data stream, it can perform number of actions,data stream, it can perform number of actions,

which are not mutually exclusivewhich are not mutually exclusive

5) t not needed and will not be5) t not needed and will not be

needed. Send it to needed. Send it to THROWTHROW

((unless we like to archive itunless we like to archive it))

6) As a result of t we may move 6) As a result of t we may move

data from data from STORESTORE or or SCRATCHSCRATCH

to to THROWTHROW

Stream QueryStream QueryProcessorProcessor

Stream QueryStream QueryProcessorProcessor

Throw Scratch Store Stream

Stream

Page 34: Continuous Queries  over  Data Streams

34

Architecture for Architecture for continuous queriescontinuous queries

Scenario1 Scenario1

Data stream D is append only Data stream D is append only - - no updates orno updates or

deletions. Always store current answer A to Q deletions. Always store current answer A to Q ..

STREAMSTREAM empty emptySTORESTORE always contain A always contain ASCRATCHSCRATCH contains whatever needed to to contains whatever needed to to keep answer in keep answer in STORESTORE up to date up to date

Page 35: Continuous Queries  over  Data Streams

35

Architecture for Architecture for continuous queriescontinuous queries

Scenario2Scenario2Answer A exclusively as data stream D.Answer A exclusively as data stream D.

STREAMSTREAM stream answer A stream answer A

STORESTORE empty emptySCRATCHSCRATCH contains whatever needed to to keep contains whatever needed to to keep answer in answer in STORESTORE up to date up to date

Page 36: Continuous Queries  over  Data Streams

36

Architecture for Architecture for continuous queriescontinuous queries

Scenario 3Scenario 3

Input stream append only, answer A may haveInput stream append only, answer A may have

updates and deletionsupdates and deletions

Example Example : : Q is groupQ is group--by with Min aggregation functionby with Min aggregation function..

Answer A maintained in Answer A maintained in STORESTORE

SCRATCHSCRATCH is empty is empty

Page 37: Continuous Queries  over  Data Streams

37

Architecture for Architecture for continuous queriescontinuous queries

Scenario 4Scenario 4Input streams may include updates andInput streams may include updates anddeletionsdeletions

Unbounded storage required for Unbounded storage required for SCRATCHSCRATCH to ensure that Min always will be computedto ensure that Min always will be computed Both in 3 and 4Both in 3 and 4: : data moved to data moved to STREAMSTREAM only only whenever known that no further updateswhenever known that no further updates//deletions deletions etc of tuples of this group will occuretc of tuples of this group will occur..

Page 38: Continuous Queries  over  Data Streams

38

The Architecture and The Architecture and Related WorkRelated Work

Implementing Triggers in terms of proposed Implementing Triggers in terms of proposed architecture (for launching triggered actions architecture (for launching triggered actions assume actions performed by SQL storedassume actions performed by SQL stored--proceduresprocedures..)) STREAMSTREAM and and STORESTORE empty empty.. SCRATCHSCRATCH used for data required to moniotor complex used for data required to moniotor complex

eventsevents BenefitsBenefits: : complex multitable events & conditions to complex multitable events & conditions to

be monitoredbe monitored Trigger processing benefit from efficient data Trigger processing benefit from efficient data

management management / / processingprocessing Techniques Techniques ( ( see belowsee below))

Page 39: Continuous Queries  over  Data Streams

39

The Architecture and The Architecture and Related WorkRelated Work

Implementing Materialized views in terms ofImplementing Materialized views in terms of

proposed architectureproposed architecture View itsef is maintained in View itsef is maintained in STORESTORE Base dataBase data: : in in SCRATCHSCRATCH Data expiration Data expiration : : to expedite cleanup ofto expedite cleanup of

SCRATCHSCRATCH No way to ensure bounding of size of STORE No way to ensure bounding of size of STORE

and and SCRATCHSCRATCH

Page 40: Continuous Queries  over  Data Streams

40

End of Part IEnd of Part I

Page 41: Continuous Queries  over  Data Streams

41

Research Problems Research Problems Designing Query Language Designing Query Language Online processing of rapid streamsOnline processing of rapid streams

ApproximationApproximation techniques techniques Storage constrains vs. performance requirementsStorage constrains vs. performance requirements SummarizationSummarization

Query Planning / OptimizationQuery Planning / Optimization Building good Query PlanBuilding good Query Plan SchedulingScheduling Sub-Plans SharingSub-Plans Sharing

Resource ManagementResource Management AdaptationAdaptation

Page 42: Continuous Queries  over  Data Streams

42

Research Problems: Research Problems: Languages for Languages for

Continuous QueriesContinuous Queries

Bounding the size of scratch/storeBounding the size of scratch/store Open problem : to determine for arbitrary Open problem : to determine for arbitrary

SQL query whether properties satisfiedSQL query whether properties satisfied

Page 43: Continuous Queries  over  Data Streams

43

Query LanguageQuery Language

Query language allows both streams and Query language allows both streams and relationsrelations

Assumptions: Assumptions:

Streams: Ordered Append-only Unbounded Multiple streams allowed

Relations: Unordered Support updates and deletions

Page 44: Continuous Queries  over  Data Streams

44

SQL ExtensionsSQL ExtensionsFor Continuous Queries For Continuous Queries

FROM FROM allowed both to Streams and allowed both to Streams and Relations Relations

Sliding Window forSliding Window for FROMFROM clause (for clause (for streams) streams)

Optional "Optional "PartitioningPartitioning" " clause clause Mandatory Mandatory ""Window sizeWindow size"" Optional "Optional "Filtering predicateFiltering predicate""

Page 45: Continuous Queries  over  Data Streams

45

Windows specification Windows specification

UsingUsing ROWSROWS

ROWS 50 PRECEEDINGROWS 50 PRECEEDING

UsingUsing RANGERANGE

RANGE 15 minutes PRECEEDINGRANGE 15 minutes PRECEEDING

Page 46: Continuous Queries  over  Data Streams

46

Example 1Example 1

Web Server

CL1

CL2

CL3 CL

4

DSMS

Internet

SS ( C ( Client_id, lient_id, URL, domain, URL, domain, time )time )

Clients

.com

CL5

CL7

.il

.NF

CS web Math web

Page 47: Continuous Queries  over  Data Streams

47

Example 1 (CQL)Example 1 (CQL)““FromFrom” with “” with “RangeRange””

Stream "Requests" of requests to web Stream "Requests" of requests to web server with attributes:server with attributes:

((client_id, URL, domain, time)client_id, URL, domain, time)

Query counting number of request of pages Query counting number of request of pages from domain “cs.huji.ac.il” in the last day:from domain “cs.huji.ac.il” in the last day:

SELECT COUNT(*)SELECT COUNT(*)

FROM Request S[FROM Request S[RANGE 1 DAY PRECEEDINGRANGE 1 DAY PRECEEDING]]

WHERE S.domain= "cs.huji.ac.il"WHERE S.domain= "cs.huji.ac.il"

Page 48: Continuous Queries  over  Data Streams

48

Partitioning Clause Partitioning Clause

Partitions data in several groups Partitions data in several groups ComputesComputes separate windowseparate window for each for each

groupgroup Merges windows into single result Merges windows into single result Is syntactically same asIs syntactically same as GROUP BYGROUP BY

clauseclause Example : Example :

Page 49: Continuous Queries  over  Data Streams

49

Example 2 Example 2 ““Partition By”Partition By”

How many pages served (only each clients 10 How many pages served (only each clients 10 most recent requests) by request from domainmost recent requests) by request from domain

CS.HUJI.AC.ILCS.HUJI.AC.IL from from CS website CS website ??

SELECT COUNT (*) SELECT COUNT (*) FROMFROM requests S requests S [[PARTITION BYPARTITION BY s.Client_id s.Client_id Rows 10 PRECEEDINGRows 10 PRECEEDING Where s.Domain = Where s.Domain = ‘C‘CS.HUJI.AC.IL’ S.HUJI.AC.IL’ ]] Where s.URL LIKE Where s.URL LIKE 'http://cs.huji.Ac.Il/%'http://cs.huji.Ac.Il/%''

Page 50: Continuous Queries  over  Data Streams

50

Example 3 Example 3 Join with relationJoin with relation

Classify domain by primary type of web content they Classify domain by primary type of web content they serveserve

..ac.il EDUCATIONac.il EDUCATION .gov.il Government .gov.il Government .co.il COMMERCE.co.il COMMERCE .com COMMERCE.com COMMERCE

Count number of requests from "commerce" domains out Count number of requests from "commerce" domains out of last 10000 records of last 10000 records

10% sample of requests stream is used 10% sample of requests stream is used

Page 51: Continuous Queries  over  Data Streams

51

Example 3 (Cont.)Example 3 (Cont.)

SELECT COUNT (*) FROMSELECT COUNT (*) FROM (SELECT R.class(SELECT R.class FROM FROM Requests SRequests S 10% SAMPLE , 10% SAMPLE , Domains RDomains R WHERE WHERE SS.Domain=.Domain=RR.Domain) .Domain) TT [ROWS 10000 PRECEEDING][ROWS 10000 PRECEEDING] WHERE WHERE TT.class = "commerce".class = "commerce"

Note: stream ofNote: stream of RequestsRequests is joined withis joined with DomainsDomains relation resulting inrelation resulting in stream stream TT , before, before applying sliding windowapplying sliding window

Page 52: Continuous Queries  over  Data Streams

52

Performance Challenge:Performance Challenge:

Multiple rapid incoming data streamsMultiple rapid incoming data streams Multiple complex queries with Multiple complex queries with

timeliness requirements timeliness requirements Finite resources Finite resources

Page 53: Continuous Queries  over  Data Streams

53

Solution: Approximation Solution: Approximation

Approximate answers Approximate answers Graceful degradationGraceful degradation Maximize precision based on available Maximize precision based on available

resources resources

Page 54: Continuous Queries  over  Data Streams

54

ApproximationApproximation : :Static vs. DynamicStatic vs. Dynamic

Queries modified at Queries modified at submission time to use submission time to use fewer resources fewer resources

User guaranteed certain User guaranteed certain query behaviorquery behavior

User can configure User can configure approximation mechanism approximation mechanism

Adaptation mechanisms Adaptation mechanisms not needed not needed

Queries modified at Queries modified at run timerun time

Not suitable for some Not suitable for some applicationsapplications

Page 55: Continuous Queries  over  Data Streams

55

Approximation Approximation Techniques Techniques

Window ReductionWindow Reduction Sampling rate reductionSampling rate reduction Summarization (Synopses) Summarization (Synopses)

Page 56: Continuous Queries  over  Data Streams

56

Window reduction Window reduction

Decreasing size of window Decreasing size of window Introduce Window where none was specified originally Introduce Window where none was specified originally

May increase output rate (duplicate elimination for example)

Must detect bad cases statically Affects resources used by operator

Page 57: Continuous Queries  over  Data Streams

57

Sampling rate reduction Sampling rate reduction

will reduce output rate will reduce output rate will not to influence resource requirements of will not to influence resource requirements of

operation operation

Introduce SAMPLE if not specified Reduce sampling rate

Page 58: Continuous Queries  over  Data Streams

58

SummarizationSummarization

Summaries(data synopses)Summaries(data synopses) - concise - concise representation at expense of accuracy representation at expense of accuracy Sampling, Histograms Wavelets Sampling, Histograms Wavelets

How to make guaranties about query results based on summaries ? How to maintain efficiently in rapid data streams ? What summarization techniques are better ?

Page 59: Continuous Queries  over  Data Streams

59

Dynamic approximation Dynamic approximation ChallengesChallenges

Some apps will not tolerate unpredicted Some apps will not tolerate unpredicted and variable accuracy and variable accuracy

Extend Language to specify tolerable Extend Language to specify tolerable imprecision imprecision

Page 60: Continuous Queries  over  Data Streams

60

Dynamic approximation Dynamic approximation techniques techniques

Synopses compression Synopses compression Sampling Sampling Load sheddingLoad shedding

Page 61: Continuous Queries  over  Data Streams

61

Synopses compression Synopses compression

Synopses: concise representation at expense Synopses: concise representation at expense of accuracyof accuracy

Reducing memory overheadReducing memory overhead Methods:Methods:

histograms, Wavelets, etchistograms, Wavelets, etc

Page 62: Continuous Queries  over  Data Streams

62

Load shedding Load shedding

Drop tuples from queries, when they Drop tuples from queries, when they grow too large grow too large

Drops chunks of tuples at time -- differs Drops chunks of tuples at time -- differs from sampling, which eliminates from sampling, which eliminates probabilistically probabilistically

load shedding -- biased, but easier to load shedding -- biased, but easier to implement implement

Page 63: Continuous Queries  over  Data Streams

63

Query Plans: Query Plans: How DSMS process How DSMS process

Query?Query?

Separate Query Plan for each Continuous Query vs. one Separate Query Plan for each Continuous Query vs. one Mega-Query plan for all computations for all usersMega-Query plan for all computations for all users

Plan components may be sharedPlan components may be shared

Query registers before streams start to produce dataQuery registers before streams start to produce data How about adding queries over existing streams How about adding queries over existing streams Queries over archived / discarded DataQueries over archived / discarded Data

Issues to consider:

Page 64: Continuous Queries  over  Data Streams

64

STREAM System: Query STREAM System: Query Plans Plans

Query OperatorsQuery Operators

Reads stream of tuples from set of input Reads stream of tuples from set of input queues, processes them, writes output tuples queues, processes them, writes output tuples into single output queueinto single output queue

Input Queue

Input QueueOperator

Output Queue

Page 65: Continuous Queries  over  Data Streams

65

Query Plans (Cont.)Query Plans (Cont.)

Inter-Operator QueuesInter-Operator Queues Queues connect different operators and defineQueues connect different operators and define

tuples flowtuples flow

SynopsesSynopsesSummarizes tuples seen so far at intermediateSummarizes tuples seen so far at intermediateoperator as needed for futureoperator as needed for future

Page 66: Continuous Queries  over  Data Streams

66

When Synopses When Synopses Needed ?Needed ?

Join operatorMust remember tuples seen so far on each of input streams – maintain synopses for each

Filter operator (selection) Do not maintain state – no need for synopses

Page 67: Continuous Queries  over  Data Streams

67

ExampleExample

Str

eam

R

Str

eam

SOperator O1 (Join)Synop1 Synop2

Synop3 Synop4

Str

eam

T

Operator O2(select)

Operator O3(Join)

Query1

Query2

Queue1 Queue2

Queue3

Queue 4

SelectionOver Join of R and S

Join of R,S, T

Q3 is Shared

Scheduler

Page 68: Continuous Queries  over  Data Streams

68

Explanations to ExampleExplanations to Example

Two plans (for Q1 and Q2) share a sub-plan Two plans (for Q1 and Q2) share a sub-plan joining streams R and S by sharing it output joining streams R and S by sharing it output queue q3queue q3

Execution of operators controlled by Global Execution of operators controlled by Global SchedulerScheduler

When operator O scheduled, control passes to When operator O scheduled, control passes to O for period determined by number of tuplesO for period determined by number of tuples

Possible time-slice based schedulingPossible time-slice based scheduling

Page 69: Continuous Queries  over  Data Streams

69

Resource Sharing for Resource Sharing for Query PlansQuery Plans

When Continuous Queries share common sub-When Continuous Queries share common sub-expressionsexpressions

Similar to traditional DBMSSimilar to traditional DBMS Resource sharing and Approximation considered Resource sharing and Approximation considered

separatelyseparately Do not share , if sharing introduces Do not share , if sharing introduces

approximation like merging sub-expressions approximation like merging sub-expressions with different window sizeswith different window sizes

Page 70: Continuous Queries  over  Data Streams

70

Implementation of Shared Implementation of Shared QueueQueue

Queue maintains pointer to first unread tuple for each Queue maintains pointer to first unread tuple for each operator operator

Discard tuple once they had been read by all operatorsDiscard tuple once they had been read by all operators

t1 t2 t3 t4 t5 t6 t7 t8 Shared Queue

Op1

Op2

Op3

Op4

Page 71: Continuous Queries  over  Data Streams

71

Resource Sharing (cont.)Resource Sharing (cont.)

Base Data Stream accessed by multiple queries Base Data Stream accessed by multiple queries shared as common sub-expressionshared as common sub-expression

Number of tuples in shared queue depends on :Number of tuples in shared queue depends on : Rate of addition to the queueRate of addition to the queue Rate at which slowest operator consumes Rate at which slowest operator consumes

tuplestuples Common sub-expression of 2 queries with very Common sub-expression of 2 queries with very

different consumption rates different consumption rates

Page 72: Continuous Queries  over  Data Streams

72

Shared Queue IssuesShared Queue Issues

P1, P2 – parents of operator JP1, P2 – parents of operator J J will be scheduled frequently, for sake of P1J will be scheduled frequently, for sake of P1 J should be scheduled less frequently for P2 (to avoid J should be scheduled less frequently for P2 (to avoid

proliferation of tuples in q) proliferation of tuples in q)

Operator J (Join) Queue qStream

Stream

P1Heavy consumer

P2Light consumer

Page 73: Continuous Queries  over  Data Streams

73

Sub-Plan SharingSub-Plan Sharing

Formally proven: Formally proven: sub-plan sharing may be sub-optimal for common sub-plan sharing may be sub-optimal for common

sub-expressions with joinssub-expressions with joins for common sub-expressions without joins sharing is for common sub-expressions without joins sharing is

always preferablealways preferable

Page 74: Continuous Queries  over  Data Streams

74

Synopses SharingSynopses Sharing

Issues to consider:Issues to consider: Which operator responsible to manage Which operator responsible to manage

shared synopses ?shared synopses ? Synopses required by different operators , Synopses required by different operators ,

how to choose size of common synopses?how to choose size of common synopses? If synopses are identical, how to cope with If synopses are identical, how to cope with

different consumption rates?different consumption rates?

Page 75: Continuous Queries  over  Data Streams

75

SchedulingScheduling Objective for Scheduler:Objective for Scheduler:

Stream-based variation of response timeStream-based variation of response time ThroughputThroughput Weighted fairness among queuesWeighted fairness among queues Minimize intermediate queues sizesMinimize intermediate queues sizes

Granularity for Scheduler:Granularity for Scheduler: Max number of tuples consumed by operatorMax number of tuples consumed by operator Time-unitTime-unit Parallelism in scheduling algorithm ?Parallelism in scheduling algorithm ?

Page 76: Continuous Queries  over  Data Streams

76

Scheduling : ExampleScheduling : Example

O1 takes 1 time unit to operate on n tuples from q1,O1 takes 1 time unit to operate on n tuples from q1,with 20% selectivity, produces n/5 tuples in q2with 20% selectivity, produces n/5 tuples in q2

Op. O1 Op. O2q1 q2

O2 takes 1 time unit to operate on n/5 tuples O2 takes 1 time unit to operate on n/5 tuples from q2,from q2, and it doesn’t produces tuples.and it doesn’t produces tuples.

Page 77: Continuous Queries  over  Data Streams

77

Scheduling Example Scheduling Example (Cont.)(Cont.)

Assume, Assume, averageaverage arrival rate on q1 is no more arrival rate on q1 is no more than n per 2 time units queues are boundedthan n per 2 time units queues are bounded

Arrivals may be burstyArrivals may be bursty

Possible scheduling strategiesPossible scheduling strategies Algoritm1 (time-slicing) :Algoritm1 (time-slicing) :

tuples processed 1 time unit by each operator.tuples processed 1 time unit by each operator.

O1 consumes n units, O2 consumes n/5; O1 consumes n units, O2 consumes n/5;

O1, O2 …O1, O2 …Algoritm2 : O1 operates until its queue empty, Algoritm2 : O1 operates until its queue empty,

afterwards – O2afterwards – O2

Page 78: Continuous Queries  over  Data Streams

78

Algorithm 1Algorithm 1

11

22

33

44

55

66

77

88

2266

11

2n tuples arrived

n tuples arrived n tuples

arrived

Orange : Tuples in Q1 Orange : Tuples in Q1 Yellow : Tuples in Q2Yellow : Tuples in Q2

Time

Queue Size

Page 79: Continuous Queries  over  Data Streams

79

Algorithm2Algorithm2

Orange : tuples in Q1 Orange : tuples in Q1 Yellow : Tuples in Q2Yellow : Tuples in Q2

11

22

33

44

55

66

77

88

2n tuples arrived

n tuples arrived n tuples

arrived

Queue Size

Time

Page 80: Continuous Queries  over  Data Streams

80

Comparison. Which is Comparison. Which is better?better?

22

33

44

55

66

77

88

2266

11

2n tuples arrived

n tuples arrived n tuples

arrived

Time11

Orange : Algorithm1 Yellow : Algorithm2

Total size of bothqueues

Page 81: Continuous Queries  over  Data Streams

81

Greedy Scheduler RuleGreedy Scheduler Rule

Schedule the operator that Schedule the operator that consumes largestconsumes largest number of of tuples per time and is the number of of tuples per time and is the most most selectiveselective (produces fewest tuples) (produces fewest tuples)

Operators with full batches in Operators with full batches in input queuesinput queues are are favored over high priority operators with under-favored over high priority operators with under-full inputs (better utilization of time-slice)full inputs (better utilization of time-slice)

High-priority operator may be underutilized if High-priority operator may be underutilized if feeders are low priority – feeders are low priority – consider chains of consider chains of operatorsoperators

Page 82: Continuous Queries  over  Data Streams

82

Scheduling Algorithm Scheduling Algorithm DiscussionDiscussion

Queue size minimizationQueue size minimization Increased time to initial resultsIncreased time to initial results Strategy 1 would produce initial results fasterStrategy 1 would produce initial results faster Incorporate response time and weighted fairness Incorporate response time and weighted fairness

into algorithminto algorithm Flexible time-slicesFlexible time-slices Taking context-switching into accountTaking context-switching into account

Page 83: Continuous Queries  over  Data Streams

83

Resource ManagementResource Management Relevant Resources:Relevant Resources:

Memory Memory CPUCPU I/O (if disk used)I/O (if disk used) Network (in Distributed DSMS)Network (in Distributed DSMS)

Our Goal Our Goal

Maximize query precision by making best useMaximize query precision by making best use

of available resources and have a capability toof available resources and have a capability to

do that dynamically and adaptivelydo that dynamically and adaptively

Page 84: Continuous Queries  over  Data Streams

84

Resource Management Resource Management Cont.Cont.

Allocating memory to query plan Allocating memory to query plan Incorporating known constraints on input Incorporating known constraints on input

streams to reduce synopses without streams to reduce synopses without compromising precisioncompromising precision

Operator scheduling to minimize queue sizeOperator scheduling to minimize queue size

Focus on memory used by synopses and queues Algorithms developed in STREAM :

Page 85: Continuous Queries  over  Data Streams

85

Resource Management Resource Management Approaches (Cont.)Approaches (Cont.)

Exploiting constraints over data streamsExploiting constraints over data streams

When additional information about streams is When additional information about streams is available (gathered stats, constraint specs) -- available (gathered stats, constraint specs) -- reduce resource utilization with same result reduce resource utilization with same result precision precision

Page 86: Continuous Queries  over  Data Streams

86

Adaptation – why?Adaptation – why?

Adaptation:Adaptation: Queries are long runningQueries are long running Parameters Parameters

Stream flow rateStream flow rate Stream data characteristics Stream data characteristics Environment (available RAM) Environment (available RAM) may vary -- how to adapt? may vary -- how to adapt?

Page 87: Continuous Queries  over  Data Streams

87

Exploiting Constraints Exploiting Constraints over Data Streamsover Data Streams

Answering Requires synopses of unbounded size !Answering Requires synopses of unbounded size !

Query Q : join , to monitorfulfillments delays

O FStream Orders

Stream Fulfillments

Order_IDItem

_ID

Synop-O Synop-F

Page 88: Continuous Queries  over  Data Streams

88

Constraints (cont.)Constraints (cont.) Tuples for given (orderID, itemID) arrive at stream O Tuples for given (orderID, itemID) arrive at stream O

before corresponding tuples arrive to Fbefore corresponding tuples arrive to F No need to maintain a join synopses for F !!No need to maintain a join synopses for F !! Another constrain: tuples arrive at O clustered by Another constrain: tuples arrive at O clustered by

orderIDorderID We need only to save tuples for given orderID, until next We need only to save tuples for given orderID, until next

orderID seenorderID seen

Ord1, item 4

Ord1, item 2

Ord1, item 1

Ord1, item 3

Ord3, item 4

Ord3, item 1Ord1, item 3

Ord1, item 2

Ord1, item 1

Ord3, item 1

Ord3, item 4

Ord3, item 2

More RAM needed

for synapse

Page 89: Continuous Queries  over  Data Streams

89

Constraints (Cont.)Constraints (Cont.)

Referential integrityReferential integrity Unique-valueUnique-value Clustered-ArrivalClustered-Arrival Ordered-ArrivalOrdered-Arrival

Page 90: Continuous Queries  over  Data Streams

90

SummarySummary

Architecture for DSMS Query Language Common Design Problems Tradeoff: efficiency, accuracy, storage

Page 91: Continuous Queries  over  Data Streams

91

ReferencesReferences “Continuous Queries over Data Streams” by S.Babu, J.Widom (Stanford University)

“Query Processing, Approximation, and Resource Management In a Data Stream Management System” by R.Motiwani, J.Widom and others (Stanford University)

http://www.db.stanford.edu/stream

Page 92: Continuous Queries  over  Data Streams

Questions ?Questions ?

Page 93: Continuous Queries  over  Data Streams

93