Data Stream Processing - Uni Konstanz
Transcript of Data Stream Processing - Uni Konstanz
11
Data Stream ProcessingData Stream Processing
Weiwei SUNWeiwei SUNUniversity of KonstanzUniversity of Konstanz
22
Data Stream ProcessingData Stream Processing
STREAM-STREAM-ststanfordanford st streream datam dataammanageranager–– Semantics and query languageSemantics and query language–– Query executions and optimizationsQuery executions and optimizations
YfilterYfilter-XML stream filtering engine-XML stream filtering engine
33
STREAM-STREAM-ststanfordanford st strereamamdatdata ma manageranager
STREAMSTREAM is a general-purpose DSMS is a general-purpose DSMS(Data Stream Management System)(Data Stream Management System)prototypeprototype
The motivation of this talk is toThe motivation of this talk is tointroduce problems, solutions andintroduce problems, solutions andchallenges of data stream processingchallenges of data stream processing
44
Data StreamsData Streams
Continuous, unbounded, rapid, time-varyingContinuous, unbounded, rapid, time-varyingstreams of data elementsstreams of data elements
Occur in a variety of modern applicationsOccur in a variety of modern applications–– Network monitoring and traffic engineeringNetwork monitoring and traffic engineering–– Sensor networks, RFID tagsSensor networks, RFID tags–– Telecom call recordsTelecom call records–– Financial applicationsFinancial applications–– Web logs and click-streamsWeb logs and click-streams–– Manufacturing processesManufacturing processes
DSMS = Data Stream Management SystemDSMS = Data Stream Management System
55
Using Traditional DatabaseUsing Traditional DatabaseUser/Application
Loader
QueryQuery ResultResultResultResult
……QueryQuery
……
Table R
Table S
66
New Approach for DataNew Approach for DataStreamsStreams User/Application
RegisterRegisterContinuousContinuous
QueryQuery
Stream QueryProcessor
ResultResult
Input streams
77
DBMS versus DSMS DBMS versus DSMS
Persistent relationsPersistent relations
One-time queriesOne-time queries
Random accessRandom access
Access planAccess plandetermined bydetermined byquery processorquery processorand physical DBand physical DBdesigndesign
Transient streams (andTransient streams (andpersistent relations)persistent relations)
Continuous queriesContinuous queries
Sequential accessSequential access
Unpredictable dataUnpredictable datacharacteristics andcharacteristics andarrival patternsarrival patterns
88
DSMS
Scratch Store
The (Simplified) Big PictureThe (Simplified) Big Picture
Input streams
RegisterQuery
StreamedResult
StoredResult
ArchiveStored
Relations
99
A (Simplified) System ArchitectureA (Simplified) System Architectureof Network Monitoringof Network Monitoring
RegisterMonitoring
Queries
DSMS
Scratch Store
Network measurements,Packet traces
IntrusionWarnings
OnlinePerformance
Metrics
ArchiveLookupTables
1010
Using Conventional DBMSUsing Conventional DBMS
Data streams as Data streams as relation insertsrelation inserts, continuous, continuousqueries as queries as triggers triggers oror materialized views materialized views
Problems with this approachProblems with this approach–– Inserts are typically batched, high overheadInserts are typically batched, high overhead–– Expressiveness: simple conditions (triggers), noExpressiveness: simple conditions (triggers), no
built-in notion of sequence (views)built-in notion of sequence (views)–– No notion of approximationNo notion of approximation–– Current systems donCurrent systems don’’t scale to large # oft scale to large # of
triggerstriggers–– Views donViews don’’t provide streamed resultst provide streamed results
1111
The STREAM SystemThe STREAM System
Data streams and stored relationsData streams and stored relations Declarative language for registeringDeclarative language for registering
continuous queriescontinuous queries Flexible query plans and executionFlexible query plans and execution
strategiesstrategies Textual, graphical, and applicationTextual, graphical, and application
interfacesinterfaces Relational, centralized (for now)Relational, centralized (for now)
1212
STREAM System ChallengesSTREAM System Challenges
Must cope with:Must cope with:–– Stream ratesStream rates that may be that may be high,variablehigh,variable,,
burstybursty–– Stream dataStream data that may be unpredictable, that may be unpredictable,
variablevariable–– Continuous query loadsContinuous query loads that may be that may be
high, variablehigh, variable
1313
STREAM System ChallengesSTREAM System Challenges
Must cope with:Must cope with:–– Stream ratesStream rates that may be that may be highhigh,variable,variable,,
burstybursty–– Stream dataStream data that may be unpredictable, that may be unpredictable,
variablevariable–– Continuous query loadsContinuous query loads that may be that may be
highhigh, variable, variable
OverloadOverload
1414
STREAM System ChallengesSTREAM System Challenges
Must cope with:Must cope with:–– Stream ratesStream rates that may be that may be high,high,variablevariable,,
burstybursty–– Stream dataStream data that may be that may be
unpredictable, variableunpredictable, variable–– Continuous query loadsContinuous query loads that may be that may be
high, high, variablevariable
OverloadOverload Changing conditionsChanging conditions
1515
STREAM System FeaturesSTREAM System Features
Aggressive Aggressive sharingsharing of state and of state andcomputationcomputation
Careful Careful resource allocation and useresource allocation and use Continuous Continuous self-monitoringself-monitoring and and
reoptimizationreoptimization Graceful Graceful approximationapproximation as necessary as necessary
1616
We will mainly talk aboutWe will mainly talk about
Query languageQuery language–– Semantics of CQLSemantics of CQL
Query plans and execution issuesQuery plans and execution issues–– Operator, Queue, and StateOperator, Queue, and State–– State sharingState sharing–– Stream constraintsStream constraints–– Operator scheduling optimizationOperator scheduling optimization
1717
Query LanguageQuery Language
CQL CQL –– Continuous Query Continuous QueryLanguageLanguage
1818
Aside on SemanticsAside on Semantics
The semantics of SQL queries is (relatively)The semantics of SQL queries is (relatively)easy to understandeasy to understand–– Even lots of SQL queries running togetherEven lots of SQL queries running together
The semantics of a single trigger isThe semantics of a single trigger is(relatively) easy to understand(relatively) easy to understand–– But lots of triggers together can be complexBut lots of triggers together can be complex
The semantics of even a single continuousThe semantics of even a single continuousquery may not be obviousquery may not be obvious–– But lots running together is no harderBut lots running together is no harder
1919
A A NonobviousNonobvious Continuous ContinuousQueryQuery
Stream of stock quotes: Stream of stock quotes: Stocks(ticker,priceStocks(ticker,price))
Monitor last 10 minutes of quotes:Monitor last 10 minutes of quotes:Select Select ∗∗ From Stocks [Range 10 minutes] From Stocks [Range 10 minutes]
Is result a relation, a stream, or something else?Is result a relation, a stream, or something else?
If a relation, what exactly does it contain?If a relation, what exactly does it contain?
If a stream, how does query differ from:If a stream, how does query differ from:Select Select ∗∗ From Stocks [Range 1 minute] From Stocks [Range 1 minute]oror Select Select ∗∗ From Stocks [ From Stocks [∞∞]]
2020
Another Another NonobviousNonobvious CQ CQ
Stream of ordered items, table of itemStream of ordered items, table of itempricesprices
Prices for five most recent ordered items:Prices for five most recent ordered items:Select Select P.priceP.priceFrom Items I [Rows 5], From Items I [Rows 5], PriceTablePriceTable P PWhere Where II.itemID.itemID = = P.itemIDP.itemID
Is result a stream or a relation?Is result a stream or a relation? What if item price changes?What if item price changes?
2121
Start with SQLStart with SQLThen addThen add……
StreamsStreams as new data type as new data type ContinuousContinuous instead of one-time semanticsinstead of one-time semantics WindowsWindows on streams ( on streams (Stream-to-Relation)) SamplingSampling on streams (Approximate results) on streams (Approximate results) RRelation-to-Stream operatorselation-to-Stream operators
IstreamIstream, , DstreamDstream RstreamRstream
Continuous QueryContinuous QueryLanguage Language –– CQL CQL
2222
Relations and StreamsRelations and Streams
Assume global, discrete, ordered timeAssume global, discrete, ordered timedomaindomain
RelationRelation–– Maps Maps time time TT toto set-of-set-of-tuplestuples RR–– It differs from the traditional oneIt differs from the traditional one
StreamStream–– Set of pairs Set of pairs <<tuple,timestamptuple,timestamp>>–– Unbounded, TransientUnbounded, Transient
2323
ConversionsConversions
A relation-to-relation operator takes one or morerelations as input and produces a relation as output.
A stream-to-relation operator takes a stream as inputand produces a relation as output.
A relation-to-stream operator takes a relation as inputand produces a stream as output.
Streams Relations
Window specification
Special operators:Istream, Dstream, Rstream
Any relationalquery language
2424
The Relation-to-RelationOperators in CQL
CQL uses SQL constructs to expressits relation-to-relation operators, andmuch of the data manipulation in atypical CQL query is performed usingthese constructs, exploiting the richexpressive power of SQL.
2525
The Stream-to-RelationOperators in CQLThe stream-to-relation operators in CQL
are based on the concept of a slidingwindow over a stream:
tuple-based sliding window– Items [Rows 100]
time-based sliding window– Items [Range 5 Minutes]
partitioned sliding window–– Fulfillments [Partition By clerk Rows 5]Fulfillments [Partition By clerk Rows 5]
2626
Three Relation-to-Three Relation-to-Stream Operators Stream Operators in CQL Three Three relation-to-stream operatorsrelation-to-stream operators
IstreamIstream, , DstreamDstream RstreamRstream–– Istream(Istream(RR)) contains all contains all ((r,Tr,T )) where where rr∈∈RR at time at time
T T but but rr∉∉RR at time at time TT––11 insert streaminsert stream
–– Dstream(Dstream(RR)) contains all contains all ((r,Tr,T )) where where rr∈∈RR at attime time TT––1 1 but but rr∉∉RR at time at time TT delete streamdelete stream
–– Rstream(Rstream(RR)) contains all contains all ((r,Tr,T )) where where rr∈∈RR at time at timeTT relation streamrelation stream
2727
Abstract SemanticsAbstract Semantics
Take any relational query languageTake any relational query language Can reference streams in place of relationsCan reference streams in place of relations
–– But must convert to relations using any windowBut must convert to relations using any windowspecification languagespecification language( default window = [( default window = [∞∞] )] )
Can convert relations to streamsCan convert relations to streams–– For streamed resultsFor streamed results–– For windows over relationsFor windows over relations
(note: converts back to relation)(note: converts back to relation)
2828
Query Result at Time Query Result at Time TT
Use all relations at time Use all relations at time TTUse all streams up to Use all streams up to TT, converted, convertedto relationsto relations
Compute relational resultCompute relational result Convert result to streams if desiredConvert result to streams if desired
2929
CQL Example Query 1CQL Example Query 1
Two streams, contrived for ease of examples:Two streams, contrived for ease of examples: Orders (Orders (orderIDorderID, customer, cost), customer, cost) Fulfillments ( Fulfillments (orderIDorderID, clerk), clerk)
3030
CQL Example Query 1CQL Example Query 1
Two streams, contrived for ease of examples:Two streams, contrived for ease of examples: Orders (Orders (orderIDorderID, customer, cost), customer, cost) Fulfillments ( Fulfillments (orderIDorderID, clerk), clerk)
Total cost of orders fulfilled over the last dayTotal cost of orders fulfilled over the last dayby clerk by clerk ““SueSue”” for customer for customer ““JoeJoe””
Select Sum(Select Sum(O.cost)O.cost)From Orders O, Fulfillments F [Range 1 Day]From Orders O, Fulfillments F [Range 1 Day]Where Where O.orderIDO.orderID = = F.orderIDF.orderID And F.clerk = And F.clerk = ““SueSue”” And O.customer = And O.customer = ““JoeJoe””
3131
CQL Example Query 1CQL Example Query 1
Two streams, contrived for ease of examples:Two streams, contrived for ease of examples: Orders (Orders (orderIDorderID, customer, cost), customer, cost) Fulfillments ( Fulfillments (orderIDorderID, clerk), clerk)
Total cost of orders fulfilled over the last dayTotal cost of orders fulfilled over the last dayby clerk by clerk ““SueSue”” for customer for customer ““JoeJoe””
Select Sum(Select Sum(O.cost)O.cost)From Orders O, From Orders O, Fulfillments F [Range 1 Day]Fulfillments F [Range 1 Day]Where Where O.orderIDO.orderID = = F.orderIDF.orderID And F.clerk = And F.clerk = ““SueSue”” And O.customer = And O.customer = ““JoeJoe””
3232
CQL Example Query 1CQL Example Query 1
Two streams, contrived for ease of examples:Two streams, contrived for ease of examples: Orders (Orders (orderIDorderID, customer, cost), customer, cost) Fulfillments ( Fulfillments (orderIDorderID, clerk), clerk)
Total cost of orders fulfilled over the last dayTotal cost of orders fulfilled over the last dayby clerk by clerk ““SueSue”” for customer for customer ““JoeJoe””
Select Sum(Select Sum(O.cost)O.cost)From Orders O,From Orders O, Fulfillments F [Range 1 Day] Fulfillments F [Range 1 Day]Where Where O.orderIDO.orderID = = F.orderIDF.orderID And F.clerk = And F.clerk = ““SueSue”” And O.customer = And O.customer = ““JoeJoe””
3333
CQL Example Query 1CQL Example Query 1
Two streams, contrived for ease of examples:Two streams, contrived for ease of examples: Orders (Orders (orderIDorderID, customer, cost), customer, cost) Fulfillments ( Fulfillments (orderIDorderID, clerk), clerk)
Total cost of orders fulfilled over the last dayTotal cost of orders fulfilled over the last dayby clerk by clerk ““SueSue”” for customer for customer ““JoeJoe””
Select Sum(Select Sum(O.cost)O.cost)From Orders O, Fulfillments F [Range 1 Day]From Orders O, Fulfillments F [Range 1 Day]Where Where O.orderIDO.orderID = = F.orderIDF.orderID And F.clerk = And F.clerk = ““SueSue”” And O.customer = And O.customer = ““JoeJoe””
3434
CQL Example Query 1CQL Example Query 1
Two streams, contrived for ease of examples:Two streams, contrived for ease of examples: Orders (Orders (orderIDorderID, customer, cost), customer, cost) Fulfillments ( Fulfillments (orderIDorderID, clerk), clerk)
Total cost of orders fulfilled over the last dayTotal cost of orders fulfilled over the last dayby clerk by clerk ““SueSue”” for customer for customer ““JoeJoe””
Select Sum(Select Sum(O.cost)O.cost)From Orders O, Fulfillments F [Range 1 Day]From Orders O, Fulfillments F [Range 1 Day]Where Where O.orderIDO.orderID = = F.orderIDF.orderID And F.clerk = And F.clerk = ““SueSue”” And O.customer = And O.customer = ““JoeJoe””
3535
CQL Example Query 1CQL Example Query 1
Syntactic shortcuts and defaults for convenience
Select Select IStream(IStream(Sum(Sum(O.costO.cost))))From Orders O From Orders O [[∞∞]], Fulfillments F [Range 1 Day], Fulfillments F [Range 1 Day]Where Where O.orderIDO.orderID = = F.orderIDF.orderID And F.clerk = And F.clerk = ““SueSue”” And O.customer = And O.customer = ““JoeJoe””
At time At time TT:: Entire stream Entire stream OO and and tuplestuples of last day of of last day of FF as relations as relations Evaluate query, update result relation at Evaluate query, update result relation at TT Streamed result:Streamed result: New element New element <<Sum(Sum(O.cost)O.cost),,TT>> whenever whenever
Sum(Sum(O.costO.cost)) changes from changes from TT––11
3636
CQL Example Query 2CQL Example Query 2
Using a 10% sample of the FulfillmentsUsing a 10% sample of the Fulfillmentsstream, take the 5 most recent fulfillmentsstream, take the 5 most recent fulfillmentsfor each clerk and return the maximum costfor each clerk and return the maximum cost
Select Select F.clerkF.clerk, , Max(O.costMax(O.cost))From Orders O,From Orders O, Fulfillments F [Partition By clerk Rows 5] Fulfillments F [Partition By clerk Rows 5]
10% Sample10% SampleWhere Where O.orderIDO.orderID = = F.orderIDF.orderIDGroup By Group By F.clerkF.clerk
3737
CQL Example Query 2CQL Example Query 2
Using a 10% sample of the FulfillmentsUsing a 10% sample of the Fulfillmentsstream, take the 5 most recent fulfillmentsstream, take the 5 most recent fulfillmentsfor each clerk and return the maximum costfor each clerk and return the maximum cost
Select Select F.clerkF.clerk, , Max(O.costMax(O.cost))From Orders O,From Orders O, Fulfillments F Fulfillments F [Partition By clerk Rows 5] [Partition By clerk Rows 5]
10% Sample10% SampleWhere Where O.orderIDO.orderID = = F.orderIDF.orderIDGroup By Group By F.clerkF.clerk
3838
CQL Example Query 2CQL Example Query 2
Using a 10% sample of the FulfillmentsUsing a 10% sample of the Fulfillmentsstream, take the 5 most recent fulfillmentsstream, take the 5 most recent fulfillmentsfor each clerk and return the maximum costfor each clerk and return the maximum cost
Select Select F.clerkF.clerk, , Max(O.costMax(O.cost))From Orders O,From Orders O, Fulfillments F [Partition By clerk Rows 5]Fulfillments F [Partition By clerk Rows 5]
10% Sample10% SampleWhere Where O.orderIDO.orderID = = F.orderIDF.orderIDGroup By Group By F.clerkF.clerk
3939
CQL Example Query 2CQL Example Query 2
Using a 10% sample of the FulfillmentsUsing a 10% sample of the Fulfillmentsstream, take the 5 most recent fulfillmentsstream, take the 5 most recent fulfillmentsfor each clerk and return the maximum costfor each clerk and return the maximum cost
Select Select F.clerkF.clerk, , Max(O.costMax(O.cost))From Orders O,From Orders O, Fulfillments F [Partition By clerk Rows 5] Fulfillments F [Partition By clerk Rows 5]
10% Sample10% SampleWhere Where O.orderIDO.orderID = = F.orderIDF.orderIDGroup By Group By F.clerkF.clerk
4040
CQL Example Query 2CQL Example Query 2
Using a 10% sample of the FulfillmentsUsing a 10% sample of the Fulfillmentsstream, take the 5 most recent fulfillmentsstream, take the 5 most recent fulfillmentsfor each clerk and return the maximum costfor each clerk and return the maximum cost
Select Select F.clerkF.clerk, , Max(O.costMax(O.cost))From Orders O,From Orders O, Fulfillments F [Partition By clerk Rows 5] Fulfillments F [Partition By clerk Rows 5]
10% Sample10% SampleWhere Where O.orderIDO.orderID = = F.orderIDF.orderIDGroup By Group By F.clerkF.clerk
4141
CQL Example3: Result TypeCQL Example3: Result Type
Simpler version of Example Query 2:Simpler version of Example Query 2:Select Select IstreamIstream( ( F.clerkF.clerk, , Max(O.costMax(O.cost) )) )From O, F [Rows 100]From O, F [Rows 100]Where Where O.orderIDO.orderID = = F.orderIDF.orderIDGroup By F.clerkGroup By F.clerk
Streamed result:Streamed result: Emits Emits <<clerk,maxclerk,max>>stream element whenever maxstream element whenever maxchanges for a clerk (or new clerk)changes for a clerk (or new clerk)
4242
CQL Example3: Result TypeCQL Example3: Result Type
Simpler version of Example Query 2:Simpler version of Example Query 2:Select Select RStream(RStream(F.clerkF.clerk, , Max(O.costMax(O.cost))))From O, F [Rows 100]From O, F [Rows 100]Where Where O.orderIDO.orderID = = F.orderIDF.orderIDGroup By F.clerkGroup By F.clerk
Result is a relation, updated asResult is a relation, updated asstream elements arrivestream elements arrive
4343
CQL Example Query 4CQL Example Query 4
Relation Relation CurPrice(stockCurPrice(stock, price), price) Select stock, Select stock, Avg(priceAvg(price)) From From Istream(Istream(CurPriceCurPrice)) [Range 1 Day] [Range 1 Day] Group By stock Group By stock
Average price over last day for eachAverage price over last day for eachstockstock
IstreamIstream provides history of provides history of CurPriceCurPrice Window on history (back to relation),Window on history (back to relation),
group and aggregategroup and aggregate
4444
Any questions?Any questions?
4545
Query plans andQuery plans andexecution issuesexecution issues
4646
Query ExecutionQuery Execution
When a continuous query is registered,When a continuous query is registered,generate a generate a query planquery plan–– New plan merged with existing plansNew plan merged with existing plans–– Users can also create & manipulate plans directlyUsers can also create & manipulate plans directly
Plans composed of three main components:Plans composed of three main components:–– OperatorsOperators–– QueuesQueues–– Synopses/StatesSynopses/States (windows, operators requiring(windows, operators requiring
history)history) Global Global schedulerscheduler for plan execution for plan execution
4747
Operators used inOperators used inSTREAM query plansSTREAM query plans
4848
QueueQueue
A queue in a query plan connects its“producing” plan operator OP to its“consuming” operator OC
The elements that OP produces are insertedinto the queue and buered there until theyare processed by OC
Elements in a queue are increasing onElements in a queue are increasing ontimestamptimestamp–– To maintain the semantics of sliding windowTo maintain the semantics of sliding window
4949
Synopsis/StateSynopsis/State
Logically, a synopsis belongs to a specific planoperator, storing state that may be required forfuture evaluation of that operator.
For example, to perform a windowed join oftwo streams, the join operator must be ableto probe all tuples in the current window oneach input stream. Thus, the join operatormaintains one synopsis (e.g., a hash table) foreach of its inputs. On the other hand,operators such as selection and duplicate-preserving union do not require any synopses.
State1 State2
5050
A simple query plan illustratingA simple query plan illustratingoperators, queues, and synopsesoperators, queues, and synopses
Query:Select *From S1 [Rows 1000], S2 [Range 2 Minutes]Where S1.A = S2.A And S1.A > 10
q3 holds elementsrepresenting therelation "S1 [Rows1000]"
q4 holds elements for"S2 [Range 2 Minutes]“
5151
A simple query plan illustratingA simple query plan illustratingoperators, queues, and synopsesoperators, queues, and synopses
Query:Select *From S1 [Rows 1000], S2 [Range 2 Minutes]Where S1.A = S2.A And S1.A > 10
q5 holds elements ofthe joined relation "S1
[Rows 1000] S2[Range 2 Minutes]"
5252
A simple query plan illustratingA simple query plan illustratingoperators, queues, and synopsesoperators, queues, and synopses
The select operator canbe pushed down intoone or both branchesbelow the binary-joinoperator, and alsobelow the seq-windowoperator on S2.
However, tuple-basedwindows do notcommute with filterconditions, andtherefore the selectoperator cannot bepushed below the seq-window operator on S1.
5353
A simple query plan illustratingA simple query plan illustratingoperators, queues, and synopsesoperators, queues, and synopses
Each seq-windowoperator maintains asynopsis so that it cangenerate "−" elementswhen tuples expirefrom the slidingwindow.
The binary-joinoperator maintains asynopsis materializingeach of its relationalinputs for use inperforming joins withtuples on the oppositeinput.
5454
A simple query plan illustratingA simple query plan illustratingoperators, queues, and synopsesoperators, queues, and synopses
The contents ofsynopsis1 andsynopsis3 aresimilar (as are thecontents ofsynopsis2 andsynopsis4)– both maintain a
materialization ofthe same window
– but at slightlydifferent positions ofstream S1.
5555
A simple query plan illustratingA simple query plan illustratingoperators, queues, and synopsesoperators, queues, and synopses
Query:Select *From S1 [Rows 1000], S2 [Range 2 Minutes]Where S1.A = S2.A And S1.A > 10
Query PlanQuery PlanExecutionExecution
5656
Any questions?Any questions?
5757
Memory Overhead inMemory Overhead inQuery ProcessingQuery Processing
Queues + StateQueues + State Continuous queries keep stateContinuous queries keep state
indefinitelyindefinitely Online requirements suggest usingOnline requirements suggest using
memory rather than diskmemory rather than disk Goal: minimize memory use whileGoal: minimize memory use while
providing timely, accurate answersproviding timely, accurate answers
5858
Reducing MemoryReducing MemoryOverheadOverhead
1)1) Enable Enable state sharingstate sharing within andwithin andacross queriesacross queries
2)2) Exploit Exploit constraints on streamsconstraints on streams to toreduce statereduce state
3)3) Specialized Specialized operator schedulingoperator scheduling to toreduce queue sizesreduce queue sizes
5959
State sharing in one query planState sharing in one query plan
Multiple synopsesMultiple synopseswithin a single querywithin a single queryplan may materializeplan may materializenearly identical relationsnearly identical relations
Select *From S1 [Rows 1000], S2 [Range 2 Minutes]Where S1.A = S2.A And
S1.A > 10 The contents of
synopsis1 andsynopsis3 are similar(as are the contents ofsynopsis2 andsynopsis4)
6060
State sharing in one query planState sharing in one query plan
Use light-weightUse light-weightstubs to replacestubs to replacethe synopsesthe synopses–– Implement theImplement the
same interfaces assame interfaces asnon-sharednon-sharedsynopsessynopses
A single store toA single store tohold the actualhold the actualtuplestuples
6161
State sharing in multiple plansState sharing in multiple plans
Q1:Select *From S1 [Rows 1000], S2 [Range 2 Minutes]Where S1.A = S2.A And S1.A > 10Q2:Select A, Max(B)From S1 [Rows 200]Group By A
Clearly the store mustClearly the store mustcontain the union of itscontain the union of itscorresponding stubs:corresponding stubs:–– A A tupletuple is inserted into the is inserted into the
store as soon as it isstore as soon as it isinserted by any one of theinserted by any one of thestubsstubs
–– A A tupletuple is removed only is removed onlywhen it has been removedwhen it has been removedfrom all of the stubs.from all of the stubs.
6262
Any questions?Any questions?
6363
Stream ConstraintsStream Constraints
For many queries, large or unbounded state isFor many queries, large or unbounded state isrequired for required for arbitraryarbitrary streams streams
Orders (Orders (orderIDorderID, customer, cost), customer, cost)Fulfillments (Fulfillments (orderIDorderID, portion, portion, clerk), clerk)
Select Select Sum(Sum(O.costO.cost))From Orders O, Fulfillments F [Range 1 Day]From Orders O, Fulfillments F [Range 1 Day]Where Where O.orderIDO.orderID = = F.orderIDF.orderID And And F.clerkF.clerk = = ““SueSue”” And And O.customerO.customer = = ““JoeJoe””
If there are no constraints, we have to keep allIf there are no constraints, we have to keep allOrders Orders tuplestuples..
6464
kk-constraints-constraints
But streams may exhibit But streams may exhibit constraintsconstraintsthat reduce, bound, or even eliminatethat reduce, bound, or even eliminatestatesstates–– ClusteredClustered–– OrderedOrdered–– Stream-based referential integrityStream-based referential integrity
Relaxed version: Relaxed version: kk-constraints-constraints
6565
Clustered-arrival Clustered-arrival kk-constraint-constraint
A clustered-arrival k-constraint on a stream attribute S.Adefines a bound k on the distance between any two elementsthat have the same value of S.A.
Select Select Sum(Sum(O.costO.cost))From Orders O, Fulfillments F [Range 1 Day]From Orders O, Fulfillments F [Range 1 Day]Where Where O.orderIDO.orderID = = F.orderIDF.orderID And And F.clerkF.clerk = = ““SueSue”” And And O.customerO.customer = = ““JoeJoe””
If If FulfillmentsFulfillments is is kk-clustered-clustered on on orderIDorderID, can infer when to, can infer when todiscard discard Orders Orders tupletuple
When there are more than When there are more than kk tuplestuples their their orderIDorderID<>oID1<>oID1after we read the first after we read the first tupletuple its its orderIDorderID=oID1, then we can=oID1, then we candiscard the Orders discard the Orders tupletuple its its orderIDorderID=oID1.=oID1.
For the special case of For the special case of kk=0 for this constraint, the=0 for this constraint, theFulfillments stream is Fulfillments stream is strict clustered.strict clustered.
Orders (Orders (orderIDorderID, customer, cost), customer, cost)Fulfillments (Fulfillments (orderIDorderID, portion, portion, clerk), clerk)
6666
Ordered-arrival Ordered-arrival kk-constraint-constraint
An ordered-arrival k-constraint on a stream attribute S.Adefines a bound k on the amount of reordering in values ofS.A. Specifically, given any tuple s in stream S, for all tupless’ that arrive at least k + 1 elements after s, it must be truethat s’.A>=s.A (or s’.A<=s.A)* .
Select Select Sum(Sum(O.costO.cost))From Orders O, Fulfillments F [Range 1 Day]From Orders O, Fulfillments F [Range 1 Day]Where Where O.orderIDO.orderID = = F.orderIDF.orderID And And F.clerkF.clerk = = ““SueSue”” And And O.customerO.customer = = ““JoeJoe””
If If FulfillmentsFulfillments is is kk-ordered-ordered on on orderIDorderID, can infer when to, can infer when todiscard discard Orders Orders tupletupless
When there are more than When there are more than kk tuplestuples their their orderIDorderID<>oID1<>oID1after we read the first after we read the first tupletuple its its orderIDorderID=oID1, then we can=oID1, then we candiscard the Orders discard the Orders tuplestuples there there orderIDorderID=>oID1 (or=>oID1 (ororderIDorderID<=oID1)*.<=oID1)*.–– We may discard a batch of Orders We may discard a batch of Orders tuplestuples at one time at one time
For the special case of For the special case of kk=0 for this constraint, the=0 for this constraint, theFulfillments stream is Fulfillments stream is strict ordered.strict ordered.
6767
Referential integrityReferential integritykk-constraint-constraint
A referential integrity k-constraint on a many-one join betweenstreams defines a bound k on the delay between the arrival of atuple on the “many” stream and the arrival of its joining “one” tupleon the other stream.
Select Select Sum(Sum(O.costO.cost))From Orders O, Fulfillments F [Range 1 Day]From Orders O, Fulfillments F [Range 1 Day]Where Where O.orderIDO.orderID = = F.orderIDF.orderID And And F.clerkF.clerk = = ““SueSue”” And And O.customerO.customer = = ““JoeJoe””
If If FulfillmentsFulfillments is is kk--referential-integrityreferential-integrity on on orderIDorderID, can, caninfer when to discard infer when to discard Orders Orders tupletuple
After we get an Orders After we get an Orders tupletuple its its orderIDorderID=oID1, all=oID1, allFulfillments Fulfillments tuplestuples with same with same orderIDorderID will arrive at within will arrive at within kktuplestuples in Fulfillments stream; so we can discard the Orders in Fulfillments stream; so we can discard the Orderstupletuple its its orderIDorderID=oID1 after we read =oID1 after we read kk tuplestuples in Fulfillments in Fulfillmentsstream.stream.
For the special case of For the special case of kk=0 for this constraint, termed =0 for this constraint, termed strictstrictreferential integrityreferential integrity, corresponding Fulfillments , corresponding Fulfillments tuplestuples will willalways arrive before Orders always arrive before Orders tupletuple. (Though it is not logical in. (Though it is not logical inthis example.)this example.)
6868
Query execution plans reduce orQuery execution plans reduce oreliminate state based on eliminate state based on kk-constraints-constraints– The smaller the value of k for each
constraint, the more state that can bediscarded.
6969
Exploiting ConstraintsExploiting Constraints
Stream data may be unpredictable andStream data may be unpredictable andvariable, so variable, so ……
Continuously monitorContinuously monitor streams to identify streams to identify kk--constraints relevant to queriesconstraints relevant to queries
If constraints violated, get If constraints violated, get approximateapproximateresultsresults
Details in: Details in: ““Exploiting k-Constraints to ReduceExploiting k-Constraints to ReduceMemory Overhead in Continuous Queries overMemory Overhead in Continuous Queries overData StreamsData Streams””, TODS 2004, TODS 2004
7070
Any questions?Any questions?
7171
Operator SchedulingOperator Scheduling
Many possible scheduling objectives: minimizeMany possible scheduling objectives: minimizecomputation, memory use, latency, inaccuracy,computation, memory use, latency, inaccuracy,starvation, starvation, ……
7272
Operator SchedulingOperator Scheduling
Many possible scheduling objectives: minimizeMany possible scheduling objectives: minimizecomputationcomputation, , memory usememory use, latency, inaccuracy,, latency, inaccuracy,starvation, starvation, ……
If the operator sequence is not fixedIf the operator sequence is not fixed–– Optimize the sequence, reorder the sequenceOptimize the sequence, reorder the sequence–– Pipelined Filters, Pipelined Filters, ““A-GreedyA-Greedy””
If the operator sequence is fixedIf the operator sequence is fixed–– Optimize the operator scheduling in running time toOptimize the operator scheduling in running time to
minimize the memory useminimize the memory use–– ““ChainChain””
7373
Pipelined FiltersPipelined Filters
Filter1
PacketsPackets
Bad packetsBad packets
Filter2
Filter3
Commutative filters over a streamCommutative filters over a stream Example:Example: Track HTTP packets Track HTTP packets
with destination address matchingwith destination address matchinga prefix in given table anda prefix in given table andcontent matchingcontent matching
Simple to complex filtersSimple to complex filters–– Boolean predicatesBoolean predicates–– Table lookupsTable lookups–– Pattern matchingPattern matching–– User-defined functionsUser-defined functions
7474
Pipelined Filters:Pipelined Filters:Problem DefinitionProblem Definition
Commutative filters:Commutative filters: F F11, F, F22 , ,……, F, Fnn
Plan:Plan: TuplesTuples FFππ(1)(1) F Fππ(2)(2) …… …… FFππ(n(n))
Goal:Goal: Minimize expected cost to Minimize expected cost toprocess a process a tupletuple
7575
Pipelined Filters: ExamplePipelined Filters: Example
1234
456
8
1 12 23
77
1122
F1 F2 F3 F4
1
Input tuples Output tuples
Informal Goal: If tuple will be dropped, then drop it as cheaply as possible
7676
Why is Our ProblemWhy is Our ProblemHard?Hard?
High drop-rate firstHigh drop-rate first Low cost firstLow cost first High High drop-rate/costdrop-rate/cost first first Filter drop-rates and costs can changeFilter drop-rates and costs can change
over timeover time Filters can be Filters can be correlatedcorrelated
E.g., Protocol = HTTP and E.g., Protocol = HTTP and DestPortDestPort = 80 = 80
7777
Speed of Speed of adaptivityadaptivity–– Detecting changes andDetecting changes and
finding new planfinding new plan
Run-time overheadRun-time overhead–– Re-optimization, collectingRe-optimization, collecting
statistics, plan switchingstatistics, plan switching
Convergence propertiesConvergence properties–– Plan properties under stablePlan properties under stable
statisticsstatistics
ProfilerProfiler Re-optimizerRe-optimizer
ExecutorExecutor
StreaMonStreaMon
Metrics for an AdaptiveMetrics for an AdaptiveAlgorithmAlgorithm
7878
Assume statistics are not changingAssume statistics are not changing–– Order filters by decreasing drop-rate/costOrder filters by decreasing drop-rate/cost
–– Correlations Correlations NP-HardNP-Hard
Greedy algorithm: Greedy algorithm: Use conditionalUse conditionalstatisticsstatistics
1.1. FFππ(1)(1) has maximum drop-rate/cost has maximum drop-rate/cost
2.2. FFππ(2)(2) has maximum drop-rate/cost ratio for has maximum drop-rate/cost ratio fortuplestuples not dropped by F not dropped by Fππ(1)(1)
3.3. And so onAnd so on……
Pipelined Filters: StablePipelined Filters: StableStatisticsStatistics
7979
Challenge:Challenge:–– Online algorithmOnline algorithm–– Fast Fast adaptivityadaptivity to Greedy ordering to Greedy ordering–– Low run-time overheadLow run-time overhead
A-Greedy: Adaptive GreedyA-Greedy: Adaptive Greedy
Adaptive Version of GreedyAdaptive Version of Greedy
8080
Profiler: Maintains conditionalfilter drop-rates and costs
over recent tuples
Executor:Processes tuples with
current Greedy ordering
Re-optimizer: Ensures thatfilter ordering is Greedy for
current statisticsstatistics
Estimated
are requiredWhich statistics
Combined in part for
efficiency
Changes infilter ordering
A-GreedyA-Greedy
8181
Main innovation: A-Main innovation: A-GreedyGreedy’’ss Profiler Profiler
Responsible for maintaining currentResponsible for maintaining currentstatisticsstatistics–– Filter costsFilter costs–– Conditional filter drop-rates: exponential!Conditional filter drop-rates: exponential!
Profile Window:Profile Window: Sampled statistics Sampled statisticsfrom which required conditional drop-from which required conditional drop-rates can be estimatedrates can be estimated
8282
Profile WindowProfile Window
1234
456
8
1 12 23
77
44
0 1 1 0
0 0 1 11 0 0 1
1 0 0 1 ProfileWindow
1
F1 F2 F3 F4
8383
Greedy Ordering Using ProfileGreedy Ordering Using ProfileWindowWindow
111100000000110000001100001100111100000000110011
F1 F2 F3 F4
22332222
F1 F2 F3 F4
22222233
F3 F1 F2 F4
112200
22222233
F3 F2 F4 F1
001122
0011Matrix View Greedy Ordering
8484
Conclusions of A-GreedyConclusions of A-Greedy
Fast Fast adaptivityadaptivity to Greedy ordering to Greedy ordering–– Running cost of A-Greedy itself is lowRunning cost of A-Greedy itself is low
It can get the best plan in almost all cases.It can get the best plan in almost all cases.–– Running cost of filters is best almostRunning cost of filters is best almost
Low run-time overheadLow run-time overhead
Details in: Details in: ““Adaptive Processing ofAdaptive Processing ofPipelined Stream FiltersPipelined Stream Filters””, SIGMOD 2004, SIGMOD 2004
8585
Any questions?Any questions?
8686
Operator Scheduling inOperator Scheduling inRunning TimeRunning Time
Problem: The operator sequence isProblem: The operator sequence isfixed.fixed.
Goal: Optimize the operatorGoal: Optimize the operatorscheduling in running time toscheduling in running time tominimize the memory occupationminimize the memory occupation
8787
A simple exampleA simple example
Two operators, O1 followed by O2– O1 takes one time unit to process a batch
of n elements, and it produces 0.2n outputelements per input batch.
– O2 takes one time unit to operate on 0.2nelements, and it sends its output out of thesystem.
Consider the following arrival pattern:n elements arrive at every time instantfrom t = 0 to t = 6, then no elementsarrive from time t = 7 through t = 13.
O2
O1
8888
FIFO scheduling FIFO scheduling && Greedy Greedyschedulingscheduling
FIFO schedulingFIFO scheduling: When batches of : When batches of nn elements elementshave been accumulated, they are passed throughhave been accumulated, they are passed throughboth operators in two consecutive time units,both operators in two consecutive time units,during which no other element is processed.during which no other element is processed.
Greedy schedulingGreedy scheduling: At any time instant, if there: At any time instant, if thereis a batch of n elements buffered before O1, it isis a batch of n elements buffered before O1, it isprocessed in one time unit. Otherwise, if there areprocessed in one time unit. Otherwise, if there aremore than 0.2n elements buffered before O2, thenmore than 0.2n elements buffered before O2, then0.2n elements are processed using one time unit.0.2n elements are processed using one time unit.This strategy is "greedy" since it gives preferenceThis strategy is "greedy" since it gives preferenceto the operator that has the to the operator that has the greatest rate ofgreatest rate ofreduction in total queue sizereduction in total queue size per unit time. per unit time.
8989
FIFO scheduling FIFO scheduling vsvs Greedy Greedyschedulingscheduling
1.41.4
OO11
3.23.2
OO11
77
MemMem..
Op.Op.
MemMem..
Op.Op.
2.22.22.02.01.81.81.61.61.41.41.21.21.01.0
OO11OO11OO11OO11OO11OO11GreedyGreedy
4.04.03.23.23.03.02.22.22.02.01.21.21.01.0
OO22OO11OO22OO11OO22OO11FIFOFIFO
66554433221100TimeTime
0.00.0
OO22
0.00.0
OO22
1414
1.21.2
2.12.1
Avg.Avg.(0~13(0~13
))
1.21.2
OO22
3.03.0
OO22
88
1.01.0
OO22
2.22.2
OO11
99
0.80.8
OO22
2.02.0
OO22
1010
0.60.6
OO22
1.21.2
OO11
1111
0.40.4
OO22
1.01.0
OO22
1212
0.20.2
OO22
0.20.2
OO11
1313
MemMem..
Op.Op.
MemMem..
Op.Op.
GreedyGreedy
FIFOFIFO
TimeTime
9090
Is Greedy scheduling goodIs Greedy scheduling goodenough?enough?
In the above example, GreedyIn the above example, Greedyscheduling seems as a goodscheduling seems as a goodapproach.approach.
Is it good enough?Is it good enough?
O2
O1
9191
Another exampleAnother example
O1 produces 0.9n elements per n inputelements in one time unit
O2 processes 0.9n elements in one timeunit without changing the input size
O3 processes 0.9n elements in one timeunit and sends its output out of thesystem
Priority: Priority: O3 > O1 > O2
O1
O2
O3
9292
FIFO scheduling FIFO scheduling vsvs Greedy Greedyschedulingscheduling
In this case, FIFO scheduling is better than GreedyIn this case, FIFO scheduling is better than Greedyschedulingscheduling
……
……
……
……
……
0.0.00
OO33
0.0.00
OO33
2121
5.35.3
OO33
3.93.9
OO22
1111
3.63.6
2.92.9
Avg.Avg.(0~20(0~20
))
5.45.4
OO22
3.93.9
OO11
1010
5.45.4
OO33
4.04.0
OO33
99
6.36.3
OO22
4.94.9
OO22
88
6.36.3
OO11
4.94.9
OO11
77
MemMem..
Op.Op.
MemMem..
Op.Op.
6.46.45.55.54.64.63.73.72.82.81.91.91.01.0
OO11OO11OO11OO11OO11OO11
GreedyGreedy
5.05.04.94.93.93.93.03.02.92.91.91.91.01.0
OO33OO22OO11OO33OO22OO11
FIFOFIFO
66554433221100TimeTime
9393
Greedy schedulingGreedy scheduling
Under the greedy strategy, although O3 hashighest priority, sometimes it is "blocked" fromrunning because it is preceded by O2, the operatorwith the lowest priority.
If O1, O2 and O3 are viewed as a single block,then together they reduce n elements to zeroelements over three units of time, for an averagereduction of 0.33n elements per unit time--betterthan the reduction rate of 0.1n elements O1provides.– Since the greedy algorithm considers individual operators
only, it does not take advantage of this fact.
9494
Chain SchedulingChain Scheduling
The chain scheduling algorithm formsblocks (“chains”) of operators as follows:
Start by marking the first operatorin the plan as the “current”operator.
Next, find the block of consecutiveoperators starting at the "current"operator that maximizes thereduction in total queue sizeper unit time.
Mark the first operator followingthis block as the "current" operatorand repeat the previous step untilall operators have been assigned tochains.
Chains are scheduled according tothe greedy algorithm, but within achain, execution proceeds in FIFOorder.
9595
Implied assumptionImplied assumption
We assume that the selectivities andper-tuple processing times are knownfor each operator.We use these toconstruct the progress chart asexplained above.
9696
Gather statistics
Selectivities and processing times could belearned during query execution bygathering statistics over a period of time. Ifwe expect these values to change overtime, we could use the following strategy:1.divide time into fixed windows and collect
statistics independently in each window;2.use the statistics from the ith window to
compute the progress chart for the (i + 1)stwindow.
9797
Multi-stream queriesMulti-stream queries
The query plan is a tree instead of aThe query plan is a tree instead of aqueuequeue
9898
Any questions?Any questions?
9999
Omitted topicsOmitted topics
Coping with OverloadCoping with Overload–– ““Load-sheddingLoad-shedding”” ≈≈ discarding discarding tuplestuples–– What is definition of What is definition of ““bestbest””??
100100
Omitted topicsOmitted topics
Coping with Changing ConditionsCoping with Changing Conditions Continuous queries are long-running;Continuous queries are long-running;
conditions may changeconditions may change–– Data characteristics, arrival characteristics,Data characteristics, arrival characteristics,
query load, available resources, systemquery load, available resources, systemconditions, conditions, ……
–– Solution: Solution: self-monitoring self-monitoring andand adaptivityadaptivity Other results:Other results:
–– Adaptive operator reorderingAdaptive operator reordering–– Adaptive cachingAdaptive caching
101101
ReferencesReferences
http://www-http://www-db.stanford.edudb.stanford.edu/stream//stream/ STREAM: The Stanford Data Stream Management System,STREAM: The Stanford Data Stream Management System,
to appear in a book on data stream managementto appear in a book on data stream management STREAM: The Stanford Stream Data ManagerSTREAM: The Stanford Stream Data Manager, IEEE Data, IEEE Data
Engineering Bulletin 2003Engineering Bulletin 2003 Models and Issues in Data Stream Systems, PODS 2002Models and Issues in Data Stream Systems, PODS 2002 The CQL Continuous Query Language: SemanticThe CQL Continuous Query Language: Semantic
Foundations and Query Execution, to appear in VLDBFoundations and Query Execution, to appear in VLDBJournalJournal
Exploiting k-Constraints to Reduce Memory Overhead inExploiting k-Constraints to Reduce Memory Overhead inContinuous Queries over Data Streams, TODS 2004Continuous Queries over Data Streams, TODS 2004
Adaptive Ordering of Pipelined Stream Filters, SIGMODAdaptive Ordering of Pipelined Stream Filters, SIGMOD20042004
Operator Scheduling in Data Stream Systems, to appear inOperator Scheduling in Data Stream Systems, to appear inVLDB JournalVLDB Journal
102102
ThanksThanks