Query Processing, Resource Management, and Approximation in a Data Stream Management System
description
Transcript of Query Processing, Resource Management, and Approximation in a Data Stream Management System
Query Processing, Resource Query Processing, Resource Management, and Approximation in a Management, and Approximation in a
Data Stream Management SystemData Stream Management System
Selected subset of slides taken from talk by Jennifer Widom at NEDS.
stanfordstreamdatamanager
stanfordstreamdatamanager 2
Data StreamsData Streams• Stream = Continuous, unbounded, rapid, time-
varying streams of data elements
• DSMSDSMS = Data Stream Management System
stanfordstreamdatamanager 3
The STREAM SystemThe STREAM System
• Declarative language for registering continuous queries; considering data streams and stored relations
• Formal semantics; more theoretical team
stanfordstreamdatamanager 4
Contributions to DateContributions to Date
• Semantics for continuous queries
• Query plans
• Exploiting stream constraints
• Operator scheduling
• Approximation techniques
stanfordstreamdatamanager 5
DSMS
Scratch Store
The (Simplified) Big PictureThe (Simplified) Big Picture
Input streams
RegisterQuery
StreamedResult
StoredResult
ArchiveStored
Relations
stanfordstreamdatamanager 6
(Simplified) Network Monitoring(Simplified) Network Monitoring
RegisterMonitoring
Queries
DSMS
Scratch Store
Network measurements,Packet traces
IntrusionWarnings
OnlinePerformance
Metrics
ArchiveLookupTables
stanfordstreamdatamanager 7
Declarative Language for Declarative Language for Continuous QueriesContinuous Queries
• A distinction between STREAM and Aurora :– Aurora users directly manipulate one large
execution plan– STREAM compiles declarative queries into
individual plans, system may merge plans
• Syntax based on SQL, additional constructs for sliding windows and sampling
stanfordstreamdatamanager 8
Example Query 1Example Query 1Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk)
stanfordstreamdatamanager 9
Example Query 1Example Query 1Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk)
Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”
Select Sum(O.cost)From Orders O, Fulfillments F [Range 1 Day]Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”
stanfordstreamdatamanager 10
Example Query 1Example Query 1Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk)
Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”
Select Sum(O.cost)From Orders O, Fulfillments F [Range 1 Day]Fulfillments F [Range 1 Day]Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”
stanfordstreamdatamanager 11
Example Query 1Example Query 1Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk)
Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”
Select Sum(O.cost)From Orders O Orders O, Fulfillments F [Range 1 Day]Where O.orderID = F.orderIDWhere O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”
stanfordstreamdatamanager 12
Example Query 1Example Query 1Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk)
Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”
Select Sum(O.cost)From Orders O, Fulfillments F [Range 1 Day]Where O.orderID = F.orderID And F.clerk = “Sue”And F.clerk = “Sue” And O.customer = “Joe”And O.customer = “Joe”
stanfordstreamdatamanager 13
Example Query 1Example Query 1Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk)
Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”
Select Sum(O.cost)Select Sum(O.cost)From Orders O, Fulfillments F [Range 1 Day]Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”
stanfordstreamdatamanager 14
Example Query 2Example Query 2Using a 10% sample of the Fulfillments stream,
take the 5 most recent fulfillments for each clerk and return the maximum cost
Select F.clerk, Max(O.cost)From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% SampleWhere O.orderID = F.orderIDGroup By F.clerk
stanfordstreamdatamanager 15
Example Query 2Example Query 2Using a 10% sample of the Fulfillments stream,
take the 5 most recent fulfillments for each clerk and return the maximum cost
Select F.clerk, Max(O.cost)From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% SampleFulfillments F [Partition By clerk Rows 5] 10% SampleWhere O.orderID = F.orderIDGroup By F.clerk
stanfordstreamdatamanager 16
Example Query 2Example Query 2Using a 10% sample of the Fulfillments stream,
take the 5 most recent fulfillments for each clerk and return the maximum cost
Select F.clerk, Max(O.cost)From Orders O, From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% SampleWhere O.orderID = F.orderIDWhere O.orderID = F.orderIDGroup By F.clerk
stanfordstreamdatamanager 17
Example Query 2Example Query 2Using a 10% sample of the Fulfillments stream,
take the 5 most recent fulfillments for each clerk and return the maximum cost
Select F.clerk, Max(O.cost)Select F.clerk, Max(O.cost)From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% SampleWhere O.orderID = F.orderIDGroup By F.clerkGroup By F.clerk
stanfordstreamdatamanager 18
Semantics of Database LanguagesSemantics of Database Languages• An often neglected topic
• Traditional relational databases are in reasonable shape– Relational algebra SQL
• But triggers were a mess
• The semantics of an innocent-looking continuous query over data streams may not be obvious
stanfordstreamdatamanager 19
A Nonobvious Continuous QueryA Nonobvious Continuous Query• Stream of stock quotes: Stocks(ticker,price)
• Monitor last 10 minutes of quotes:Select From Stocks [Range 10 minutes]
• Is result a relation, a stream, or something else?
• If a relation, what exactly does it contain?
• If a stream, how does query differ from:Select From Stocks [Range 1 minute]or Select From Stocks []
stanfordstreamdatamanager 20
Our Semantics and Language Our Semantics and Language for Continuous Queries for Continuous Queries
• Abstract: Abstract: interpretation for CQs based on certain “black boxes”
• Concrete: Concrete: SQL-based instantiation for our system; includes syntactic shortcuts, defaults, equivalences
• Goals– CQs over multiple streams and relations– Exploit relational semantics to the extent possible– Easy queries should be easy to write, simple
queries should do what you expect
stanfordstreamdatamanager 21
Relations and StreamsRelations and Streams• Assume global, discrete, ordered time domain
• Relation– Maps time T to set-of-tuples R
• Stream– Set of (tuple,timestamp) elements
stanfordstreamdatamanager 22
ConversionsConversions
Streams Relations
Window specification
Special operators:Istream, Dstream, Rstream
Any relational query language
stanfordstreamdatamanager 23
Conversion DefinitionsConversion Definitions• Stream-to-relation
– S [W] is a relation — at time T it contains all tuples in window W applied to stream S up to T
– When W = , contains all tuples in stream S up to T
• Relation-to-stream– Istream(R) contains all (r,T ) where rR at time T
but rR at time T–1– Dstream(R) contains all (r,T ) where rR at time
T–1 but rR at time T– Rstream(R) contains all (r,T ) where rR at time T
stanfordstreamdatamanager 24
Abstract SemanticsAbstract Semantics• Take any relational query language
• Can reference streams in place of relations– But must convert to relations using any window
specification language ( default window = [] )
• Can convert relations to streams– For streamed results– For windows over relations
(note: converts back to relation)
stanfordstreamdatamanager 25
Query Result at Time Query Result at Time TT• Use all relations at time T
• Use all streams up to T, converted to relations
• Compute relational result
• Convert result to streams if desired
stanfordstreamdatamanager 26
TimeTime• Easiest: global system clock
– Stream elements and relation updates timestamped on entry to system
• Application-defined time– Streams and relation updates contain application
timestamps, may be out of order– Application generates “heartbeat”
• Or deduce heartbeat from parameters: stream skew, scrambling, latency, and clock progress
– Query results in application time
stanfordstreamdatamanager 27
Abstract Semantics – Example 1Abstract Semantics – Example 1Select F.clerk, Max(O.cost)From O [], F [Rows 1000]Where O.orderID = F.orderIDGroup By F.clerk
• Maximum-cost order fulfilled by each clerk in last 1000 fulfillments
stanfordstreamdatamanager 28
Abstract Semantics – Example 1Abstract Semantics – Example 1Select F.clerk, Max(O.cost)From O [], F [Rows 1000]Where O.orderID = F.orderIDGroup By F.clerk
• At time T: entire stream O and last 1000 tuples of F as relations
• Evaluate query, update result relation at T
stanfordstreamdatamanager 29
Abstract Semantics – Example 1Abstract Semantics – Example 1Select IstreamIstream(F.clerk, Max(O.cost))From O [], F [Rows 1000]Where O.orderID = F.orderIDGroup By F.clerk
• At time T: entire stream O and last 1000 tuples of F as relations
• Evaluate query, update result relation at T• Streamed result:Streamed result: New element
(<clerk,max>,T) whenever <clerk,max> changes from T–1
stanfordstreamdatamanager 30
Abstract Semantics – Example 2Abstract Semantics – Example 2Relation CurPrice(stock, price)
Select stock, Avg(price) From Istream(CurPrice) [Range 1 Day] Group By stock
• Average price over last day for each stock
stanfordstreamdatamanager 31
Abstract Semantics – Example 2Abstract Semantics – Example 2Relation CurPrice(stock, price)
Select stock, Avg(price) From Istream(CurPrice) [Range 1 Day] Group By stock
• Istream provides history of CurPrice
• Window on history, back to relation, group and aggregate
stanfordstreamdatamanager 32
Concrete Language – CQLConcrete Language – CQL• Relational query language: SQL
• Window spec. language derived from SQL-99– Tuple-based, time-based, partitioned
• Syntactic shortcuts and defaultsSyntactic shortcuts and defaults– So easy queries are easy to write and simple
queries do what you expect
• EquivalencesEquivalences– Basis for query-rewrite optimizations– Includes all relational equivalences, plus new
stream-based ones
stanfordstreamdatamanager 33
Two Extremely Simple Two Extremely Simple CQL ExamplesCQL Examples
Select From Strm• Had better return Strm (It does)
– Default window for Strm– Default Istream for result
Select From Strm, Rel Where Strm.A = Rel.B• Often want “NOW” window for Strm• But may not want as default
stanfordstreamdatamanager 34
Query ExecutionQuery Execution• When a continuous query is registered,
generation a query planquery plan– Users can also register plans directly
• Plans composed of three main components:– OperatorsOperators (as in most conventional DBMS’s)– Inter-operator QueuesQueues (as in many conventional
DBMS’s)– State (synopses)State (synopses)
• Global schedulerscheduler for plan execution
stanfordstreamdatamanager 35
Operators and StateOperators and State• State (synopsessynopses)
– Summarize tuples seen so far (exact or approximate) for operators requiring history
– To implement windows
• Example: synopsis join– Sliding-window join– Approximation of full join
State1 State2⋈
stanfordstreamdatamanager 36
Simple Query PlanSimple Query Plan
State4⋈State3
Stream1 Stream2
Stream3
Q1 Q2
State1 State2⋈
Scheduler
stanfordstreamdatamanager 37
Some Issues in Query Plan Generation Some Issues in Query Plan Generation • Compatibility and conversions for streams and
relations (+/- streams+/- streams)
• State sharing, incremental computation
• Windowed joinsWindowed joins: Multiway versus 2-way
• Windows in general: push down, pull up, split, merge, …
• Time coordination, operator-level heartbeats
stanfordstreamdatamanager 38
Memory Overhead in Query ProcessingMemory Overhead in Query Processing• Queues + State
• Continuous queries keep state indefinitely
• Online requirements suggest using memory rather than disk– But we realize this assumption is shaky
stanfordstreamdatamanager 39
Memory Overhead in Query ProcessingMemory Overhead in Query Processing• Queues + State
• Continuous queries keep state indefinitely
• Online requirements suggest using memory rather than disk– But we realize this assumption is shaky
• Goal: minimize memory use while providing Goal: minimize memory use while providing timely, accurate answerstimely, accurate answers
stanfordstreamdatamanager 40
Reducing Memory OverheadReducing Memory Overhead
Two main techniques to date
1) Exploit constraints on streamsconstraints on streams to reduce statereduce state
2) Clever operator schedulingoperator scheduling to reduce queue sizesreduce queue sizes
stanfordstreamdatamanager 41
Exploiting Stream ConstraintsExploiting Stream Constraints• For most queries, unbounded memory is
required for arbitrary streams [PODS ’01]
stanfordstreamdatamanager 42
Exploiting Stream ConstraintsExploiting Stream Constraints• For most queries, unbounded memory is
required for arbitraryarbitrary streams [PODS ’01]
• But streams may exhibit constraints that reduce, bound, or even eliminate state
stanfordstreamdatamanager 43
Exploiting Stream ConstraintsExploiting Stream Constraints• For most queries, unbounded memory is
required for arbitrary streams [PODS ’01]
• But streams may exhibit constraints that reduce, bound, or even eliminate state
• Conventional database constraints– Redefined for streams– “Relaxed” for stream environment
stanfordstreamdatamanager 44
Stream ConstraintsStream Constraints• Each constraint type defines adherence adherence
parameterparameter k
• Clustered(k) for attribute S.A
• Ordered(k) for attribute S.A
• Referential-Integrity(k) for join S1 S2
stanfordstreamdatamanager 45
Algorithm for Exploiting ConstraintsAlgorithm for Exploiting Constraints• Input
– Any Select-Project-Join query over streams– Any set of k-constraints
• Output– Query execution plan that reduces or eliminates
state based on k-constraints– If constraints violated, get approximate resultIf constraints violated, get approximate result
stanfordstreamdatamanager 46
Constraint ExamplesConstraint ExamplesOrders (orderID, cost)Fulfillments (orderID, portion, clerk)Query: Many-one join F O
stanfordstreamdatamanager 47
Constraint ExamplesConstraint ExamplesOrders (orderID, cost)Fulfillments (orderID, portion, clerk)Query: Many-one join F O
• Clustered(k) on F.orderID Matched O tuples discarded after k arrivals of non-
matching F’s
stanfordstreamdatamanager 48
Constraint ExamplesConstraint ExamplesOrders (orderID, cost)Fulfillments (orderID, portion, clerk)Query: Many-one join F O
• Clustered(k) on F.orderID Matched O tuples discarded after k arrivals of non-
matching F’s
• Referential-Integrity(k) F tuples retained for at most k arrivals of O tuples
stanfordstreamdatamanager 49
Operator SchedulingOperator Scheduling• Global scheduler invokes run method of query
plan operators with “timeslice” parameter
• Many possible scheduling objectives: minimize latency, inaccuracy, memory use, computation, starvation, …– First scheduler: round-robinSecond scheduler: minimize queue sizes– Third scheduler: minimize combination of queue
sizes and latency
stanfordstreamdatamanager 50
ApproximationApproximation• Why approximate?
– Memory requirement too high, even with constraints and clever scheduling
– Can’t process streams fast enough for query load
stanfordstreamdatamanager 51
Approximation (cont’d)Approximation (cont’d)• Static:Static: rewrite queries to add (or shrink)
sampling or windows+ User can participate, predictable behavior– Doesn’t consider dynamic conditions
• Dynamic:Dynamic: modify query plan – insert sampling operators, shrink windows, load shedding+ Adapts to current resource availability– How to convey to user? (major open issue)(major open issue)
stanfordstreamdatamanager 52
The Holy GrailThe Holy Grail• Given:
– Declarative query– Resources– Constraints on streams
Generate plan and resource allocation that takes advantage of constraints and maximizes precision
• Do it for multiple (weighted) queries, dynamically and adaptively, and convey what’s happening to the user
stanfordstreamdatamanager 53
http://www-db.stanford.edu/streamhttp://www-db.stanford.edu/stream
Contributors: Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, Justin Rosenstein, Rohit Varma