Query Processing, Resource Management, and Approximation in a Data Stream Management System

53
Query Processing, Resource Query Processing, Resource Management, and Management, and Approximation in a Data Approximation in a Data Stream Management System Stream Management System Selected subset of slides taken from talk by Jennifer Widom at NEDS. stanfordstreamdatamanager

description

Query Processing, Resource Management, and Approximation in a Data Stream Management System. Selected subset of slides taken from talk by Jennifer Widom at NEDS. st anfordst re amdat am anager. Data Streams. Stream = Continuous, unbounded, rapid, time-varying streams of data elements - PowerPoint PPT Presentation

Transcript of Query Processing, Resource Management, and Approximation in a Data Stream Management System

Page 1: Query Processing, Resource Management, and Approximation in a Data Stream Management System

Query Processing, Resource Query Processing, Resource Management, and Approximation in a Management, and Approximation in a

Data Stream Management SystemData Stream Management System

Selected subset of slides taken from talk by Jennifer Widom at NEDS.

stanfordstreamdatamanager

Page 2: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 2

Data StreamsData Streams• Stream = Continuous, unbounded, rapid, time-

varying streams of data elements

• DSMSDSMS = Data Stream Management System

Page 3: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 3

The STREAM SystemThe STREAM System

• Declarative language for registering continuous queries; considering data streams and stored relations

• Formal semantics; more theoretical team

Page 4: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 4

Contributions to DateContributions to Date

• Semantics for continuous queries

• Query plans

• Exploiting stream constraints

• Operator scheduling

• Approximation techniques

Page 5: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 5

DSMS

Scratch Store

The (Simplified) Big PictureThe (Simplified) Big Picture

Input streams

RegisterQuery

StreamedResult

StoredResult

ArchiveStored

Relations

Page 6: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 6

(Simplified) Network Monitoring(Simplified) Network Monitoring

RegisterMonitoring

Queries

DSMS

Scratch Store

Network measurements,Packet traces

IntrusionWarnings

OnlinePerformance

Metrics

ArchiveLookupTables

Page 7: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 7

Declarative Language for Declarative Language for Continuous QueriesContinuous Queries

• A distinction between STREAM and Aurora :– Aurora users directly manipulate one large

execution plan– STREAM compiles declarative queries into

individual plans, system may merge plans

• Syntax based on SQL, additional constructs for sliding windows and sampling

Page 8: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 8

Example Query 1Example Query 1Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk)

Page 9: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 9

Example Query 1Example Query 1Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk)

Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”

Select Sum(O.cost)From Orders O, Fulfillments F [Range 1 Day]Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”

Page 10: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 10

Example Query 1Example Query 1Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk)

Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”

Select Sum(O.cost)From Orders O, Fulfillments F [Range 1 Day]Fulfillments F [Range 1 Day]Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”

Page 11: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 11

Example Query 1Example Query 1Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk)

Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”

Select Sum(O.cost)From Orders O Orders O, Fulfillments F [Range 1 Day]Where O.orderID = F.orderIDWhere O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”

Page 12: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 12

Example Query 1Example Query 1Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk)

Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”

Select Sum(O.cost)From Orders O, Fulfillments F [Range 1 Day]Where O.orderID = F.orderID And F.clerk = “Sue”And F.clerk = “Sue” And O.customer = “Joe”And O.customer = “Joe”

Page 13: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 13

Example Query 1Example Query 1Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk)

Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”

Select Sum(O.cost)Select Sum(O.cost)From Orders O, Fulfillments F [Range 1 Day]Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”

Page 14: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 14

Example Query 2Example Query 2Using a 10% sample of the Fulfillments stream,

take the 5 most recent fulfillments for each clerk and return the maximum cost

Select F.clerk, Max(O.cost)From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% SampleWhere O.orderID = F.orderIDGroup By F.clerk

Page 15: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 15

Example Query 2Example Query 2Using a 10% sample of the Fulfillments stream,

take the 5 most recent fulfillments for each clerk and return the maximum cost

Select F.clerk, Max(O.cost)From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% SampleFulfillments F [Partition By clerk Rows 5] 10% SampleWhere O.orderID = F.orderIDGroup By F.clerk

Page 16: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 16

Example Query 2Example Query 2Using a 10% sample of the Fulfillments stream,

take the 5 most recent fulfillments for each clerk and return the maximum cost

Select F.clerk, Max(O.cost)From Orders O, From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% SampleWhere O.orderID = F.orderIDWhere O.orderID = F.orderIDGroup By F.clerk

Page 17: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 17

Example Query 2Example Query 2Using a 10% sample of the Fulfillments stream,

take the 5 most recent fulfillments for each clerk and return the maximum cost

Select F.clerk, Max(O.cost)Select F.clerk, Max(O.cost)From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% SampleWhere O.orderID = F.orderIDGroup By F.clerkGroup By F.clerk

Page 18: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 18

Semantics of Database LanguagesSemantics of Database Languages• An often neglected topic

• Traditional relational databases are in reasonable shape– Relational algebra SQL

• But triggers were a mess

• The semantics of an innocent-looking continuous query over data streams may not be obvious

Page 19: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 19

A Nonobvious Continuous QueryA Nonobvious Continuous Query• Stream of stock quotes: Stocks(ticker,price)

• Monitor last 10 minutes of quotes:Select From Stocks [Range 10 minutes]

• Is result a relation, a stream, or something else?

• If a relation, what exactly does it contain?

• If a stream, how does query differ from:Select From Stocks [Range 1 minute]or Select From Stocks []

Page 20: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 20

Our Semantics and Language Our Semantics and Language for Continuous Queries for Continuous Queries

• Abstract: Abstract: interpretation for CQs based on certain “black boxes”

• Concrete: Concrete: SQL-based instantiation for our system; includes syntactic shortcuts, defaults, equivalences

• Goals– CQs over multiple streams and relations– Exploit relational semantics to the extent possible– Easy queries should be easy to write, simple

queries should do what you expect

Page 21: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 21

Relations and StreamsRelations and Streams• Assume global, discrete, ordered time domain

(more on this later)

• Relation– Maps time T to set-of-tuples R

• Stream– Set of (tuple,timestamp) elements

Page 22: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 22

ConversionsConversions

Streams Relations

Window specification

Special operators:Istream, Dstream, Rstream

Any relational query language

Page 23: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 23

Conversion DefinitionsConversion Definitions• Stream-to-relation

– S [W] is a relation — at time T it contains all tuples in window W applied to stream S up to T

– When W = , contains all tuples in stream S up to T

• Relation-to-stream– Istream(R) contains all (r,T ) where rR at time T

but rR at time T–1– Dstream(R) contains all (r,T ) where rR at time

T–1 but rR at time T– Rstream(R) contains all (r,T ) where rR at time T

Page 24: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 24

Abstract SemanticsAbstract Semantics• Take any relational query language

• Can reference streams in place of relations– But must convert to relations using any window

specification language ( default window = [] )

• Can convert relations to streams– For streamed results– For windows over relations

(note: converts back to relation)

Page 25: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 25

Query Result at Time Query Result at Time TT• Use all relations at time T

• Use all streams up to T, converted to relations

• Compute relational result

• Convert result to streams if desired

Page 26: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 26

TimeTime• Easiest: global system clock

– Stream elements and relation updates timestamped on entry to system

• Application-defined time– Streams and relation updates contain application

timestamps, may be out of order– Application generates “heartbeat”

• Or deduce heartbeat from parameters: stream skew, scrambling, latency, and clock progress

– Query results in application time

Page 27: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 27

Abstract Semantics – Example 1Abstract Semantics – Example 1Select F.clerk, Max(O.cost)From O [], F [Rows 1000]Where O.orderID = F.orderIDGroup By F.clerk

• Maximum-cost order fulfilled by each clerk in last 1000 fulfillments

Page 28: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 28

Abstract Semantics – Example 1Abstract Semantics – Example 1Select F.clerk, Max(O.cost)From O [], F [Rows 1000]Where O.orderID = F.orderIDGroup By F.clerk

• At time T: entire stream O and last 1000 tuples of F as relations

• Evaluate query, update result relation at T

Page 29: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 29

Abstract Semantics – Example 1Abstract Semantics – Example 1Select IstreamIstream(F.clerk, Max(O.cost))From O [], F [Rows 1000]Where O.orderID = F.orderIDGroup By F.clerk

• At time T: entire stream O and last 1000 tuples of F as relations

• Evaluate query, update result relation at T• Streamed result:Streamed result: New element

(<clerk,max>,T) whenever <clerk,max> changes from T–1

Page 30: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 30

Abstract Semantics – Example 2Abstract Semantics – Example 2Relation CurPrice(stock, price)

Select stock, Avg(price) From Istream(CurPrice) [Range 1 Day] Group By stock

• Average price over last day for each stock

Page 31: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 31

Abstract Semantics – Example 2Abstract Semantics – Example 2Relation CurPrice(stock, price)

Select stock, Avg(price) From Istream(CurPrice) [Range 1 Day] Group By stock

• Istream provides history of CurPrice

• Window on history, back to relation, group and aggregate

Page 32: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 32

Concrete Language – CQLConcrete Language – CQL• Relational query language: SQL

• Window spec. language derived from SQL-99– Tuple-based, time-based, partitioned

• Syntactic shortcuts and defaultsSyntactic shortcuts and defaults– So easy queries are easy to write and simple

queries do what you expect

• EquivalencesEquivalences– Basis for query-rewrite optimizations– Includes all relational equivalences, plus new

stream-based ones

Page 33: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 33

Two Extremely Simple Two Extremely Simple CQL ExamplesCQL Examples

Select From Strm• Had better return Strm (It does)

– Default window for Strm– Default Istream for result

Select From Strm, Rel Where Strm.A = Rel.B• Often want “NOW” window for Strm• But may not want as default

Page 34: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 34

Query ExecutionQuery Execution• When a continuous query is registered,

generation a query planquery plan– Users can also register plans directly

• Plans composed of three main components:– OperatorsOperators (as in most conventional DBMS’s)– Inter-operator QueuesQueues (as in many conventional

DBMS’s)– State (synopses)State (synopses)

• Global schedulerscheduler for plan execution

Page 35: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 35

Operators and StateOperators and State• State (synopsessynopses)

– Summarize tuples seen so far (exact or approximate) for operators requiring history

– To implement windows

• Example: synopsis join– Sliding-window join– Approximation of full join

State1 State2⋈

Page 36: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 36

Simple Query PlanSimple Query Plan

State4⋈State3

Stream1 Stream2

Stream3

Q1 Q2

State1 State2⋈

Scheduler

Page 37: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 37

Some Issues in Query Plan Generation Some Issues in Query Plan Generation • Compatibility and conversions for streams and

relations (+/- streams+/- streams)

• State sharing, incremental computation

• Windowed joinsWindowed joins: Multiway versus 2-way

• Windows in general: push down, pull up, split, merge, …

• Time coordination, operator-level heartbeats

Page 38: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 38

Memory Overhead in Query ProcessingMemory Overhead in Query Processing• Queues + State

• Continuous queries keep state indefinitely

• Online requirements suggest using memory rather than disk– But we realize this assumption is shaky

Page 39: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 39

Memory Overhead in Query ProcessingMemory Overhead in Query Processing• Queues + State

• Continuous queries keep state indefinitely

• Online requirements suggest using memory rather than disk– But we realize this assumption is shaky

• Goal: minimize memory use while providing Goal: minimize memory use while providing timely, accurate answerstimely, accurate answers

Page 40: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 40

Reducing Memory OverheadReducing Memory Overhead

Two main techniques to date

1) Exploit constraints on streamsconstraints on streams to reduce statereduce state

2) Clever operator schedulingoperator scheduling to reduce queue sizesreduce queue sizes

Page 41: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 41

Exploiting Stream ConstraintsExploiting Stream Constraints• For most queries, unbounded memory is

required for arbitrary streams [PODS ’01]

Page 42: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 42

Exploiting Stream ConstraintsExploiting Stream Constraints• For most queries, unbounded memory is

required for arbitraryarbitrary streams [PODS ’01]

• But streams may exhibit constraints that reduce, bound, or even eliminate state

Page 43: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 43

Exploiting Stream ConstraintsExploiting Stream Constraints• For most queries, unbounded memory is

required for arbitrary streams [PODS ’01]

• But streams may exhibit constraints that reduce, bound, or even eliminate state

• Conventional database constraints– Redefined for streams– “Relaxed” for stream environment

Page 44: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 44

Stream ConstraintsStream Constraints• Each constraint type defines adherence adherence

parameterparameter k

• Clustered(k) for attribute S.A

• Ordered(k) for attribute S.A

• Referential-Integrity(k) for join S1 S2

Page 45: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 45

Algorithm for Exploiting ConstraintsAlgorithm for Exploiting Constraints• Input

– Any Select-Project-Join query over streams– Any set of k-constraints

• Output– Query execution plan that reduces or eliminates

state based on k-constraints– If constraints violated, get approximate resultIf constraints violated, get approximate result

Page 46: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 46

Constraint ExamplesConstraint ExamplesOrders (orderID, cost)Fulfillments (orderID, portion, clerk)Query: Many-one join F O

Page 47: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 47

Constraint ExamplesConstraint ExamplesOrders (orderID, cost)Fulfillments (orderID, portion, clerk)Query: Many-one join F O

• Clustered(k) on F.orderID Matched O tuples discarded after k arrivals of non-

matching F’s

Page 48: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 48

Constraint ExamplesConstraint ExamplesOrders (orderID, cost)Fulfillments (orderID, portion, clerk)Query: Many-one join F O

• Clustered(k) on F.orderID Matched O tuples discarded after k arrivals of non-

matching F’s

• Referential-Integrity(k) F tuples retained for at most k arrivals of O tuples

Page 49: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 49

Operator SchedulingOperator Scheduling• Global scheduler invokes run method of query

plan operators with “timeslice” parameter

• Many possible scheduling objectives: minimize latency, inaccuracy, memory use, computation, starvation, …– First scheduler: round-robinSecond scheduler: minimize queue sizes– Third scheduler: minimize combination of queue

sizes and latency

Page 50: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 50

ApproximationApproximation• Why approximate?

– Memory requirement too high, even with constraints and clever scheduling

– Can’t process streams fast enough for query load

Page 51: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 51

Approximation (cont’d)Approximation (cont’d)• Static:Static: rewrite queries to add (or shrink)

sampling or windows+ User can participate, predictable behavior– Doesn’t consider dynamic conditions

• Dynamic:Dynamic: modify query plan – insert sampling operators, shrink windows, load shedding+ Adapts to current resource availability– How to convey to user? (major open issue)(major open issue)

Page 52: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 52

The Holy GrailThe Holy Grail• Given:

– Declarative query– Resources– Constraints on streams

Generate plan and resource allocation that takes advantage of constraints and maximizes precision

• Do it for multiple (weighted) queries, dynamically and adaptively, and convey what’s happening to the user

Page 53: Query Processing, Resource Management, and Approximation in a Data Stream Management System

stanfordstreamdatamanager 53

http://www-db.stanford.edu/streamhttp://www-db.stanford.edu/stream

Contributors: Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, Justin Rosenstein, Rohit Varma