Query Processing, Resource Management, and Approximation in a Data Stream Management System

Post on 25-Feb-2016

50 views 0 download

description

Query Processing, Resource Management, and Approximation in a Data Stream Management System. Selected subset of slides taken from talk by Jennifer Widom at NEDS. st anfordst re amdat am anager. Data Streams. Stream = Continuous, unbounded, rapid, time-varying streams of data elements - PowerPoint PPT Presentation

Transcript of Query Processing, Resource Management, and Approximation in a Data Stream Management System

Query Processing, Resource Query Processing, Resource Management, and Approximation in a Management, and Approximation in a

Data Stream Management SystemData Stream Management System

Selected subset of slides taken from talk by Jennifer Widom at NEDS.

stanfordstreamdatamanager

stanfordstreamdatamanager 2

Data StreamsData Streams• Stream = Continuous, unbounded, rapid, time-

varying streams of data elements

• DSMSDSMS = Data Stream Management System

stanfordstreamdatamanager 3

The STREAM SystemThe STREAM System

• Declarative language for registering continuous queries; considering data streams and stored relations

• Formal semantics; more theoretical team

stanfordstreamdatamanager 4

Contributions to DateContributions to Date

• Semantics for continuous queries

• Query plans

• Exploiting stream constraints

• Operator scheduling

• Approximation techniques

stanfordstreamdatamanager 5

DSMS

Scratch Store

The (Simplified) Big PictureThe (Simplified) Big Picture

Input streams

RegisterQuery

StreamedResult

StoredResult

ArchiveStored

Relations

stanfordstreamdatamanager 6

(Simplified) Network Monitoring(Simplified) Network Monitoring

RegisterMonitoring

Queries

DSMS

Scratch Store

Network measurements,Packet traces

IntrusionWarnings

OnlinePerformance

Metrics

ArchiveLookupTables

stanfordstreamdatamanager 7

Declarative Language for Declarative Language for Continuous QueriesContinuous Queries

• A distinction between STREAM and Aurora :– Aurora users directly manipulate one large

execution plan– STREAM compiles declarative queries into

individual plans, system may merge plans

• Syntax based on SQL, additional constructs for sliding windows and sampling

stanfordstreamdatamanager 8

Example Query 1Example Query 1Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk)

stanfordstreamdatamanager 9

Example Query 1Example Query 1Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk)

Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”

Select Sum(O.cost)From Orders O, Fulfillments F [Range 1 Day]Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”

stanfordstreamdatamanager 10

Example Query 1Example Query 1Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk)

Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”

Select Sum(O.cost)From Orders O, Fulfillments F [Range 1 Day]Fulfillments F [Range 1 Day]Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”

stanfordstreamdatamanager 11

Example Query 1Example Query 1Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk)

Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”

Select Sum(O.cost)From Orders O Orders O, Fulfillments F [Range 1 Day]Where O.orderID = F.orderIDWhere O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”

stanfordstreamdatamanager 12

Example Query 1Example Query 1Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk)

Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”

Select Sum(O.cost)From Orders O, Fulfillments F [Range 1 Day]Where O.orderID = F.orderID And F.clerk = “Sue”And F.clerk = “Sue” And O.customer = “Joe”And O.customer = “Joe”

stanfordstreamdatamanager 13

Example Query 1Example Query 1Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk)

Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”

Select Sum(O.cost)Select Sum(O.cost)From Orders O, Fulfillments F [Range 1 Day]Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”

stanfordstreamdatamanager 14

Example Query 2Example Query 2Using a 10% sample of the Fulfillments stream,

take the 5 most recent fulfillments for each clerk and return the maximum cost

Select F.clerk, Max(O.cost)From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% SampleWhere O.orderID = F.orderIDGroup By F.clerk

stanfordstreamdatamanager 15

Example Query 2Example Query 2Using a 10% sample of the Fulfillments stream,

take the 5 most recent fulfillments for each clerk and return the maximum cost

Select F.clerk, Max(O.cost)From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% SampleFulfillments F [Partition By clerk Rows 5] 10% SampleWhere O.orderID = F.orderIDGroup By F.clerk

stanfordstreamdatamanager 16

Example Query 2Example Query 2Using a 10% sample of the Fulfillments stream,

take the 5 most recent fulfillments for each clerk and return the maximum cost

Select F.clerk, Max(O.cost)From Orders O, From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% SampleWhere O.orderID = F.orderIDWhere O.orderID = F.orderIDGroup By F.clerk

stanfordstreamdatamanager 17

Example Query 2Example Query 2Using a 10% sample of the Fulfillments stream,

take the 5 most recent fulfillments for each clerk and return the maximum cost

Select F.clerk, Max(O.cost)Select F.clerk, Max(O.cost)From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% SampleWhere O.orderID = F.orderIDGroup By F.clerkGroup By F.clerk

stanfordstreamdatamanager 18

Semantics of Database LanguagesSemantics of Database Languages• An often neglected topic

• Traditional relational databases are in reasonable shape– Relational algebra SQL

• But triggers were a mess

• The semantics of an innocent-looking continuous query over data streams may not be obvious

stanfordstreamdatamanager 19

A Nonobvious Continuous QueryA Nonobvious Continuous Query• Stream of stock quotes: Stocks(ticker,price)

• Monitor last 10 minutes of quotes:Select From Stocks [Range 10 minutes]

• Is result a relation, a stream, or something else?

• If a relation, what exactly does it contain?

• If a stream, how does query differ from:Select From Stocks [Range 1 minute]or Select From Stocks []

stanfordstreamdatamanager 20

Our Semantics and Language Our Semantics and Language for Continuous Queries for Continuous Queries

• Abstract: Abstract: interpretation for CQs based on certain “black boxes”

• Concrete: Concrete: SQL-based instantiation for our system; includes syntactic shortcuts, defaults, equivalences

• Goals– CQs over multiple streams and relations– Exploit relational semantics to the extent possible– Easy queries should be easy to write, simple

queries should do what you expect

stanfordstreamdatamanager 21

Relations and StreamsRelations and Streams• Assume global, discrete, ordered time domain

(more on this later)

• Relation– Maps time T to set-of-tuples R

• Stream– Set of (tuple,timestamp) elements

stanfordstreamdatamanager 22

ConversionsConversions

Streams Relations

Window specification

Special operators:Istream, Dstream, Rstream

Any relational query language

stanfordstreamdatamanager 23

Conversion DefinitionsConversion Definitions• Stream-to-relation

– S [W] is a relation — at time T it contains all tuples in window W applied to stream S up to T

– When W = , contains all tuples in stream S up to T

• Relation-to-stream– Istream(R) contains all (r,T ) where rR at time T

but rR at time T–1– Dstream(R) contains all (r,T ) where rR at time

T–1 but rR at time T– Rstream(R) contains all (r,T ) where rR at time T

stanfordstreamdatamanager 24

Abstract SemanticsAbstract Semantics• Take any relational query language

• Can reference streams in place of relations– But must convert to relations using any window

specification language ( default window = [] )

• Can convert relations to streams– For streamed results– For windows over relations

(note: converts back to relation)

stanfordstreamdatamanager 25

Query Result at Time Query Result at Time TT• Use all relations at time T

• Use all streams up to T, converted to relations

• Compute relational result

• Convert result to streams if desired

stanfordstreamdatamanager 26

TimeTime• Easiest: global system clock

– Stream elements and relation updates timestamped on entry to system

• Application-defined time– Streams and relation updates contain application

timestamps, may be out of order– Application generates “heartbeat”

• Or deduce heartbeat from parameters: stream skew, scrambling, latency, and clock progress

– Query results in application time

stanfordstreamdatamanager 27

Abstract Semantics – Example 1Abstract Semantics – Example 1Select F.clerk, Max(O.cost)From O [], F [Rows 1000]Where O.orderID = F.orderIDGroup By F.clerk

• Maximum-cost order fulfilled by each clerk in last 1000 fulfillments

stanfordstreamdatamanager 28

Abstract Semantics – Example 1Abstract Semantics – Example 1Select F.clerk, Max(O.cost)From O [], F [Rows 1000]Where O.orderID = F.orderIDGroup By F.clerk

• At time T: entire stream O and last 1000 tuples of F as relations

• Evaluate query, update result relation at T

stanfordstreamdatamanager 29

Abstract Semantics – Example 1Abstract Semantics – Example 1Select IstreamIstream(F.clerk, Max(O.cost))From O [], F [Rows 1000]Where O.orderID = F.orderIDGroup By F.clerk

• At time T: entire stream O and last 1000 tuples of F as relations

• Evaluate query, update result relation at T• Streamed result:Streamed result: New element

(<clerk,max>,T) whenever <clerk,max> changes from T–1

stanfordstreamdatamanager 30

Abstract Semantics – Example 2Abstract Semantics – Example 2Relation CurPrice(stock, price)

Select stock, Avg(price) From Istream(CurPrice) [Range 1 Day] Group By stock

• Average price over last day for each stock

stanfordstreamdatamanager 31

Abstract Semantics – Example 2Abstract Semantics – Example 2Relation CurPrice(stock, price)

Select stock, Avg(price) From Istream(CurPrice) [Range 1 Day] Group By stock

• Istream provides history of CurPrice

• Window on history, back to relation, group and aggregate

stanfordstreamdatamanager 32

Concrete Language – CQLConcrete Language – CQL• Relational query language: SQL

• Window spec. language derived from SQL-99– Tuple-based, time-based, partitioned

• Syntactic shortcuts and defaultsSyntactic shortcuts and defaults– So easy queries are easy to write and simple

queries do what you expect

• EquivalencesEquivalences– Basis for query-rewrite optimizations– Includes all relational equivalences, plus new

stream-based ones

stanfordstreamdatamanager 33

Two Extremely Simple Two Extremely Simple CQL ExamplesCQL Examples

Select From Strm• Had better return Strm (It does)

– Default window for Strm– Default Istream for result

Select From Strm, Rel Where Strm.A = Rel.B• Often want “NOW” window for Strm• But may not want as default

stanfordstreamdatamanager 34

Query ExecutionQuery Execution• When a continuous query is registered,

generation a query planquery plan– Users can also register plans directly

• Plans composed of three main components:– OperatorsOperators (as in most conventional DBMS’s)– Inter-operator QueuesQueues (as in many conventional

DBMS’s)– State (synopses)State (synopses)

• Global schedulerscheduler for plan execution

stanfordstreamdatamanager 35

Operators and StateOperators and State• State (synopsessynopses)

– Summarize tuples seen so far (exact or approximate) for operators requiring history

– To implement windows

• Example: synopsis join– Sliding-window join– Approximation of full join

State1 State2⋈

stanfordstreamdatamanager 36

Simple Query PlanSimple Query Plan

State4⋈State3

Stream1 Stream2

Stream3

Q1 Q2

State1 State2⋈

Scheduler

stanfordstreamdatamanager 37

Some Issues in Query Plan Generation Some Issues in Query Plan Generation • Compatibility and conversions for streams and

relations (+/- streams+/- streams)

• State sharing, incremental computation

• Windowed joinsWindowed joins: Multiway versus 2-way

• Windows in general: push down, pull up, split, merge, …

• Time coordination, operator-level heartbeats

stanfordstreamdatamanager 38

Memory Overhead in Query ProcessingMemory Overhead in Query Processing• Queues + State

• Continuous queries keep state indefinitely

• Online requirements suggest using memory rather than disk– But we realize this assumption is shaky

stanfordstreamdatamanager 39

Memory Overhead in Query ProcessingMemory Overhead in Query Processing• Queues + State

• Continuous queries keep state indefinitely

• Online requirements suggest using memory rather than disk– But we realize this assumption is shaky

• Goal: minimize memory use while providing Goal: minimize memory use while providing timely, accurate answerstimely, accurate answers

stanfordstreamdatamanager 40

Reducing Memory OverheadReducing Memory Overhead

Two main techniques to date

1) Exploit constraints on streamsconstraints on streams to reduce statereduce state

2) Clever operator schedulingoperator scheduling to reduce queue sizesreduce queue sizes

stanfordstreamdatamanager 41

Exploiting Stream ConstraintsExploiting Stream Constraints• For most queries, unbounded memory is

required for arbitrary streams [PODS ’01]

stanfordstreamdatamanager 42

Exploiting Stream ConstraintsExploiting Stream Constraints• For most queries, unbounded memory is

required for arbitraryarbitrary streams [PODS ’01]

• But streams may exhibit constraints that reduce, bound, or even eliminate state

stanfordstreamdatamanager 43

Exploiting Stream ConstraintsExploiting Stream Constraints• For most queries, unbounded memory is

required for arbitrary streams [PODS ’01]

• But streams may exhibit constraints that reduce, bound, or even eliminate state

• Conventional database constraints– Redefined for streams– “Relaxed” for stream environment

stanfordstreamdatamanager 44

Stream ConstraintsStream Constraints• Each constraint type defines adherence adherence

parameterparameter k

• Clustered(k) for attribute S.A

• Ordered(k) for attribute S.A

• Referential-Integrity(k) for join S1 S2

stanfordstreamdatamanager 45

Algorithm for Exploiting ConstraintsAlgorithm for Exploiting Constraints• Input

– Any Select-Project-Join query over streams– Any set of k-constraints

• Output– Query execution plan that reduces or eliminates

state based on k-constraints– If constraints violated, get approximate resultIf constraints violated, get approximate result

stanfordstreamdatamanager 46

Constraint ExamplesConstraint ExamplesOrders (orderID, cost)Fulfillments (orderID, portion, clerk)Query: Many-one join F O

stanfordstreamdatamanager 47

Constraint ExamplesConstraint ExamplesOrders (orderID, cost)Fulfillments (orderID, portion, clerk)Query: Many-one join F O

• Clustered(k) on F.orderID Matched O tuples discarded after k arrivals of non-

matching F’s

stanfordstreamdatamanager 48

Constraint ExamplesConstraint ExamplesOrders (orderID, cost)Fulfillments (orderID, portion, clerk)Query: Many-one join F O

• Clustered(k) on F.orderID Matched O tuples discarded after k arrivals of non-

matching F’s

• Referential-Integrity(k) F tuples retained for at most k arrivals of O tuples

stanfordstreamdatamanager 49

Operator SchedulingOperator Scheduling• Global scheduler invokes run method of query

plan operators with “timeslice” parameter

• Many possible scheduling objectives: minimize latency, inaccuracy, memory use, computation, starvation, …– First scheduler: round-robinSecond scheduler: minimize queue sizes– Third scheduler: minimize combination of queue

sizes and latency

stanfordstreamdatamanager 50

ApproximationApproximation• Why approximate?

– Memory requirement too high, even with constraints and clever scheduling

– Can’t process streams fast enough for query load

stanfordstreamdatamanager 51

Approximation (cont’d)Approximation (cont’d)• Static:Static: rewrite queries to add (or shrink)

sampling or windows+ User can participate, predictable behavior– Doesn’t consider dynamic conditions

• Dynamic:Dynamic: modify query plan – insert sampling operators, shrink windows, load shedding+ Adapts to current resource availability– How to convey to user? (major open issue)(major open issue)

stanfordstreamdatamanager 52

The Holy GrailThe Holy Grail• Given:

– Declarative query– Resources– Constraints on streams

Generate plan and resource allocation that takes advantage of constraints and maximizes precision

• Do it for multiple (weighted) queries, dynamically and adaptively, and convey what’s happening to the user

stanfordstreamdatamanager 53

http://www-db.stanford.edu/streamhttp://www-db.stanford.edu/stream

Contributors: Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, Justin Rosenstein, Rohit Varma