Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By...

21
Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom

Transcript of Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By...

Page 1: Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Analysis of :Operator Scheduling in a Data Stream Manager

CS561 – Advanced Database Systems

By

Eric Bloom

Page 2: Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Agenda

• Overview of Stream Processing

• Aurora Project Goals

• Aurora Processing Example

• Aurora Architecture

• Multi-Thread Vs. Single-Thread processing

• Important Definitions

• Superbox Scheduling and Processing

• Tuple Batching

• Experimental Evaluation

• Quality of Service (QoS) Scheduling

• QoS Scheduling Scalability

• Related Work

Page 3: Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Overview of Stream Processing

Stream Processing is the processing of potentially unbounded, continuous streams of data

• Data streams are created via micro-sensors, GPS devices, monitoring devices

• Examples include: soldier location tracking, traffic sensors, stock market exchanges, heart monitors

• Data may be received evenly or in bursts

Page 4: Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Aurora Project Goals

• To build a data stream manager that addresses the performance and processing requirements of stream-based applications

• To support multiple concurrent continuous queries on one or more application data streams

• To use Quality-of-Service(QoS) based criteria to make resource allocation decisions

Page 5: Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Aurora Processing Example

Input DataStreams

Output toApplications

Operator Boxes Continuous & ad hocqueries

HistoricalStorage

Page 6: Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Aurora Architecture

PersistentStore

BufferManager

Scheduler

Catalogs

LoadShredder

QoSMonitor

Router

Box Processors

B1

B2

B3

B4

Inputs Outputs

Page 7: Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Multi-Thread Vs. Single Thread Processing

• Multi-Thread Processing– Each query is processed in its own thread

– The operating system manages resource allocation

– Advantages

• Processing can take advantage of efficient operating system algorithms

• Easier to program

– Disadvantages• Software has limited control of resource management

• Additional overhead do to cache misses, lock contention and switching

Page 8: Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Multi-Thread Vs. Single Thread Processing

• Single-Thread Processing– All operations are processed within a single thread– All resource allocation decisions are made by the scheduler

– Advantages• Allows processing to be scheduled based on latency and other Quality

of Service factors based on query needs• Avoids the limitations of multi-thread processing

– Disadvantages• More complex to program

• Aurora has chosen to implement a single-threaded scheduling model

Page 9: Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Important Definitions

• Quality of Service (QoS) – Specific requirements that represent the needs of a specific query. In Aurora, the primary QoS factor is latency

• Query Tree – The set of operators (boxes) and data streams that represent a query.

• Superbox – A sequence of operators that are scheduled and executed as an atomic group. Aurora treats each query as separate superbox.

• Two-Level Scheduling – Scheduling is done at two levels. First, at the superbox level (deciding which superbox to process) and second, what order to execute the operators within the superbox once a superbox is selected.

Page 10: Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Important Definitions (Cont.)

• Scheduling Plan – The combination of dynamically based superbox scheduling and algorithm based operator execution order within the superbox is called a scheduling plan.

• Application-at-a-time (AAAT) is a term used in Aurora that statically defines each query (application) as a superbox

• Box-at-a-time (BAAT) refers to scheduling at the box level rather then the superbox level

• Static and dynamic scheduling approaches – Static approaches to scheduling are defined prior to runtime. Dynamic scheduling approaches use runtime information and statistics to adjust and prioritize scheduling order during execution

• Traversing a superbox – This refers to how the operators within a superbox should be scheduled and executed

Page 11: Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Non-Superbox Processing

3

8

11

16

1 6 9 12 14

2 7 10 13 15

4

5

Page 12: Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Superbox Processing

B4

B3

B2

B1

A1 A2 A3 A5 A5

C1 C2 C3 C4 C5

B5

B6

Page 13: Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Superbox Traversal

• Min-Cost (MC) – Attempts to optimize per-output-tuple processing costs by minimizing the number of operator calls per output tuple

• Min-Latency (ML) – Attempts to produce initial tuples as soon as possible

• Min-Memory (MM) – Attempts to minimize memory usage

Superbox traversal refers to how the operators within a superbox should be executed

Page 14: Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Superbox Traversal Processing

• Min-Cost(MC)– B4 > B5 > B3 > B2 > B6 > B1

• Min-Latency(ML)– B1 > B2 > B1 > B6 > B1 > B4 > B2 > B1 > B3 > B2 > B1 > B5 > B3 >

B2 > B1

• Min-Memory(MM)– B3 > B6 > B2 > B5 > B3 > B2 > B1 > B4 > B2 > B1

B4

B3

B2

B1B5

B6

Page 15: Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Tuple Batching (Train Processing)

• A Tuple Train is the process of executing tuples as a batch within a single operator call.

• The goal of Tuple Train processing is to reduce overall processing cost per tuple processed

• Advantages of Tuple Train processing are:– Decreased number of total operator executions

– Cuts down on low level overhead such as context switching, scheduling, memory management and execution queue maintenance

– Some windowing and merge-join operators work efficiently when batching tuples

Page 16: Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Experimental Evaluation Definitions

• Stream-based applications do not currently have a standardized benchmark

• Aurora modeled queries as a rooted tree structure from a stream input box to an application output box

• Trees are categorized based on depth and fan-out– Depth is the number of box levels from input to output

– Fan-out is the average number of children of each box

Page 17: Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Experimental Evaluation Results

• At low volumes “Round Robin Box-At-A-Time (RR-BAAT)” scheduling was almost as efficient as “Minimum Cost Application-At-A-Time (MC-AAAT)” at low volumes but much less efficient and higher levels– At low volumes, the efficiencies of MC-AAAT were reduced by more

complex scheduling overhead– As volumes increased, the efficiencies of MC-AAAT became more

apparent as scheduling overhead became a lower percentage to total processing

• Experimentation was also done to compare ML, MC and MM scheduling techniques– As expected, each technique minimized their specified attribute (latency,

cost and memory respectively)– However, at very low processing levels the simplest algorithms tended to

do the best (but who cares :)

Page 18: Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Quality of Service (QoS) Scheduling

• Definitions– Utility – is how useful the tuple will be when it exits the query

– Urgency – is represented by the angle of the downward slope of the utility QoS parameter. In other words, how fast the utility deteriorates

• Approach– Keep track of the latency of tuples that reside in the queues and pick

tuples for processing based on whose execution will provide the highest aggregate QoS delivered to the applications.

Page 19: Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Latency-Utility Relationship

Critical Points

Latency

Qua

lity

of

Ser

vice

0,0

1

The older the data gets,The less it is worth,The lower the quality of service------------------------------------Aurora combines the QoS charts of each query being executed with the average latency of the tuples in each box to decide which superbox to execute next.

The idea is to, on average, maintain the highest quality of service.

Page 20: Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

QoS Scheduling Scalability

• Problem– A per-tuple approach to QoS based scheduling will not scale because of

the amount of processing needed to maintain it

• Solution– Latency is not calculated at the tuple level, rather, it is calculated as the

average latency of tuples in the box input queue

– Priority is given based on the combination of utility and urgency

– Once a box’s priority (priority tuple or “p-tuple”) is calculated, the boxes are placed in logical buckets bases on their priority value

– Scheduling is then done based on the priority of the bucket

– All boxes in a given bucket are considered equal

Page 21: Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Related Work

• Eddies – has a tuple-at-a-time scheduler providing adaptablility, but does not scale well

• Urhan – works on rate-based pipeline scheduling of data between operators

• NiagaraCQ – query optimization for streaming data from wide-area information sources

• STREAM – provides comprehensive data stream management using chain scheduling algorithms

• Note, that none of the above projects have a notion of QoS