Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By...

Analysis of :Operator Scheduling in a Data Stream Manager

CS561 – Advanced Database Systems

By

Eric Bloom

Agenda

• Overview of Stream Processing

• Aurora Project Goals

• Aurora Processing Example

• Aurora Architecture

• Multi-Thread Vs. Single-Thread processing

• Important Definitions

• Superbox Scheduling and Processing

• Tuple Batching

• Experimental Evaluation

• Quality of Service (QoS) Scheduling

• QoS Scheduling Scalability

• Related Work

Overview of Stream Processing

Stream Processing is the processing of potentially unbounded, continuous streams of data

• Data streams are created via micro-sensors, GPS devices, monitoring devices

• Examples include: soldier location tracking, traffic sensors, stock market exchanges, heart monitors

• Data may be received evenly or in bursts

Aurora Project Goals

• To build a data stream manager that addresses the performance and processing requirements of stream-based applications

• To support multiple concurrent continuous queries on one or more application data streams

• To use Quality-of-Service(QoS) based criteria to make resource allocation decisions

Aurora Processing Example

Input DataStreams

Output toApplications

Operator Boxes Continuous & ad hocqueries

HistoricalStorage

Aurora Architecture

PersistentStore

BufferManager

Scheduler

Catalogs

LoadShredder

QoSMonitor

Router

Box Processors

B1

B2

B3

B4

Inputs Outputs

Multi-Thread Vs. Single Thread Processing

• Multi-Thread Processing– Each query is processed in its own thread

– The operating system manages resource allocation

– Advantages

• Processing can take advantage of efficient operating system algorithms

• Easier to program

– Disadvantages• Software has limited control of resource management

• Additional overhead do to cache misses, lock contention and switching

Multi-Thread Vs. Single Thread Processing

• Single-Thread Processing– All operations are processed within a single thread– All resource allocation decisions are made by the scheduler

– Advantages• Allows processing to be scheduled based on latency and other Quality

of Service factors based on query needs• Avoids the limitations of multi-thread processing

– Disadvantages• More complex to program

• Aurora has chosen to implement a single-threaded scheduling model

Important Definitions

• Quality of Service (QoS) – Specific requirements that represent the needs of a specific query. In Aurora, the primary QoS factor is latency

• Query Tree – The set of operators (boxes) and data streams that represent a query.

• Superbox – A sequence of operators that are scheduled and executed as an atomic group. Aurora treats each query as separate superbox.

• Two-Level Scheduling – Scheduling is done at two levels. First, at the superbox level (deciding which superbox to process) and second, what order to execute the operators within the superbox once a superbox is selected.

Important Definitions (Cont.)

• Scheduling Plan – The combination of dynamically based superbox scheduling and algorithm based operator execution order within the superbox is called a scheduling plan.

• Application-at-a-time (AAAT) is a term used in Aurora that statically defines each query (application) as a superbox

• Box-at-a-time (BAAT) refers to scheduling at the box level rather then the superbox level

• Static and dynamic scheduling approaches – Static approaches to scheduling are defined prior to runtime. Dynamic scheduling approaches use runtime information and statistics to adjust and prioritize scheduling order during execution

• Traversing a superbox – This refers to how the operators within a superbox should be scheduled and executed

Non-Superbox Processing

3

8

11

16

1 6 9 12 14

2 7 10 13 15

4

5

Superbox Processing

B4

B3

B2

B1

A1 A2 A3 A5 A5

C1 C2 C3 C4 C5

B5

B6

Superbox Traversal

• Min-Cost (MC) – Attempts to optimize per-output-tuple processing costs by minimizing the number of operator calls per output tuple

• Min-Latency (ML) – Attempts to produce initial tuples as soon as possible

• Min-Memory (MM) – Attempts to minimize memory usage

Superbox traversal refers to how the operators within a superbox should be executed

Superbox Traversal Processing

• Min-Cost(MC)– B4 > B5 > B3 > B2 > B6 > B1

• Min-Latency(ML)– B1 > B2 > B1 > B6 > B1 > B4 > B2 > B1 > B3 > B2 > B1 > B5 > B3 >

B2 > B1

• Min-Memory(MM)– B3 > B6 > B2 > B5 > B3 > B2 > B1 > B4 > B2 > B1

B4

B3

B2

B1B5

B6

Tuple Batching (Train Processing)

• A Tuple Train is the process of executing tuples as a batch within a single operator call.

• The goal of Tuple Train processing is to reduce overall processing cost per tuple processed

• Advantages of Tuple Train processing are:– Decreased number of total operator executions

– Cuts down on low level overhead such as context switching, scheduling, memory management and execution queue maintenance

– Some windowing and merge-join operators work efficiently when batching tuples

Experimental Evaluation Definitions

• Stream-based applications do not currently have a standardized benchmark

• Aurora modeled queries as a rooted tree structure from a stream input box to an application output box

• Trees are categorized based on depth and fan-out– Depth is the number of box levels from input to output

– Fan-out is the average number of children of each box

Experimental Evaluation Results

• At low volumes “Round Robin Box-At-A-Time (RR-BAAT)” scheduling was almost as efficient as “Minimum Cost Application-At-A-Time (MC-AAAT)” at low volumes but much less efficient and higher levels– At low volumes, the efficiencies of MC-AAAT were reduced by more

complex scheduling overhead– As volumes increased, the efficiencies of MC-AAAT became more

apparent as scheduling overhead became a lower percentage to total processing

• Experimentation was also done to compare ML, MC and MM scheduling techniques– As expected, each technique minimized their specified attribute (latency,

cost and memory respectively)– However, at very low processing levels the simplest algorithms tended to

do the best (but who cares :)

Quality of Service (QoS) Scheduling

• Definitions– Utility – is how useful the tuple will be when it exits the query

– Urgency – is represented by the angle of the downward slope of the utility QoS parameter. In other words, how fast the utility deteriorates

• Approach– Keep track of the latency of tuples that reside in the queues and pick

tuples for processing based on whose execution will provide the highest aggregate QoS delivered to the applications.

Latency-Utility Relationship

Critical Points

Latency

Qua

lity

of

Ser

vice

0,0

1

The older the data gets,The less it is worth,The lower the quality of service------------------------------------Aurora combines the QoS charts of each query being executed with the average latency of the tuples in each box to decide which superbox to execute next.

The idea is to, on average, maintain the highest quality of service.

QoS Scheduling Scalability

• Problem– A per-tuple approach to QoS based scheduling will not scale because of

the amount of processing needed to maintain it

• Solution– Latency is not calculated at the tuple level, rather, it is calculated as the

average latency of tuples in the box input queue

– Priority is given based on the combination of utility and urgency

– Once a box’s priority (priority tuple or “p-tuple”) is calculated, the boxes are placed in logical buckets bases on their priority value

– Scheduling is then done based on the priority of the bucket

– All boxes in a given bucket are considered equal

Related Work

• Eddies – has a tuple-at-a-time scheduler providing adaptablility, but does not scale well

• Urhan – works on rate-based pipeline scheduling of data between operators

• NiagaraCQ – query optimization for streaming data from wide-area information sources

• STREAM – provides comprehensive data stream management using chain scheduling algorithms

• Note, that none of the above projects have a notion of QoS

Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By...

Documents

Transcript of Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By...