Real Time Data Stream Processing Engine

8/3/2019 Real Time Data Stream Processing Engine

1/13

REAL TIME DATA STREAMPROCESSING ENGINE

Ashish Kumar Gupta1

and Abhinav Rastogi2

1Nibble Computer SocietyJSS Academy Of Technical Education, Noida 201301, India

[email protected]

2Quanta Electronics Society

JSS Academy Of Technical Education, Noida 201301, [email protected]

Abstract:

The rapid growth in information science and technology in general and the

complexity and volume of data in particular have introduced new challenges forthe research community. Databases are growing incessantly and many sources

produce data continuously. In many cases, we need to extract some sort of

knowledge from this continuous stream of data. Examples include customer click

streams, telephone records, large sets of web pages, multimedia data, and sets of

retail chain transactions. These sources are called data streams. If the process isnot strictly stationary (as most of real world applications), the target concept

could gradually change over time.

Keywords

Internet traffic monitoring, on-line stream analysis, sliding windows, frequentitem queries

1. Introduction

A Stream Processing Engine is a computing platform for capturing, integrating,

understanding, and reacting to business events as they occur.

Data streaming systems are increasingly used as infrastructure for critical monitoring

applications such as financial alerts and network intrusion detection. These monitoring

applications often have many concurrent users asking similar but different queries over acommon data stream. For example, a system that monitors stock market trades might

have multiple users interested in the total value of trades in a sliding window. While

some of these users might care about stocks of a particular sector, or only about high


2/13

volume trades, others might compute complex user-defined predicates on fluctuating

quantities like stock price. Similarly, the aggregation window that different users are

interested in can vary widely. Money managers in financial institutions who runalgorithmic trading systems might want aggregates over 5-10 minute windows reported

every 60-90 seconds depending on the specific financial models they use. In contrast, day

traders with individual investing strategies might only need these results every 5-10minutes. Clearly such a system will have to support hundreds of queries. Therefore the

need of such system arises which handles all the queries within no time and processes it.

Recent years have witnessed an increasing interest in designing algorithmsfor querying and analyzing streaming data (i.e., data that is seen only once in a fixed

order) with only limited memory. Providing (perhaps approximate) answers to queries

over such continuous data streams is a crucial requirement for many applicationenvironments; examples include large telecom and IP network installations where

performance data from different parts of the network needs to be continuously collected

and analyzed.

In India the systems are getting automatised every day. So the traditional databasesystems are not sufficient to work for on fly data. There this system will help to improve

the performance.

Examples:

Financial services, telecommunication, stock market, medical department, military and

industrial process control are some of the areas that will benefit from stream processing.As these sectors receive a large amount of data within no time so storing data and

querying on this stored data would take much time and a bulk of data would be queuing.

So this processing engine will process the data and query on it.

2. Objective

The emerging real-time information environment is being fueled by an unprecedented

increase in the amount of live data that needs to be understood and reacted toinstantaneously. The traditional store and query model cannot address the needs of a

world where in many cases information's value may exist for only a moment. Stream

Processing Engine provides the infrastructure able to support this growing class of

problems.

Description

The major issues:

What will this engine do?Why is it required at the moment?

What is the underlying technique?

These questions are not far to be fetched as I have already stated that now a days stream

of data have gained relevance over the contemporary data that was stored in databases

and processed the case was that the data grew redundant over time and was considered


3/13

obsolete now the intellectuals thought of an idea to implement some mechanism that can

process the data in the stream itself and store only the data that was relevant.

Techniques:

The challenges faced by Stream Processing Engines are manifold.

They relate to the size of the data set and sometimes by the size of the slidingwindows and intermediate results of queries.

Often exact data are not needed for aggregate queries.

In real life, data streams are not continuous but often have bursts of data (e.g.network traffic). Processing these bursts of data without compromising system

performance is a key challenge.

Implementing Join processing with minimal resources in data streams is also amajor challenge.

Recent research on these problems has given birth to the following major contemporary

techniques:

Performance of aggregate queries benefit greatly from computation of "SketchSummaries" (i.e. summaries which are representative of the overall data) toprovide approximate answers. [2]

Adaptive load-aware scheduling of query operators can be used to minimize

resource consumption during peak loads. [3] Join processing can be made approximate for sliding windows of data. [4]

Exploiting similarities between incoming queries can lead to better resource.

Real-time Feeds

Alerts

Actions

Embedded

local

stora e

Data

store

Remot

e


4/13

3. Algorithms used or proposed

3.1 Chain: Operator Scheduling for Memory Minimization in

Data Stream Systems

Theorem:Given a system with k queries, all operator selectivities 1,

Let C(t) = # of blocks of memory used by Chain at time t.

At every time t, any algorithm must use C(t) - k memory.

Calculate lower envelope

Priority = slope of lower envelope segment

Always schedule highest-priority available operator

Break ties using operator order in pipeline

Favor later operators

In many applications involving continuous data streams, data arrival is burstyand data rate fluctuates over time. Systems that seek to give rapid or

real-time query responses in such an environment must be prepared to deal

gracefully with bursts in data arrival without compromising system performance.

We discuss one strategy for processing bursty streams adaptive,

load-aware schedulingof query operators to minimize resource consumption

during times of peak load. We show that the choice of an operator scheduling

strategy can have significant impact on the run-time system memory usage.We then present Chain scheduling, an operator scheduling strategy for

data stream systems that is near-optimal in minimizing run-time memory usage

for any collection of single stream queries involving selections, projections and

foreign-key joins with stored relations. Chain scheduling also performs wellfor queries with sliding-window joins over multiple streams, and multiple queries

of the above types.

Important components of systems research that have received less attention todate are run-time resource allocation and optimization. In this paper we focus on

one aspect of run-time resource allocation, namely operator scheduling.

Therefore, adaptivitybecomes critical to a data stream system as compared to atraditional DBMS.

Opt1

O t2

Opt3Lower envelope

B

lo

c

k

S

i

ze


5/13

Various approaches to adaptive query processing are possible given that the data

may exhibit different types of variability. For example, a system could modify the

structure of query plans, or dynamically reallocate memory among queryoperators in response to changing conditions, as suggested or take a holistic

approach to adaptivity and do away with fixed query plans altogether, as in the

Eddiesarchitecture .While these approaches focus primarily on adapting to changing

characteristics of the data itself(e.g., changing selectivities), we focus on

adaptivity towards changing arrivalcharacteristics of the data.

As mentioned earlier, most data streams exhibit considerable burstinessand arrival-rate variation. It is crucial for any stream system to adapt gracefully to

such variations in data arrival, making sure that we do not run out of critical

resources such as main memory during the bursts. The focus of this paper is todesign techniques for such adaptivity.

Query execution can be captured by a data flow diagram, where every tuple

passes through a unique operator path. Thus queries can be represented as rooted

trees. Everyoperator is a filter that operates on a tuple and produces s tuples, where s is theoperator selectivity. Obviously, the selectivity assumption does not hold at the

granularity of a single tuple but is merely a convenient abstraction to capture theaverage behavior of

the operator. For example, we assume that a select operator with selectivity 0.5will select about 500 tuples of every 1000 tuples that it processes. Henceforth, a

tuple should not be thought of as an individual tuple, but should be viewed as an

convenient abstraction of a memory unit, such as a page, that contains multiple

tuples.Over adequately large memory units, we can assume that if an operator with

selectivity

s operates on inputs that require one unit of memory, its output will require sunits of memory.

How it works :-

Inputs:- Data flow path(s) consisting of sequences of operators- For each operator we know:- Execution time (per block)- Selectivity


6/13

Greedy algorithm:

Operator priority = selectivity per unit time (si/ti)

Always schedule the highest-priority available operator

(0,0) (6,0)

Time

Blo

c

kSi

(0,1)

(1,0.5)

(4,0.25)

Opt1

Opt2

Opt3

Time: t1Selectivity: s1


Query #2

Query #1

Stream Stream




7/13

3.2 Sliding Window Algorithms:

Many infinite stream algorithms do not have obvious counterparts in the sliding

window model. For example, one counter suffices to maintain the minimumelement in an infinite stream, but keeping track of the minimum element in a

sliding window of sizeNtakes (N) spaceconsider an increasing sequence of

values, in which the oldest item in any window is the minimum and must be

replaced whenever

the window moves forward. The fundamental problem is that as new items arrive,old items must be simultaneously evicted from the window, meaning that we need

to

store some information about the order of the packets in the window.Zhu and Shasha introduceBasic Windows to incrementally compute simple

windowed aggregates in [5]. The window is divided into equally-sized Basic

Windows and only asynopsis and a timestamp are stored for each Basic Window.When the timestamp

of the oldest Basic Window expires, that window is dropped and a fresh Basic

Window is added. This method does not require the storage of the entire slidingwindow, but results are refreshed only after the stream fills the current Basic

Window. If the available memory is small, then the number of synopses that maybe stored issmall and hence the refresh interval is large.Exponential Histograms (EH) have

been introduced by Datar et al. [6] and recently expanded in [7] to provide

approximate

answers to simple window aggregates at all times. The idea is to build BasicWindows with various sizes and maintain a bound on the error caused by

counting those elements

Memory Usage

0

0.5

1

1.5

2

2.53

0 2 4 6 810

12

14

16

18

Time

BlockSize

FIFO

Chain


8/13

in the oldest Basic Window which may have already expired. The algorithm

guarantees an error of at most_while using O (( 1/e)log2N) space.

3.3 Proposed algorithm

We propose the following simple algorithm. Frequent, that employs the Basic

Window approach (i.e. the jumping window model) and stores a top-ksynopsis in

each Basic

Window. We fix an integerkand for each Basic Window, maintain a list of the k

most frequent items in this window. We assume that a single Basic Window fits

in main memory, within which we may count item frequencies exactly. Let i

be the frequency of the kth most frequent item in the ith Basic Window. Then =Pi i is the upper limit on the frequency of an item type that does not appear on

any of

the top-klists. Now, we sum the reported frequencies for each item present in at

least one top-ksynopsis and if there exists a category whose reported frequencyexceeds , we are

certain that this category has a true frequency of at least . The pseudo code is

given below, assuming thatNis the sliding window size, b is the number ofelements per Basic

Window, andN/b is the total number of Basic Windows. An updated answer is

generated whenever the window slides forward by bpackets.

Implementation

Repeat:1. For each element e in the next b elements:

If a local counter exists for the type of element e:

Increment the local counter.Otherwise:

Create a new local counter for this element type

and set it equal to 1.

2. Add a summary Scontaining identities and countsof the kmost frequent items to the back of queue Q.

3. Delete all local counters.

4. For each type named in S:If a global counter exists for this type:

Add to it the count recorded in S.

Otherwise:Create a new global counter for this element type

and set it equal to the count recorded in S.

5. Add the count of the kth largest type in Sto .

6. IfsizeOf(Q)> N/b:(a) Remove the summary S_from the front ofQ and

subtract the count of the kth largest type in S_from .

(b) For all element types named in S_:


9/13

Subtract from their global counters the counts recorded in S_.

If a counter is decremented to zero:

Delete it.(c) Output the identity and value of each global

counter> .

3.4 Processing Complex Aggregate Queries over Data Streams

Defined once, and run until user terminates them

Q is a selection: then size(A) may be unbounded. Thus, we cannot guaranteewe can store it.

Q is a self-join: If we want to provide only NEW results, then we needunlimited storage to guarantee no duplicates exist in result

Q contains aggregation: then tuples in A might be deleted by new observedtuples.

Ex: Select A, sum(B)

From Stream X What if B < 0 ?

Group by A

Having sum(B) > 100

What if we can delete tuples in the Stream? What if Q contains a blocking operator near the top (example: aggregation)? Online Aggregation Techniques useful


10/13

Work on Self-Maintenance: important to limit size of Scratch. If a view can be

self-maintainable, any auxiliary storage much occupy bounded space

Work on Data Expiration: important for knowing when to move elementsfrom Scratch to Throw.

Goal: Group similar queries over data sources, to eliminate commonprocessing needed and minimize response time and storage needed .

Niagara (joint work of Wisconsin, Oregon) Tukwilla (Washington) Telegraph (Berkeley)

Eddy Knowing State of Tuples Passes Tuples by Reference to Operators (avoids copying) When Eddy does not have any more input tuples, it polls the sources for more

input.

Tuples need to be augmented with additional information: Ready Bits: Which operators need to be applied Done Bits: Which operators have been applied Queries Completed: Signals if tuple has been output or rejected by the

query

Completion Mask (per query): To know when a tuple can be output fora query (completion mask & done bits = mask)

Queries with no joins are partitioned per data source (to save space in the bitsrequired)

Queries with Disjunctions (ORs) are transformed into conjunctive normalform (and of ors).

Range/exact predicates are found in Grouped filter


11/13

Stems and Joins SteMs: Multiway-Pipelined Joins Double- Pipelined Joins maintain a hash index on each relation. When N relations are joined, at least n-2 inflight indices are needed for

intermediate results even for left-deep trees.

Previous approach cannot change query plan without re-computingintermediate indices.

4. Conclusion and Future Work:

In this paper we have we have tried to find out the practical implementation of real timedata stream processing engine by analysis of the algorithms and their feasibility and

found them to be applicable on the test scenario.

Next we would be developing our own engine, which captures the data packets over a

socket interprects it into meaning full data , processes them and stores the relevantinformation in the data base and drops the rest of the data which has been processesed

and is not required any more.

These Real Time Stream Data Processing Engines are the need of the hour since the

processes of data mining is growing critical day by day for the development of anindividual, economic growth of the society and advancement of Technology.

5. References

Research Projects:

Aurora (supports cq, ad-hoc query, and materialized view)- Aims to better support monitoring applications

Borealis (distributed SPE, QoS based techniques)- A distributed stream processing engine based on Aurora and Medusa

Shivnath Babu and Rajeev Motwani ,

Department of Computer Science , StanfordUniversity Stanford, CA 94305

Cornell University, [email protected]


12/13

Minos Garofalakis, Bell Labs, Lucent, [email protected]

Johannes Gehrke, Cornell [email protected]

Rajeev Rastogi, Bell Labs, Lucent, [email protected]

[1] A. Arasu et al.. Resource sharing in continuous sliding-window aggregates. InVLDB. 2004.

[2] A. Arasu, et al.. The CQL continuous query language: Semantic foundationsand query execution. VLDB Journal, (To appear).

[3] F. Bancilhon, et al.. FAD, a powerful and simple database language. In

VLDB. 1987.

[4] D. Carney, et al.. Monitoring streams - a new class of data management

applications. In VLDB. 2002.

[5] S. Chandrasekaran, et al.. TelegraphCQ: Continuous dataflow processing for

an uncertain world. In CIDR. 2003.

[6] S. Chandrasekaran et al.. Streaming queries over streaming data. In VLDB.

2002.

[7] J. Chen, et al.. NiagaraCQ: a scalable continuous query system for Internetdatabases. In SIGMOD. 2000.

[8] C. D. Cranor, et al.. Gigascope: A stream database for network applications. In

SIGMOD. 2003.

[9] M. Denny et al.. Predicate result range caching for continuous queries. In

SIGMOD. 2005.

[10] P. M. Deshpande, et al.. Caching multidimensional queries using chunks. In

SIGMOD. 1998.

[11] C. L. Forgy. Rete: A fast algorithm for the many pattern/many object matchproblem. Artifical Intelligence, 19(1):1737, September 1982.

[12] M. J. Franklin, et al.. Design considerations for high fan-in systems: TheHiFi approach. In CIDR. 2005.


13/13

[13] L. Golab et al.. Update-pattern-aware modeling and processing of continuous

queries. In SIGMOD. 2005.

[14] G. Graefe. Query evaluation techniques for large databases. ACM

Computing Surveys, 25(2):73170, June 1993.

[15] J. Gray, et al.. Data Cube: a relational aggregation operator generalizing

group-by, cross-tab and sub-total. In ICDE. February 1996.

Real Time Data Stream Processing Engine

Documents

Transcript of Real Time Data Stream Processing Engine