Real Time Data Stream Processing Engine
Transcript of Real Time Data Stream Processing Engine
-
8/3/2019 Real Time Data Stream Processing Engine
1/13
REAL TIME DATA STREAMPROCESSING ENGINE
Ashish Kumar Gupta1
and Abhinav Rastogi2
1Nibble Computer SocietyJSS Academy Of Technical Education, Noida 201301, India
2Quanta Electronics Society
JSS Academy Of Technical Education, Noida 201301, [email protected]
Abstract:
The rapid growth in information science and technology in general and the
complexity and volume of data in particular have introduced new challenges forthe research community. Databases are growing incessantly and many sources
produce data continuously. In many cases, we need to extract some sort of
knowledge from this continuous stream of data. Examples include customer click
streams, telephone records, large sets of web pages, multimedia data, and sets of
retail chain transactions. These sources are called data streams. If the process isnot strictly stationary (as most of real world applications), the target concept
could gradually change over time.
Keywords
Internet traffic monitoring, on-line stream analysis, sliding windows, frequentitem queries
1. Introduction
A Stream Processing Engine is a computing platform for capturing, integrating,
understanding, and reacting to business events as they occur.
Data streaming systems are increasingly used as infrastructure for critical monitoring
applications such as financial alerts and network intrusion detection. These monitoring
applications often have many concurrent users asking similar but different queries over acommon data stream. For example, a system that monitors stock market trades might
have multiple users interested in the total value of trades in a sliding window. While
some of these users might care about stocks of a particular sector, or only about high
-
8/3/2019 Real Time Data Stream Processing Engine
2/13
volume trades, others might compute complex user-defined predicates on fluctuating
quantities like stock price. Similarly, the aggregation window that different users are
interested in can vary widely. Money managers in financial institutions who runalgorithmic trading systems might want aggregates over 5-10 minute windows reported
every 60-90 seconds depending on the specific financial models they use. In contrast, day
traders with individual investing strategies might only need these results every 5-10minutes. Clearly such a system will have to support hundreds of queries. Therefore the
need of such system arises which handles all the queries within no time and processes it.
Recent years have witnessed an increasing interest in designing algorithmsfor querying and analyzing streaming data (i.e., data that is seen only once in a fixed
order) with only limited memory. Providing (perhaps approximate) answers to queries
over such continuous data streams is a crucial requirement for many applicationenvironments; examples include large telecom and IP network installations where
performance data from different parts of the network needs to be continuously collected
and analyzed.
In India the systems are getting automatised every day. So the traditional databasesystems are not sufficient to work for on fly data. There this system will help to improve
the performance.
Examples:
Financial services, telecommunication, stock market, medical department, military and
industrial process control are some of the areas that will benefit from stream processing.As these sectors receive a large amount of data within no time so storing data and
querying on this stored data would take much time and a bulk of data would be queuing.
So this processing engine will process the data and query on it.
2. Objective
The emerging real-time information environment is being fueled by an unprecedented
increase in the amount of live data that needs to be understood and reacted toinstantaneously. The traditional store and query model cannot address the needs of a
world where in many cases information's value may exist for only a moment. Stream
Processing Engine provides the infrastructure able to support this growing class of
problems.
Description
The major issues:
What will this engine do?Why is it required at the moment?
What is the underlying technique?
These questions are not far to be fetched as I have already stated that now a days stream
of data have gained relevance over the contemporary data that was stored in databases
and processed the case was that the data grew redundant over time and was considered
-
8/3/2019 Real Time Data Stream Processing Engine
3/13
obsolete now the intellectuals thought of an idea to implement some mechanism that can
process the data in the stream itself and store only the data that was relevant.
Techniques:
The challenges faced by Stream Processing Engines are manifold.
They relate to the size of the data set and sometimes by the size of the slidingwindows and intermediate results of queries.
Often exact data are not needed for aggregate queries.
In real life, data streams are not continuous but often have bursts of data (e.g.network traffic). Processing these bursts of data without compromising system
performance is a key challenge.
Implementing Join processing with minimal resources in data streams is also amajor challenge.
Recent research on these problems has given birth to the following major contemporary
techniques:
Performance of aggregate queries benefit greatly from computation of "SketchSummaries" (i.e. summaries which are representative of the overall data) toprovide approximate answers. [2]
Adaptive load-aware scheduling of query operators can be used to minimize
resource consumption during peak loads. [3] Join processing can be made approximate for sliding windows of data. [4]
Exploiting similarities between incoming queries can lead to better resource.
Real-time Feeds
Alerts
Actions
Embedded
local
stora e
Data
store
Remot
e
-
8/3/2019 Real Time Data Stream Processing Engine
4/13
3. Algorithms used or proposed
3.1 Chain: Operator Scheduling for Memory Minimization in
Data Stream Systems
Theorem:Given a system with k queries, all operator selectivities 1,
Let C(t) = # of blocks of memory used by Chain at time t.
At every time t, any algorithm must use C(t) - k memory.
Calculate lower envelope
Priority = slope of lower envelope segment
Always schedule highest-priority available operator
Break ties using operator order in pipeline
Favor later operators
In many applications involving continuous data streams, data arrival is burstyand data rate fluctuates over time. Systems that seek to give rapid or
real-time query responses in such an environment must be prepared to deal
gracefully with bursts in data arrival without compromising system performance.
We discuss one strategy for processing bursty streams adaptive,
load-aware schedulingof query operators to minimize resource consumption
during times of peak load. We show that the choice of an operator scheduling
strategy can have significant impact on the run-time system memory usage.We then present Chain scheduling, an operator scheduling strategy for
data stream systems that is near-optimal in minimizing run-time memory usage
for any collection of single stream queries involving selections, projections and
foreign-key joins with stored relations. Chain scheduling also performs wellfor queries with sliding-window joins over multiple streams, and multiple queries
of the above types.
Important components of systems research that have received less attention todate are run-time resource allocation and optimization. In this paper we focus on
one aspect of run-time resource allocation, namely operator scheduling.
Therefore, adaptivitybecomes critical to a data stream system as compared to atraditional DBMS.
Opt1
O t2
Opt3Lower envelope
B
lo
c
k
S
i
ze
-
8/3/2019 Real Time Data Stream Processing Engine
5/13
Various approaches to adaptive query processing are possible given that the data
may exhibit different types of variability. For example, a system could modify the
structure of query plans, or dynamically reallocate memory among queryoperators in response to changing conditions, as suggested or take a holistic
approach to adaptivity and do away with fixed query plans altogether, as in the
Eddiesarchitecture .While these approaches focus primarily on adapting to changing
characteristics of the data itself(e.g., changing selectivities), we focus on
adaptivity towards changing arrivalcharacteristics of the data.
As mentioned earlier, most data streams exhibit considerable burstinessand arrival-rate variation. It is crucial for any stream system to adapt gracefully to
such variations in data arrival, making sure that we do not run out of critical
resources such as main memory during the bursts. The focus of this paper is todesign techniques for such adaptivity.
Query execution can be captured by a data flow diagram, where every tuple
passes through a unique operator path. Thus queries can be represented as rooted
trees. Everyoperator is a filter that operates on a tuple and produces s tuples, where s is theoperator selectivity. Obviously, the selectivity assumption does not hold at the
granularity of a single tuple but is merely a convenient abstraction to capture theaverage behavior of
the operator. For example, we assume that a select operator with selectivity 0.5will select about 500 tuples of every 1000 tuples that it processes. Henceforth, a
tuple should not be thought of as an individual tuple, but should be viewed as an
convenient abstraction of a memory unit, such as a page, that contains multiple
tuples.Over adequately large memory units, we can assume that if an operator with
selectivity
s operates on inputs that require one unit of memory, its output will require sunits of memory.
How it works :-
Inputs:- Data flow path(s) consisting of sequences of operators- For each operator we know:- Execution time (per block)- Selectivity
-
8/3/2019 Real Time Data Stream Processing Engine
6/13
Greedy algorithm:
Operator priority = selectivity per unit time (si/ti)
Always schedule the highest-priority available operator
(0,0) (6,0)
Time
Blo
c
kSi
(0,1)
(1,0.5)
(4,0.25)
Opt1
Opt2
Opt3
Time: t1Selectivity: s1
Time: t2Selectivity: s2
Query #2
Query #1
Stream Stream
Time: t3Selectivity: s3
Time: t4Selectivity: s4
-
8/3/2019 Real Time Data Stream Processing Engine
7/13
3.2 Sliding Window Algorithms:
Many infinite stream algorithms do not have obvious counterparts in the sliding
window model. For example, one counter suffices to maintain the minimumelement in an infinite stream, but keeping track of the minimum element in a
sliding window of sizeNtakes (N) spaceconsider an increasing sequence of
values, in which the oldest item in any window is the minimum and must be
replaced whenever
the window moves forward. The fundamental problem is that as new items arrive,old items must be simultaneously evicted from the window, meaning that we need
to
store some information about the order of the packets in the window.Zhu and Shasha introduceBasic Windows to incrementally compute simple
windowed aggregates in [5]. The window is divided into equally-sized Basic
Windows and only asynopsis and a timestamp are stored for each Basic Window.When the timestamp
of the oldest Basic Window expires, that window is dropped and a fresh Basic
Window is added. This method does not require the storage of the entire slidingwindow, but results are refreshed only after the stream fills the current Basic
Window. If the available memory is small, then the number of synopses that maybe stored issmall and hence the refresh interval is large.Exponential Histograms (EH) have
been introduced by Datar et al. [6] and recently expanded in [7] to provide
approximate
answers to simple window aggregates at all times. The idea is to build BasicWindows with various sizes and maintain a bound on the error caused by
counting those elements
Memory Usage
0
0.5
1
1.5
2
2.53
0 2 4 6 810
12
14
16
18
Time
BlockSize
FIFO
Chain
-
8/3/2019 Real Time Data Stream Processing Engine
8/13
in the oldest Basic Window which may have already expired. The algorithm
guarantees an error of at most_while using O (( 1/e)log2N) space.
3.3 Proposed algorithm
We propose the following simple algorithm. Frequent, that employs the Basic
Window approach (i.e. the jumping window model) and stores a top-ksynopsis in
each Basic
Window. We fix an integerkand for each Basic Window, maintain a list of the k
most frequent items in this window. We assume that a single Basic Window fits
in main memory, within which we may count item frequencies exactly. Let i
be the frequency of the kth most frequent item in the ith Basic Window. Then =Pi i is the upper limit on the frequency of an item type that does not appear on
any of
the top-klists. Now, we sum the reported frequencies for each item present in at
least one top-ksynopsis and if there exists a category whose reported frequencyexceeds , we are
certain that this category has a true frequency of at least . The pseudo code is
given below, assuming thatNis the sliding window size, b is the number ofelements per Basic
Window, andN/b is the total number of Basic Windows. An updated answer is
generated whenever the window slides forward by bpackets.
Implementation
Repeat:1. For each element e in the next b elements:
If a local counter exists for the type of element e:
Increment the local counter.Otherwise:
Create a new local counter for this element type
and set it equal to 1.
2. Add a summary Scontaining identities and countsof the kmost frequent items to the back of queue Q.
3. Delete all local counters.
4. For each type named in S:If a global counter exists for this type:
Add to it the count recorded in S.
Otherwise:Create a new global counter for this element type
and set it equal to the count recorded in S.
5. Add the count of the kth largest type in Sto .
6. IfsizeOf(Q)> N/b:(a) Remove the summary S_from the front ofQ and
subtract the count of the kth largest type in S_from .
(b) For all element types named in S_:
-
8/3/2019 Real Time Data Stream Processing Engine
9/13
Subtract from their global counters the counts recorded in S_.
If a counter is decremented to zero:
Delete it.(c) Output the identity and value of each global
counter> .
3.4 Processing Complex Aggregate Queries over Data Streams
Defined once, and run until user terminates them
Q is a selection: then size(A) may be unbounded. Thus, we cannot guaranteewe can store it.
Q is a self-join: If we want to provide only NEW results, then we needunlimited storage to guarantee no duplicates exist in result
Q contains aggregation: then tuples in A might be deleted by new observedtuples.
Ex: Select A, sum(B)
From Stream X What if B < 0 ?
Group by A
Having sum(B) > 100
What if we can delete tuples in the Stream? What if Q contains a blocking operator near the top (example: aggregation)? Online Aggregation Techniques useful
-
8/3/2019 Real Time Data Stream Processing Engine
10/13
Work on Self-Maintenance: important to limit size of Scratch. If a view can be
self-maintainable, any auxiliary storage much occupy bounded space
Work on Data Expiration: important for knowing when to move elementsfrom Scratch to Throw.
Goal: Group similar queries over data sources, to eliminate commonprocessing needed and minimize response time and storage needed .
Niagara (joint work of Wisconsin, Oregon) Tukwilla (Washington) Telegraph (Berkeley)
Eddy Knowing State of Tuples Passes Tuples by Reference to Operators (avoids copying) When Eddy does not have any more input tuples, it polls the sources for more
input.
Tuples need to be augmented with additional information: Ready Bits: Which operators need to be applied Done Bits: Which operators have been applied Queries Completed: Signals if tuple has been output or rejected by the
query
Completion Mask (per query): To know when a tuple can be output fora query (completion mask & done bits = mask)
Queries with no joins are partitioned per data source (to save space in the bitsrequired)
Queries with Disjunctions (ORs) are transformed into conjunctive normalform (and of ors).
Range/exact predicates are found in Grouped filter
-
8/3/2019 Real Time Data Stream Processing Engine
11/13
Stems and Joins SteMs: Multiway-Pipelined Joins Double- Pipelined Joins maintain a hash index on each relation. When N relations are joined, at least n-2 inflight indices are needed for
intermediate results even for left-deep trees.
Previous approach cannot change query plan without re-computingintermediate indices.
4. Conclusion and Future Work:
In this paper we have we have tried to find out the practical implementation of real timedata stream processing engine by analysis of the algorithms and their feasibility and
found them to be applicable on the test scenario.
Next we would be developing our own engine, which captures the data packets over a
socket interprects it into meaning full data , processes them and stores the relevantinformation in the data base and drops the rest of the data which has been processesed
and is not required any more.
These Real Time Stream Data Processing Engines are the need of the hour since the
processes of data mining is growing critical day by day for the development of anindividual, economic growth of the society and advancement of Technology.
5. References
Research Projects:
Aurora (supports cq, ad-hoc query, and materialized view)- Aims to better support monitoring applications
Borealis (distributed SPE, QoS based techniques)- A distributed stream processing engine based on Aurora and Medusa
Shivnath Babu and Rajeev Motwani ,
Department of Computer Science , StanfordUniversity Stanford, CA 94305
Cornell University, [email protected]
-
8/3/2019 Real Time Data Stream Processing Engine
12/13
Minos Garofalakis, Bell Labs, Lucent, [email protected]
Johannes Gehrke, Cornell [email protected]
Rajeev Rastogi, Bell Labs, Lucent, [email protected]
[1] A. Arasu et al.. Resource sharing in continuous sliding-window aggregates. InVLDB. 2004.
[2] A. Arasu, et al.. The CQL continuous query language: Semantic foundationsand query execution. VLDB Journal, (To appear).
[3] F. Bancilhon, et al.. FAD, a powerful and simple database language. In
VLDB. 1987.
[4] D. Carney, et al.. Monitoring streams - a new class of data management
applications. In VLDB. 2002.
[5] S. Chandrasekaran, et al.. TelegraphCQ: Continuous dataflow processing for
an uncertain world. In CIDR. 2003.
[6] S. Chandrasekaran et al.. Streaming queries over streaming data. In VLDB.
2002.
[7] J. Chen, et al.. NiagaraCQ: a scalable continuous query system for Internetdatabases. In SIGMOD. 2000.
[8] C. D. Cranor, et al.. Gigascope: A stream database for network applications. In
SIGMOD. 2003.
[9] M. Denny et al.. Predicate result range caching for continuous queries. In
SIGMOD. 2005.
[10] P. M. Deshpande, et al.. Caching multidimensional queries using chunks. In
SIGMOD. 1998.
[11] C. L. Forgy. Rete: A fast algorithm for the many pattern/many object matchproblem. Artifical Intelligence, 19(1):1737, September 1982.
[12] M. J. Franklin, et al.. Design considerations for high fan-in systems: TheHiFi approach. In CIDR. 2005.
-
8/3/2019 Real Time Data Stream Processing Engine
13/13
[13] L. Golab et al.. Update-pattern-aware modeling and processing of continuous
queries. In SIGMOD. 2005.
[14] G. Graefe. Query evaluation techniques for large databases. ACM
Computing Surveys, 25(2):73170, June 1993.
[15] J. Gray, et al.. Data Cube: a relational aggregation operator generalizing
group-by, cross-tab and sub-total. In ICDE. February 1996.