Real Time Data Stream Processing Engine

download Real Time Data Stream Processing Engine

of 13

Transcript of Real Time Data Stream Processing Engine

  • 8/3/2019 Real Time Data Stream Processing Engine

    1/13

    REAL TIME DATA STREAMPROCESSING ENGINE

    Ashish Kumar Gupta1

    and Abhinav Rastogi2

    1Nibble Computer SocietyJSS Academy Of Technical Education, Noida 201301, India

    [email protected]

    2Quanta Electronics Society

    JSS Academy Of Technical Education, Noida 201301, [email protected]

    Abstract:

    The rapid growth in information science and technology in general and the

    complexity and volume of data in particular have introduced new challenges forthe research community. Databases are growing incessantly and many sources

    produce data continuously. In many cases, we need to extract some sort of

    knowledge from this continuous stream of data. Examples include customer click

    streams, telephone records, large sets of web pages, multimedia data, and sets of

    retail chain transactions. These sources are called data streams. If the process isnot strictly stationary (as most of real world applications), the target concept

    could gradually change over time.

    Keywords

    Internet traffic monitoring, on-line stream analysis, sliding windows, frequentitem queries

    1. Introduction

    A Stream Processing Engine is a computing platform for capturing, integrating,

    understanding, and reacting to business events as they occur.

    Data streaming systems are increasingly used as infrastructure for critical monitoring

    applications such as financial alerts and network intrusion detection. These monitoring

    applications often have many concurrent users asking similar but different queries over acommon data stream. For example, a system that monitors stock market trades might

    have multiple users interested in the total value of trades in a sliding window. While

    some of these users might care about stocks of a particular sector, or only about high

  • 8/3/2019 Real Time Data Stream Processing Engine

    2/13

    volume trades, others might compute complex user-defined predicates on fluctuating

    quantities like stock price. Similarly, the aggregation window that different users are

    interested in can vary widely. Money managers in financial institutions who runalgorithmic trading systems might want aggregates over 5-10 minute windows reported

    every 60-90 seconds depending on the specific financial models they use. In contrast, day

    traders with individual investing strategies might only need these results every 5-10minutes. Clearly such a system will have to support hundreds of queries. Therefore the

    need of such system arises which handles all the queries within no time and processes it.

    Recent years have witnessed an increasing interest in designing algorithmsfor querying and analyzing streaming data (i.e., data that is seen only once in a fixed

    order) with only limited memory. Providing (perhaps approximate) answers to queries

    over such continuous data streams is a crucial requirement for many applicationenvironments; examples include large telecom and IP network installations where

    performance data from different parts of the network needs to be continuously collected

    and analyzed.

    In India the systems are getting automatised every day. So the traditional databasesystems are not sufficient to work for on fly data. There this system will help to improve

    the performance.

    Examples:

    Financial services, telecommunication, stock market, medical department, military and

    industrial process control are some of the areas that will benefit from stream processing.As these sectors receive a large amount of data within no time so storing data and

    querying on this stored data would take much time and a bulk of data would be queuing.

    So this processing engine will process the data and query on it.

    2. Objective

    The emerging real-time information environment is being fueled by an unprecedented

    increase in the amount of live data that needs to be understood and reacted toinstantaneously. The traditional store and query model cannot address the needs of a

    world where in many cases information's value may exist for only a moment. Stream

    Processing Engine provides the infrastructure able to support this growing class of

    problems.

    Description

    The major issues:

    What will this engine do?Why is it required at the moment?

    What is the underlying technique?

    These questions are not far to be fetched as I have already stated that now a days stream

    of data have gained relevance over the contemporary data that was stored in databases

    and processed the case was that the data grew redundant over time and was considered

  • 8/3/2019 Real Time Data Stream Processing Engine

    3/13

    obsolete now the intellectuals thought of an idea to implement some mechanism that can

    process the data in the stream itself and store only the data that was relevant.

    Techniques:

    The challenges faced by Stream Processing Engines are manifold.

    They relate to the size of the data set and sometimes by the size of the slidingwindows and intermediate results of queries.

    Often exact data are not needed for aggregate queries.

    In real life, data streams are not continuous but often have bursts of data (e.g.network traffic). Processing these bursts of data without compromising system

    performance is a key challenge.

    Implementing Join processing with minimal resources in data streams is also amajor challenge.

    Recent research on these problems has given birth to the following major contemporary

    techniques:

    Performance of aggregate queries benefit greatly from computation of "SketchSummaries" (i.e. summaries which are representative of the overall data) toprovide approximate answers. [2]

    Adaptive load-aware scheduling of query operators can be used to minimize

    resource consumption during peak loads. [3] Join processing can be made approximate for sliding windows of data. [4]

    Exploiting similarities between incoming queries can lead to better resource.

    Real-time Feeds

    Alerts

    Actions

    Embedded

    local

    stora e

    Data

    store

    Remot

    e

  • 8/3/2019 Real Time Data Stream Processing Engine

    4/13

    3. Algorithms used or proposed

    3.1 Chain: Operator Scheduling for Memory Minimization in

    Data Stream Systems

    Theorem:Given a system with k queries, all operator selectivities 1,

    Let C(t) = # of blocks of memory used by Chain at time t.

    At every time t, any algorithm must use C(t) - k memory.

    Calculate lower envelope

    Priority = slope of lower envelope segment

    Always schedule highest-priority available operator

    Break ties using operator order in pipeline

    Favor later operators

    In many applications involving continuous data streams, data arrival is burstyand data rate fluctuates over time. Systems that seek to give rapid or

    real-time query responses in such an environment must be prepared to deal

    gracefully with bursts in data arrival without compromising system performance.

    We discuss one strategy for processing bursty streams adaptive,

    load-aware schedulingof query operators to minimize resource consumption

    during times of peak load. We show that the choice of an operator scheduling

    strategy can have significant impact on the run-time system memory usage.We then present Chain scheduling, an operator scheduling strategy for

    data stream systems that is near-optimal in minimizing run-time memory usage

    for any collection of single stream queries involving selections, projections and

    foreign-key joins with stored relations. Chain scheduling also performs wellfor queries with sliding-window joins over multiple streams, and multiple queries

    of the above types.

    Important components of systems research that have received less attention todate are run-time resource allocation and optimization. In this paper we focus on

    one aspect of run-time resource allocation, namely operator scheduling.

    Therefore, adaptivitybecomes critical to a data stream system as compared to atraditional DBMS.

    Opt1

    O t2

    Opt3Lower envelope

    B

    lo

    c

    k

    S

    i

    ze

  • 8/3/2019 Real Time Data Stream Processing Engine

    5/13

    Various approaches to adaptive query processing are possible given that the data

    may exhibit different types of variability. For example, a system could modify the

    structure of query plans, or dynamically reallocate memory among queryoperators in response to changing conditions, as suggested or take a holistic

    approach to adaptivity and do away with fixed query plans altogether, as in the

    Eddiesarchitecture .While these approaches focus primarily on adapting to changing

    characteristics of the data itself(e.g., changing selectivities), we focus on

    adaptivity towards changing arrivalcharacteristics of the data.

    As mentioned earlier, most data streams exhibit considerable burstinessand arrival-rate variation. It is crucial for any stream system to adapt gracefully to

    such variations in data arrival, making sure that we do not run out of critical

    resources such as main memory during the bursts. The focus of this paper is todesign techniques for such adaptivity.

    Query execution can be captured by a data flow diagram, where every tuple

    passes through a unique operator path. Thus queries can be represented as rooted

    trees. Everyoperator is a filter that operates on a tuple and produces s tuples, where s is theoperator selectivity. Obviously, the selectivity assumption does not hold at the

    granularity of a single tuple but is merely a convenient abstraction to capture theaverage behavior of

    the operator. For example, we assume that a select operator with selectivity 0.5will select about 500 tuples of every 1000 tuples that it processes. Henceforth, a

    tuple should not be thought of as an individual tuple, but should be viewed as an

    convenient abstraction of a memory unit, such as a page, that contains multiple

    tuples.Over adequately large memory units, we can assume that if an operator with

    selectivity

    s operates on inputs that require one unit of memory, its output will require sunits of memory.

    How it works :-

    Inputs:- Data flow path(s) consisting of sequences of operators- For each operator we know:- Execution time (per block)- Selectivity

  • 8/3/2019 Real Time Data Stream Processing Engine

    6/13

    Greedy algorithm:

    Operator priority = selectivity per unit time (si/ti)

    Always schedule the highest-priority available operator

    (0,0) (6,0)

    Time

    Blo

    c

    kSi

    (0,1)

    (1,0.5)

    (4,0.25)

    Opt1

    Opt2

    Opt3

    Time: t1Selectivity: s1

    Time: t2Selectivity: s2

    Query #2

    Query #1

    Stream Stream

    Time: t3Selectivity: s3

    Time: t4Selectivity: s4

  • 8/3/2019 Real Time Data Stream Processing Engine

    7/13

    3.2 Sliding Window Algorithms:

    Many infinite stream algorithms do not have obvious counterparts in the sliding

    window model. For example, one counter suffices to maintain the minimumelement in an infinite stream, but keeping track of the minimum element in a

    sliding window of sizeNtakes (N) spaceconsider an increasing sequence of

    values, in which the oldest item in any window is the minimum and must be

    replaced whenever

    the window moves forward. The fundamental problem is that as new items arrive,old items must be simultaneously evicted from the window, meaning that we need

    to

    store some information about the order of the packets in the window.Zhu and Shasha introduceBasic Windows to incrementally compute simple

    windowed aggregates in [5]. The window is divided into equally-sized Basic

    Windows and only asynopsis and a timestamp are stored for each Basic Window.When the timestamp

    of the oldest Basic Window expires, that window is dropped and a fresh Basic

    Window is added. This method does not require the storage of the entire slidingwindow, but results are refreshed only after the stream fills the current Basic

    Window. If the available memory is small, then the number of synopses that maybe stored issmall and hence the refresh interval is large.Exponential Histograms (EH) have

    been introduced by Datar et al. [6] and recently expanded in [7] to provide

    approximate

    answers to simple window aggregates at all times. The idea is to build BasicWindows with various sizes and maintain a bound on the error caused by

    counting those elements

    Memory Usage

    0

    0.5

    1

    1.5

    2

    2.53

    0 2 4 6 810

    12

    14

    16

    18

    Time

    BlockSize

    FIFO

    Chain

  • 8/3/2019 Real Time Data Stream Processing Engine

    8/13

    in the oldest Basic Window which may have already expired. The algorithm

    guarantees an error of at most_while using O (( 1/e)log2N) space.

    3.3 Proposed algorithm

    We propose the following simple algorithm. Frequent, that employs the Basic

    Window approach (i.e. the jumping window model) and stores a top-ksynopsis in

    each Basic

    Window. We fix an integerkand for each Basic Window, maintain a list of the k

    most frequent items in this window. We assume that a single Basic Window fits

    in main memory, within which we may count item frequencies exactly. Let i

    be the frequency of the kth most frequent item in the ith Basic Window. Then =Pi i is the upper limit on the frequency of an item type that does not appear on

    any of

    the top-klists. Now, we sum the reported frequencies for each item present in at

    least one top-ksynopsis and if there exists a category whose reported frequencyexceeds , we are

    certain that this category has a true frequency of at least . The pseudo code is

    given below, assuming thatNis the sliding window size, b is the number ofelements per Basic

    Window, andN/b is the total number of Basic Windows. An updated answer is

    generated whenever the window slides forward by bpackets.

    Implementation

    Repeat:1. For each element e in the next b elements:

    If a local counter exists for the type of element e:

    Increment the local counter.Otherwise:

    Create a new local counter for this element type

    and set it equal to 1.

    2. Add a summary Scontaining identities and countsof the kmost frequent items to the back of queue Q.

    3. Delete all local counters.

    4. For each type named in S:If a global counter exists for this type:

    Add to it the count recorded in S.

    Otherwise:Create a new global counter for this element type

    and set it equal to the count recorded in S.

    5. Add the count of the kth largest type in Sto .

    6. IfsizeOf(Q)> N/b:(a) Remove the summary S_from the front ofQ and

    subtract the count of the kth largest type in S_from .

    (b) For all element types named in S_:

  • 8/3/2019 Real Time Data Stream Processing Engine

    9/13

    Subtract from their global counters the counts recorded in S_.

    If a counter is decremented to zero:

    Delete it.(c) Output the identity and value of each global

    counter> .

    3.4 Processing Complex Aggregate Queries over Data Streams

    Defined once, and run until user terminates them

    Q is a selection: then size(A) may be unbounded. Thus, we cannot guaranteewe can store it.

    Q is a self-join: If we want to provide only NEW results, then we needunlimited storage to guarantee no duplicates exist in result

    Q contains aggregation: then tuples in A might be deleted by new observedtuples.

    Ex: Select A, sum(B)

    From Stream X What if B < 0 ?

    Group by A

    Having sum(B) > 100

    What if we can delete tuples in the Stream? What if Q contains a blocking operator near the top (example: aggregation)? Online Aggregation Techniques useful

  • 8/3/2019 Real Time Data Stream Processing Engine

    10/13

    Work on Self-Maintenance: important to limit size of Scratch. If a view can be

    self-maintainable, any auxiliary storage much occupy bounded space

    Work on Data Expiration: important for knowing when to move elementsfrom Scratch to Throw.

    Goal: Group similar queries over data sources, to eliminate commonprocessing needed and minimize response time and storage needed .

    Niagara (joint work of Wisconsin, Oregon) Tukwilla (Washington) Telegraph (Berkeley)

    Eddy Knowing State of Tuples Passes Tuples by Reference to Operators (avoids copying) When Eddy does not have any more input tuples, it polls the sources for more

    input.

    Tuples need to be augmented with additional information: Ready Bits: Which operators need to be applied Done Bits: Which operators have been applied Queries Completed: Signals if tuple has been output or rejected by the

    query

    Completion Mask (per query): To know when a tuple can be output fora query (completion mask & done bits = mask)

    Queries with no joins are partitioned per data source (to save space in the bitsrequired)

    Queries with Disjunctions (ORs) are transformed into conjunctive normalform (and of ors).

    Range/exact predicates are found in Grouped filter

  • 8/3/2019 Real Time Data Stream Processing Engine

    11/13

    Stems and Joins SteMs: Multiway-Pipelined Joins Double- Pipelined Joins maintain a hash index on each relation. When N relations are joined, at least n-2 inflight indices are needed for

    intermediate results even for left-deep trees.

    Previous approach cannot change query plan without re-computingintermediate indices.

    4. Conclusion and Future Work:

    In this paper we have we have tried to find out the practical implementation of real timedata stream processing engine by analysis of the algorithms and their feasibility and

    found them to be applicable on the test scenario.

    Next we would be developing our own engine, which captures the data packets over a

    socket interprects it into meaning full data , processes them and stores the relevantinformation in the data base and drops the rest of the data which has been processesed

    and is not required any more.

    These Real Time Stream Data Processing Engines are the need of the hour since the

    processes of data mining is growing critical day by day for the development of anindividual, economic growth of the society and advancement of Technology.

    5. References

    Research Projects:

    Aurora (supports cq, ad-hoc query, and materialized view)- Aims to better support monitoring applications

    Borealis (distributed SPE, QoS based techniques)- A distributed stream processing engine based on Aurora and Medusa

    Shivnath Babu and Rajeev Motwani ,

    Department of Computer Science , StanfordUniversity Stanford, CA 94305

    Cornell University, [email protected]

  • 8/3/2019 Real Time Data Stream Processing Engine

    12/13

    Minos Garofalakis, Bell Labs, Lucent, [email protected]

    Johannes Gehrke, Cornell [email protected]

    Rajeev Rastogi, Bell Labs, Lucent, [email protected]

    [1] A. Arasu et al.. Resource sharing in continuous sliding-window aggregates. InVLDB. 2004.

    [2] A. Arasu, et al.. The CQL continuous query language: Semantic foundationsand query execution. VLDB Journal, (To appear).

    [3] F. Bancilhon, et al.. FAD, a powerful and simple database language. In

    VLDB. 1987.

    [4] D. Carney, et al.. Monitoring streams - a new class of data management

    applications. In VLDB. 2002.

    [5] S. Chandrasekaran, et al.. TelegraphCQ: Continuous dataflow processing for

    an uncertain world. In CIDR. 2003.

    [6] S. Chandrasekaran et al.. Streaming queries over streaming data. In VLDB.

    2002.

    [7] J. Chen, et al.. NiagaraCQ: a scalable continuous query system for Internetdatabases. In SIGMOD. 2000.

    [8] C. D. Cranor, et al.. Gigascope: A stream database for network applications. In

    SIGMOD. 2003.

    [9] M. Denny et al.. Predicate result range caching for continuous queries. In

    SIGMOD. 2005.

    [10] P. M. Deshpande, et al.. Caching multidimensional queries using chunks. In

    SIGMOD. 1998.

    [11] C. L. Forgy. Rete: A fast algorithm for the many pattern/many object matchproblem. Artifical Intelligence, 19(1):1737, September 1982.

    [12] M. J. Franklin, et al.. Design considerations for high fan-in systems: TheHiFi approach. In CIDR. 2005.

  • 8/3/2019 Real Time Data Stream Processing Engine

    13/13

    [13] L. Golab et al.. Update-pattern-aware modeling and processing of continuous

    queries. In SIGMOD. 2005.

    [14] G. Graefe. Query evaluation techniques for large databases. ACM

    Computing Surveys, 25(2):73170, June 1993.

    [15] J. Gray, et al.. Data Cube: a relational aggregation operator generalizing

    group-by, cross-tab and sub-total. In ICDE. February 1996.