PAPER PRESENTATION on
description
Transcript of PAPER PRESENTATION on
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
PAPER PRESENTATION on
An Efficient and Resilient Approach to Filtering & Disseminating Streaming
Data
CMPE 521
Database Systems
Prepared by:
Mürsel Taşgın
Onur Kardeş
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Introduction
The internet and the web are increasingly used to disseminate fast changing data.
Several examples for fast changing data:sensors,
traffic and weather information,
stock prices,
sports scores,
health monitoring information
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Introduction
The properties of this data:Highly dinamic,
Streaming,
Aperiodic.
Users are interested in not only monitoring streaming data but in also using it for on-line decision making.
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Introduction
SOURCE
Repository 1
Repository 2
Replicating the Source
Repository 3
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Introduction
Services like Akamai.net and IBM’s edge server technology are exemplars of such networks of repositories, which aim to provide better services by shifting most of the work to the edge of the network (closer to the end users).
But, although such systems scale quite well, if the data is changing at a fast rate, the quality of service at a repository farther from the data source would deteriorate.
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Introduction
In general;Replication can reduce the load on the sources,
But, replication of time-varying data introduces new challenges:
Coherency
Delays and scalability
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Introduction
Coherency requirement (cr) : Coherency requirement (cr) : Users specify the bound on the tolerable imprecision associated with each requested data item.
SOURCE
Microsoft : $60,85
at time : 11:43 Repository 2
Microsoft : $60,86
at time : 11:41
Repository 1
Microsoft : $60,89
at time : 11:36
USER 1
USER 2
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Introduction
Coherency-preserving system:Coherency-preserving system:the delivered data must preserve associated coherency requirements,
resilient to failures,
efficient.
Necessary changes are pushed to the users; instead of polling the source independently.
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Introduction
Construction of an effective Construction of an effective dissemination network of repositoriesdissemination network of repositories
A logical overlay network of repositories are created according to:
coherency needs of users attached to each repository
expected delays at each repository
this network is called dynamic data dissemination graph (d3g).
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Introduction
Construction of an effective Construction of an effective dissemination network of repositoriesdissemination network of repositories
The previous algorithm called LeLA, for d3g, was unable to cope with large number of data.
A new algorithm (DiTA) to build dissemination networks that are scalable and resilient, is introduced.
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Introduction
Construction of an effective Construction of an effective dissemination network of repositoriesdissemination network of repositories
In DiTA, repositories with more stringent coherency requirements are placed closer to the source in the network as they are likely to get more updates than the ones with looser coherency requirements.
In DiTA, a dynamic data dissemination tree, d3g, is created for each data item, x.
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Introduction
SOURCE
Repository 1c = 0.2
Repository 2c = 0.3
Repository 3c = 0.8
Repository 4c = 0.7
Repository 5c = 0.9
Repository 6c = 0.7
Construction of an effective Construction of an effective dissemination network of repositoriesdissemination network of repositories
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Introduction
Provision for the dissemination of dynamic data Provision for the dissemination of dynamic data in spite of failures in the overlay network in spite of failures in the overlay network
to handle repository and communication link failures; back-up parents are used.
back-up parent is asked to deliver data with coherency that is less stringent than that associated with the parent.
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Introduction
Provision for the dissemination of dynamic data Provision for the dissemination of dynamic data in spite of failures in the overlay networkin spite of failures in the overlay network
x,y,z,t a,b,c,x
zy,z,tx,t
Parent
Back-up Parent
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Introduction
Efficient filtering and scheduling techniques for Efficient filtering and scheduling techniques for repositoriesrepositories
normally a repository receives updates and selectively disseminates them to its downstreams.
it is not always necessary to disseminate the exact values of the most recent updates, as long as the values presented preserve the coherency of the data.
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
The Basic Framework: Data Coherency and Overlay Network
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
The Basic Framework: Data Coherency and Overlay Network
a coherency requirement (c) is associated with a data
item, to denote the maximum permissible deviation of
the user’s view from the value of data x at the source.
c can be specified in terms of;time (values should never be out-of-sync by more than 5sec.)
value (weather information where the temperature value should never be out-of-sync by more than 2 degrees).
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
The Basic Framework: Data Coherency and Overlay Network
Each data item in the repository from which a user obtains data must be refreshed in such a way that the user-specified coherency requirements are maintained.
fidelity f observed by a user can be defined to be the total length of time for which the above inequality holds
Ux(t) – Sx(t) ≤ c1
Px(t) – Sx(t) ≤ c2
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
The Basic Framework: Data Coherency and Overlay Network
Assume x is served by a single source
Repositories R1,....,Rn are interested in x.
These repositories in turn serve a subset of the remaining repositories such that the resulting network is in the form a tree rooted at the source and consisting of repositories R1,....,Rn .
Parent dependent relationship.
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
The Basic Framework: Data Coherency and Overlay Network
Since the repository disseminates updates to its users and dependents, the coherency requirement of a repository should be the most stringent requirement that it has to serve.
When a data change occurs at the source, it checks which of its direct and indirect dependents are interested in the change and pushes the change to them.
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Building a d3t
Start with a physical layout of the communication network in the form of a graph, where the graph consists of a set of sources, repositories and the underlying network.
Try to build a d3t for a data item x.
The root of the d3t will be the source, which serves x.
A repository P serving repository Q with data item x, is called the parent of Q; and Q is called the dependent of P for x.
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Building a d3t
Source for data itemx
USERSUSERSUSERSUSERS
R1 R2
Parent
Dependents
Level 0
Level 1
Level 2
in each repository;
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Building a d3t
A repository should ideally serve at least as many unique pairs as the number of data items served to it.
If a repository is currently serving less than this fixed number, then we say that the repository has the resources to serve a new dependent.
R1Dependent Data Item
R7 xR11 yR18 xR9 zR10 tR21 x
?
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Building a d3t
R4c=0.1
R7c=0.8
R5c=0.4
R9c=0.7
R8c=0.6
SOURCE
R6c=0.5
R10c=0.3
Enough resources?
Max(c)=0.8Max(c)=0.7
Max(c)=0.8 Max(c)=0.6Max(c)=0.7
Enough resources?
Enough resources?YEScR6 > cR10So, replace R10 with R6, and push R6 down
NO
NO
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Building a d3t
R4c=0.1
R5c=0.4
R6c=0.5
R8c=0.6
R10c=0.3
Max(c)=0.6
R9c=0.7
SOURCE
Max(c)=0.8
Max(c)=0.8
R7c=0.8
Max(c)=0.7
Max(c)=0.7Max(c)=0.5
This algorithm is called as
Data-Item-at-a-Time-Algorithm
(DiTA)
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Building a d3t
Real world stock price streams from http://finance.yahoo.com are used.
10,000 values are polled during 1,000 traces; approximately a new data value is obtained per second.
Traces – Collection procedure and charectristicsTraces – Collection procedure and charectristics
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Building a d3t
A coherency requirement c is associated with each of the chosen data items.
c’s associated with data in a repository are a mix of stringent tolerances (varying from $0.01 to 0.05) and less stringent tolerances (varying from $0.5 to 0.99).
T% of the data items have stringent coherency requirements at each repository (the remaining (100 – T)%, of data items have less stringent coherency requirements).
Repositories – Data, Coherency and Cooperation characteristicsRepositories – Data, Coherency and Cooperation characteristics
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Building a d3t
The router topology was generated using BRITE (http://www.cs.bu.edu/brite).
The repositories and the sources are selected randomly.
node-node communication delays derived from a Pareto distribution: x (1 / x1/α) + x1 where α = x’ / (x’-1) and
Physical Network – topology and delaysPhysical Network – topology and delays
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Building a d3t
x’ is the mean, x1 is the minimum delay a link can have.
According to the experiments, x’=15 ms and x1=2 ms.
The computational delays for dissemination is taken to be 12.5 ms .
Physical Network – topology and delaysPhysical Network – topology and delays
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Building a d3t
The key metric is the loss in fidelity of the data.
Fidelity was the total length of time which the inequality;
|P(t) – S(t)| < c holds.
Fidelity of a repository is the mean over all data items stored in that repository
Fidelity of the system is the mean fidelity of all repositories.
Obviously, the loss of fidelity is (100% - fidelity)
One another metric is the number of messages in the system (system load)
MetricsMetrics
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Building a d3t
For the base performance measurement, 600 routers, 100 repositories and 4 servers were used.
Total number of data items served by servers was varied from 100 to 1000.
T parameter was varied from 20 to 80.
A previous algorithm, LeLA was used as a benchmark.
Performance EvaluationPerformance Evaluation
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Building a d3t
Each node in DiTA does less work than in LeLA.
Thus, in DiTA height of the dissemination tree will be more.
So, when computational delays are low; but link delays are large, LeLA may act better.
But, this happens only for negligible computational delays (0.5 ms) and very high link delays (110 ms)
Performance EvaluationPerformance Evaluation
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Enchancing the Resiliency of the Repository Network
Active backups vs. Passive backups
Passive backups may increase the load, which causes the loss in fidelity.
So active backup parents are used.
A backup parent serves data to a dependent Q with a coherency cB > c.
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Enchancing the Resiliency of the Repository Network
If all changes are less than cB, the dependent can not know when parent P fails. So P should send periodic “I’m alive” messages.
Once P fails, Q requests B to serve it the data at c . When P recovers from the failure, Q requests B to serve the data item at cB.
In this approach, there no backup for backups. So that when both P and B fails, Q can not get any updates.
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Enchancing the Resiliency of the Repository Network
For the sake of simplicity, cB = k * c.Here, choice of k is important:
Choice of cChoice of cBB Using a Probabilistic Model Using a Probabilistic Model
kBackup will send
updates frequentlywhich incur high computational
and communication
overheads
Dependent will miss a
large number of changes during
failure of the parent
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Enchancing the Resiliency of the Repository Network
Assuming that the data values change with uniform probability and
Using a Markov Chain Model:
# Misses = 2k2 – 22k2-2 is the number of updates a dependent will miss before it detects that there is a failure.
According to the experiments, this number is rather pessimistic; nearly an upper limit.
Choice of cChoice of cBB Using a Probabilistic Model Using a Probabilistic Model
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Enchancing the Resiliency of the Repository Network
Choice of backup parentsChoice of backup parents
R
BP
Q
CAny siblings?
NO
Any siblings?
B C
YES
Choose one of them randomly
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Enchancing the Resiliency of the Repository Network
In case the coherency at which Q wants x from B is less then the coherency at which B wants x ,
the parent of B is asked to serve x to Q with the required tighter coherency.
An advantage of choosing a sibling, is that the change in coherency requirement is not percolated all the way to the source.
However, if an ancestor of P and B is heavily loaded, then the delay due to the load will be reflected in the updates of both the P and B . This might result in additional loss in fidelity.
Choice of backup parentsChoice of backup parents
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Enchancing the Resiliency of the Repository Network
Because the kinds of failures are memory-less, an exponential probability distribution is used for simulating them.
Pr (X > t) = e-λt
λ = λ1 time to failure
λ = λ2 time to recover
In this approach link failures are not taken into account. So the model is incomplete...
Effect of Repository failures on Loss of FidelityEffect of Repository failures on Loss of Fidelity
λ2fast recovery
slow recovery
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Enchancing the Resiliency of the Repository Network
The effect of adding resiliency is shown.
k=2 is used.
When 100 data items are used, 23% of updates sent by backups are disseminated.
Some updates sent by backups reached before parents’.
Perfomance EvaluationPerfomance Evaluation
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Enchancing the Resiliency of the Repository Network
But when backup parents are loaded ( > 400), their updates are of no use, and increase the loss of fidelity.
The dependent should control them by time-stamping the updates.
Perfomance EvaluationPerfomance Evaluation
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Enchancing the Resiliency of the Repository Network
During the experiment, about 80-90% of the repositories experienced at least one failure,
and the maximum number of failures in the system at any given time for λ2 = 0.001 was around 12.
For λ2 = 0.01, the maximum number of failures was 5 and for λ2 = 0.1 , the maximum failures was 2.
Perfomance EvaluationPerfomance Evaluation
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Enchancing the Resiliency of the Repository Network
Effect of quick recovery is shown.
λ1 = 0.0001 and λ2 = 2
For high coherence requirements, resiliency improves fidelity even for transient failures.
Perfomance EvaluationPerfomance Evaluation
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Enchancing the Resiliency of the Repository Network
However, with resiliency; with a very large number of data items, for e.g., 1000, fidelity drops.
This is because, at this point, the cost of resiliency exceeds the benefits obtained by it, and hence this increases the lost in fidelity.
Perfomance EvaluationPerfomance Evaluation
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Reducing the Delay at a Repository
Delays1) Queing delay: The time delay between the arrival of the update and time
its processing started
2) Processing delay: Check delay (decide if the update should be processed) + computation delay( delay of computing the update and pushing data to the dependents)
Update of yUpdate of x update of yupdate of x
Queue update requests
queing delay
Check if update needed
yx
Process of the updates and disseminating data is complete!
processing delay
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Reducing the Delay at a Repository
Question: How can we reduce the average delays to improve fidelity?
This can be done by:a) Better filtering i.e. Reducing the processing delay in determining if
an update needs to disseminated to one or more dependents
a) Better scheduling of disseminations
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Reducing the Delay at a Repository
Better Filtering
For each dependent, a repository maintains the coherency req. & last value pushed to
Upper bound = last pushed value + cr
Lower bound = last pushed value - cr
C1=0.7
C2=0.6
C3=0.5
C4=0.3
C5=0.1
C6=0.05
The dependent with first largest cr which needs to be disseminated
For every window the below rule is valid
If an update violates above rule a pseudo value is generated as actual value
Algorithm to find the dependents to disseminate data
So
rted
cr
valu
es
CR values for dependents reside at the repository
Dependent ordering
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Reducing the Delay at a Repository
Better Filtering
Better filtering provides:
•Sending the updates of dynamic data to end users who are actually
interested in that update.
•By filtering, no garbage data flow is on the network. (no flooding of
data over the network) This improves communication time in the
networks and provides better response times
•By the help of filtering, a better scalable system can be established and it will resist against unexpected heavy loads.
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Reducing the Delay at a Repository
Better scheduling of disseminations
u2u1
C(u1)Cost of update(delay)
C(u2)Cost of update(delay)
b(u1)Beneficiary of update
b(u2)Beneficiary of update
Total delay of processing ui
Approach:
Instead of standard queueing of processing the update requests, a kind of prioritization is superior to have
better performance b(u)/C(u) SCORINGEach update request is shceduled according to this score. B(u) is the number of dependents that will receive the update, C(u) is the cost of dissemination to all dependants. B(u) values are stored at aech repository so they are precomputed automatıcally.
Advantages:
•Update requests that is important to many dependents will be processed earlier BUSINESS IMPORTANCE
•Updates with low ratio gets delayed and if a new update arrives older ones are dropped, which improves performance especially in heaviliy loaded environments SCALABILITY
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Reducing the Delay at a Repository
Scheduling provides:
• Priority scheme and business importance approach that achieves better results
• As filtering, it makes improvements on scalability; some out of date update requests are discarded from the queue. This saves unnecessary computations and queue delays.
Better scheduling of disseminations
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Reducing the Delay at a Repository
Experimental Results“Dependent ordering” has lower loss of fidelity than “simple algorithm”. However “Scheduling” has better than those (up to 15%)
“Dependent ordering” has less number of pushes than “simple algorithm”.
“Scheduling” algorithm decrease computation delays because some updates are dropped at the queue because of new updates arrive and older ones are out of date.
Fidelity loss with “Scheduling” is shown with some numbers. It is seen that fidelity drops with an increase in the number of data items. Even with large increases in the number of data items, high update rates loss of fidelity is in the range within 10% only.
This provides better scalability
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Reducing the Delay at a Repository
Advantages of the better performance approachesApproach-1-: Maintaining the dependents ordered by cr values
Reduces the number of checks required for processing each update
Reduces the number of pushes
Approach-2-: Scheduling
Reduces the overall delay to the end clients by processing updates which provide a higher benefit at a lower cost
Gives a better choice in dropping updates as low score updates are dropped
Due to lower propagation delay, it provides better scalibility and degrades gracefully under unexpected heavy loads
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Related Work
Simple decision procedure is superior. Because there are many complex algorithms and database systems, that take much computation time to maintain data repository up to date
Some dynamic web data dissemination algorithms also uses push-based scheme. However if they use coherency scalability is improved and another important feature is that data repositories don’t need to cooperate with each other to maintain coherence information. (it’s up to date already!)
This approach deals with rapidly changing dynamic data while some similar approaches focus on web content that changes at slower time-scales
Most powerful side of this approach is that it deals with the problem of failure and forms a resillient dissemination network.
Boğaziçi University – Computer Engineering Dept. CMPE 521
An Efficient and Resilient Approach to Filtering &Disseminating Streaming Data
Conclusion
The key points in this architecture are:
Design of a push-based dissemination for time-varying data. Not all the updates are disseminated to each repository, only the updates that meet the coherency requirements are pushed EFFICIENT
Design of cooperative dissemination network. This provides a resilient network and even if a failure in the network occurs, data coherency is not completely lost. RESILLIENT
Intelligent filtering, scheduling, selective dissemination reduces the overhead in the network. It provides a better scalability and it’s a good alternative for dynamic data publishing. SCALABLE