On Unified Stream Reasoning - The RDF Stream Processing realm
Data Stream Processing - University of Edinburgh
Transcript of Data Stream Processing - University of Edinburgh
![Page 1: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/1.jpg)
Data Stream ProcessingPart II
Stream Windows Lossy Counting Sticky Sampling1
![Page 2: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/2.jpg)
Data Streams (recap)
continuous, unbounded sequence of itemsunpredictable arrival timestoo large to store locallyone pass real time processing required
Stream Windows Lossy Counting Sticky Sampling2
![Page 3: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/3.jpg)
Reservoir Sampling (recap)
r/N r/N r/N r/N r/N
Reservoir
Stream
create representative sample of incoming data items Nuniformly sample into reservoir of size r
Stream Windows Lossy Counting Sticky Sampling3
![Page 4: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/4.jpg)
Today
Counting algorithms based on streamwindows
Lossy Counting
Sticky Sampling
Stream Windows Lossy Counting Sticky Sampling4
![Page 5: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/5.jpg)
Stream Windows
Mechanism for extracting a finite relation from aninfinite stream.
Stream Windows Lossy Counting Sticky Sampling5
![Page 6: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/6.jpg)
Window Example
a d j u w s w y u j g d e d l
stream
past future
a d j u w s w y u j g d e d l
stream
a d j u w s w y u j g d e d l
stream
Sliding Window
Stream Windows Lossy Counting Sticky Sampling6
![Page 7: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/7.jpg)
Window Example
a d j u w s w y u j g d e d l
stream
past future
a d j u w s w y u j g d e d l
stream
a d j u w s w y u j g d e d l
stream
Sliding Window
Stream Windows Lossy Counting Sticky Sampling7
![Page 8: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/8.jpg)
Window Example
a d j u w s w y u j g d e d l
stream
past future
a d j u w s w y u j g d e d l
stream
a d j u w s w y u j g d e d l
stream
Sliding Window
Stream Windows Lossy Counting Sticky Sampling8
![Page 9: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/9.jpg)
Window Example
a d j u w s w y u j g d e d l
stream
past future
a d j u w s w y u j g d e d l
stream
a d j u w s w y u j g d e d l
stream
Sliding WindowStream Windows Lossy Counting Sticky Sampling
9
![Page 10: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/10.jpg)
Window Types
assumes existences of some attribute that defines the order ofthe stream elements (e.g. time)w is the window length (size) expressed in units of theordering attribute (e.g. seconds)
t1 t2 t3 t4 t1' t2' t3' t4'Sliding Window
ti' - ti = w
t1 t2 t3Tumbling Window
ti+1 - ti = w
Stream Windows Lossy Counting Sticky Sampling10
![Page 11: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/11.jpg)
Window Types
assumes existences of some attribute that defines the order ofthe stream elements (e.g. time)w is the window length (size) expressed in units of theordering attribute (e.g. seconds)
t1 t2 t3 t4 t1' t2' t3' t4'Sliding Window
ti' - ti = w
t1 t2 t3Tumbling Window
ti+1 - ti = w
Stream Windows Lossy Counting Sticky Sampling11
![Page 12: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/12.jpg)
Count based Windows
Ordering attribute can cause problems for duplicates(e.g. same time stamps)
Use count based windows instead
t1 t2 t3t1' t2' t3'Count based Window
Count based windows are potentially unpredicatable with respect tofluctuation in input rates.
Stream Windows Lossy Counting Sticky Sampling12
![Page 13: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/13.jpg)
Count based Windows
Ordering attribute can cause problems for duplicates(e.g. same time stamps)
Use count based windows instead
t1 t2 t3t1' t2' t3'Count based Window
Count based windows are potentially unpredicatable with respect tofluctuation in input rates.
Stream Windows Lossy Counting Sticky Sampling13
![Page 14: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/14.jpg)
Punctuation based Windows
Split windows based on punctuations in the data
Punctuation based Window\n \n \n
Potentially problematic if windows grow too large or too small.
Stream Windows Lossy Counting Sticky Sampling14
![Page 15: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/15.jpg)
Punctuation based Windows
Split windows based on punctuations in the data
Punctuation based Window\n \n \n
Potentially problematic if windows grow too large or too small.
Stream Windows Lossy Counting Sticky Sampling15
![Page 16: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/16.jpg)
Window Standing Query Example
What is the average of the integers in the window?
Stream of integersWindow of size w = 4
Count based sliding windowfor the first w inputs, sum and countafterwards change average by adding (i − j)/w to the previouswindow average
Stream Windows Lossy Counting Sticky Sampling16
![Page 17: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/17.jpg)
Window Standing Query Example
1 3 5 4 8 9 3 1 4 2 7 5 6 8 7
stream
1 3 5 4 8 9 3 1 4 2 7 5 6 8 7
stream
1 3 5 4 8 9 3 1 4 2 7 5 6 8 7
stream
1+3+5+44
= 3.25
3.25 + i−jw
with i newest value, joldest value
1+3+5+44
+ 8−14
= 5
5 + 9−34
= 6.5
Datastructure?
Stream Windows Lossy Counting Sticky Sampling17
![Page 18: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/18.jpg)
Window Standing Query Example
1 3 5 4 8 9 3 1 4 2 7 5 6 8 7
stream
1 3 5 4 8 9 3 1 4 2 7 5 6 8 7
stream
1 3 5 4 8 9 3 1 4 2 7 5 6 8 7
stream
1+3+5+44
= 3.25
3.25 + i−jw
with i newest value, joldest value
1+3+5+44
+ 8−14
= 5
5 + 9−34
= 6.5
Datastructure?
Stream Windows Lossy Counting Sticky Sampling18
![Page 19: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/19.jpg)
Window Standing Query Example
1 3 5 4 8 9 3 1 4 2 7 5 6 8 7
stream
1 3 5 4 8 9 3 1 4 2 7 5 6 8 7
stream
1 3 5 4 8 9 3 1 4 2 7 5 6 8 7
stream
1+3+5+44
= 3.25
3.25 + i−jw
with i newest value, joldest value
1+3+5+44
+ 8−14
= 5
5 + 9−34
= 6.5
Datastructure?
Stream Windows Lossy Counting Sticky Sampling19
![Page 20: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/20.jpg)
Window Standing Query Example
1 3 5 4 8 9 3 1 4 2 7 5 6 8 7
stream
1 3 5 4 8 9 3 1 4 2 7 5 6 8 7
stream
1 3 5 4 8 9 3 1 4 2 7 5 6 8 7
stream
1+3+5+44
= 3.25
3.25 + i−jw
with i newest value, joldest value
1+3+5+44
+ 8−14
= 5
5 + 9−34
= 6.5
Datastructure?
Stream Windows Lossy Counting Sticky Sampling20
![Page 21: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/21.jpg)
Window Standing Query Example
1 3 5 4 8 9 3 1 4 2 7 5 6 8 7
stream
1 3 5 4 8 9 3 1 4 2 7 5 6 8 7
stream
1 3 5 4 8 9 3 1 4 2 7 5 6 8 7
stream
1+3+5+44
= 3.25
3.25 + i−jw
with i newest value, joldest value
1+3+5+44
+ 8−14
= 5
5 + 9−34
= 6.5
Datastructure?
Stream Windows Lossy Counting Sticky Sampling21
![Page 22: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/22.jpg)
Window Standing Query Example
1 3 5 4 8 9 3 1 4 2 7 5 6 8 7
stream
1 3 5 4 8 9 3 1 4 2 7 5 6 8 7
stream
1 3 5 4 8 9 3 1 4 2 7 5 6 8 7
stream
1+3+5+44
= 3.25
3.25 + i−jw
with i newest value, joldest value
1+3+5+44
+ 8−14
= 5
5 + 9−34
= 6.5
Datastructure?
Stream Windows Lossy Counting Sticky Sampling22
![Page 23: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/23.jpg)
Window Average
#!/usr/bin/env python2import sysimport QueueWINDOW = 4elems = Queue.Queue()elem_sum = 0for i in range(WINDOW): # initial average
val = int(sys.stdin.readline().strip())elems.put(val)elem_sum += val
avg = float(elem_sum) / WINDOWfor line in sys.stdin:
print(avg)val = int(line.strip())avg = avg + (val - elems.get())/float(WINDOW)elems.put(val)
Allows calculation in a single pass of each element.
Stream Windows Lossy Counting Sticky Sampling23
![Page 24: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/24.jpg)
Window Average
#!/usr/bin/env python2import sysimport QueueWINDOW = 4elems = Queue.Queue()elem_sum = 0for i in range(WINDOW): # initial average
val = int(sys.stdin.readline().strip())elems.put(val)elem_sum += val
avg = float(elem_sum) / WINDOWfor line in sys.stdin:
print(avg)val = int(line.strip())avg = avg + (val - elems.get())/float(WINDOW)elems.put(val)
Allows calculation in a single pass of each element.
Stream Windows Lossy Counting Sticky Sampling24
![Page 25: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/25.jpg)
Window based Algorithm
Lossy Counting
Stream Windows Lossy Counting Sticky Sampling25
![Page 26: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/26.jpg)
Problem Description
Maintain a count of distinctelements seen so far
Examples:Google web crawler counting URL encounters.Detecting spam pages through content analysis.User login rankings to web services.
Straight forward solution: Hashtable
Too large for memory, too slow on disk
Stream Windows Lossy Counting Sticky Sampling26
![Page 27: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/27.jpg)
Problem Description
Maintain a count of distinctelements seen so far
Examples:Google web crawler counting URL encounters.Detecting spam pages through content analysis.User login rankings to web services.
Straight forward solution: Hashtable
Too large for memory, too slow on disk
Stream Windows Lossy Counting Sticky Sampling27
![Page 28: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/28.jpg)
Problem Description
Maintain a count of distinctelements seen so far
Examples:Google web crawler counting URL encounters.Detecting spam pages through content analysis.User login rankings to web services.
Straight forward solution: Hashtable
Too large for memory, too slow on disk
Stream Windows Lossy Counting Sticky Sampling28
![Page 29: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/29.jpg)
Problem Description
Maintain a count of distinctelements seen so far
Examples:Google web crawler counting URL encounters.Detecting spam pages through content analysis.User login rankings to web services.
Straight forward solution: Hashtable
Too large for memory, too slow on disk
Stream Windows Lossy Counting Sticky Sampling29
![Page 30: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/30.jpg)
Algorithm Parameters
Environment ParametersElements seen so far N
User-specified Parameterssupport threshold s ∈ (0, 1)
error parameter ε ∈ (0, 1)
Stream Windows Lossy Counting Sticky Sampling30
![Page 31: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/31.jpg)
Algorithm Guarantees
1 All items whose true frequency exceeds sN are output. Thereare no false negatives.
2 No items whose true frequency is less than (s − ε)N is output.3 Estimated frequencies are less than the true frequencies by at
most εN.
Stream Windows Lossy Counting Sticky Sampling31
![Page 32: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/32.jpg)
Example
With s = 10%, ε = 1%,N = 1000
1 All elements exceeding frequency sN = 100 will be output.2 No elements with frequencies below (s − ε)N = 90 are output.
False positives between 90 and 100 might or might not beoutput.
3 All estimated frequencies diverge from their true frequencies byat most εN = 10 instances.
Rule of thumb: ε = 0.1s
Stream Windows Lossy Counting Sticky Sampling32
![Page 33: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/33.jpg)
Example
With s = 10%, ε = 1%,N = 1000
1 All elements exceeding frequency sN = 100 will be output.
2 No elements with frequencies below (s − ε)N = 90 are output.False positives between 90 and 100 might or might not beoutput.
3 All estimated frequencies diverge from their true frequencies byat most εN = 10 instances.
Rule of thumb: ε = 0.1s
Stream Windows Lossy Counting Sticky Sampling33
![Page 34: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/34.jpg)
Example
With s = 10%, ε = 1%,N = 1000
1 All elements exceeding frequency sN = 100 will be output.2 No elements with frequencies below (s − ε)N = 90 are output.
False positives between 90 and 100 might or might not beoutput.
3 All estimated frequencies diverge from their true frequencies byat most εN = 10 instances.
Rule of thumb: ε = 0.1s
Stream Windows Lossy Counting Sticky Sampling34
![Page 35: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/35.jpg)
Example
With s = 10%, ε = 1%,N = 1000
1 All elements exceeding frequency sN = 100 will be output.2 No elements with frequencies below (s − ε)N = 90 are output.
False positives between 90 and 100 might or might not beoutput.
3 All estimated frequencies diverge from their true frequencies byat most εN = 10 instances.
Rule of thumb: ε = 0.1s
Stream Windows Lossy Counting Sticky Sampling35
![Page 36: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/36.jpg)
Example
With s = 10%, ε = 1%,N = 1000
1 All elements exceeding frequency sN = 100 will be output.2 No elements with frequencies below (s − ε)N = 90 are output.
False positives between 90 and 100 might or might not beoutput.
3 All estimated frequencies diverge from their true frequencies byat most εN = 10 instances.
Rule of thumb: ε = 0.1s
Stream Windows Lossy Counting Sticky Sampling36
![Page 37: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/37.jpg)
Expected Errors
1 high frequency false positives2 small errors in frequency estimations
Acceptable for high numbers of N
Stream Windows Lossy Counting Sticky Sampling37
![Page 38: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/38.jpg)
Expected Errors
1 high frequency false positives2 small errors in frequency estimations
Acceptable for high numbers of N
Stream Windows Lossy Counting Sticky Sampling38
![Page 39: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/39.jpg)
Lossy Counting in Action
Incoming Stream of Colours
Stream Windows Lossy Counting Sticky Sampling39
![Page 40: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/40.jpg)
Divide into Windows/Buckets
w w w
Window Size w =⌈1ε
⌉=
⌈1
0.01
⌉= 100
Stream Windows Lossy Counting Sticky Sampling40
![Page 41: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/41.jpg)
First Window Comes In
Empty Counts Frequency Counts
First Window
Go through elements. If counter exists, increase by one, if notcreate one and initialise it to one.
Stream Windows Lossy Counting Sticky Sampling41
![Page 42: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/42.jpg)
Adjust Counts at Window Boundaries
Frequency Counts Frequency Counts
Reduce all counts by one. If counter is zero for a specificelement, drop it.
Stream Windows Lossy Counting Sticky Sampling42
![Page 43: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/43.jpg)
Next Window Comes In
Next Window
Frequency CountsFrequency Counts
Count elements and adjust counts afterwards.
Stream Windows Lossy Counting Sticky Sampling43
![Page 44: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/44.jpg)
Lossy Counting Summary
Split Stream into WindowsFor each window: Count elements, if no counter exists, createone.At window boundaries: Reduce all frequencies by one. Iffrequency goes to zero, drop counter.Process next window ...
Data structure to save counters:Hashtable<Color,Integer>
Stream Windows Lossy Counting Sticky Sampling44
![Page 45: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/45.jpg)
Lossy Counting Summary
Split Stream into WindowsFor each window: Count elements, if no counter exists, createone.At window boundaries: Reduce all frequencies by one. Iffrequency goes to zero, drop counter.Process next window ...
Data structure to save counters:
Hashtable<Color,Integer>
Stream Windows Lossy Counting Sticky Sampling45
![Page 46: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/46.jpg)
Lossy Counting Summary
Split Stream into WindowsFor each window: Count elements, if no counter exists, createone.At window boundaries: Reduce all frequencies by one. Iffrequency goes to zero, drop counter.Process next window ...
Data structure to save counters:Hashtable<Color,Integer>
Stream Windows Lossy Counting Sticky Sampling46
![Page 47: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/47.jpg)
Output
With s = 10%, ε = 1%,N = 200
Frequency Counts
outputthreshold
Output
24
22
19
27
False Positive
To reduce false positives to acceptable amount, onlyoutput counters with frequency f ≥ (s − ε)N = 18.
Stream Windows Lossy Counting Sticky Sampling47
![Page 48: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/48.jpg)
Accuracy Improvement
Reduction step of counters follows the approach of reducing allcounters by one. An improved version maintains exact
frequencies and remembers for each counter at which windowid it was created. At window boundaries, counters are onlyremoved when their frequency falls below a certain level in
relation to their window id.
(Color,Integer,WindowID)
See paper for details.
G. S. Manku, R. Motwani. Approximate Frequency Counts overData Streams, VLDB, 2002.
Stream Windows Lossy Counting Sticky Sampling48
![Page 49: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/49.jpg)
Window based Algorithm
Sticky Sampling
Stream Windows Lossy Counting Sticky Sampling49
![Page 50: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/50.jpg)
Problem Description
Counting algorithm using asampling approach.
Probabilistic sampling decides if a counter for adistinct element is created.
If a counter exists for a certain element, every futureinstance of this element will be counted.
Stream Windows Lossy Counting Sticky Sampling50
![Page 51: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/51.jpg)
Algorithm Parameters
Environment ParametersElements seen so far N
User-specified Parameterssupport threshold s ∈ (0, 1)
error parameter ε ∈ (0, 1)
probability of failure δ ∈ (0, 1)
The algorithm is probabilistic and fails if any of thethree guarantees is not satisfied.
Stream Windows Lossy Counting Sticky Sampling51
![Page 52: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/52.jpg)
Algorithm Guarantees
1 All items whose true frequency exceeds sN are output. Thereare no false negatives.
2 No items whose true frequency is less than (s − ε)N is output.3 Estimated frequencies are less than the true frequencies by at
most εN.
Guarantees and thereby expected errors are the sameas for Lossy Counting. Except for the small
probability that it might fail to provide correctanswers.
Stream Windows Lossy Counting Sticky Sampling52
![Page 53: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/53.jpg)
Sticky Sampling in Action
Incoming Stream of Colours
Stream Windows Lossy Counting Sticky Sampling53
![Page 54: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/54.jpg)
Divide into Windows/Buckets
w = t w = 2t w = 4t
window 1 window 2 window 3
Dynamic window size with t = 1ε log
(1sδ
)With s = 10%, ε = 1%, δ = 0.1%
t ≈ 921
Stream Windows Lossy Counting Sticky Sampling54
![Page 55: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/55.jpg)
A Window Comes in
window 1 window 2 window 3
w = tr = 1
w = 2tr = 2
w = 4tr = 4
Go through elements. If counter exists, increase it. If not,create a counter with probability 1
rand initialise it to one.
Sampling rate r grows in proportion to window size.
Stream Windows Lossy Counting Sticky Sampling55
![Page 56: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/56.jpg)
Adjust Counts at Window Boundaries
Frequency Counts Frequency Counts
Go through elements ofeach counter. Toss coin, ifunsuccessful removeelement, otherwise moveon to next counter. Ifcounter becomes zero,drop it.
Ensures uniformsampling
Stream Windows Lossy Counting Sticky Sampling56
![Page 57: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/57.jpg)
Sticky Sampling Summary
Split stream into windows, doubling window size of each newwindowFor each window: Go through elements if counter exists,increase it. If not, create one with probability 1
r with r growingat the same rate as window size.At window boundaries: Reduce all frequencies by tossing anunbiased coin for each counted element. Remove element ifcoin toss unsuccessful, otherwise move on to next counter. Iffrequency goes to zero, drop counter.Process next window ...
Stream Windows Lossy Counting Sticky Sampling57
![Page 58: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/58.jpg)
Output
Same principle as Lossy Counting
To reduce false positives to acceptable amount, onlyoutput counters with frequency f ≥ (s − ε)N .
Stream Windows Lossy Counting Sticky Sampling58
![Page 59: Data Stream Processing - University of Edinburgh](https://reader034.fdocuments.in/reader034/viewer/2022051403/627c98806bde764cf748e6e0/html5/thumbnails/59.jpg)
Lossy Counting vs. Sticky Sampling
Feature Lossy Counting Sticky SamplingResults deterministic probabilisticMemory grows with N static (independent of N)Theory performs worse performs betterPractice performs better performs worse
performance in terms of memory and accuracy
Stream Windows Lossy Counting Sticky Sampling59