Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with...
-
Upload
sharon-tillett -
Category
Documents
-
view
214 -
download
0
Transcript of Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with...
Counting Distinct Objects over Sliding Windows
Presented by:
Muhammad Aamir Cheema
Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin
University of New South Wales, Australia
Introduction
Counting distinct objects: Given a dataset D, return the number of distinct objects
in D.
Counting distinct objects against sliding windows: Given a data stream, return the number of distinct
objects that arrive at or after timestamp t.
Applications traffic management, call centers, wireless
communication, stock market etc.
Introduction
Approximate counting:
Let n be the actual number of distinct objects and n’ be the reported answer. Build a sketch s.t. every query is answered with the following guarantee;
|n-n’|/n ≤ ε with confidence (1 – δ)
Contribution: FM based algorithms
SE-FM (accuracy guarantee + space usage guarantee) PCSA-based algorithm (No accuracy guarantee (although
practical) + more efficient)
k-Skyband
(Accuracy guarantee + efficient + no space usage guarantee)
FM Algorithm
0 0 0 0
1 0 1 0
FM SKETCHLet h(x) be a uniform hash function Let “pivot” p(y) be the position of left most 1-
bit of h(x) FM be an array of size k initialized to zero For each record x in dataset
FM[pivot] = 1; Let B=FMmin be the position of left most 0-bit
of FM Number of distinct elements = α * 2B
where α = 1.2897385
Each bit i of h(x) has 1/2 probability to be oneFM
r1 r2 r1 r3 r1
h(r1)
0 0 1 0h(r2)
1 1 0 1h(r3)
1 0 0 01 0 1 0
FMmin = 1
k = 4
P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. JCSS 1985
FM Algorithm
1 0 1 0
FM
r1 r2 r1 r3 r1
h(r1)
0 0 1 0h(r2)
1 1 0 1h(r3)
1 0 1 0
FMmin = 1
Each bit i of h(x) has 1/2 probability to be one A h(x) with first i bits zero and (i+1)th bit one has
a probability 1/2i+1
Let n be the number of distinct elements FM[0] is accessed appx. n/2 times FM[1] is accessed appx. n/4 times …. FM[i] is accessed appx. n/2i+1 times
If i >> log2 n FM[i] will almost certainly be zero
If i << log2 n FM[i] will almost certainly be one
If i ≈ log2 n FM[i] may be zero or one
Hence, the first i for which FM[i] is zero may be used to approximate number of distinct elements n.
FM Algorithm
FM1 1 0 1 0
B1 = 1
Use r hash functions to create r FM Sketches Initialize each FM to zero For each record x in dataset
For each hash function hi(x)
FMi[pivot] = 1;
Let Bi be the position of left most 0-bit of FMi
B = (B1 + B2 … + Br )/ r Number of distinct elements = α * 2B
where α = 1.2897385
1 1 0 0
B2 = 2
1 1 0 1
B3 = 2
FM2
FM3
B = (1 + 2 + 2)/3 = 1.67
Performance Guarantee: Let n be the actual number of distinct objects, n’ be the reported answer and m be the domain of elements then;
P( |n’ – n|/n ≤ є ) ≥ 1 - δ
If n > 1/є
and k = O(log m + log 1/є + log 1/δ )
and r = O(1/є2 log 1/δ)
FM-based Algorithm
1 0 1 0
1 2 3 4 5Maintaining one FM sketch For each record (x,t) in dataset
FM[pivot] = t;
Answering a query For any t, let B = FMmin (t) be the position of
left most entry of FM with value less than t Number of distinct elements arrived after
(inclusive) t = α * 2B where α = 1.2897385
FM
r1 r2 r3 r2 r2
h(r1)
0 0 1 0h(r2)
1 1 0 1h(r3)
0 0 0 01 0 0 0
FMmin (4) = 0
1 0 2 03 0 4 03 0 5 03 0 2 0
FM-based Algorithm
Maintain r FM sketches Initialize each FM to zero For each record (x,t) in dataset
For each hash function hi(x)
FMi[pivot] = t;
Answering a query For any t, let Bi (t) be the position of left most entry smaller
than t in i-th FM Let B = ( B1 (t) + B2 (t) … + Br(t) )/ r Number of distinct elements arrived after (inclusive) t = α * 2B
where α = 1.2897385
Performance Analysis
Let n be the actual number of distinct objects arriving not before time t, n’ be the reported answer and m be the domain of elements then;
P( |n’ – n|/n ≤ є ) ≥ 1 - δ
If n > 1/є
and k = O(log m + log 1/є + log 1/δ )
and r = O(1/є2 log 1/δ)
Total Space: O(1/є2 log 1/δ log m) Total maintenance cost for one record: O(1/є2 log 1/δ log log m) Total query cost: O(1/є2 log 1/δ log log m)
PCSA-based Algorithm
Maintain r FM sketches but update j < r sketches Generate j hash functions H(x) that map x to [1,r] Initialize each FM to zero For each record (x,t) in dataset
For each of the j hash functions H() i = H(x) Update i-th FM sketch
Answering a query For any t, let Bi (t) be the position of left most entry smaller than t in i-
th FM Let B = ( B1 (t) + B2 (t) … + Br(t) )/ r Number of distinct elements arrived after (inclusive) t = (α * 2B)/ j where
α = 1.2897385Inspired by PCSA technique in ”P.. Flajolet and G. N. Martin. Probabilistic counting
algorithms for data base applications. JCSS 1985”NOTE: No accuracy guarantee but performs well in practice
BJKST AlgorithmMain Idea• Let h() be a hash function to hash D to [1,m3] where m = |D|• For each record x, we generate its hash value h(x)• Maintain k-th smallest distinct hash value k_min
Number of distinct elements = n = km3/k_min
Improved algorithm• Use r hash functions• Compute ni for each hash function hi() as above
• Report final answer as median of ni values
Performance guarantee:
P( |n’ – n|/n ≤ є ) ≥ 1 - δ
If m > 1/ δ
and n > k
and k = O(1/є2)
and r = O(log 1/δ)Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements
in datastream. In RANDOM'02.
K-Skyband TechniqueMain Idea• Let h() be a hash function to hash D to [1,m3] where m = |D|• For each record (x,t’) we generate h(x) and store record (x, h(x), t’)
Answering a query q(t):• Retrieve all records (x,h(x),t’) for which timestamp t’ ≥ t• Get the k-th smallest distinct hashed value and apply BJKST algorithm
Limitation: Requires storing all records
K-Skyband TechniqueFor any time t, we need to find k-th smallest hash value
arriving no later than t
A record x dominates another record y if x arrives after y and has smaller hash value
K-Skybands keeps only the objects that are dominated by at most (k-1) records
Maintaining K-Skyband:• Keep a counter for each record• When a new element (x,t) arrives, increment the
counter of all records dominated by it• Remove the records with counter at least equal to k
We increment the counters of groups to improve efficiency (Domination aggregation search tree)
a
e
d
c
b
h(x)
t
k = 2
K-Skyband TechniqueAnswering Query:
Find k_min (the k-th smallest hash value among elements arriving no later than t)
• Let z be the number of elements arrived before t• k_min is the (z+k)-th overall smallest hash value
Algorithm:• Maintain a binary search tree eT that stores elements
according to t• Maintain a binary search tree eH that stores elements
according to h(x)
When a query q(t) arrives• Compute z by using eT• Find (z+k)-th overall smallest hash value from eH
a
e
d
c
b
h(x)
t
k = 2
z = 3f
k_min = 5th smallest h(x)
Performance Analysis
Let n be the actual number of distinct objects arriving not before time t, n’ be the reported answer and m be the domain of elements then;
P( |n’ – n|/n ≤ є ) ≥ 1 - δ
If m > 1/ δ
and n > k
and k = O(1/є2)
and r = O(log 1/δ)
Expected total space: O(1/є2 log 1/δ log n)
Expected time complexity: O(log 1/δ (log 1/є + log n))
Experiments
• Synthetic datasets following Uniform and Zipf distribution• Real dataset WorldCup 98 HTTP requests (20 M records)
j
Space Efficiency
Space Efficiency
Time Efficiency
Maintenance cost
Time Efficiency
Query response time
Accuracy
Thanks
P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In VLDB, 2001.
Space usage: 1/ε2 log 1/δ m1/2
Y. Tao, G. Kollios, J. Considine, F. Li, and D. Papadias. Spatio-temporal aggregation using sketches. In ICDE 2004.
Space usage: O(N/ε2 log 1/δ log m)
Space Requirement (SE-FM)
To guarantee the performance we require the following; k = O(log m + log 1/є + log 1/δ ) r = O(1/є2 log 1/δ)
Let m > 1/є and m > 1/δ; then k = O(log m)
Size of one sketch is k = O(log m);
Size of r sketches is: O(r log m) = O(1/є2 log 1/δ log m);
Total Space: O(1/є2 log 1/δ log m)
Time Complexity (SE-FM)
To guarantee the performance we require the following; k = O(log m + log 1/є + log 1/δ ) r = O(1/є2 log 1/δ)
The elements in a sketch are stored in a min-heap to support logarithmic search/update; Hence, cost of one search/update operation: O( log k) = O( log log m) To maintain the sketches, we update r sketches for each record x
Total maintenance cost for one record: O( r log log m) = O(1/є2 log 1/δ log log m) To answer a query, we search in r sketches
Total cost: O( r log log m) = O(1/є2 log 1/δ log log m)
Space Usage (K-Skyband)Performance guarantee:
P( |n’ – n|/n ≤ є ) ≥ 1 - δ
If m > 1/ δ
and n > k
and k = O(1/є2)
and r = O(log 1/δ)
Expected size of k-skyband = O (k ln (n/k) )
Expected size of r k-sybands = O(rk log (n/k) ) = O(1/є2 log 1/δ log n)
Time Complexity (K-Skyband)Performance guarantee:
P( |n’ – n|/n ≤ є ) ≥ 1 - δ
If m > 1/ δ
and n > k
and k = O(1/є2)
and r = O(log 1/δ)
Answering Query q(t):
Search eT to compute z: log (k log n) = O(log k + log n)
Search eH to find (z+t)-th element: O(log k + log n)
We require this for all r sketches: O (r (log k + log n)) = O(log 1/δ (log 1/є + log n))