Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with...

Counting Distinct Objects over Sliding Windows

Presented by:

Muhammad Aamir Cheema

Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin

University of New South Wales, Australia

Introduction

Counting distinct objects: Given a dataset D, return the number of distinct objects

in D.

Counting distinct objects against sliding windows: Given a data stream, return the number of distinct

objects that arrive at or after timestamp t.

Applications traffic management, call centers, wireless

communication, stock market etc.

Introduction

Approximate counting:

Let n be the actual number of distinct objects and n’ be the reported answer. Build a sketch s.t. every query is answered with the following guarantee;

|n-n’|/n ≤ ε with confidence (1 – δ)

Contribution: FM based algorithms

SE-FM (accuracy guarantee + space usage guarantee) PCSA-based algorithm (No accuracy guarantee (although

practical) + more efficient)

k-Skyband

(Accuracy guarantee + efficient + no space usage guarantee)

FM Algorithm

0 0 0 0

1 0 1 0

FM SKETCHLet h(x) be a uniform hash function Let “pivot” p(y) be the position of left most 1-

bit of h(x) FM be an array of size k initialized to zero For each record x in dataset

FM[pivot] = 1; Let B=FMmin be the position of left most 0-bit

of FM Number of distinct elements = α * 2B

where α = 1.2897385

Each bit i of h(x) has 1/2 probability to be oneFM

r1 r2 r1 r3 r1

h(r1)

0 0 1 0h(r2)

1 1 0 1h(r3)

1 0 0 01 0 1 0

FMmin = 1

k = 4

P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. JCSS 1985

FM Algorithm

1 0 1 0

FM

r1 r2 r1 r3 r1

h(r1)

0 0 1 0h(r2)

1 1 0 1h(r3)

1 0 1 0

FMmin = 1

Each bit i of h(x) has 1/2 probability to be one A h(x) with first i bits zero and (i+1)th bit one has

a probability 1/2i+1

Let n be the number of distinct elements FM[0] is accessed appx. n/2 times FM[1] is accessed appx. n/4 times …. FM[i] is accessed appx. n/2i+1 times

If i >> log2 n FM[i] will almost certainly be zero

If i << log2 n FM[i] will almost certainly be one

If i ≈ log2 n FM[i] may be zero or one

Hence, the first i for which FM[i] is zero may be used to approximate number of distinct elements n.

FM Algorithm

FM1 1 0 1 0

B1 = 1

Use r hash functions to create r FM Sketches Initialize each FM to zero For each record x in dataset

For each hash function hi(x)

FMi[pivot] = 1;

Let Bi be the position of left most 0-bit of FMi

B = (B1 + B2 … + Br )/ r Number of distinct elements = α * 2B

where α = 1.2897385

1 1 0 0

B2 = 2

1 1 0 1

B3 = 2

FM2

FM3

B = (1 + 2 + 2)/3 = 1.67

Performance Guarantee: Let n be the actual number of distinct objects, n’ be the reported answer and m be the domain of elements then;

P( |n’ – n|/n ≤ є ) ≥ 1 - δ

If n > 1/є

and k = O(log m + log 1/є + log 1/δ )

and r = O(1/є2 log 1/δ)

FM-based Algorithm

1 0 1 0

1 2 3 4 5Maintaining one FM sketch For each record (x,t) in dataset

FM[pivot] = t;

Answering a query For any t, let B = FMmin (t) be the position of

left most entry of FM with value less than t Number of distinct elements arrived after

(inclusive) t = α * 2B where α = 1.2897385

FM

r1 r2 r3 r2 r2

h(r1)

0 0 1 0h(r2)

1 1 0 1h(r3)

0 0 0 01 0 0 0

FMmin (4) = 0

1 0 2 03 0 4 03 0 5 03 0 2 0

FM-based Algorithm

Maintain r FM sketches Initialize each FM to zero For each record (x,t) in dataset

For each hash function hi(x)

FMi[pivot] = t;

Answering a query For any t, let Bi (t) be the position of left most entry smaller

than t in i-th FM Let B = ( B1 (t) + B2 (t) … + Br(t) )/ r Number of distinct elements arrived after (inclusive) t = α * 2B

where α = 1.2897385

Performance Analysis

Let n be the actual number of distinct objects arriving not before time t, n’ be the reported answer and m be the domain of elements then;

P( |n’ – n|/n ≤ є ) ≥ 1 - δ

If n > 1/є

and k = O(log m + log 1/є + log 1/δ )

and r = O(1/є2 log 1/δ)

Total Space: O(1/є2 log 1/δ log m) Total maintenance cost for one record: O(1/є2 log 1/δ log log m) Total query cost: O(1/є2 log 1/δ log log m)

PCSA-based Algorithm

Maintain r FM sketches but update j < r sketches Generate j hash functions H(x) that map x to [1,r] Initialize each FM to zero For each record (x,t) in dataset

For each of the j hash functions H() i = H(x) Update i-th FM sketch

Answering a query For any t, let Bi (t) be the position of left most entry smaller than t in i-

th FM Let B = ( B1 (t) + B2 (t) … + Br(t) )/ r Number of distinct elements arrived after (inclusive) t = (α * 2B)/ j where

α = 1.2897385Inspired by PCSA technique in ”P.. Flajolet and G. N. Martin. Probabilistic counting

algorithms for data base applications. JCSS 1985”NOTE: No accuracy guarantee but performs well in practice

BJKST AlgorithmMain Idea• Let h() be a hash function to hash D to [1,m3] where m = |D|• For each record x, we generate its hash value h(x)• Maintain k-th smallest distinct hash value k_min

Number of distinct elements = n = km3/k_min

Improved algorithm• Use r hash functions• Compute ni for each hash function hi() as above

• Report final answer as median of ni values

Performance guarantee:

P( |n’ – n|/n ≤ є ) ≥ 1 - δ

If m > 1/ δ

and n > k

and k = O(1/є2)

and r = O(log 1/δ)Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements

in datastream. In RANDOM'02.

K-Skyband TechniqueMain Idea• Let h() be a hash function to hash D to [1,m3] where m = |D|• For each record (x,t’) we generate h(x) and store record (x, h(x), t’)

Answering a query q(t):• Retrieve all records (x,h(x),t’) for which timestamp t’ ≥ t• Get the k-th smallest distinct hashed value and apply BJKST algorithm

Limitation: Requires storing all records

K-Skyband TechniqueFor any time t, we need to find k-th smallest hash value

arriving no later than t

A record x dominates another record y if x arrives after y and has smaller hash value

K-Skybands keeps only the objects that are dominated by at most (k-1) records

Maintaining K-Skyband:• Keep a counter for each record• When a new element (x,t) arrives, increment the

counter of all records dominated by it• Remove the records with counter at least equal to k

We increment the counters of groups to improve efficiency (Domination aggregation search tree)

a

e

d

c

b

h(x)

t

k = 2

K-Skyband TechniqueAnswering Query:

Find k_min (the k-th smallest hash value among elements arriving no later than t)

• Let z be the number of elements arrived before t• k_min is the (z+k)-th overall smallest hash value

Algorithm:• Maintain a binary search tree eT that stores elements

according to t• Maintain a binary search tree eH that stores elements

according to h(x)

When a query q(t) arrives• Compute z by using eT• Find (z+k)-th overall smallest hash value from eH

a

e

d

c

b

h(x)

t

k = 2

z = 3f

k_min = 5th smallest h(x)

Performance Analysis

Let n be the actual number of distinct objects arriving not before time t, n’ be the reported answer and m be the domain of elements then;

P( |n’ – n|/n ≤ є ) ≥ 1 - δ

If m > 1/ δ

and n > k

and k = O(1/є2)

and r = O(log 1/δ)

Expected total space: O(1/є2 log 1/δ log n)

Expected time complexity: O(log 1/δ (log 1/є + log n))

Experiments

• Synthetic datasets following Uniform and Zipf distribution• Real dataset WorldCup 98 HTTP requests (20 M records)

j

Space Efficiency

Time Efficiency

Maintenance cost

Time Efficiency

Query response time

Accuracy

Thanks

P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In VLDB, 2001.

Space usage: 1/ε2 log 1/δ m1/2

Y. Tao, G. Kollios, J. Considine, F. Li, and D. Papadias. Spatio-temporal aggregation using sketches. In ICDE 2004.

Space usage: O(N/ε2 log 1/δ log m)

Space Requirement (SE-FM)

To guarantee the performance we require the following; k = O(log m + log 1/є + log 1/δ ) r = O(1/є2 log 1/δ)

Let m > 1/є and m > 1/δ; then k = O(log m)

Size of one sketch is k = O(log m);

Size of r sketches is: O(r log m) = O(1/є2 log 1/δ log m);

Total Space: O(1/є2 log 1/δ log m)

Time Complexity (SE-FM)

To guarantee the performance we require the following; k = O(log m + log 1/є + log 1/δ ) r = O(1/є2 log 1/δ)

The elements in a sketch are stored in a min-heap to support logarithmic search/update; Hence, cost of one search/update operation: O( log k) = O( log log m) To maintain the sketches, we update r sketches for each record x

Total maintenance cost for one record: O( r log log m) = O(1/є2 log 1/δ log log m) To answer a query, we search in r sketches

Total cost: O( r log log m) = O(1/є2 log 1/δ log log m)

Space Usage (K-Skyband)Performance guarantee:

P( |n’ – n|/n ≤ є ) ≥ 1 - δ

If m > 1/ δ

and n > k

and k = O(1/є2)

and r = O(log 1/δ)

Expected size of k-skyband = O (k ln (n/k) )

Expected size of r k-sybands = O(rk log (n/k) ) = O(1/є2 log 1/δ log n)

Time Complexity (K-Skyband)Performance guarantee:

P( |n’ – n|/n ≤ є ) ≥ 1 - δ

If m > 1/ δ

and n > k

and k = O(1/є2)

and r = O(log 1/δ)

Answering Query q(t):

Search eT to compute z: log (k log n) = O(log k + log n)

Search eH to find (z+t)-th element: O(log k + log n)

We require this for all r sketches: O (r (log k + log n)) = O(log 1/δ (log 1/є + log n))

Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with...

Documents

Transcript of Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with...