Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
-
Upload
vincent-burns -
Category
Documents
-
view
214 -
download
0
Transcript of Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Approximate Frequency Counts over Data Streams
Loo Kin Kong4th Oct., 2002
Plan Motivation Paper review: Approximate Frequency Counts over
Data Streams Finding frequent items Finding frequent itemsets Performance
Conclusion
Motivation In some new applications, data come as a continu
ous “stream” The sheer volume of a stream over its lifetime is huge Queries require timely answer
Examples: Stock ticks Network traffic measurements
Frequent itemset mining on offline databases vs data streams Often, level-wise algorithms are used to mine offlin
e databases E.g., the Apriori algorithm and its variants At least 2 database scans are needed
Level-wise algorithms cannot be applied to mine data streams
Cannot go through the data stream multiple times
Paper review: Approximate Frequency Counts over Data Streams By G. S. Manku and R. Motwani Published in VLDB 02 Main contributions of the paper:
Proposed 2 algorithms to find frequent items appear in a data stream of items
Extended the algorithms to find frequent itemset
Notations Some notations:
Let N denote the current length of the stream Let s (0,1) denote the support threshold Let (0,1) denote the error tolerance
Goals of the paper
The algorithm ensures that All itemsets whose true frequency exceeds sN are reporte
d No itemset whose true frequency is less than (s-)N is out
put Estimated frequencies are less than the true frequencies
by at most N
The simple case: finding frequent items Each transaction in the stream contains only 1 ite
m 2 algorithms were proposed, namely:
Sticky Sampling Algorithm Lossy Counting Algorithm
Features of the algorithms: Sampling techniques are used Frequency counts found are approximate but error is gua
ranteed not to exceed a user-specified tolerance level For Lossy Counting, all frequent items are reported
Sticky Sampling Algorithm User input includes 3 values, namely:
Support threshold s Error tolerance Probability of failure
Counts are kept in a data structure S Each entry in S is in the form (e,f), where:
e is the item f is the frequency of e in the stream since the entry is
inserted in S When queried about the frequent items, all
entries (e,f) such that f (s - )N
Sticky Sampling Algorithm (cont’d)1. S ; N 0; t 1/ log (1/s); r 12. e next transaction; N N + 13. if (e,f) exists in S do4. increment the count f5. else if random(0,1) > 1/r do6. insert (e,1) to S7. endif8. if N = 2t 2n do9. r 2r10. halfSampRate(S);11. endif12. Goto 2;
S: The set of all countse: Transaction (item)
N: Curr. len. of streamr: Sampling ratet: 1/ log (1/s)
Sticky Sampling Algorithm: halfSampRate()1. function halfSampRate(S)2. for every entry (e,f) in S do3. while random(0,1) < 0.5 and f > 0 do4. f f – 15. if f = 0 do6. remove the entry from S7. endif
Lossy Counting Algorithm Incoming data stream is conceptually divided into
buckets of 1/ transactions Counts are kept in a data structure D Each entry in D is in the form (e, f, ), where:
e is the item f is the frequency of e in the stream since the entry is inser
ted in D is the maximum count of e in the stream before e is adde
d to D
Lossy Counting Algorithm (cont’d)1. D ; N 02. w 1/; b 13. e next transaction; N N + 14. if (e,f,) exists in D do5. f f + 16. else do7. insert (e,1,b-1) to D8. endif9. if N mod w = 0 do10. prune(D, b); b b + 111. endif12. Goto 3;
D: The set of all countsN: Curr. len. of stream
e: Transaction (itemset)w: Bucket width
b: Current bucket id
Lossy Counting Algorithm – prune()1. function prune(D, b)2. for each entry (e,f,) in D do3. if f + b do4. remove the entry from D5. endif
Lossy Counting Lossy Counting guarantees that:
When deletion occurs, b N If an entry (e, f, ) is deleted, fe b where fe is the actual fre
quency count of e Hence, if an entry (e, f, ) is deleted, fe N
Finally, f fe f + N
Sticky Sampling vs Lossy Counting Sticky Sampling is non-deterministic, while Lossy
Counting is deterministic Experimental result shows that Lossy Counting req
uires fewer entries than Sticky Sampling
The more complex case: finding frequent itemsets The Lossy Counting algorithm is extended to find fr
equent itemsets Transactions in the data stream contains any num
ber of items
Overview of the algorithm Incoming data stream is conceptually divided into
buckets of 1/ transactions Counts are kept in a data structure D Multiple buckets ( of them say) are processed in a
batch Each entry in D is in the form (set, f, ), where:
set is the itemset f is the frequency of set in the stream since the entry is ins
erted in D is the maximum count of set in the stream before set is a
dded to D
Overview of the algorithm (cont’d) D is updated by the operations UPDATE_SET and N
EW_SET UPDATE_SET updates and deletes entries in D
For each entry (set, f, ), count occurrence of set in the batch and update the entry
If an updated entry satisfies f + bcurrent, the entry is removed from D
NEW_SET inserts new entries into D If a set set has frequency f in the batch and set does not
occur in D, create a new entry (set, f, bcurrent-)
Implementation Challenges:
Not to enumerate all subsets of a transaction Data structure must be compact for better space efficienc
y 3 major modules:
Buffer Trie SetGen
Implementation (cont’d) Buffer: repeatedly reads in a batch of buckets of tr
ansactions, where each transaction is a set of item-id’s, into available main memory
Trie: maintains the data structure D SetGen: generates subsets of item-id’s along with
their frequency counts in the current batch Not all possible subsets need to be generated If a subset S is not inserted into D after application of bot
h UPDATE_SET and NEW_SET, then no supersets of S should be considered
Performance IBM dataset (T10 I4 D1000K / 10K items)
Performance (cont’d) Compared with Apriori
IBM dataset (T10 I4 D1000K / 10K items)
Conclusion
Sticky Sampling and Lossy Counting are 2 approximate algorithms that can find frequent items
Both algorithms produces frequency counts within a user-specified error tolerance level, though Sticky Sampling is non-deterministic
Lossy Counting can be extended to find frequent itemsets
Reference G. S. Manku and R. Motwani. Approximate Frequency Counts
over Data Streams. In VLDB 02, Hong Kong, 2002
Q & A