May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev...
-
date post
21-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev...
May 21, 2003 CS Forum Annual Meeting 1
Randomization for Massive Randomization for Massive and Streaming Data Setsand Streaming Data Sets
Rajeev Motwani
2
Data Streams Mangement SystemsData Streams Mangement Systems
Traditional DBMS – data stored in finite, persistent data setsdata sets
Data Streams – distributed, continuous, unbounded, rapid, time-varying, noisy, …
Emerging DSMS – variety of modern applications Network monitoring and traffic engineering Telecom call records Network security Financial applications Sensor networks Manufacturing processes Web logs and clickstreams Massive data sets
3
DSMS
Scratch Store
DSMS – Big PictureDSMS – Big Picture
Input streams
RegisterQuery
StreamedResult
StoredResult
ArchiveStored
Relations
4
Algorithmic IssuesAlgorithmic Issues
Computational Model Streaming data (or, secondary memory) Bounded main memory
Techniques New paradigms Negative Results and Approximation Randomization
Complexity Measures Memory Time per item (online, real-time) # Passes (linear scan in secondary memory)
5
Stream Model of ComputationStream Model of Computation
10
11
1
0
1
0
0
1
1
Increasi
ng time
Main Memory (Synopsis Data Structures)
Data Stream
Memory: poly(1/ε, log N)
Query/Update Time: poly(1/ε, log N)
N: # items so far, or window size
ε: error parameter
6
““Toy” Example – Network MonitoringToy” Example – Network Monitoring
RegisterMonitoring
Queries
DSMS
Scratch Store
Network measurements,Packet traces,
…
IntrusionWarnings
OnlinePerformance
Metrics
ArchiveLookupTables
7
Frequency Related ProblemsFrequency Related Problems
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Find all elements with frequency > 0.1%
Top-k most frequent elements
What is the frequency of element 3? What is the total frequency
of elements between 8 and 14?
Find elements that occupy 0.1% of the tail.
Mean + Variance?
Median?
How many elements have non-zero frequency?
Analytics on Packet Headers – IP AddressesAnalytics on Packet Headers – IP Addresses
8
Example 1– Distinct ValuesExample 1– Distinct Values
Input Sequence X = x1, x2, …, xn, … Domain U = {0,1,2, …, u-1} Compute D(X) number of distinct values
Remarks Assume stream size n is finite/known
(generally, n is window size) Domain could be arbitrary (e.g., text,
tuples)
9
Naïve ApproachNaïve Approach
Counter C(i) for each domain value i Initialize counters C(i) 0 Scan X incrementing appropriate counters
Problem Memory size M << n Space O(u) – possibly u >> n
(e.g., when counting distinct words in web crawl)
10
Negative ResultNegative Result
Theorem:Deterministic algorithms need M = Ω(n log u)
bits
Proof: Information-theoretic arguments
Note: Leaves open randomization/approximation
11
Randomized AlgorithmRandomized Algorithm
Analysis
Random h few collisions & avg list-size O(n/t)
Thus
Space: O(n) – since we need t = Ω(n)
Time: O(1) per item [Expected]
h:U [1..t]
Input Stream Hash Table
12
Improvement via Sampling?Improvement via Sampling?
Sample-based Estimation Random Sample R (of size r) of n values in X Compute D(R) Estimator E = D(R) x n/r
Benefit – sublinear space
Cost – estimation error is high Why? – low-frequency values
underrepresented
13
Negative Result for SamplingNegative Result for Sampling
Consider estimator E of D(X) examining r items in X Possibly in adaptive/randomized fashion.
Theorem: For any , E has relative error
with probability at least .
Remarks r = n/10 Error 75% with probability ½ Leaves open randomization/approximation on full scans
δ
1ln
2r
rn
reδ
δ
14
Randomized ApproximationRandomized Approximation Simplified Problem – For fixed t, is D(X) >> t?
Choose hash function h: U[1..t] Initialize answer to NO For each xi, if h(xi) = t, set answer to YES
Observe – need 1 bit memory only !
Theorem: If D(X) < t, P[output NO] > 0.25 If D(X) > 2t, P[output NO] < 0.14
Boolean Flag
Input Stream
h:U [1..t]
YES/NOYES/NOtt
11
15
AnalysisAnalysis
Let – Y be set of distinct elements of X output NO no element of Y hashes to t
P [element hashes to t] = 1/t Thus – P[output NO] = (1-1/t)|Y|
Since |Y| = D(X), D(X) < t P[output NO] > (1-1/t)t > 0.25 D(X) > 2t P[output NO] < (1-1/t)2t < 1/e^2
16
Boosting AccuracyBoosting Accuracy
With 1 bit distinguish D(X)<t from D(X)>2t
Running O(log 1/δ) instances in parallel reduce error probability to any δ>0
Running O(log n) in parallel for t = 1, 2, 4, 8,…, n can estimate D(X) within factor 2
Choice of multiplier 2 is arbitrary can use factor (1+ε) to reduce error to ε
Theorem: Can estimate D(X) within factor (1±ε) with probability (1-δ) using space
)(δ
1log
ε
nlogO 2
17
Example 2 – Elephants-and-AntsExample 2 – Elephants-and-Ants
Identify items whose current frequency exceeds support threshold s = 0.1%.
[Jacobson 2000, Estan-Verghese 2001]
Stream
18
Algorithm 1: Lossy CountingAlgorithm 1: Lossy Counting
Step 1: Divide the stream into ‘windows’
Window-size W is function of support s – specify later…
Window 1 Window 2 Window 3
19
Lossy Counting in Action ...Lossy Counting in Action ...
Empty
FrequencyCounts
At window boundary, decrement all counters by 1
+
First Window
20
Lossy Counting continued ...Lossy Counting continued ...FrequencyCounts
At window boundary, decrement all counters by 1
Next Window
+
21
Error AnalysisError Analysis
If current size of stream = Nand window-size W = 1/ε
then # windows = εN
Rule of thumb: Set ε = 10% of support sExample: Given support frequency s = 1%, set error frequency ε = 0.1%
frequency error
How much do we undercount?
22
Output: Elements with counter values exceeding (s-ε)N
Approximation guarantees Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s–ε)N
Putting it all together…Putting it all together…
How many counters do we need?
Worst case bound: 1/ε log εN counters
Implementation details…
23
Algorithm 2: Sticky SamplingAlgorithm 2: Sticky Sampling
Stream
Create counters by sampling Maintain exact counts thereafter
What is sampling rate?
341530
283141233519
24
Sticky Sampling contd...Sticky Sampling contd...For finite stream of length N
Sampling rate = 2/εN log 1/s
Same Rule of thumb: Set ε = 10% of support sExample: Given support threshold s = 1%, set error threshold ε = 0.1% set failure probability = 0.01%
Output: Elements with counter values exceeding (s-ε)N
Same error guarantees as Lossy Counting but probabilistic
Approximation guarantees (probabilistic) Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s-ε)N
= probability of failure
25
Number of counters?Number of counters?
Finite stream of length NSampling rate: 2/εN log 1/s
Independent of N
Infinite stream with unknown NGradually adjust sampling rate
In either case,Expected number of counters = 2/ log 1/s
26
Example 3 – Correlated AttributesExample 3 – Correlated Attributes
C1 C2 C3 C4 C5R1 1 1 1 1 0R2 1 1 0 1 0R3 1 0 0 1 0R4 0 0 1 0 1R5 1 1 1 0 1R6 1 1 1 1 1R7 0 1 1 1 1R8 0 1 1 1 0
… … …
Input Stream – items with boolean attributes Matrix – M(r,c) = 1 Row r has Attribute c Identify – Highly-correlated column-pairs
27
Correlation Correlation Similarity Similarity
View column as set of row-indexes (where it has 1’s)
Set Similarity (Jaccard measure)
Example
ji
ji
jiCC
CC)C,sim(C
Ci Cj
0 11 01 1 sim(Ci,Cj) = 2/5 = 0.40 01 10 1
28
Identifying Similar Columns?Identifying Similar Columns?
Goal – finding candidate pairs in small memory
Signature Idea Hash columns Ci to small signature sig(Ci) Set of signatures fits in memory sim(Ci,Cj) approximated by sim(sig(Ci),sig(Cj))
Naïve Approach Sample P rows uniformly at random Define sig(Ci) as P bits of Ci in sample Problem
sparsity would miss interesting part of columns sample would get only 0’s in columns
29
Key ObservationKey Observation
For columns Ci, Cj, four types of rowsCi Cj
A 1 1B 1 0C 0 1D 0 0
Overload notation: A = # rows of type A
ObservationCBA
A)C,sim(C ji
30
Min HashingMin Hashing
Randomly permute rows Hash h(Ci) = index of first row with 1 in column Ci
Suprising Property P[h(Ci) = h(Cj)] = sim(Ci, Cj)
Why? Both are A/(A+B+C) Look down columns Ci, Cj until first non-Type-D row h(Ci) = h(Cj) if type A row
31
Min-Hash SignaturesMin-Hash Signatures
Pick – k random row permutations Min-Hash Signature
sig(C) = k indexes of first rows with 1 in column C
Similarity of signatures Define: sim(sig(Ci),sig(Cj)) = fraction of
permutations where Min-Hash values agree
Lemma E[sim(sig(Ci),sig(Cj))] = sim(Ci,Cj)
32
ExampleExample
C1 C2 C3
R1 1 0 1R2 0 1 1R3 1 0 0R4 1 0 1R5 0 1 0
Signatures S1 S2 S3
Perm 1 = (12345) 1 2 1Perm 2 = (54321) 4 5 4Perm 3 = (34512) 3 5 4
Similarities 1-2 1-3 2-3Col-Col 0.00 0.50 0.25Sig-Sig 0.00 0.67 0.00
33
Implementation TrickImplementation Trick
Permuting rows even once is prohibitive
Row Hashing Pick k hash functions hk: {1,…,n}{1,
…,O(n)} Ordering under hk gives random row
permutation One-pass implementation
34
Comparing SignaturesComparing Signatures
Signature Matrix S Rows = Hash Functions Columns = Columns Entries = Signatures
Need – Pair-wise similarity of signature columns Problem
MinHash fits column signatures in memory But comparing signature-pairs takes too much time Limiting candidate pairs – Locality Sensitive Hashing
35
SummarySummary
New algorithmic paradigms needed for streams and massive data sets
Negative results abound Need to approximate Power of randomization
36
Thank You!Thank You!
37
ReferencesReferences
Rajeev Motwani (http://theory.stanford.edu/~rajeev)
STREAM Project (http://www-db.stanford.edu/stream)
STREAM: The Stanford Stream Data Manager. Bulletin of the Technical Committee on Data Engineering 2003.
Motwani et al. Query Processing, Approximation, and Resource Management in a Data Stream Management System. CIDR 2003.
Babcock-Babu-Datar-Motwani-Widom. Models and Issues in Data Stream Systems. PODS 2002.
Manku-Motwani. Approximate Frequency Counts over Streaming Data. VLDB 2003.
Babcock-Datar-Motwani-O’Callahan. Maintaining Variance and K-Medians over Data Stream Windows. PODS 2003.
Guha-Meyerson-Mishra-Motwani-O’Callahan. Clustering Data Streams: Theory and Practice. IEEE TKDE 2003.
38
References (contdReferences (contd)) Datar-Gionis-Indyk-Motwani. Maintaining Stream Statistics
over Sliding Windows. SIAM Journal on Computing 2002. Babcock-Datar-Motwani. Sampling From a Moving Window
Over Streaming Data. SODA 2002. O’Callahan-Guha-Mishra-Meyerson-Motwani. High-
Performance Clustering of Streams and Large Data Sets. ICDE 2003.
Guha-Mishra-Motwani-O’Callagahan. Clustering Data Streams. FOCS 2000.
Cohen et al. Finding Interesting Associations without Support Pruning. ICDE 2000.
Charikar-Chaudhuri-Motwani-Narasayya. Towards Estimation Error Guarantees for Distinct Values. PODS 2000.
Gionis-Indyk-Motwani. Similarity Search in High Dimensions via Hashing. VLDB 1999.
Indyk-Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC 1998.