Algorithms for data streams Foundations of Data Science 2014 Indian Institute of Science Navin...

Algorithms for data streams

Foundations of Data Science2014

Indian Institute of ScienceNavin Goyal

Introduction• Data streams: Very large input data arriving sequentially, too large to

fit in memory

• Examples: – networks (traffic passing through a router) – databases (transaction logs)– scientific data (satellites, sensors, LHC,…)– financial data

• What can we compute about the data in such situations?

• Today’s lecture: Start with an illustrative example problem, and then some generalities about the streaming model and problems

Example: Counting• We get a sequential stream of data:

• Want to maintain the total number of ’s • If the count is at most , what is the number of bits of memory

needed?

• Can we use smaller memory to count up to ? (Could be useful when the stream is really long)

• No (Proof?)

Example: Counting• Need bits to count up to • To do better, need to relax the rules of the game

• Two new ingredients:– Allow approximate answers– Allow randomization (so sometimes the output could be

wrong, but only with small probability)

• In the type of problems we will look at we often need both to be able to do much better

• In these lectures, we will almost always use both

Counting• Trick #1: approximation• Divide the range into buckets of geometrically increasing sizes

• Instead of maintaining the count in full precision, just maintain the index of the bucket in which lies

• Note that • If we have to output the count at some point, we just output • This makes at most a factor of error as • #buckets = , so only need to keep bits • Question: How can we reduce the approximation factor?• Problem with approximation : When do we go to the next bucket?

Counting • Trick #2: randomization

• Recall the following fact from probability– We have a coin that comes up heads with probability – How many trials in expectation before we see heads?

• Can use this to solve the problem of deciding when to update the bucket index:– When we get in bucket (so this bucket has size ) we want to increase

by 1 after increments to – Whenever we see a (so should be incremented by 1) we toss a coin

that comes up heads with prob. and increase the index of the bucket when we see heads, then in expectation it will take increments before we increase the index

Counting• Morris counter: An approximate randomized counter with

bits:

– Initialize – If the next bit is , increment with probability , else do

nothing– At the end, output

• Denote by the (random) value of the Morris counter after increments

• Would like to claim: is a “good” approximation of

Performance of Morris counterLemma (so is an unbiased estimator of )Proof (as always)

(as always)

Boosting the success probability ILemma

• Application of Chebyshev’s inequality:

• is not a directly useful estimator of because of high variance

Boosting the success probability I• Constructing an estimator with smaller variance but still the

same expectation: Take independent samples of The new estimator is

• [

• (because are independent)

Performance of Morris counter • For sufficiently large , is a useful estimator for :

• Taking gives

• Theorem For any fixed , the Morris counter with the estimator achieves with probability at least – additive error at most – number of bits used (proof along the above lines)

• Dependence on not good

Boosting the success probability II

• An abstract formulation of the problem of boosting success probability:

• We have a random variable with the property that

– Estimate within additive error with probability at least – You can use independent samples of , but minimize the number of

samples

• Test your understanding: Why didn’t we apply the Chernoff bound to the sum of estimators in the previous analysis?

𝑛 𝑛+𝜖𝑛𝑛−𝜖 𝑛

Boosting success probability II• Use the median of independent samples of the estimator• Median is a more robust estimator compared to the mean with weak

dependence on the behavior of the random variable far from the median

• If we have independent samples of (assume odd)• Define indicator random variables for :

if , otherwise Then

• If median then for at least of the we have , i.e. • By the Chernoff bound (For example, using Theorem 11.4 in the textbook)

• Choosing now gives median

• Test your understanding: Why don’t we just use the median all the time for boosting the probability of success instead of the mean?

Recap• Median of the means estimator: Use means to get a constant

probability of the estimate being good, and then take the median to boost the probability efficiently

• Final algorithm for counting:– Let and – Run independent copies of the basic Morris counter– Let – Let , – When asked for the count output the median of the mean:

median

Questions to ponder(Some of the following questions have been answered in the literature or are easy; in other cases I don’t know the answer)

• Can you do with fewer bits than ?

• What happens if we allow increments bigger than ? How fast can you update the counter?

• If you have two Morris counters how do you add them?

• What happens if you allow decrements? Does the Morris counter work? If not, can you devise a different scheme that’s as efficient?

• What is the randomness requirement of Morris counter? How small can you make it?

Streaming data: models and problems

Models for streaming data• The data is presented as a stream of elements. Depending on the type of the elements we get different models• Universe

• Cash register model: where and are small positive integers ( may or may not be known in advance, but usually we will assume that it is known)

• Turnstile model: where and integer

Models for streaming data• Turnstile model: where and integer

• Frequency vector for stream :– For define– The frequency vector

• Strict turnstile model: where and integer with the condition that all components of are non-negative at all times

• Various other models: sliding window, random order, cloud, distributed streams, …

Restrictions on the algorithm• The goal is to compute some function of the stream

• No random access to the input: just get to make one unidirectional pass. Occasionally we will be interested in more than one pass

• Space: Much smaller compared to the size of the data, typically in bits

• Update/query time: process each new stream element quickly, typically in time . Answer queries in time

• Randomness and approximation: often essential Want guarantees of one of the following form

– Relative approximation:– Additive approximation:

Probability here is over the randomness of the input, data is chosen arbitrarily

Some streaming problems: frequency moments

• Given a stream • Frequency moments: For ,

• Note that is just the number of items in the stream. Morris counter solves this efficiently

• But only in the cash register model

• The number of distinct in (in the strict turnstile model)

• Many applications of frequency moments: query optimization in databases, data mining, internet routing

A general template for many streaming algorithms

• Come up with a basic random estimator for the quantity of interest (usually the non-trivial part)

• Give an efficient algorithm to compute the estimator (may need the use of hashing or some other way of reducing randomness requirements)

• Improve the probability of success by some trick such as the median of means estimator

Plan for next few lectures• Algorithms for frequency moments and

• Algorithms for heavy hitters, point query

• Document sketching

• Matrix problem in the streaming model

Distinct items: • Given a stream , where , count the number of distinct items (so we are in the

cash register model)

• Example: 3 5 7 4 3 4 3 4 7 5 9• 5 distinct elements: 3 4 5 7 9 (we only want the count of distinct elements, and

not the set of distinct elements)

• In terms of frequency moments estimation, this is the problem of estimating

• The easy deterministic solutions with space and ( number of distinct elements)

• Deterministic exact solution requires space in the worst case• How about deterministic approximate solutions? And exact randomized?

• Can we do better with randomization and approximation?

Algorithms for data streams Foundations of Data Science 2014 Indian Institute of Science Navin...

Documents

Transcript of Algorithms for data streams Foundations of Data Science 2014 Indian Institute of Science Navin...