1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

51
1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA

Transcript of 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

Page 1: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

1

Towards a Synopsis Warehouse

Peter J. Haas

IBM Almaden Research Center

San Jose, CA

Page 2: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

2

Acknowledgements:

Kevin Beyer

Paul Brown

Rainer Gemulla (TU Dresden)

Wolfgang Lehner (TU Dresden)

Berthold Reinwald

Yannis Sismanis

Page 3: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

3

Information Discovery for the Enterprise

Syndicated Data Provider Crawlable/deep WebCompany Data

Semi-Structured UnstructuredStructured

Office documentsE-Mail, Product Manuals

ECM (reports, spreadsheets,

Financial docs (XBRL))

ERP (SAP), CRM, WBIBPM, SCM

Business-Object Discovery

Query: “Explain the product movement, buyer behavior, maximize the ROI on my product campaigns.”

Query: “The sales team is visiting company XYZ next week. What do they need to know about XYZ?”

ContentMetadata

Business objects

Enterprise Repository

Analyze, IntegrateCrawl, ETL

Search BusinessIntelligence

Order

Account

CustomerData Analysis

&Similarity

Page 4: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

4

Motivation, Continued• Challenge: Scalability

– Massive amounts of data at high speed• Batches and/or streams

– Structured, semi-structured, unstructured data

• Want quick approximate analyses– Automated data integration and schema discovery– “Business object” identification– Quick approximate answers to queries– Data browsing/auditing

• Our approach: a warehouse of synopses– For scalability and flexibility

Page 5: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

5

A Synopsis Warehouse

Full-ScaleWarehouse Of Data Partitions

Synop.

Synop.

Synop.

S1,1 S1,2 Sn,mWarehouseof Synopses

merge

S*,* S1-2,3-7 etc

Page 6: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

6

Outline

• Synopsis 1: Uniform samples– Background– Creating and combining samples

• Hybrid Bernoulli and Hybrid Reservoir algorithms– Updating samples

• Stable datasets: random pairing• Growing datasets: resizing algorithms• Maintaining Bernoulli samples of multisets

• Synopsis 2: AKMV samples for DV estimation– Base partitions: KMV synopses

• DV estimator and properties – Compound partitions: augmentation

• DV estimator and closure properties

Page 7: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

7

Synopsis 1: Uniform Samples

• Design goals– True uniformity – Bounded memory– Keep sample full– Support for compressed samples

• 80% of 1000 customer datasets had < 4000 distinct values

Uniform Sample

Other Synopses

Mining AlgorithmsStratified Samples, Etc.

xx

xx

xx

xx

xx

xxx

xx

xx

xx

xx

Statistical

Procedures

Page 8: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

8

Classical Uniform Methods

• Bernoulli sampling – Coin flip: includes each element with prob = q– Random, unbounded (binomial) sample size– Easy to merge: Bern(q) Bern(q) = Bern(q)

• Reservoir sampling– Creates uniform sample of fixed size k

• Insert first k elements into sample• Then insert ith element with prob. pi = k / i

– Variants and optimizations (e.g., Vitter)– Merging is harder

x6 x5 x4 x3 x2 x1

x4 x2 x1 Sample size = 3

Include with prob. 3/5

Page 9: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

9

Drawback of Basic Methods

• Neither method is very compact – Ex: dataset = (<A,500>,<B,300>)– Stored as (A,A,…,A,B,B,…B) - 800 chars

• Concise sampling (GM 98)– Compact: purge Bern(q) sample S if too large

• Bern(q’/q) subsample of S Bern(q’) sample

– Not uniform (rare items under-represented)

Page 10: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

10

New Sampling Methods (ICDE ’06)

• Two flavors: – Hybrid reservoir (HR)– Hybrid Bernoulli (HB)

• Properties– Truly uniform– Bounded footprint at all times– Will store exact distribution if possible– Samples stored in compressed form– Merging algorithms available

Page 11: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

11

Hybrid Reservoir (HR) Sampling

{<a,3>}+a

{<a,3>,b}+b

{<a,3>,<b,2>}+b

+c

{a,<b,2>} (subsample)

{a,b,b} (expand)

{c,b,b} (reservoir sampling)

+d {c,b,d}

Ex: Sample capacity = two <v,#> pairs or three values

+a {a,a,a}

{<a,3>} (compress)

{<a,2>}+a +a

… …

done

Phase 1 (Maintain exact frequency distribution)

Phase 2 (Reservoir sampling)

{<a,3>,<b,1>}+b

Page 12: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

12

Hybrid Bernoulli

• Similar to Hybrid Reservoir except– Expand into Bernoulli sample in Phase 2– Revert to Reservoir sample in Phase 3

• If termination in Phase 2– Uniform sample– “Almost” a Bernoulli sample

(controllable engineering approximation)

Page 13: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

13

Merging Samples

• Both samples in Phase 2 (usual case)– Bernoulli: equalize q’s and take union

• Take subsample to equalize q’s

– Reservoir: take subsamples and merge• Random (hypergeometric) subsample size

• Corner cases– One sample in Phase 1, etc.– See ICDE ’06 paper for details

Page 14: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

14

HB versus HR

• Advantages:– HB samples are cheaper to merge

• Disadvantages:– HR sampling controls sample size better– Need to know partition size in advance

• For subsampling during sample creation

– Engineering approximation required

Page 15: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

15

Speedup: HB Sampling

You derive “speed-up” advantages from parallelism with up to about 100 partitions.

Page 16: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

16

Speedup: HR Sampling

Similar results to previous slide, but merging HR samples is more complex than HB samples.

Page 17: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

17

Linear Scale-Up

HB Sampling

HR Sampling

Page 18: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

18

Updates Within a Partition• Arbitrary inserts/deletes (updates trivial)• Previous goals still hold

– True uniformity– Bounded sample size– Keep sample size close to upper bound

• Also: minimize/avoid base-data access

Sample

Partition

(updates), deletes, insertsFull-Scale

Warehouse

XExpensive!Synopsis

Warehouse

Page 19: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

19

New Algorithms (VLDB ’06+)• Stable datasets: Random pairing

– Generalizes reservoir/stream sampling• Handles deletions• Avoids base-data accesses

– Dataset insertions paired randomly with “uncompensated deletions”• Only requires counters (cg, cb) of “good” and “bad” UD’s• Insert into sample with probability cb / (cb + cg)

– Extended sample-merging algorithm (VLDBJ ’07)

• Growing datasets: Resizing– Theorem: can’t avoid base-data access– Main ideas:

• Temporarily convert to Bern(q): may require base-data access• Drift up to new size (stay within new footprint at all times)• Choose q optimally to reduce overall resizing time

– Approximate and Monte Carlo methods

Page 20: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

20

Bernoulli Samples of Multisets (PODS ’07)

• Bernoulli samples over multisets (w. deletions)– When boundedness is not an issue– Compact, easy to parallelize– Problem: how to handle deletions (pairing?)

• Idea: maintain “tracking counter”– # inserts into DS since first insertion into sample (GM98)

• Can exploit tracking counter– To estimate frequencies, sums, avgs

• Unbiased (except avg) and low variance

– To estimate # distinct values (!)

• Maintaining tracking counter– Subsampling: new algorithm – Merging: negative result

Page 21: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

21

Outline

• Synopsis 1: Uniform samples– Background– Creating and combining samples

• Hybrid Bernoulli and Hybrid Reservoir algorithms– Updating samples

• Stable datasets: random pairing• Growing datasets: resizing algorithms• Maintaining Bernoulli samples of multisets

• Synopsis 2: AKMV samples for DV estimation– Base partitions: KMV synopses

• DV estimator and properties – Compound partitions: augmentation

• DV estimator and closure properties

Page 22: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

22

AKMV Samples (SIGMOD ’07)

• Goal: Estimate # distinct values– Dataset similarity (Jaccard distance)– Key detection– Data cleansing

• Within warehouse framework– Must handle multiset union, intersection,

difference

Page 23: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

23

KMV Synopsis• Used for a base partition• Synopsis: k smallest hashed values

– vs bitmaps (e.g., logarithmic counting)• Need inclusion/exclusion to handle intersection• Less accuracy, poor scaling

– vs sample counting• Random size K (between k/2 and k)

– vs Bellman [DJMS02]• minHash for k independent hash functions• O(k) time per arriving value, vs O(log k)

• Can view as uniform sample of DV’s

Page 24: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

24

The Basic Estimator• Estimate:

– U(k) = kth smallest (normalized) hashed value

• Properties (theory of uniform order statistics)– Normalized hashed values “look like” i.i.d. uniform[0,1] RVs

• Large-D scenario (simpler formulas)– Theorem: U(k) approx.= sum of k i.i.d exp(D) random variables

– Analysis coincides with [Cohen97]– Can use simpler formulas to choose synopsis size

( )ˆ ( 1) / kD k U

( 1) ( 1)ˆ ˆ[ ] and [ ] ( 1)( 1)( 2) ( )

ˆ| | / is bounded in (can also get confidence bounds)

Asymptotically efficient as (minimal variance)

r r D D D rE D D E D k

k k k r

E D D D D

k

Page 25: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

25

Compound Partitions

• Given a multiset expression E– In terms of base partitions A1,…,An

– Union, intersection, multiset difference

• Augmented KMV synopsis– KMV synopsis for

– Counters: cE(v) = multiplicity of value v in E

– AKMV synopses are closed under multiset operations

• Estimator (unbiased) for # DVs in E:

1 nA A

( )

1ˆ EE

k

K kD

k U

KE = # positive counters

Page 26: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

26

Experimental Comparison

Unbiased SDLogLog Sample-Counting Unbiased-baseline0

0.02

0.04

0.06

0.08

0.1

Abs

olut

e R

elat

ive

Err

or

Page 27: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

27

For More Details• "Toward automated large scale information integration and discovery." P. Brown,

P. J. Haas, J. Myllymaki, H. Pirahesh, B. Reinwald, and Y. Sismanis. In Data Management in a Connected World, T. Härder and W. Lehner, eds. Springer-Verlag, 2005.

• “Techniques for warehousing of sample data”. P. G. Brown and P. J. Haas. ICDE ‘06.

• “A dip in the reservoir: maintaining sample synopses of evolving datasets”. R. Gemulla, W. Lehner, and P. J. Haas. VLDB ‘06.

• “Maintaining Bernoulli samples over evolving multisets”. R. Gemulla, W. Lehner, and P. J. Haas. PODS ‘07.

• “On synopses for distinct-value estimation under multiset operations” K. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. SIGMOD ‘07.

• “Maintaining bounded-size sample synopses of evolving multisets” R. Gemulla, W. Lehner, P. J. Haas. VLDB Journal, 2007.

Page 28: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

28

Backup Slides

Page 29: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

29

Bernoulli Sampling– Bern(q) independently includes each element with probability q– Random, uncontrollable sample size– Easy to merge Bernoulli samples: union of 2 Bern(q) samp’s = Bern(q)

+t1

2/3 1/3

1

+t2 2

2/3 1/3

21 1

2/3 1/3

30%

1 23 2 32 1 31 1 2 3

2/3 1/32/3 1/3 2/3 1/3 2/3 1/3

+t315% 15% 7% 15% 7% 7% 4%

q = 1/3

Page 30: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

30

Reservoir Sampling (Example)

• Sample size M = 2

1 2+t1 +t2100%

1 2

1/3

3 2 1 3

1/3 1/3

1 2+t1 +t2

+t333% 33% 33%

t 1 2 4 2 1 4

2/4 1/4 1/4

3 2 4 2 3 4

2/4 1/4 1/4

1 3 4 3 1 4

2/4 1/4 1/4

+ 4

1 2

1/3

3 2 1 3

1/3 1/3

1 2+t1 +t2

+t3

16% 8% 8% 8% 8% 8% 8%16% 16%

Page 31: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

31

Concise-Sampling Example

• Dataset – D = { a, a, a, b, b, b }

• Footprint– F = one <value, #> pair

• Three (possible) samples of size = 3– S1 = { a, a, a }, S2 = { b, b, b }, S3 = { a, a, b }.– S1 = {<a,3>}, S2 = {<b,3>}, S3 = {<a,2>,<b,1>}.

• Three samples should have equal likelihood – But Prob(S1) = Prob(S2) > 0 and Prob(S3) = 0

• In general:– Concise sampling under-represents ‘rare’ population

elements

Page 32: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

32

Hybrid Bernoulli Algorithm• Phase 1

– Start by storing 100% sample compactly– Termination in Phase 1 exact distribution

• Abandon Phase 1 if footprint too big– Take subsample and expand– Fall back to Bernoulli sampling (Phase 2)– If footprint exceed: revert to reservoir sampling (Phase 3)

• Compress sample upon termination• If Phase 2 termination: (almost) Bernoulli sample• If Phase 3 termination: Bounded reservoir sample• Stay within footprint at all times

– Messy details

Page 33: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

33

Subsampling in HB Algorithm

• Goal: find q such that P{|S| > nF} = p

• Solve numerically:

• Approximate solution (< 3% error):

| | | || |

1(1 )

F

D j D jD

j n jq q p

2 2 2

2

| | 2 | | | | 4 | | 4

2 | | | |

F p p p F F

p

D n z z D D z D n nq

D D z

Page 34: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

34

Merging HB Samples

• If both samples in Phase 2– Choose q as before (w.r.t. |D1 U D2|)

– Convert both samples to compressed Bern(q)

[Use Bern(q’/q) trick as in Concise Sampling]– If union of compressed samples fits in memory

then join and exitelse use reservoir sampling (unlikely)

Page 35: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

35

Merging a Pair of HR Samples

• If both samples in Phase 2– Set k = min(|S1|, |S2|)

– Select L elements from S1 and k – L from S2

• L has hypergeometric distribution on {0,1,…,k}– Distribution depends on |D1|, |D2|

• Take (compressed) reservoir subsamples of S1, S2

• Join (compressed union) and exit

Page 36: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

36

Generating Realizations of LL is a random variable with probability mass function

P(l) = P{ L=l } given by:

k

DD

lk

D

l

D

lP||||

||||

)(21

21

for l = 0, 1, …. k-1

• Simplest implementation– Compute P recursively– Use inversion method (probe cumulative distribution at each merge)

• Optimizations when |D|’s and |S|’s unchanging– Use alias methods to generate L from cached distributions in O(1) time

Page 37: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

37

Naïve/Prior Approaches

unstableconduct deletions, continue with smaller sample

(RS with deletions)

CommentsTechniqueAlgorithm

expensive, low space efficiency in our setting

tailored for multiset populations Distinct-value sampling

special case of our RP algorithm

developed for data streams (sliding windows only)

Passive sampling

Not uniform (!)“coin flip” sampling with deletions, purge if too large

Bernoulli sampling with purging

stable but expensiveimmediately sample from base data to refill the sample

CAR(WOR)

expensive, unstablelet sample size decrease, but occasionally recompute

RS with resampling

not uniformuse insertions to immediately refill the sample

Naïve

Counting samples Not uniformModification of concise sampling

Page 38: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

38

Random Pairing

1 2

1 2

1/3

3 2 1 3

1/3 1/3

+t1 +t2

+t3

-t2 1 3 1 3

-t3 1 1

1

1

1

1

1

1

1 2

1 2

1/3

3 2 1 3

1/3 1/3

+t1 +t2

+t3

-t2 1 3 1 3

-t3 1 1

+t4

1

1

1

1

1

1

1 1 4

1/2 1/2

4 4

1/2 1/2

1 4 1

1/2 1/2

1 2

1 2

1/3

3 2 1 3

1/3 1/3

+t1 +t2

+t3

-t2 1 3 1 3

-t3 1 1

+t4

1

1

1

1

1

1

+t5

1 1 4

1/2 1/2

1 4 1

1/2 1/2

4 4

1/2 1/2

1 5

1

1 4

1

1 5

1

1 4

1

4 5

1

4 5

1

16% 16% 16% 16%16% 16%

Page 39: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

39

Performance

0 2 4 6 8 1 0

R PC A R

C A R W O RR S RB S P

0.1

1

10

100

100K sam p le10 m illion ope ra tions

D ataset S ize (m illions)

Co

st

Page 40: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

40

A Negative Result• Theorem

– Any resizing algorithm MUST access base data

• Example– data set

– samples of size 2

– new data set

– samples of size 3

1 2 3 4

1 2 1 3 1 4 2 3 2 4 3 4

16% 16% 16% 16% 16% 16%

1 2 3 1 2 5

0% >0%Not uniform!

1 2 3 4 5 6 ...

Page 41: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

41

Resizing: Phase 1

Conversion to Bernoulli sample– Given q, randomly determine sample size

• U = Binomial(|D|,q)

– Reuse S to create Bernoulli sample• Subsample if U < |S|• Else sample additional tuples (base data access)

– Choice of q• small less base data accesses• large more base data accesses

Page 42: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

42

Resizing: Phase 2

Run Bernoulli sampling– Include new tuples with probability q– Delete from sample as necessary– Eventually reach new sample size– Revert to reservoir sampling – Choice of q

• small long drift time• large short drift time

Page 43: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

43

Choosing q (Inserts Only)

• Expected Phase 1 (conversion) time

• Expected Phase 2 (drifting) time

• Choose q to minimize E[T1] + E[T2]

1

| |[ ] | | ln

| | min(| | , ')a

D ME T t D

D M D q M M

2

' | |[ ] bt M D q

E Tq

Page 44: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

44

Resizing Behavior

• Example (dependence on base-access cost):– resize by 30% if sampling fraction drops below 9%– dependent on costs of accessing base data

Low costs

immediate resizing

Moderate costs

combined solution

High costs

degenerates to Bernoulli sampling

Page 45: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

45

Choosing q (w. Deletes)

• Simple approach (insert prob. = p > 0.5)– Expected change in partition size (Phase 2)

• (p)(1)+(1-p)(-1) = 2p-1

– So scale Phase 2 cost by 1/(2p-1)

• More sophisticated approach– Hitting time of Markov chain to boundary– Stochastic approximation algorithm

• Modified Kiefer-Wolfowitz

Page 46: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

46

The RPMerge Algorithm

• Conceptually: defer deletions until after merge

• Generate Yi’s directly – Can assume that

deletions happen after the insertions

1

7

7

2

34

4

5

5

6

R1+

S1+

S+

S

R1

RP( = 4)M1

RP( ) 1-

2-

d1 = 28

12

12

911

11

5

5 11

11

10

1 0

13

13

14

14

1

5 11

1

1

14

14

R2+

S2+

R2 d2 = 3

X1 = 2

Y1 = 2 Y2 = 1

X2 = 2

RP( = 5)M2

Page 47: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

47

New Maintenance Method• Idea: use tracking counters

– After j-th transaction, augmented sample Sj is

Sj = { (Xj (t),Yj (t)): t T and Xj (t) > 0}• Xj(t) = frequency of item t in the sample

• Yj(t) = net # of insertions of t into R since t joined sample

Deletion of t

Delete t from sample

With prob. (Xj(t) – 1) / (Yj(t) – 1)

Sample

Data

Xj(t) copies of item t in sample

Nj(t) copies of item t in dataset

Insertion of t

Insert t into sample

With prob. q

Page 48: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

48

Frequency Estimation• Naïve (Horvitz-Thompson) unbiased estimator

• Exploit tracking counter:

• Theorem

• Can extend to other aggregates (see paper)

1 1ˆiX

iiXN

q

X

qq

1 if 0ˆ0 if

1

0i

iY

i

i q YN

Y

Y

andˆ ˆ ˆ[ ] [ ] [ ]i i iY i Y XE N N V N V N

Page 49: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

49

Estimating Distinct-Value Counts

• If usual DV estimators unavailable (BH+07)• Obtain S’ from S: insert t D(S) with probability

• Can show: P(t S’) = q for t D(R)• HT unbiased estimator: = |S’| / q• Improve via conditioning (Var[E[U|V]] ≤ Var[U]):

1 if ( ) 1( )

if ( ) 1

Y tp t

q Y t

ˆHTD

( )ˆ ˆ[ | ] ( ) /Y HT t D S

D E D S p t q

Page 50: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

50

Estimating the DV Count

• Exact computation via sorting– Usually infeasible

• Sampling-based estimation– Very hard problem (need large samples)

• Probabilistic counting schemes– Single-pass, bounded memory– Several flavors (mostly bit-vector synopses)

• Linear counting (ASW87)• Logarithmic counting (FM85,WVT90,AMS, DF03)• Sample counting (ASW87,Gi01, BJKST02)

Page 51: 1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.

51

Intuition

• Look at spacings– Example with k = 4 and D = 7:

– E[V] 1 / D so that D 1 / E[V]

– Estimate D as 1 / Avg(V1,…,Vk)

– I.e., as k / Sum(V1,…,Vk)

– I.e., as k / u(k)

– Upward bias (Jensen’s inequality) so change k to k-1

x x x x x x x

V1 V 2 V 3 V4

0 1