Helping Kinsey Compute

30
Helping Kinsey Compute Cynthia Dwork Microsoft Research

description

Helping Kinsey Compute. Cynthia Dwork Microsoft Research. The Problem. Exploit Data, eg, Medical Insurance Database Does smoking contribute to heart disease? Was there a rise in asthma emergency room cases this month? What fraction of the admissions during 2004 were men 25-35? - PowerPoint PPT Presentation

Transcript of Helping Kinsey Compute

Page 1: Helping Kinsey Compute

Helping Kinsey Compute Helping Kinsey Compute

Cynthia Dwork

Microsoft Research

Cynthia Dwork

Microsoft Research

Page 2: Helping Kinsey Compute

The Problem

• Exploit Data, eg, Medical Insurance Database

— Does smoking contribute to heart disease?

— Was there a rise in asthma emergency room cases this month?

— What fraction of the admissions during 2004 were men 25-35?

• …while preserving privacy of individuals

Page 3: Helping Kinsey Compute

Holistic Statistics

• Is the dataset well clustered?

• What is the single best predictor for risk of stroke?

• How are attributes X and Y correlated; what is the cov(X,Y)?

• Are the data inherently low-dimensional?

Page 4: Helping Kinsey Compute

Statistical Database

Query (f,S)Query (f,S)

f: row f: row [0,1][0,1]

S S µµ [n] [n]

Exact Answer Exact Answer f(row r) f(row r)

Database Database

(D(D11, … D, … Dnn) )

ffffff

ff

+ noise

Page 5: Helping Kinsey Compute

Statistical Database

ffffff

ff

+ noise

Under control of interlocutor:Noise generationNumber of queries T permitted

Page 6: Helping Kinsey Compute

Why Bother With Noise?

Limiting interface to queries about large sets is insufficient:

A = {1, … , n} and B = {2, … , n}

a2 A f(row a) - b2 B f(row b) = f(row 1)

Page 7: Helping Kinsey Compute

Previous (Modern) Work in this Model

• Dinur, Nissim [2003]

Single binary attribute (query function f = identity)

Non-privacy: whp adversary guesses 1- rows

— Theorem: Polytime non-privacy if whp |noise| is o(√n)

— Theorem: Privacy with o(√n) noise if #queries is << n

• Privacy “for free” !

Rows » samples from underlying distribution: Pr[row i = 1] = p

E[# 1’s] = pn, Var = (n)

Acutal #1’s » pn § (√n)

|Privacy-preserving noise| is o(sampling error)

Page 8: Helping Kinsey Compute

Real Power in this Model

• Dwork, Nissim [2004]Multiple binary attributes

q=(S,f), f:{0,1}d ! {0,1}

—Definition of privacy appropriate to enriched query set

—Theorem: Privacy with o(√n) noise if #queries is << n

—Coined term SuLQ

• Vertically Partitioned Databases—Learn joint statistics from independently operated SuLQ

databases:

• Given SulQA, SuLQB learn if A implies B in probability

• Eg, heart disease risk increases with smoking

• Enables learning statistics for all Boolean fns of attributes

Page 9: Helping Kinsey Compute

Still More Power [Blum, Dwork, McSherry, Nissim 05]

• Extend Privacy Proofs

— Real-valued functions f: [0,1]d ! [0,1]

— Per row analysis: drop dependence on n!

• How many queries has THIS row participated in?

• Our Data, Ourselves

• Holistic Statistics: A Calculus of Noisy Computation

— Beyond statistics:

• (not too) noisy versions of k-means, perceptron, ID3 algs

• (not too) noisy optimal projections SVD, PCA

• All of STAT learning

Page 10: Helping Kinsey Compute

Towards Defining Privacy: “Facts of Life” vs Privacy Breach

• Diabetes is more likely in obese persons

— Does not imply THIS obese person has or will have diabetes

• Sneaker color preference is correlated with political party

— Does not imply THIS person in red sneakers is a Republican

• Half of all marriages result in divorce

— Does not imply Pr [ THIS marriage will fail ] = ½

Page 11: Helping Kinsey Compute

(, T)-Privacy

Power of adversary:

• Phase 0: Specify a goal function g: row {0,1}Actually, a polynomial number of functions;

Adversary will try to learn this information about someone

• Phase 1: Adaptively make T queries

• Phase 2: Choose a row i to attack; get entire database except for row i

Privacy Breach: Occurs if adversary’s “confidence” in g( row i ) changes by

Notes:

• Adversary chooses goal

• My privacy is preserved even if everybody else tells their secrets to the adversary

Page 12: Helping Kinsey Compute

Flavor of Privacy Proofs

• Define confidence in value of g( row i )

— c0 = log [p0/(1-p0)]

— 0 when p = ½, skyrockets as p moves toward 0 or 1

• Model evolution of confidence as a martingale

— Argue expected difference at each step is small

— Compute absolute upper bound on difference

— Plug these two parameters into Azuma’s inequality

Obtain probabilistic statement regarding change in confidence, equivalently, change from prior to posterior probabilities about value of g( row i )

c0

Page 13: Helping Kinsey Compute

Remainder of This Talk

• Description of SuLQ Algorithm + Statement of Main Theorem

• Examples

— k means

— SVD, PCA

— Perceptron

— STAT learning

• Vertically Partitioned Data

— Determining if ) in probability: Pr[|] ¸ Pr[]+ when and are in different SuLQ databases

• Summary

Page 14: Helping Kinsey Compute

The SuLQ Algorithm

• Algorithm:

— Input: query (S µ [n], f: [0,1]d ! [0,1])

—Output: i 2 Sf( row i ) + N(0, R)

• Theorem: 8 , with probability at least 1-, choosing

R > 32 log(2/) log (T/)T/2 ensures that for each (target, predicate) pair, after T queries the probability that the confidence has increased by more than is at most .

• R is independent of n. Bigger n means better stats.

Page 15: Helping Kinsey Compute

k Means Clustering

physics, OR, machine learning, data mining, etc.

Page 16: Helping Kinsey Compute

SuLQ k Means

• Estimate size of each cluster

• Estimate average of points in cluster

— Estimate their sum; and

— Divide estimated sum by estimated average

Page 17: Helping Kinsey Compute

Side by Side: k Means and SuLQ k-Means

Basic step:

• Input: data points p1,…,pn and k ‘centers’ c1,…,ck in [0,1]d

• Sj = points for which cj is the closest center

• Output: c’j = average of points in Sj, j=1, … k

Basic step:

• Input: data points p1,…,pn and k ‘centers’ c1,…,ck in [0,1]d

• sj = SuLQ( f(di) :=

1 if j = arg minj ||cj – di|| 0 otherwise)

• ’j = SuLQ( f(di) :=

di if j = arg minj ||cj - di|| 0 otherwise) / sj

k(1+d) queries total

Page 18: Helping Kinsey Compute

Small Error!

For each 1 · j · k, if |Sj| >> R1/2 then with high probability ||’j – c’j|| is O( (||j|| + d1/2 ) R1/2/|Sj|).

• Inaccuracies:

— Estimating |Sj|

— Summing points in Sj

• Even with just the first:

(1/sj - 1/|Sj|) I 2 Sjdi

= (1/sj - 1/|Sj|) (j |Sj|)

= ((|Sj| - sj)/sj ) j ¼ (noise/size) j

Page 19: Helping Kinsey Compute

Reducing Dimensionality

• Reduce Dimensionality in a dataset while retaining those characteristics that contribute most to its variance

• Find Optimal Linear Projections

— Latent semantic indexing, spectral clustering, etc., employ best rank k approximations to A

• Singular Value Decomposition uses top k eigenvectors of ATA

• Principal Component Analysis uses top k eigenvectors of cov(A)

• Approach

— Approximate ATA and cov(A) using SuLQ, then compute eigenvectors

Page 20: Helping Kinsey Compute

Optimal Projections

• ATA = i diT di

• = (i di)/n

• cov(A) = i(di - )T(di - )

• SuLQ (f(i) = diT di) = AT A + N(0,R)d £ d

’ = SuLQ(f(i)=di)/n

• SuLQ( f(i) = (di - ’)T (di - ’) )

d2 and d2+d queries, respectively

Page 21: Helping Kinsey Compute

Perceptron [Rosenblatt 57]

• Input: n points p1,…,pn in [-1,1]d, and labels b1,…,bn in {-1,1}—Assumed linearly separable, with a plane through the origin

• Initialize w randomly

• h w, p i b > 0 iff label b agrees with sign of h w, p i

• While 9 labeled point (pi,bi) s.t. h wi, pi i bi · 0, set w = w + pi·bi

• Output: w

ppii

ww

ww

Page 22: Helping Kinsey Compute

SuLQ Perceptron

• Initialize w = 0d and s= n.

Repeat while s >> R1/2

• Count the misclassified rows (1 query) :

s = SuLQ(f(di) := 1 if h di , w i bi · 0 and 0 ow)

• Synthesize a misclassified vector (d queries) :

v = SuLQ(f(di) := bi di if h di , w i ¢ bi · 0 and 0 ow) / s

• Update w:

Set w = w + v

Return the final value of w.

Page 23: Helping Kinsey Compute

How Many Rounds?

Theorem: If there exists a unit vector w’ and scalar such that for all i hw',dii bi ¸ and for all j, >> (dR)1/2/|Sj| then with high probability the algorithm terminates in at most 32 maxi |di|2 / rounds.

|Sj| = number of misclassified vectors at iteration j

In each round j, hw', wi increases by more than |w| does. Since hw', wi · |w'| ¢ |w| = |w|, this must stop. Otherwise hw', wi would overtake |w|.

Page 24: Helping Kinsey Compute

The Statistical Queries Learning Model

[Kearns93]

• Concept c: {0,1}d {0,1}

• Distribution D on {0,1}d

• STAT(c,D) Oracle

—Query: (p, ) where p:{0,1}d+1 {0,1} and =1/poly(d)

—Answer: PrxD[p(x,c(x))] + for ||

Page 25: Helping Kinsey Compute

Capturing STAT

Each row contains a labeled example (x, c(x))

Input: predicate p and accuracy

• Initialize tally = 0.

• Reduce variance:

Repeat t ¸ R/ n2 times

tally = tally + SuLQ(f(di) := p(di))

Output: tally / tn

Page 26: Helping Kinsey Compute

Capturing STAT

Theorem: For any algorithm that -learns a class C using at most q statistical queries of accuracy {1, … , q}, the adapted algorithm can -learn C on a SuLQ database of n elements, provided that

n2 ¸ R log(q / )}/(T-q) £ j · q 1/j

Page 27: Helping Kinsey Compute

Probabilistic Implication: Two SuLQ Databases

implies in probability: Pr[|] ≥ Pr[]+

• Construct a tester for distinguishing <1 from >2 (for constants 1 < 2)

—Estimate by binary search

• In the analysis we consider deviations from an expected value, of magnitude (√n)

—As perturbation << √n, it does not mask out these deviations

• Results generalize to functions and of attributes in two distinct SuLQ databases

Page 28: Helping Kinsey Compute

Key Insight: Test for Pr[|] ≥ Pr[]+

Assume T chosen so that noise = o(√n).

1. Find a “heavy” set S for : a subset of rows that have more than |S| a +[a(1-a) |S]1/2 ones in database. Here, a = Pr[] and |S| = (n).

Find S s.t. aS, > |S| a + √ [|S|(a(1- a))].

Let excess= aS, - |S| a. Note that excess is (n1/2).

2. Query the SuLQ database for , on S

If aS, ¸ |S| Pr[] + excess ( / (1 - a)) then return 1 else return 0

If is constant then noise is too small to hide the correlation.

Page 29: Helping Kinsey Compute

Summary

• SuLQ framework for privacy-preserving statistical databases

— real-valued query functions

— Variance for noise depends (roughly linearly) on number of queries, not size of database

• Examples of power of SuLQ calculus

• Vertically Partitioned Databases

Page 30: Helping Kinsey Compute

Sources

• C. Dwork and K. Nissim,

Privacy-Preserving Datamining on Vertically Partitioned Databases

• A. Blum, C. Dwork, F. McSherry, and K. Nissim,

Practical Privacy: The SuLQ Framework

• See http://research.microsoft.com/research/sv/DabasePrivacy