Calibrating Noise to Sensitivity in Private Data Analysis

42
Calibrating Noise to Sensitivity in Private Data Analysis Kobbi Nissim BGU With Cynthia Dwork, Frank McSherry, Adam Smith, Enav Weinreb

description

Calibrating Noise to Sensitivity in Private Data Analysis. Kobbi Nissim BGU. With Cynthia Dwork, Frank McSherry, Adam Smith, Enav Weinreb. x 1. query. x 2. x 3. answer. . San. x n-1. x n. Users (government, researchers, marketers, … ). The Setting. - PowerPoint PPT Presentation

Transcript of Calibrating Noise to Sensitivity in Private Data Analysis

Page 1: Calibrating Noise to Sensitivity in Private Data Analysis

Calibrating Noise to Sensitivity in Private Data Analysis

Kobbi Nissim

BGU

With Cynthia Dwork, Frank McSherry, Adam Smith, Enav Weinreb

Page 2: Calibrating Noise to Sensitivity in Private Data Analysis

The Setting

xn

xn-1

x3

x2

x1 query

answerSan

I just want to learn a few

harmless global statistics

Users(government, researchers,

marketers, …)

Can I combine these to learn

some private info?

x Dn

(n rows each of domain D)

X =

Page 3: Calibrating Noise to Sensitivity in Private Data Analysis

What is privacy?

Clearly we cannot undo the harm done by others

Can we minimize the additional harm while providing utility? Goal: Whether or not I contribute my data

does not affect my privacy

Page 4: Calibrating Noise to Sensitivity in Private Data Analysis

Output Perturbation

xn

xn-1

x3

x2

x1 f

f(x) + noiseSan

random coins¢ ¢ ¢

San Controls:• which functions f• kind of perturbation

Page 5: Calibrating Noise to Sensitivity in Private Data Analysis

When Can I Release f(x) accurately?

Intuition: global information is “insensitive” to individual data and is safe

f(x1,…,xn) is sensitive if changing a few entries can drastically change its value

Page 6: Calibrating Noise to Sensitivity in Private Data Analysis

Talk Outline

A framework for output perturbation based on “sensitivity” Formalize “sensitivity” and relate it to

privacy definitions Examples of sensitivity based analysis New ideas

Basic models for privacy Local vs. global Noninteractive vs. Interactive

Page 7: Calibrating Noise to Sensitivity in Private Data Analysis

Related Work

Relevant work in Statistics, Data mining, Computer Security, Databases Largely: no precise definitions and

analysis of privacy Recently: A foundational approach

[DN03,EGS03,DN04,BDMN05, KMN05 CDMSW05,CDMT05,MS06,CM06,…]

This work extends [DN03,DN04,BDMN05]

Page 8: Calibrating Noise to Sensitivity in Private Data Analysis

Privacy as Indistinguishability

xn

xn-1

x3

x2

x1

x=

query 1answer 1query T

answer T

transcriptT(x)

San

random coins¢ ¢ ¢

query 1answer 1query T

answer T

transcriptT(x')

San

random coins¢ ¢ ¢

x’=

xn

xn-1

x3

x2’x1

Distributions at “distance” <

Differ in 1 row

Page 9: Calibrating Noise to Sensitivity in Private Data Analysis

-Indistinguishability

A sanitizer is -indistinguishable if for all pairs x,x’ Dn which differ on at

most one entry for all adversaries A for all transcripts t

Pr[TA(x) = t]

Pr[TA(x’) = t] e

Page 10: Calibrating Noise to Sensitivity in Private Data Analysis

Semantically Flavored Definitions

Indistinguishability - easy to work with but does not directly say what the adversary can do an learn

“Ideal” semantic definition: Adversary does not change his beliefs about me

Problem: dependencies, e.g. in form of side information Say you know that I am 20 pounds heavier than average

Israeli… You will learn my weight from census results

Whether or not I participate

Ways to get around: Assume “independence” of X1,…,Xn [DN03,DN04,BDMN05] Compare “what A knows now” vs “what A would have learned

anyway” [DM]

Page 11: Calibrating Noise to Sensitivity in Private Data Analysis

Incremental Risk

Suppose adversary has prior “beliefs” about x Probability distribution, r.v. X= (X1,…,Xn)

Given transcript t, adversary updates “beliefs”

according to Bayes’ rule New distribution Xi’| T(X)=t

Page 12: Calibrating Noise to Sensitivity in Private Data Analysis

Incremental Risk

Two options: I participate in census (input = X)

I do not participate (input Yi = X1,…,Xi-1,*,Xi+1,…,Xn )

Privacy: whether I participate or not does not

significantly influence adversary’s posterior beliefs:

For all transcripts t, for all i: X’i |T(X)=t ¼ X’i |T(Yi)=t

SanXSanYi

“Proof:” indistinguishability guarantees that updates are

the same within 1±

Bugger!It’s the same whether you participate or

not

Page 13: Calibrating Noise to Sensitivity in Private Data Analysis

Recall – -Indistinguishability

For all pairs x,x’ Dn s.t. dist(x,x’) = 1 For all transcripts t

Pr[TA(x) = t]

Pr[TA(x’) = t] e

Page 14: Calibrating Noise to Sensitivity in Private Data Analysis

An Example – Sum Queries

xn

xn-1

x3

x2

x1

Pls let me know fA(x)=iA xi

fA(x) + noiseSan

random coins¢ ¢ ¢

x [0,1]n

Page 15: Calibrating Noise to Sensitivity in Private Data Analysis

x 2 [0,1]n fA(x)=iA xi

Can be used as a basis for other tasks: clustering, learning, classification… [BDMN05]

Answer: xi + Y where Y Lap(1/)

Laplace Distribution: h(y) e-|y|

Note: |fA(x)-fA(x’)| 1

Sum Queries – Answering a Query

Page 16: Calibrating Noise to Sensitivity in Private Data Analysis

Property of Lap x,y: h(x)/h(y) e|x-y|

Pr[T(x)=t] e|fA(x)-t|

Pr[T(x’)=t] e|fA(x’)-t|

Pr[T(x)=t] / Pr[T(x’)=t] e | fA(x)- fA(x’)| e

Sum Queries – Proof of -Indistinguishability

max |fA(x)-fA(x’)| = 1

f(x) f(x’)

Page 17: Calibrating Noise to Sensitivity in Private Data Analysis

We chose noise magnitude to cover for max |f(x)-f(x’)|

Sensitivity Sf = max ||f(x)-f(x’)||1

Local Sensitivity LSf(x) = max ||f(x)-f(x’)||1

Sensitivity

f

xn

xn-1

xn

xn-1x3

x2

x1

Sanx=

¢¢ ¢

x3 San

x’=¢ ¢ ¢

x2’x1

f(x) + noise

f

f(x’) + noise

dist(x,x’)=1

dist(x,x’)=1

Page 18: Calibrating Noise to Sensitivity in Private Data Analysis

Calibrating Noise to Sensitivity

xn

xn-1

x3

x2

x1Pls let me know

f (x)

f (x) + Lap(Sf /)San

random coins¢ ¢ ¢

x Dn

h(y) e-/Sf ||y||1

Page 19: Calibrating Noise to Sensitivity in Private Data Analysis

Calibrating Noise to Sensitivity - Why it Works?

Sf = max |f(x)-f(x’)|1

Property of Lap: x,y: h(x)/h(y) e||x-y||1

Pr[T(x)=t] / Pr[T(x’)=t] e / Sf ||fA(x)- fA(x’)||1 e

dist(x,x’)=1 h(y) e-/Sf ||y||1

Page 20: Calibrating Noise to Sensitivity in Private Data Analysis

Main Result

Theorem: If a user U is limited to T adaptive queries of sensitivity Sf

then -indistinguishability if iid noise Lap(SfT/ added to query answers

Same idea works with other metrics and noise Which useful functions are insensitive?

All useful functions should be insensitive… Statistical conclusions should not depend on small

variations in data

Page 21: Calibrating Noise to Sensitivity in Private Data Analysis

Using insensitive functions

Strategies: Use theorem, output f(x) + Lap(Sf /)

Sf may be hard to analyze/compute

Sf high for functions considered ‘insensitive’

Express f in terms of insensitive functions Resulting noise depends on input (in form and

magnitude)

Page 22: Calibrating Noise to Sensitivity in Private Data Analysis

Example - Expressing f in terms of insensitive functions x {0,1}n f(x) = ( xi)2

Sf = n2 - (n-1)2 = 2n-1 af = ( xi)2

+ Lap(2n/) If f(x) << n noise dominates

However f(x) = (g(x))2 where g(x) = xi

Sg=1 Better to query for g

Get ag = xi + Lap(1/) Estimate f(x) as (ag)2

Taking constant results in stddev O( xi)

– (1/ )2

Page 23: Calibrating Noise to Sensitivity in Private Data Analysis

Useful Insensitive functions

Means, variances,… With appropriate assumptions on data

Histograms & contingency tables Singular value decomposition Distance to a property Functions with low query complexity

Page 24: Calibrating Noise to Sensitivity in Private Data Analysis

Histograms/Contingency Tables

x1,…,xn 2 D where D partitioned into d disjoint bins b1,…,bd

h(x) = (v1,…,vd) where vi=|{i : xi bi}| Sh = 2

Changing one value xi changes vector by · 2

Irrespective of d

Add Laplacian with std. dev. 2/ to each count

b1 b2 … b4

Can do that with

sum queries …

Page 25: Calibrating Noise to Sensitivity in Private Data Analysis

Distance to a Property

Say P = set of “good” databases

Distance to P =

min # points in x that must be

changed to make x in P Always has sensitivity 1

Add Laplacian with stdev 1/

Examples: Distance to being clusterable

Weight of minimum cut in graph

Px

distance to P

Page 26: Calibrating Noise to Sensitivity in Private Data Analysis

Approximations with Low Query Complexity

Lemma: Assume algorithm A that randomly samples n points and

Pr[ A(x) f(x) ± ] > (1+)/2 Then Sf · 2

Proof: Consider x,x’ that differ on point i Let Ai be A conditioned on not choosing point i Pr[Ai(x) f(x)± | pt i not sampled] > 1/2 Pr[Ai(x’) f(x’)± | pt i not sampled] > 1/2

point p that is within dist from both f(x), f(x’) Sf · 2

Support of Ai(x)=Ai(x)p

Page 27: Calibrating Noise to Sensitivity in Private Data Analysis

Local Sensitivity Median – typically insensitive, large (global)

sensitivity LSf(x) = max ||f(x)-f(x’)||1

Example: f(x) = min(xi, 10) where xi{0,1} LSf(x) = 1 if xi 10 and 0 otherwise

dist(x,x’)=1

10 n xi

Page 28: Calibrating Noise to Sensitivity in Private Data Analysis

Local Sensitivity – First Attempt Calibrate noise to LSf(x)

Answer query f by f(x) + Lap(LSf(x)/) If x1…x10=1 and x11…xn=0

Answer = 10 + Lap(1/) If x1…x11=1 and x12…xn=0

Answer = 10 Noise magnitude

may be disclosive!

10 n xi

Page 29: Calibrating Noise to Sensitivity in Private Data Analysis

How to Calibrate Noise to Local Sensitivity?

Noise magnitude at a point x depends on LS(y) for all y Dn

N*f = max (LSf(y) e- dist(x,y)) Median

10 n xi

Page 30: Calibrating Noise to Sensitivity in Private Data Analysis

Talk Outline

A framework for output perturbation based on “sensitivity” Formalize “sensitivity” and relate it to

privacy definitions Examples of sensitivity based analysis New ideas

Basic models for privacy Local vs. global Noninteractive vs. Interactive

Page 31: Calibrating Noise to Sensitivity in Private Data Analysis

Models for Data Privacy

Collection and

sanitization

You

Bob

Alice

Users(government, researchers,

marketers, …)

Page 32: Calibrating Noise to Sensitivity in Private Data Analysis

San

Models for Data Privacy – Local vs. Global

Local:

Global:

You

Bob

AliceCollection

and sanitization

Collection and

sanitization

San

San

San

You

Bob

Alice

Including “SFE”

Page 33: Calibrating Noise to Sensitivity in Private Data Analysis

Models for Data Privacy –Interactive vs. Noninteractive

You

Bob

AliceCollection

and sanitization

Interactive:

Noninteractive:

You

Bob

AliceCollection

and sanitization

Page 34: Calibrating Noise to Sensitivity in Private Data Analysis

Models for Data Privacy - Summary

Local (vs. Global) Non central trusted party

Individuals interact directly with (untrusted) user

Individuals control their own privacy

Noninteractive (vs. Interactive) Easier distribution: web site, book, CD, …

More secure: can erase the data once it is processed

Almost all work in statistics, data mining is noninteractive!

Page 35: Calibrating Noise to Sensitivity in Private Data Analysis

Four Basic Models

Local, noninteractive

Global, interactive

Global, noninteractive

Local, interactive ??

incomparable

Page 36: Calibrating Noise to Sensitivity in Private Data Analysis

Interactive vs. Noninteractive

Local, noninteractive

Global, interactive

Global, noninteractive

Local, interactive

Page 37: Calibrating Noise to Sensitivity in Private Data Analysis

Separating Interactive from Noninteractive

Random samples: can compute estimates for many stats (essentially) no need to decide upon queries ahead of time But not private (unless small domain, small sample [CM06])

Interaction: get the power of random samples With privacy! E.g. Sum queries f(x) = i fi(xi) Even chosen adaptively!

Noninteractive schemes seem weaker Intuition: privacy cannot answer all questions ahead of time

(e.g. [DN03]) Intuition: sanitization must be tailored to specific functions

Page 38: Calibrating Noise to Sensitivity in Private Data Analysis

Separating Interactive from Noninteractive

Theorem: If D={0,1}d, then for any private,

noninteractive scheme, many sum queries

cannot be learned,

unless d = o(log n)

Weaker than Interactive

Cannot emulate random sample if data is

complex

Page 39: Calibrating Noise to Sensitivity in Private Data Analysis

Local vs. Global

Local, noninteractive

Global, interactive

Global, noninteractive

Local, interactive

Page 40: Calibrating Noise to Sensitivity in Private Data Analysis

Separating Local from Global

D = {0,1}d for d = (log n) View x as an nd matrix Local: rank(x) has sensitivity 1, can

release with low noise Global: cannot distinguish whether

rank(x) = k or much larger than k For suitable choice of d,n,k

Page 41: Calibrating Noise to Sensitivity in Private Data Analysis

To sum up

Defined privacy in terms of indistinguishability Considered semantic versions of definitions “Crypto” with non-negligible error

How to Calibrate noise to sensitivity and # of queries Seems that useful stats should be insensitive Some commonly used functions have low sensitivity For others – local sensitivity?

Begun to explore the relationships between basic models

Page 42: Calibrating Noise to Sensitivity in Private Data Analysis

Questions Which useful functions are insensitive?

What would you like to compute? Can we get stronger results using:

Local sensitivity? Computational assumptions? [MS06] Entropy in data?

How to deal with small databases? Privacy in a broader context

Rationalizing privacy and privacy related decisions Which types of privacy? How to decide upon privacy

parameters? … Handling rich data

Audio, Video, Pictures, Text, …