Foundations of Privacy Lecture 4 Lecturer: Moni Naor.

31
Foundations of Privacy Lecture 4 Lecturer: Moni Naor

Transcript of Foundations of Privacy Lecture 4 Lecturer: Moni Naor.

Foundations of Privacy

Lecture 4

Lecturer: Moni Naor

Recap of last week’s lecture• Differential Privacy• Sensitivity:

– Global sensitivity of query q:Un→Rd

GSq = maxD,D’ ||q(D) – q(D’)||1

– Local sensitivity of query q at point DLSq(D)= maxD’ |q(D) – q(D’)|

– Smooth sensitivity

Sf*(X)= maxY {LSf(Y)e- dist(x,y) }

• Histograms• Differential privacy of median

• Exponential Mechanism

Histograms• Inputs x1, x2, ..., xn in domain U

Domain U partitioned into d disjoint bins S1,…,Sd

q(x1, x2, ..., xn) = (n1, n2, ..., nd) where

nj = #{i : xi in j-th bin}

Can view as d queries: qi counts # spoints in set Si

For adjacent D,D’, only one answer can change - it can change by 1

Global sensitivity of answer vector is 1Sufficient to add Lap(1/ε) noise to each

query, still get ε-privacy

The Exponential Mechanism [McSherry Talwar]

A general mechanism that yields • Differential privacy• May yield utility/approximation• Is defined and evaluated by considering all possible answers

The definition does not yield an efficient way of evaluating it

Application/original motivation: Approximate truthfulness of auctions

• Collusion resistance• Compatibility

Side bar: Digital Goods Auction

• Some product with 0 cost of production• n individuals with valuation v1, v2, … vn

• Auctioneer wants to maximize profit

Example of the Exponential Mechanism• Data: xi = website visited by student i today• Range: Y = {website names}• For each name y, let q(y, X) = #{i : xi = y}

Goal: output the most frequently visited site• Procedure: Given X, Output website y with probability

prop to eq(y,X)

• Popular sites exponentially more likely than rare ones Website scores don’t change too quickly

Size of subset

Setting• For input D 2 UUnn want to find r2RR• Base measure on R - R - usually uniform• Score function q’: UUn n ££ R R R

assigns any pair (D,r) a real value– Want to maximize it (approximately)

The exponential mechanism– Assign output r2RR with probability proportional to

eq’(D,r) (r)

Normalizing factor r eq’(D,r) (r)

The exponential mechanism is private

• Let = maxD,D’,r |q(D,r)-q(D’,r)|

Claim: The exponential mechanism yields a 2¢¢ differentially private solution

• Prob [output = r on input D]

= eq’(D,r) (r)/r eq’(D,r) (r)• Prob [output = r on input D’]

= eq’(D’,r) (r)/r eq’(D’,r) (r)

adjacent

Ratio isbounded by

e e

Laplace Noise as Exponential Mechanism

• On query q:Un→R let q’(D,r) = -|q(D)-r|

• Prob noise = y e-y / 2 y e-y = /2 e-y

Laplace distribution Y=Lap(b) has density function

Pr[Y=y] =1/2b e-|y|/b

y

0 1 2 3 4 5-1-2-3-4

Any Differentially Private Mechanism is an instance of the Exponential Mechanism

• Let M be a differentially private mechanism

Take q’(D,r) to be log Prob[M(D) =r]

Remaining issue: Accuracy

Private Ranking• Each element i 2 {1, … n} has a real valued score

SD(i) based on a data set D.• Goal: Output k elements with highest scores.• Privacy• Data set D consists of n entries in domain D.

– Differential privacy: Protects privacy of entries in D.

• Condition: Insensitive Scores– for any element i, for any data sets D, D’ that differ in one

entry:|SD(i)- SD’(i)| · 1

Approximate ranking

• Let Sk be the kth highest score based on data set D.• An output list is -useful if:

Soundness: No element in the output has score less than Sk -

Completeness: Every element with score greater than Sk + is in the output.

Score · Sk -

Sk + · Score

Sk - · Score · Sk +

Two Approaches• Score perturbation

– Perturb the scores of the elements with noise – Pick the top k elements in terms of noisy scores.– Fast and simple implementation Question: what sort of noise should be added?

What sort of guarantees?

• Exponential sampling– Run the exponential mechanism k times.– more complicated and slower implementationWhat sort of guarantees?

Each input affects all scores

Homework

Exponential Mechanism: Simple Example (almost free) private lunch

Database of n individuals, lunch options {1…k},each individual likes or dislikes each option (1 or 0)

Goal: output a lunch option that many likeFor each lunch option j2 [k], ℓ(j) is # of ind. who like jExponential Mechanism:

Output j with probability eεℓ(j)

Actual probability: eεℓ(j)/(∑i eεℓ(i))

Normalizer

query 1,query 2,. . .

Synthetic DB: Output is a DB

Database

answer 1answer 3

answer 2

?

Sanitizer

Synthetic DB: output also a DB (of entries from same universe X), user reconstructs answers by evaluating query on output DB

Software and people compatibleConsistent answers

Answering More QueriesUsing exponential mechanism

Differential Privacy for every set C of counting queries• Error is Õ(n2/3 log|C|)

Remarkable

Hope for rich private analysis of small DBs!• Quantitative: #queries >> DB size,

• Qualitative: output of sanitizer -synthetic DB-output is a DB itself

Counting Queries

• Queries with low sensitivity

Counting-queriesC is a set of predicates c: U {0,1}Query: how many D participants satisfy c ?

Relaxed accuracy: answer query within α additive error w.h.pNot so bad: error anyway inherent in statistical analysis

Assume all queries given in advance

U

Database D of size n

Query c

Non-interactive

Utility and Privacy Can’t Always Be Achieved Simultaneously

Impossibility results for counting queries:DB with n participants can’t have o(√n) error, O(n) queries

[DiNi, DwMcTa07,DwYe08]In all these cases, strong privacy violation

What can we do? almost entire DB compromised

Huge DBs [Dwork Nissim]

DB of size n >> # queries |C|:

Add independent noise to answer on every query

Noise per query ~ #queriesFor accuracy, need #queries ≤ n

May be reasonable for hugehuge internet-scale DBs,Privacy “for free”

What about smaller DBs?

DB of size n < #queries |C|, impossibility results:

can’t have o(√n) error

Error must be Ω(√n)

The BLR Algorithm

For DBs F and Ddist(F,D) = maxq2C |q(F) – q(D)|

Intuition: far away DBs get smaller probability

Algorithm on input DB D:Sample from a distribution on DBs of size m: (m < n)

DB F gets picked w.p. / e-ε·dist(F,D)

Blum Ligett Roth08

The BLR Algorithm

Idea:• In general: Do not use large DB

– Sample and answer accordingly• DB of size m guaranteeing hitting each query with

sufficient accuracy

The BLR Algorithm: 2ε-Privacy

For adjacent D,D’ for every F|dist(F,D) – dist(F,D’)| ≤ 1

Probability of F by D: e-ε·dist(F,D)/∑G of size m e-ε·dist(G,D)

Probability of F by D’:numerator and denominator can change by eε-factor 2ε-privacy

Algorithm on input DB D:Sample from a distribution on DBs of size m: (m < n)

DB F gets picked w.p. / e-ε·dist(F,D)

The BLR Algorithm: Error Õ(n2/3 log|C|)

There exists Fgood of size m =Õ((n\α)2·log|C|) s.t. dist(Fgood,D) ≤ α

Pr [Fgood] ~ e-εα

For any Fbad with dist 2α, Pr [Fbad] ~ e-2εα

Union bound: ∑ bad DB Fbad Pr [Fbad] ~ |U|me-2εα

For α=Õ(n2/3log|C|), Pr [Fgood] >> ∑ Pr [Fbad]

Algorithm on input DB D:Sample from a distribution on DBs of size m: (m < n)

DB F gets picked w.p. / e-ε·dist(F,D)

The BLR Algorithm: Running Time

Generating the distribution by enumeration:Need to enumerate every size-m database,where m = Õ((n\α)2·log|C|)

Running time ≈ |U|Õ((n\α)2·log|c|)

Algorithm on input DB D:Sample from a distribution on DBs of size m: (m < n)

DB F gets picked w.p. / e-ε·dist(F,D)

Conclusion

Offline algorithm, 2ε-Differential Privacy for anyset C of counting queries

• Error α is Õ(n2/3 log|C|/ε)

• Super-poly running time: |U|Õ((n\α)2·log|C|)

Can we Efficiently Sanitize?

The good news

If the universe is small, Can sanitize EFFICIENTLY

The bad news cannot do much better, namely sanitize in time:

sub-poly(|C|) AND sub-poly(|U|)

Time poly(|C|,|U|)

How Efficiently Can We Sanitize?

|C|

|U|subpol

ypoly

subpoly

poly

?

Good news!

?

? ?

The Good News: Can Sanitize When Universe is Small

Efficient Sanitizer for query set C• DB size n ¸ Õ(|C|o(1) log|U|)• error is ~ n2/3 • Runtime poly(|C|,|U|)

Output is a synthetic database

Compare to [Blum Ligget Roth]:

n ¸ Õ(log|C| log|U|), runtime super-poly(|C|,|U|)

Recursive Algorithm

C0=C C1 C2 Cb

Start with DB D and large query set CRepeatedly choose random subset Ci+1 of Ci:

shrink query set by (small) factor

Recursive Algorithm

Start with DB D and large query set CRepeatedly choose random subset Ci+1 of Ci:

shrink query set by (small) factorEnd recursion: sanitize D w.r.t. small query set Cb

Output is good for all queries in small set Ci+1

Extract utility on almost-all queries in large set Ci

Fix remaining “underprivileged” queries in large set Ci

C0=C C1 C2 Cb