Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB=...

45
1 Privacy in Statistical Databases CSE 598D/STAT 598B Fall 2007 Lecture 2, 9/13/2007 Aleksandra Slavkovic Office hours: MW 3:30-4:30 Office: Thomas 412 Phone: x3-4918 Adam Smith Office hours: Mondays 3-5pm Office: IST 338K Phone: x3-0076

Transcript of Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB=...

Page 1: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

1

Privacy in Statistical DatabasesCSE 598D/STAT 598B Fall 2007

Lecture 2, 9/13/2007

Aleksandra SlavkovicOffice hours: MW 3:30-4:30

Office: Thomas 412Phone: x3-4918

Adam Smith Office hours: Mondays 3-5pm

Office: IST 338KPhone: x3-0076

Page 2: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

Reading Papers• You will have to read several different types of papers

For next week: 2 non-technical papers, 1 technical, T.B.D.

We will post the papers on Angel and email you

• You have to submit 2 or 3 questions/critiques about each

paper on Angel, before noon (12 p.m.) the day of lecture

Questions are meant to encourage discussion (and to get you to read the papers), e.g.:

• “This paper seems to assume A. What if ...?”

• “What happens if we apply this technique to problem X? I foresee problem Y and cool outcome Z.”

• “The methodology does not support claim W because of reasons B,C,D.”

Good questions ⇒ better discussions ⇒ more fun (and better grades) 2

Page 3: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

When you read a paper...• Read the abstract, intro, skim through the paper

What’s the topic, high-level structure?

• Start again, more slowly Look for the main thesis / claims / theorems

How are these supported? (arguments, experiments, simulation, proofs)

Read through arguments / proofs / etc

• Think critically Try to reformulate claims, arguments, etc in your own words

Are the proofs correct?

Do the experiments support the claims?

Do the results/claims support the conclusions?

Are they solving the right problem?

(Papers are not always good...)

• Take notes! Scribble in the margins! I find I read more carefully with printed paper.

3

Page 4: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

4

Privacy in Statistical DatabasesServer/agencyIndividuals Users

x1

x2

xn

... Aqueries

answers

)( Government,researchers,businesses

(or) Maliciousadversary

• What information can be released?

• Two conflicting goals

Utility: Users can extract “global” statistics

Privacy: Individual information stays hidden

• How can these be made precise?

(How context-dependent must they be?)

Page 5: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

5

Two Important ModelsServer/agencyIndividuals Users

x1

x2

xn

... Aqueries

answers

)( Government,researchers,businesses

(or) Maliciousadversary

• Noninteractive (“Anonymization”)

Publish data once and for all

Tables, graphs, study results

• Interactive (“Query auditing”)

Same thing, but prompted by queries from users

User can get tailored information

Understanding exactly what is leaked is difficult!

Page 6: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

What distinguishes the “CS” approach?• Not a single approach, but...

1. Automation

2. Large data sets

3. Abstraction

6

Page 7: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

This class / course

• Some methods people use

• Some definitions of what they are trying to achieve

7

Page 8: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

Terminology

8

• Algorithm

Well-defined procedure: recipe, computer program,

method for solving a Rubik’s cube

Running time: How many steps does the algorithm require?

• Typically we care only about how the running time scales with the

input size

Algorithmic Problem = inputs + acceptable outputs

• Polynomial-time

n = Input “size”, T(n) = running time on inputs of size n

Polynomial time: T(n) < nc for some constant c

Page 9: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

This class / course

• Some methods people use

• Some definitions of what they are trying to achieve

9

Page 10: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

Naive anonymization• Just remove obviously identifying information

• Names, SSN, street address, etc

• Problem: combining innocuous attributes, e.g. [Sweeney] 5-digit Zip, date of birth, and gender uniquely

identify 80+% of US population

[NY Times] AOL search results released for academic research (www.aolpsycho.com)

[Narayanan & Shmatikov] Netflix users re-identified based on movie tastes

[BDK’07] Re-identifiying users of social networks

10

Page 11: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

Naive anonymization

11

Page 12: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

Naive anonymization

12

Page 13: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

Examples of sanitization methods• Summary statistics

Means, variances Marginal totals (# people with blue eyes and brown hair)

• Generalization & suppression For some individuals, release only high-order bits, e.g. Age=22 vs

Age ∈ [20-29]

Different ways of grouping people• Clustering-based approaches

Break dataset into groups (of some minimal size) Release summary statistics for each group (e.g. k-anonymity)

13

Page 14: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

Examples of sanitization methods• Input perturbation: Change data before processing

E.g. Randomized response• flip each bit of table with probability p

Add Gaussian noise Data swapping

• Output perturbation Summary statistics with noise

• Interactive versions of above: “Auditor” decides which queries are OK and/or type of noise

• Cryptographic techniques Distribute computations among untrusted parties Given a program A, everyone learns output A(x1,x2, ..., xn) but nothing else These techniques do NOT answer the question “What algorithms A

preserve privacy”?

14

Page 15: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

15

How can we formalize “privacy”?• Different people mean different things

• Pin it down mathematically?

Goal #1: Rigor Prove clear theorems about privacy

• Few exist in literature

Make clear (and refutable) conjectures Sleep better at night

Goal #2: Interesting science (New) Computational phenomenon Algorithmic problems Statistical problems

Page 16: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

15

How can we formalize “privacy”?• Different people mean different things

• Pin it down mathematically?

Goal #1: Rigor Prove clear theorems about privacy

• Few exist in literature

Make clear (and refutable) conjectures Sleep better at night

Goal #2: Interesting science (New) Computational phenomenon Algorithmic problems Statistical problems

Can we approach privacy scientifically?

• Pin down social concept• No perfect definition?• But lots of place for rigor• Too late?

Page 17: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

Two Intuitions for Data Privacy• “If the release of statistics S makes it possible to

determine the value [of private information] more accurately than is possible without access to S, a disclosure has taken place.” [Dalenius] Learning more about me should be hard

• Privacy is “protection from being brought to the attention of others.” [Gavison] Safety is blending into a crowd

16

Page 18: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

17

Why not use crypto definitions?

Page 19: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

17

Why not use crypto definitions?• Cryptography successfully defined concepts

such asencryption “zero-knowledge” proofssecure function evaluation

Page 20: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

17

Why not use crypto definitions?• Cryptography successfully defined concepts

such asencryption “zero-knowledge” proofssecure function evaluation

Page 21: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

17

Why not use crypto definitions?• Cryptography successfully defined concepts

such asencryption “zero-knowledge” proofssecure function evaluation

• We will cover some of this in a later class

Page 22: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

17

Why not use crypto definitions?• Cryptography successfully defined concepts

such asencryption “zero-knowledge” proofssecure function evaluation

• We will cover some of this in a later class

Page 23: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

17

Why not use crypto definitions?• Cryptography successfully defined concepts

such asencryption “zero-knowledge” proofssecure function evaluation

• We will cover some of this in a later class

• Can we just use these definitions?

Page 24: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

18

Why not use crypto definitions?

Page 25: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

18

Why not use crypto definitions?• Attempt #1:

Def’n: For every entry i, no information about xi is leaked (as if encrypted)

Problem: no information at all is revealed!Tradeoff privacy vs utility

Page 26: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

18

Why not use crypto definitions?• Attempt #1:

Def’n: For every entry i, no information about xi is leaked (as if encrypted)

Problem: no information at all is revealed!Tradeoff privacy vs utility

• Attempt #2:Agree on summary statistics f(DB) that are safeDef’n: No information except f(DB)Problem: why is f(DB) safe to release?Tautology trap (Also: how do you figure out what f is?)

C

C

CC

C

C

Page 27: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

19

Why not use crypto definitions?

Page 28: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

19

Why not use crypto definitions?• Problem: Crypto makes sense in settings where the

line between “inside” and “outside” is well-definedE.g. psychologist:

• “inside” = psychologist and patient• “outside” = everyone else

Page 29: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

19

Why not use crypto definitions?• Problem: Crypto makes sense in settings where the

line between “inside” and “outside” is well-definedE.g. psychologist:

• “inside” = psychologist and patient• “outside” = everyone else

Page 30: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

19

Why not use crypto definitions?• Problem: Crypto makes sense in settings where the

line between “inside” and “outside” is well-definedE.g. psychologist:

• “inside” = psychologist and patient• “outside” = everyone else

• Statistical databases: fuzzy line between inside and outside

Page 31: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

20

xn

xn-1

Mx3

x2

x1

DB=

Adversary A

San

query 1answer 1

query Tanswer T

M

random coins¢ ¢ ¢

Straw man #1: Exact Disclosure

Page 32: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

20

xn

xn-1

Mx3

x2

x1

DB=

Adversary A

San

query 1answer 1

query Tanswer T

M

random coins¢ ¢ ¢

Straw man #1: Exact Disclosure

• Def’n: safe if adversary cannot learn any entry exactly

Page 33: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

20

xn

xn-1

Mx3

x2

x1

DB=

Adversary A

San

query 1answer 1

query Tanswer T

M

random coins¢ ¢ ¢

Straw man #1: Exact Disclosure

• Def’n: safe if adversary cannot learn any entry exactly

Page 34: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

20

xn

xn-1

Mx3

x2

x1

DB=

Adversary A

San

query 1answer 1

query Tanswer T

M

random coins¢ ¢ ¢

Straw man #1: Exact Disclosure

• Def’n: safe if adversary cannot learn any entry exactly leads to nice (but hard) combinatorial problems

Page 35: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

20

xn

xn-1

Mx3

x2

x1

DB=

Adversary A

San

query 1answer 1

query Tanswer T

M

random coins¢ ¢ ¢

Straw man #1: Exact Disclosure

• Def’n: safe if adversary cannot learn any entry exactly leads to nice (but hard) combinatorial problems Does not preclude learning value with 99% certainty or narrowing

down to a small interval

Page 36: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

20

xn

xn-1

Mx3

x2

x1

DB=

Adversary A

San

query 1answer 1

query Tanswer T

M

random coins¢ ¢ ¢

Straw man #1: Exact Disclosure

• Def’n: safe if adversary cannot learn any entry exactly leads to nice (but hard) combinatorial problems Does not preclude learning value with 99% certainty or narrowing

down to a small interval

• Historically:

Page 37: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

20

xn

xn-1

Mx3

x2

x1

DB=

Adversary A

San

query 1answer 1

query Tanswer T

M

random coins¢ ¢ ¢

Straw man #1: Exact Disclosure

• Def’n: safe if adversary cannot learn any entry exactly leads to nice (but hard) combinatorial problems Does not preclude learning value with 99% certainty or narrowing

down to a small interval

• Historically: Focus: auditing interactive queries

Page 38: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

20

xn

xn-1

Mx3

x2

x1

DB=

Adversary A

San

query 1answer 1

query Tanswer T

M

random coins¢ ¢ ¢

Straw man #1: Exact Disclosure

• Def’n: safe if adversary cannot learn any entry exactly leads to nice (but hard) combinatorial problems Does not preclude learning value with 99% certainty or narrowing

down to a small interval

• Historically: Focus: auditing interactive queries Difficulty: understanding relationships between queries

Page 39: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

20

xn

xn-1

Mx3

x2

x1

DB=

Adversary A

San

query 1answer 1

query Tanswer T

M

random coins¢ ¢ ¢

Straw man #1: Exact Disclosure

• Def’n: safe if adversary cannot learn any entry exactly leads to nice (but hard) combinatorial problems Does not preclude learning value with 99% certainty or narrowing

down to a small interval

• Historically: Focus: auditing interactive queries Difficulty: understanding relationships between queries E.g. two queries with small difference

Page 40: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

21

Straw man #2: Learning the distribution

Page 41: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

21

Straw man #2: Learning the distribution

• Assume x1,…,xn are drawn i.i.d. from unknown

distribution

Page 42: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

21

Straw man #2: Learning the distribution

• Assume x1,…,xn are drawn i.i.d. from unknown

distribution

• Def’n: San is safe if it only reveals distribution

Page 43: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

21

Straw man #2: Learning the distribution

• Assume x1,…,xn are drawn i.i.d. from unknown

distribution

• Def’n: San is safe if it only reveals distribution• Implied approach:

learn the distribution

release description of distrib

or re-sample points from distrib

Page 44: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

21

Straw man #2: Learning the distribution

• Assume x1,…,xn are drawn i.i.d. from unknown

distribution

• Def’n: San is safe if it only reveals distribution• Implied approach:

learn the distribution

release description of distrib

or re-sample points from distrib

• Problem: tautology trap estimate of distrib. depends on data… why is it safe?

Page 45: Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB= Adversary A San query answer 11 query answer T T M random coins ¢ ¢ ¢ Straw man

More to come in future lectures!

22