Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB=...

1

Privacy in Statistical DatabasesCSE 598D/STAT 598B Fall 2007

Lecture 2, 9/13/2007

Aleksandra SlavkovicOffice hours: MW 3:30-4:30

Office: Thomas 412Phone: x3-4918

Adam Smith Office hours: Mondays 3-5pm

Office: IST 338KPhone: x3-0076

Reading Papers• You will have to read several different types of papers

For next week: 2 non-technical papers, 1 technical, T.B.D.

We will post the papers on Angel and email you

• You have to submit 2 or 3 questions/critiques about each

paper on Angel, before noon (12 p.m.) the day of lecture

Questions are meant to encourage discussion (and to get you to read the papers), e.g.:

• “This paper seems to assume A. What if ...?”

• “What happens if we apply this technique to problem X? I foresee problem Y and cool outcome Z.”

• “The methodology does not support claim W because of reasons B,C,D.”

Good questions ⇒ better discussions ⇒ more fun (and better grades) 2

When you read a paper...• Read the abstract, intro, skim through the paper

What’s the topic, high-level structure?

• Start again, more slowly Look for the main thesis / claims / theorems

How are these supported? (arguments, experiments, simulation, proofs)

Read through arguments / proofs / etc

• Think critically Try to reformulate claims, arguments, etc in your own words

Are the proofs correct?

Do the experiments support the claims?

Do the results/claims support the conclusions?

Are they solving the right problem?

(Papers are not always good...)

• Take notes! Scribble in the margins! I find I read more carefully with printed paper.

3

4

Privacy in Statistical DatabasesServer/agencyIndividuals Users

x1

x2

xn

... Aqueries

answers

)( Government,researchers,businesses

(or) Maliciousadversary

• What information can be released?

• Two conflicting goals

Utility: Users can extract “global” statistics

Privacy: Individual information stays hidden

• How can these be made precise?

(How context-dependent must they be?)

5

Two Important ModelsServer/agencyIndividuals Users

x1

x2

xn

... Aqueries

answers

)( Government,researchers,businesses

(or) Maliciousadversary

• Noninteractive (“Anonymization”)

Publish data once and for all

Tables, graphs, study results

• Interactive (“Query auditing”)

Same thing, but prompted by queries from users

User can get tailored information

Understanding exactly what is leaked is difficult!

What distinguishes the “CS” approach?• Not a single approach, but...

1. Automation

2. Large data sets

3. Abstraction

6

This class / course

• Some methods people use

• Some definitions of what they are trying to achieve

7

Terminology

8

• Algorithm

Well-defined procedure: recipe, computer program,

method for solving a Rubik’s cube

Running time: How many steps does the algorithm require?

• Typically we care only about how the running time scales with the

input size

Algorithmic Problem = inputs + acceptable outputs

• Polynomial-time

n = Input “size”, T(n) = running time on inputs of size n

Polynomial time: T(n) < nc for some constant c

This class / course

• Some methods people use

• Some definitions of what they are trying to achieve

9

Naive anonymization• Just remove obviously identifying information

• Names, SSN, street address, etc

• Problem: combining innocuous attributes, e.g. [Sweeney] 5-digit Zip, date of birth, and gender uniquely

identify 80+% of US population

[NY Times] AOL search results released for academic research (www.aolpsycho.com)

[Narayanan & Shmatikov] Netflix users re-identified based on movie tastes

[BDK’07] Re-identifiying users of social networks

10

http://www.aolpsycho.com

http://www.aolpsycho.com

Naive anonymization

11

Naive anonymization

12

Examples of sanitization methods• Summary statistics

Means, variances Marginal totals (# people with blue eyes and brown hair)

• Generalization & suppression For some individuals, release only high-order bits, e.g. Age=22 vs

Age ∈ [20-29]

Different ways of grouping people• Clustering-based approaches

Break dataset into groups (of some minimal size) Release summary statistics for each group (e.g. k-anonymity)

13

Examples of sanitization methods• Input perturbation: Change data before processing

E.g. Randomized response• flip each bit of table with probability p

Add Gaussian noise Data swapping

• Output perturbation Summary statistics with noise

• Interactive versions of above: “Auditor” decides which queries are OK and/or type of noise

• Cryptographic techniques Distribute computations among untrusted parties Given a program A, everyone learns output A(x1,x2, ..., xn) but nothing else These techniques do NOT answer the question “What algorithms A

preserve privacy”?

14

15

How can we formalize “privacy”?• Different people mean different things

• Pin it down mathematically?

Goal #1: Rigor Prove clear theorems about privacy

• Few exist in literature

Make clear (and refutable) conjectures Sleep better at night

Goal #2: Interesting science (New) Computational phenomenon Algorithmic problems Statistical problems

15

How can we formalize “privacy”?• Different people mean different things

• Pin it down mathematically?

Goal #1: Rigor Prove clear theorems about privacy

• Few exist in literature

Make clear (and refutable) conjectures Sleep better at night

Goal #2: Interesting science (New) Computational phenomenon Algorithmic problems Statistical problems

Can we approach privacy scientifically?

• Pin down social concept• No perfect definition?• But lots of place for rigor• Too late?

Two Intuitions for Data Privacy• “If the release of statistics S makes it possible to

determine the value [of private information] more accurately than is possible without access to S, a disclosure has taken place.” [Dalenius] Learning more about me should be hard

• Privacy is “protection from being brought to the attention of others.” [Gavison] Safety is blending into a crowd

16

17

Why not use crypto definitions?

17

Why not use crypto definitions?• Cryptography successfully defined concepts

such asencryption “zero-knowledge” proofssecure function evaluation

17



• We will cover some of this in a later class

17



• We will cover some of this in a later class

• Can we just use these definitions?

18


18

Why not use crypto definitions?• Attempt #1:

Def’n: For every entry i, no information about xi is leaked (as if encrypted)

Problem: no information at all is revealed!Tradeoff privacy vs utility

18

Why not use crypto definitions?• Attempt #1:

Def’n: For every entry i, no information about xi is leaked (as if encrypted)

Problem: no information at all is revealed!Tradeoff privacy vs utility

• Attempt #2:Agree on summary statistics f(DB) that are safeDef’n: No information except f(DB)Problem: why is f(DB) safe to release?Tautology trap (Also: how do you figure out what f is?)

C

C

CC

C

C

19


19

Why not use crypto definitions?• Problem: Crypto makes sense in settings where the

line between “inside” and “outside” is well-definedE.g. psychologist:

• “inside” = psychologist and patient• “outside” = everyone else

19

Why not use crypto definitions?• Problem: Crypto makes sense in settings where the

line between “inside” and “outside” is well-definedE.g. psychologist:

• “inside” = psychologist and patient• “outside” = everyone else

• Statistical databases: fuzzy line between inside and outside

20

xn

xn-1

Mx3

x2

x1

DB=

Adversary A

San

query 1answer 1

query Tanswer T

M

random coins¢ ¢ ¢

Straw man #1: Exact Disclosure

20

xn

xn-1

Mx3

x2

x1

DB=

Adversary A

San

query 1answer 1

query Tanswer T

M



• Def’n: safe if adversary cannot learn any entry exactly

20

xn

xn-1

Mx3

x2

x1

DB=

Adversary A

San

query 1answer 1

query Tanswer T

M



• Def’n: safe if adversary cannot learn any entry exactly leads to nice (but hard) combinatorial problems

20

xn

xn-1

Mx3

x2

x1

DB=

Adversary A

San

query 1answer 1

query Tanswer T

M



• Def’n: safe if adversary cannot learn any entry exactly leads to nice (but hard) combinatorial problems Does not preclude learning value with 99% certainty or narrowing

down to a small interval

20

xn

xn-1

Mx3

x2

x1

DB=

Adversary A

San

query 1answer 1

query Tanswer T

M





• Historically:

20

xn

xn-1

Mx3

x2

x1

DB=

Adversary A

San

query 1answer 1

query Tanswer T

M





• Historically: Focus: auditing interactive queries

20

xn

xn-1

Mx3

x2

x1

DB=

Adversary A

San

query 1answer 1

query Tanswer T

M





• Historically: Focus: auditing interactive queries Difficulty: understanding relationships between queries

20

xn

xn-1

Mx3

x2

x1

DB=

Adversary A

San

query 1answer 1

query Tanswer T

M





• Historically: Focus: auditing interactive queries Difficulty: understanding relationships between queries E.g. two queries with small difference

21

Straw man #2: Learning the distribution

21


• Assume x1,…,xn are drawn i.i.d. from unknown

distribution

21



distribution

• Def’n: San is safe if it only reveals distribution

21



distribution

• Def’n: San is safe if it only reveals distribution• Implied approach:

learn the distribution

release description of distrib

or re-sample points from distrib

21



distribution

• Def’n: San is safe if it only reveals distribution• Implied approach:

learn the distribution

release description of distrib

or re-sample points from distrib

• Problem: tautology trap estimate of distrib. depends on data… why is it safe?

More to come in future lectures!

22

Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB=...

Documents

Transcript of Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB=...