Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB=...
Transcript of Privacy in Statistical Databasesads22/courses/privacy598d/www/lec... · n x n-1 M x 3 x 2 x 1 DB=...
1
Privacy in Statistical DatabasesCSE 598D/STAT 598B Fall 2007
Lecture 2, 9/13/2007
Aleksandra SlavkovicOffice hours: MW 3:30-4:30
Office: Thomas 412Phone: x3-4918
Adam Smith Office hours: Mondays 3-5pm
Office: IST 338KPhone: x3-0076
Reading Papers• You will have to read several different types of papers
For next week: 2 non-technical papers, 1 technical, T.B.D.
We will post the papers on Angel and email you
• You have to submit 2 or 3 questions/critiques about each
paper on Angel, before noon (12 p.m.) the day of lecture
Questions are meant to encourage discussion (and to get you to read the papers), e.g.:
• “This paper seems to assume A. What if ...?”
• “What happens if we apply this technique to problem X? I foresee problem Y and cool outcome Z.”
• “The methodology does not support claim W because of reasons B,C,D.”
Good questions ⇒ better discussions ⇒ more fun (and better grades) 2
When you read a paper...• Read the abstract, intro, skim through the paper
What’s the topic, high-level structure?
• Start again, more slowly Look for the main thesis / claims / theorems
How are these supported? (arguments, experiments, simulation, proofs)
Read through arguments / proofs / etc
• Think critically Try to reformulate claims, arguments, etc in your own words
Are the proofs correct?
Do the experiments support the claims?
Do the results/claims support the conclusions?
Are they solving the right problem?
(Papers are not always good...)
• Take notes! Scribble in the margins! I find I read more carefully with printed paper.
3
4
Privacy in Statistical DatabasesServer/agencyIndividuals Users
x1
x2
xn
... Aqueries
answers
)( Government,researchers,businesses
(or) Maliciousadversary
• What information can be released?
• Two conflicting goals
Utility: Users can extract “global” statistics
Privacy: Individual information stays hidden
• How can these be made precise?
(How context-dependent must they be?)
5
Two Important ModelsServer/agencyIndividuals Users
x1
x2
xn
... Aqueries
answers
)( Government,researchers,businesses
(or) Maliciousadversary
• Noninteractive (“Anonymization”)
Publish data once and for all
Tables, graphs, study results
• Interactive (“Query auditing”)
Same thing, but prompted by queries from users
User can get tailored information
Understanding exactly what is leaked is difficult!
What distinguishes the “CS” approach?• Not a single approach, but...
1. Automation
2. Large data sets
3. Abstraction
6
This class / course
• Some methods people use
• Some definitions of what they are trying to achieve
7
Terminology
8
• Algorithm
Well-defined procedure: recipe, computer program,
method for solving a Rubik’s cube
Running time: How many steps does the algorithm require?
• Typically we care only about how the running time scales with the
input size
Algorithmic Problem = inputs + acceptable outputs
• Polynomial-time
n = Input “size”, T(n) = running time on inputs of size n
Polynomial time: T(n) < nc for some constant c
This class / course
• Some methods people use
• Some definitions of what they are trying to achieve
9
Naive anonymization• Just remove obviously identifying information
• Names, SSN, street address, etc
• Problem: combining innocuous attributes, e.g. [Sweeney] 5-digit Zip, date of birth, and gender uniquely
identify 80+% of US population
[NY Times] AOL search results released for academic research (www.aolpsycho.com)
[Narayanan & Shmatikov] Netflix users re-identified based on movie tastes
[BDK’07] Re-identifiying users of social networks
10
Naive anonymization
11
Naive anonymization
12
Examples of sanitization methods• Summary statistics
Means, variances Marginal totals (# people with blue eyes and brown hair)
• Generalization & suppression For some individuals, release only high-order bits, e.g. Age=22 vs
Age ∈ [20-29]
Different ways of grouping people• Clustering-based approaches
Break dataset into groups (of some minimal size) Release summary statistics for each group (e.g. k-anonymity)
13
Examples of sanitization methods• Input perturbation: Change data before processing
E.g. Randomized response• flip each bit of table with probability p
Add Gaussian noise Data swapping
• Output perturbation Summary statistics with noise
• Interactive versions of above: “Auditor” decides which queries are OK and/or type of noise
• Cryptographic techniques Distribute computations among untrusted parties Given a program A, everyone learns output A(x1,x2, ..., xn) but nothing else These techniques do NOT answer the question “What algorithms A
preserve privacy”?
14
15
How can we formalize “privacy”?• Different people mean different things
• Pin it down mathematically?
Goal #1: Rigor Prove clear theorems about privacy
• Few exist in literature
Make clear (and refutable) conjectures Sleep better at night
Goal #2: Interesting science (New) Computational phenomenon Algorithmic problems Statistical problems
15
How can we formalize “privacy”?• Different people mean different things
• Pin it down mathematically?
Goal #1: Rigor Prove clear theorems about privacy
• Few exist in literature
Make clear (and refutable) conjectures Sleep better at night
Goal #2: Interesting science (New) Computational phenomenon Algorithmic problems Statistical problems
Can we approach privacy scientifically?
• Pin down social concept• No perfect definition?• But lots of place for rigor• Too late?
Two Intuitions for Data Privacy• “If the release of statistics S makes it possible to
determine the value [of private information] more accurately than is possible without access to S, a disclosure has taken place.” [Dalenius] Learning more about me should be hard
• Privacy is “protection from being brought to the attention of others.” [Gavison] Safety is blending into a crowd
16
17
Why not use crypto definitions?
17
Why not use crypto definitions?• Cryptography successfully defined concepts
such asencryption “zero-knowledge” proofssecure function evaluation
17
Why not use crypto definitions?• Cryptography successfully defined concepts
such asencryption “zero-knowledge” proofssecure function evaluation
17
Why not use crypto definitions?• Cryptography successfully defined concepts
such asencryption “zero-knowledge” proofssecure function evaluation
• We will cover some of this in a later class
17
Why not use crypto definitions?• Cryptography successfully defined concepts
such asencryption “zero-knowledge” proofssecure function evaluation
• We will cover some of this in a later class
17
Why not use crypto definitions?• Cryptography successfully defined concepts
such asencryption “zero-knowledge” proofssecure function evaluation
• We will cover some of this in a later class
• Can we just use these definitions?
18
Why not use crypto definitions?
18
Why not use crypto definitions?• Attempt #1:
Def’n: For every entry i, no information about xi is leaked (as if encrypted)
Problem: no information at all is revealed!Tradeoff privacy vs utility
18
Why not use crypto definitions?• Attempt #1:
Def’n: For every entry i, no information about xi is leaked (as if encrypted)
Problem: no information at all is revealed!Tradeoff privacy vs utility
• Attempt #2:Agree on summary statistics f(DB) that are safeDef’n: No information except f(DB)Problem: why is f(DB) safe to release?Tautology trap (Also: how do you figure out what f is?)
C
C
CC
C
C
19
Why not use crypto definitions?
19
Why not use crypto definitions?• Problem: Crypto makes sense in settings where the
line between “inside” and “outside” is well-definedE.g. psychologist:
• “inside” = psychologist and patient• “outside” = everyone else
19
Why not use crypto definitions?• Problem: Crypto makes sense in settings where the
line between “inside” and “outside” is well-definedE.g. psychologist:
• “inside” = psychologist and patient• “outside” = everyone else
19
Why not use crypto definitions?• Problem: Crypto makes sense in settings where the
line between “inside” and “outside” is well-definedE.g. psychologist:
• “inside” = psychologist and patient• “outside” = everyone else
• Statistical databases: fuzzy line between inside and outside
20
xn
xn-1
Mx3
x2
x1
DB=
Adversary A
San
query 1answer 1
query Tanswer T
M
random coins¢ ¢ ¢
Straw man #1: Exact Disclosure
20
xn
xn-1
Mx3
x2
x1
DB=
Adversary A
San
query 1answer 1
query Tanswer T
M
random coins¢ ¢ ¢
Straw man #1: Exact Disclosure
• Def’n: safe if adversary cannot learn any entry exactly
20
xn
xn-1
Mx3
x2
x1
DB=
Adversary A
San
query 1answer 1
query Tanswer T
M
random coins¢ ¢ ¢
Straw man #1: Exact Disclosure
• Def’n: safe if adversary cannot learn any entry exactly
20
xn
xn-1
Mx3
x2
x1
DB=
Adversary A
San
query 1answer 1
query Tanswer T
M
random coins¢ ¢ ¢
Straw man #1: Exact Disclosure
• Def’n: safe if adversary cannot learn any entry exactly leads to nice (but hard) combinatorial problems
20
xn
xn-1
Mx3
x2
x1
DB=
Adversary A
San
query 1answer 1
query Tanswer T
M
random coins¢ ¢ ¢
Straw man #1: Exact Disclosure
• Def’n: safe if adversary cannot learn any entry exactly leads to nice (but hard) combinatorial problems Does not preclude learning value with 99% certainty or narrowing
down to a small interval
20
xn
xn-1
Mx3
x2
x1
DB=
Adversary A
San
query 1answer 1
query Tanswer T
M
random coins¢ ¢ ¢
Straw man #1: Exact Disclosure
• Def’n: safe if adversary cannot learn any entry exactly leads to nice (but hard) combinatorial problems Does not preclude learning value with 99% certainty or narrowing
down to a small interval
• Historically:
20
xn
xn-1
Mx3
x2
x1
DB=
Adversary A
San
query 1answer 1
query Tanswer T
M
random coins¢ ¢ ¢
Straw man #1: Exact Disclosure
• Def’n: safe if adversary cannot learn any entry exactly leads to nice (but hard) combinatorial problems Does not preclude learning value with 99% certainty or narrowing
down to a small interval
• Historically: Focus: auditing interactive queries
20
xn
xn-1
Mx3
x2
x1
DB=
Adversary A
San
query 1answer 1
query Tanswer T
M
random coins¢ ¢ ¢
Straw man #1: Exact Disclosure
• Def’n: safe if adversary cannot learn any entry exactly leads to nice (but hard) combinatorial problems Does not preclude learning value with 99% certainty or narrowing
down to a small interval
• Historically: Focus: auditing interactive queries Difficulty: understanding relationships between queries
20
xn
xn-1
Mx3
x2
x1
DB=
Adversary A
San
query 1answer 1
query Tanswer T
M
random coins¢ ¢ ¢
Straw man #1: Exact Disclosure
• Def’n: safe if adversary cannot learn any entry exactly leads to nice (but hard) combinatorial problems Does not preclude learning value with 99% certainty or narrowing
down to a small interval
• Historically: Focus: auditing interactive queries Difficulty: understanding relationships between queries E.g. two queries with small difference
21
Straw man #2: Learning the distribution
21
Straw man #2: Learning the distribution
• Assume x1,…,xn are drawn i.i.d. from unknown
distribution
21
Straw man #2: Learning the distribution
• Assume x1,…,xn are drawn i.i.d. from unknown
distribution
• Def’n: San is safe if it only reveals distribution
21
Straw man #2: Learning the distribution
• Assume x1,…,xn are drawn i.i.d. from unknown
distribution
• Def’n: San is safe if it only reveals distribution• Implied approach:
learn the distribution
release description of distrib
or re-sample points from distrib
21
Straw man #2: Learning the distribution
• Assume x1,…,xn are drawn i.i.d. from unknown
distribution
• Def’n: San is safe if it only reveals distribution• Implied approach:
learn the distribution
release description of distrib
or re-sample points from distrib
• Problem: tautology trap estimate of distrib. depends on data… why is it safe?
More to come in future lectures!
22