Post on 21-Dec-2015
1
Pinning Down “Privacy”Defining Privacy in Statistical Databases
Adam SmithWeizmann Institute of Science
http://theory.csail.mit.edu/~asmith
2
Database Privacy
You
Bob
AliceUsers
(government, researchers,
marketers, …)
“Census problem”
Two conflicting goals
• Utility: Users can extract “global” statistics
• Privacy: Individual information stays hidden
• How can these be formalized?
Collection and
“sanitization”
3
Database Privacy
You
Bob
AliceUsers
(government, researchers,
marketers, …)
“Census problem”
Why privacy?
• Ethical & legal obligation
• Honest answers require respondents’ trust
Collection and
“sanitization”
4
Trust is important
5
Database Privacy
You
Bob
AliceUsers
(government, researchers,
marketers, …)
• Trusted collection agency
• Published statistics may be tables, graphs, microdata, etc
• May have noise or other distortions
• May be interactive
Collection and
“sanitization”
6
Database Privacy
You
Bob
AliceUsers
(government, researchers,
marketers, …)
Variations on model studied in
• Statistics
• Data mining
• Theoretical CS
• Cryptography
Different traditions for what “privacy” means
Collection and
“sanitization”
7
How can we formalize “privacy”?
• Different people mean different things
• Pin it down mathematically?
8
I ask them to take a poem and hold it up to the light like a color slide
or press an ear against its hive.
[…]
But all they want to dois tie the poem to a chair with ropeand torture a confession out of it.
They begin beating it with a hoseto find out what it really means.
- Billy Collins, “Introduction to poetry”
Can we approach privacy scientifically?
• Pin down social concept• No perfect definition?• But lots of place for rigor• Too late? (see Adi’s talk)
9
How can we formalize “privacy”?
• Different people mean different things
• Pin it down mathematically?
Goal #1: Rigor Prove clear theorems about privacy
• Few exist in literature
Make clear (and refutable) conjectures
Sleep better at night
Goal #2: Interesting science (New) Computational phenomenon
Algorithmic problems
Statistical problems
10
Overview
• Examples
• Intuitions for privacy
Why crypto def’s don’t apply
• A Partial* Selection of Definitions
• Conclusions
* “partial” = “incomplete” and “biased”
11
Basic Setting
• Database DB = table of n rows, each in domain DD can be numbers, categories, tax forms, etc
This talk: D = {0,1}d
E.g.: Married?, Employed?, Over 18?, …
xn
xn-1
x3
x2
x1
SanUsers
(government, researchers,
marketers, …)
query 1answer 1
query Tanswer T
DB=
random coins¢ ¢ ¢
12
Examples of sanitization methods
• Input perturbation Change data before processing
E.g. Randomized response
• flip each bit of table with probability p
• Summary statistics Means, variances
Marginal totals (# people with blue eyes and brown hair)
Regression coefficients
• Output perturbation Summary statistics with noise
• Interactive versions of above: Auditor decides which queries are OK, type of noise
13
Two Intuitions for Privacy
“If the release of statistics S makes it possible to determine the
value [of private information] more accurately than is possible
without access to S, a disclosure has taken place.” [Dalenius]
• Learning more about me should be hard
Privacy is “protection from being brought to the attention of
others.” [Gavison]
• Safety is blending into a crowd
14
Why not use crypto definitions?
• Attempt #1: Def’n: For every entry i, no information about xi is leaked
(as if encrypted)
Problem: no information at all is revealed!
Tradeoff privacy vs utility
• Attempt #2:Agree on summary statistics f(DB) that are safe
Def’n: No information about DB except f(DB)
Problem: how to decide that f is safe?
Tautology trap
(Also: how do you figure out what f is? --Yosi)
C
C
CC
C
C
15
Overview
• Examples
• Intuitions for privacy
Why crypto def’s don’t apply
• A Partial* Selection of Definitions
Two straw men
Blending into the Crowd
An impossibility result
Attribute Disclosure and Differential Privacy
• Conclusions* “partial” = “incomplete” and “biased”
Criteria• Understandable• Clear adversary’s goals &
prior knowledge / side information• I am a co-author...
16
xn
xn-1x3
x2
x1
DB=
Adversary A
San
query 1answer 1
query Tanswer T
random coins¢ ¢ ¢
Straw man #1: Exact Disclosure
• Def’n: safe if adversary cannot learn any entry exactly leads to nice (but hard) combinatorial problems
Does not preclude learning value with 99% certainty or narrowing down to a small interval
• Historically: Focus: auditing interactive queries
Difficulty: understanding relationships between queries
E.g. two queries with small difference
17
Straw man #2: Learning the distribution
• Assume x1,…,xn are drawn i.i.d. from unknown
distribution
• Def’n: San is safe if it only reveals distribution
• Implied approach:
learn the distribution
release description of distrib
or re-sample points from distrib
• Problem: tautology trap
estimate of distrib. depends on data… why is it safe?
18
Blending into a Crowd
• Intuition: I am safe in a group of k or morek varies (3… 6… 100… 10,000 ?)
• Many variations on theme:Adv. wants predicate g such that
0 < #{ i | g(xi)=true} < kg is called a breach of privacy
• Why?Fundamental:
• R. Gavison: “protection from being brought to the attention of others”
Rare property helps me re-identify someone Implicit: information about a large group is public
• e.g. liver problems more prevalent among diabetics
19
Blending into a Crowd
• Intuition: I am safe in a group of k or morek varies (3… 6… 100… 10,000 ?)
• Many variations on theme:Adv. wants predicate g such that
0 < #{ i | g(xi)=true} < kg is called a breach of privacy
• Why?Fundamental:
• R. Gavison: “protection from being brought to the attention of others”
Rare property helps me re-identify someone Implicit: information about a large group is public
• e.g. liver problems more prevalent among diabetics
How can we capture this?• Syntactic definitions• Bayesian adversary• “Crypto-flavored” definitions
Two variants:• frequency in DB• frequency in
underlying population
20
“Syntactic” Definitions
• Given sanitization S, look at set of all databases consistent with S
• Def’n: Safe if no predicate is a breach for all consistent databases
k-anonymity [L. Sweeney]
• Sanitization is histogram of data
Partition D into bins B1 [ B2 [ [ Bt
Output cardinalities fj = # ( DB Å Bj )
• Safe if for all j, either fj ¸ k or fj=0
Cell bound methods [statistics, 1990’s]
• Sanitization consists of marginal sums
Let fz = #{i : xi =z}. Then San(DB) = various sums of fz
• Safe if for all z, either 9 cons’t DB with fz ¸ k or 8 cons’t DB’s, fz=0
• Large literature using algebraic and combinatorial techniques
brown blue
blond 2 10 12
brown 12 6 18
14 16
brown blue
blond [0,12] [0,12] 12
brown [0,14] [0,16] 18
14 16
21
brown blue
blond [0,12] [0,12] 12
brown [0,14] [0,16] 18
14 16
“Syntactic” Definitions
• Given sanitization S, look at set of all databases consistent with S
• Def’n: Safe if no predicate is a breach for all consistent databases
k-anonymity [L. Sweeney]
• Sanitization is histogram of data
Partition D into bins B1 [ B2 [ [ Bt
Output cardinalities fj = # ( DB Å Bj )
• Safe if for all j, either fj ¸ k or fj=0
Cell bound methods [statistics, 1990’s]
• Sanitization consists of marginal sums
if fz = #{i : xi =z} then output = various sums of fz
• Safe if for all z, either 9 cons’t DB with fz ¸ k or 8 cons’t DB’s, fz=0
• large literature using algebraic and combinatorial techniques
Issues:
• If k is small: “all three Canadians
at Weizmann sing in a choir.”
• Semantics? Probability not considered
What if I have side information?
Algorithm for making decisions
not considered
What adversary does this apply to?
22
Security for “Bayesian” adversaries
Goal: • Adversary outputs point z 2 D
• Score = 1/fz if fz > 0 0 otherwise
• Def’n: sanitization safe if E(score) · Procedure:• Assume you know adversary’s prior distribution over databases
• Given a candidate output (e.g. set of marginal sums)
Update prior conditioned on output (via Bayes’ rule)
If maxz E( score | output ) < then release
Else consider new set of marginal sums
• Extensive literature on computing expected value (see Yosi’s talk)
Issues:
• Restricts the type of predicates adversary can choose
• Must know prior distribution
Can 1 scheme work for many distributions?
Sanitizer works harder than adversary
• Conditional probabilities don’t consider previous iterations “Simulatability” [KMN’05]
Can this be fixed (with efficient computations)?
23
Crypto-flavored Approach [CDMSW,CDMT,NS]
“If the release of statistics S makes it possible to determine the
value [of private information] more accurately than is possible
without access to S, a disclosure has taken place.” [Dalenius]
24
2 C
Crypto-flavored Approach [CDMSW,CDMT,NS]
• [CDMSW]: Compare to “simulator”:
8 distributions on databases DB
8 adversaries A, 9 A’ such that
8 subsets J µ DB:
PrDB,S[ A(S) = breach in J ] – PrDB[ A’() = breach in J ] ·
• Definition says nothing if adversary knows x1
Require that it hold for all subsets of DB
• No non-trivial examples satisfying this definition Restrict family of distributions to some class C of distribs
Try to make as large as possible
Sufficient: i.i.d. from “smooth” distribution
25
Crypto-flavored Approach [CDMSW,CDMT,NS]
• [CDMSW]: Compare to “simulator”:
8 distributions on databases DB 2 C
8 adversaries A, 9 A’ such that
8 subsets J µ DB:
PrDB,S[ A(S) = breach in J ] – PrDB[ A’() = breach in J ] ·
• [CDMSW,CDMT] Geometric data Assume xi 2 Rd
Relax definition: • Ball predicates gz,r = {x : ||x-z|| · r}
g’z,r = {x : ||x-z|| · C¢ r}
• Breach if # (DB Å gz,r) >0 and #(DB Å g’z,r) < k
Several types of histograms can be released Sufficient for “metric” problems: clustering, min. span tree,…
2
231
2 2
26
Crypto-flavored Approach [CDMSW,CDMT,NS]
• [CDMSW]: Compare to “simulator”:
8 distributions on databases DB 2 C
8 adversaries A, 9 A’ such that
8 subsets J µ DB:
PrDB,S[ A(S) = breach in J ] – PrDB[ A’() = breach in J ] ·
• [NS] No geometric restrictions A lot of noise
• Almost erase data!
Strong privacy statement
Very weak utility
• [CDMSW,CDMT,NS]: proven statements!
Issues:
• Works for a large class of prior distributions and side information
But not for all
• Not clear if it helps with “ordinary” statistical calculations
• Interesting utility requires geometric restrictions
• Too messy?
27
Blending into a Crowd
• Intuition: I am safe in a group of k or more• pros:
appealing intuition for privacyseems fundamentalmathematically interestingmeaningful statements are possible!
• consdoes it rule out learning
facts about particular individual?all results seem to make strong assumptions on
adversary’s prior distribution is this necessary? (yes…)
28
Overview
• Examples
• Intuitions for privacy
Why crypto def’s don’t apply
• A Partial* Selection of Definitions
Two Straw men
Blending into the Crowd
An impossibility result
Attribute Disclosure and Differential Privacy
• Conclusions
29
an impossibility result
• An abstract schema: Define a privacy breach
8 distributions on databases8 adversaries A, 9 A’ such that
Pr( A(San) = breach ) – Pr( A’() = breach ) ·
• Theorem: [Dwork-Naor] For reasonable “breach”, if San(DB) contains information about DB
then some adversary breaks this definition
• Example: Adv. knows Alice is 2 inches shorter than average Lithuanian
• but how tall are Lithuanians?
With sanitized database, probability of guessing height goes up
Theorem: this is unavoidable
30
proof sketch
• Suppose
If DB is uniform then entropy I( DB ; San(DB) ) > 0
“breach” is predicting a predicate g(DB)
• Pick hash function h: {databases}!{0,1}H(DB|San)
Prior distrib. is uniform conditioned on h(DB)=z
• Then
h(DB)=z gives no info on g(DB)
San(DB) and h(DB)=z together determine DB
• [DN] vastly generalize this
31
xn
xn-1x3
x2
x1
DB=
Adversary A
San
query 1answer 1
query Tanswer T
random coins¢ ¢ ¢
Preventing Attribute Disclosure
• Large class of definitions safe if adversary can’t learn “too much” about any entry
E.g.:
• Cannot narrow Xi down to small interval
• For uniform Xi, mutual information I(Xi; San(DB) ) ·
• How can we decide among these definitions?
32
Differential Privacy
• Lithuanians example:
Adv. learns height even if Alice not in DB
• Intuition [DM]:
“Whatever is learned would be learned regardless of whether or not
Alice participates”
Dual: Whatever is already known, situation won’t get worse
xn
xn-1x3
x2
x1
DB=
Adversary A
San
query 1answer 1
query Tanswer T
random coins¢ ¢ ¢
33
Differential Privacy
xn
xn-10x2
x1
DB=
Adversary A
San
query 1answer 1
query Tanswer T
random coins¢ ¢ ¢
• Define n+1 games “Game 0”: Adv. interacts with San(DB)
For each i, let DB-i = (x1,…,xi-1,0,xi+1,…,xn)
“Game i”: Adv. interacts with San(DB-i)
• Bayesian adversary: Given S and prior distrib p() on DB, define n+1 posterior distrib’s
34
Differential Privacy
xn
xn-10x2
x1
DB=
Adversary A
San
query 1answer 1
query Tanswer T
random coins¢ ¢ ¢
• Definition: San is safe if
8 prior distributions p(¢) on DB,
8 transcripts S, 8 i =1,…,n
StatDiff( p0(¢|S) , pi(¢| S) ) ·
• Note that the prior distribution may be far from both
• How can we satisfy this?
35
Approach: Indistinguishability [DiNi,EGS,BDMN]
xn
xn-1
xn
xn-1
x3
x2
x1
San
query 1answer 1query T
answer TDB=
transcriptS
random coins¢ ¢ ¢
x3 San
query 1answer 1query T
answer TDB’=
transcriptS’
random coins¢ ¢ ¢
x2’x1
Distributions at “distance” ·
Choice of distance measure is importantDiffer in 1 row
36
Approach: Indistinguishability [DiNi,EGS,BDMN]
xn
xn-1
xn
xn-1
x3
x2
x1
San
query 1answer 1query T
answer TDB=
transcriptS
random coins¢ ¢ ¢
x3 San
query 1answer 1query T
answer TDB’=
transcriptS’
random coins¢ ¢ ¢
x2’x1
Distributions at “distance” ·
Choice of distance measure is important
37
Approach: Indistinguishability [DiNi,EGS,BDMN]
SanDB=query 1
query TS
¢ ¢ ¢
SanDB’=query 1
query TS’
¢ ¢ ¢
Distrib’sdistance
·
Problem: must be large
• By hybrid argument:Any two databases induce transcripts at distance · n
• To get utility, > 1/n
Statistical difference 1/n is not meaningful
• Example: Release random point in database
San(x1,…,xn) = ( j, xj ) for random j
For every i , changing xi induces statistical difference 1/n
But some xi is revealed with probability 1
38
?
Formalizing Indistinguishability
Definition: San is -indistinguishable if
8 A, 8 DB, DB’ which differ in 1 row, 8 sets of transcripts E
Adversary A
query 1answer 1transcript
Squery 1
answer 1transcript
S’
• Equivalently, 8 S: p( San(DB) = S )p( San(DB’)= S )
2 1 ±
p( San(DB) 2 E ) 2 e±(1 ± p( San(DB’) 2 E )
39
Indistinguishability ) Differential Privacy
• Definition: San is safe if
8 prior distributions p(¢) on DB,
8 transcripts S, 8 i =1,…,n
StatDiff( p0(¢|S) , pi(¢| S) ) ·
• We can use indistinguishability: For every S and DB:
This implies StatDiff( p0(¢|S) , pi(¢| S) ) ·
40
Why does this help?
With relatively little noise:
• Averages
• Histograms
• Matrix decompositions
• Certain types of clustering
• …
See Kobbi’s talk
41
Preventing Attribute Disclosure
• Various ways to capture
“no particular value should be revealed”
• Differential Criterion: “Whatever is learned would be learned regardless of whether or not
person i participates”
• Satisfied by indistinguishability Also implies protection from re-identification?
• Two interpretations: A given release won’t make privacy worse
Rational respondent will answer if there is some gain
• Can we preserve enough utility?
42
Overview
• Examples
• Intuitions for privacy
Why crypto def’s don’t apply
• A Partial* Selection of Definitions
Two Straw men
Blending into the Crowd
An impossibility result
Attribute Disclosure and Differential Privacy
* “partial” = “incomplete” and “biased”
43
Things I Didn’t Talk About
• Economic Perspective [KPR]
Utility of providing data = value – cost
May depend on whether others participate
When is it worth my while?
• Specific methods for re-identification
• Various other frameworks (e.g. “L-diversity”)
• Other pieces of big “data privacy” picture
Access Control
Implementing trusted collection center
44
Conclusions
• Pinning down social notion in particular context
• Biased survey of approaches to definitionsA taste of techniques along the way
Didn’t talk about utility
• Question has different flavor fromusual crypto problems
statisticians’ traditional conception
• Meaningful statements are possible!Practical?
Do they cover everything? No
45
Conclusions
• How close are we to converging? e.g. s.f.e., encryption, Turing machines,…
But we’re after a social concept?
Silver bullet?
• What are the big challenges?
• Need “cryptanalysis” of these systems (Adi…?)