Towards Privacy in Public Databases Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry...
-
date post
22-Dec-2015 -
Category
Documents
-
view
217 -
download
1
Transcript of Towards Privacy in Public Databases Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry...
Towards Privacy in Public Databases
Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith,
Larry Stockmeyer, Hoeteck Wee
Work Done at Microsoft Research
2
Database Privacy
Think “Census” Individuals provide information Census Bureau publishes sanitized records
Privacy is legally mandated; what utility can we achieve?
Inherent Privacy vs Utility tension One extreme – complete privacy; no information Other extreme – complete information; no privacy
Goals: Find a middle path
preserve macroscopic properties “disguise” individual identifying information
Change the nature of discourse Establish framework for meaningful comparison of
techniques
3
Outline
Definitions privacy, defined in the breach sanitization requirements utility goals
Example: Recursive Histogram Sanitizations description of technique a robust proof of privacy
Example: “Round” Sanitizations nice learning properties privacy via cross-training
Setting the Real World Context dealing with auxiliary information
4
Outline
Definitions privacy, defined in the breach sanitization requirements utility goals
Example: Recursive Histogram Sanitizations description of technique a robust proof of privacy
Example: “Round” Sanitizations nice learning properties privacy via cross-training
Setting the Real World Context dealing with auxiliary information
5
What do WE mean by privacy?
[Ruth Gavison] Protection from being brought to the attention of others
inherently valuable attention invites further privacy loss
Privacy is assured to the extent that one blends in with the crowd
Appealing definition; can be converted into a precise mathematical statement…
6
A geometric view
Abstraction: Database consists of points in high dimensional space
Rd
Points are unlabeledyou are your collection of attributes
Distance is everythingpoints are more similar if and only if they are closer
Real Database (RDB), privaten unlabeled points in d-dimensional space think of d as number of sensitive attributes
Sanitized Database (SDB), publicn’ new points, possibly in a different space
7
The adversary or Isolator - Intuition
On input SDB and auxiliary information, adversary outputs a point q Rd
q “isolates” a real DB point x, if it is much closer to x than to x’s near neighbors q fails to isolate x if q looks roughly as much
like everyone in x’s neighborhood as it looks like x itself
Tightly clustered points have a smaller radius of isolation
RDB
8
(c,T)-Isolation – the definition
I(SDB,aux) = q x is (c,T)-isolated if B(q,c) contains fewer than
T other points from RDB
c – privacy parameter; eg, 4
qx
c
p
9
Requirements for the sanitizer
No way of obtaining privacy if AUX already reveals too much!
Sanitization procedure compromises privacy if giving the adversary access to the SDB considerably increases its probability of success
Definition of “considerably” can be forgiving Formally, quantify over distributions, adversaries, choice
of database, auxiliary information: D I I’ w.h.p. D aux
x |Pr[I(SDB,aux) isolates x] – Pr[I’(aux) isolates x]| is smallprobabilities over choices made by sanitizer and I, I’
Provides a framework for describing the power of a sanitization method, and hence for comparisons
Aux is going to cause trouble. Ignore it for now.
10
Utility Goals
Pointwise proofs of specific utilities averages, medians, clusters, regressions,…
Prove there is a large class of interesting utilities for which there are good approximation procedures using sanitized data
11
Outline
Definitions privacy, defined in the breach sanitization requirements utility goals
Example: Recursive Histogram Sanitizations description of technique a robust proof of privacy
Example: “Round” Sanitizations nice learning properties privacy via cross-training
Setting the Real World Context dealing with auxiliary information
12
Recursive Histogram Sanitization
U = d-dim cube, side = 2 Cut into 2d subcubes
split along each axis subcube has side = 1
For each subcubeif number of RDB points > 2T
then recurse
Output: list of cells and counts
13
Recursive Histogram Sanitization
Theorem: 9c s.t. if n points are drawn uniformly from U, then recursive histogram sanitizations are safe with respect to c-isolation: Pr[I(SDB) succeeds] · exp(-d).
14
Safety of Recursive Histogram Sanitization
Rough Intuition Expected distance ||q-x|| is ≈ diameter of
cell. Distances tightly concentrated around mean. Multiplying radius by c captures almost all
the parent cell - contains at least 2T points.
15
For Very Large Values of n
Wlog can switch to ball adversaries: (q,r)I wins if B(q,r) contains at least one RDB point
and B(q,cr) contains fewer than T RDB points
Define a probability density f(x) that captures adversary’s view of the RDB
To win with probability , I needs:Prf[B(q,r)] ¸ /n
Prf[B(q,cr)] · (2T + O(log -1))/n
Prf[B(q,r)]/Prf[B(q,cr)] ¸ /(2T + O(log -1))
Bound by bounding ratio, · 2-d, < 1
16
Prf[B(q,r)]/Prf[B(q,cr)]
f(x) = (nC/n) (1 / Vol(C))fraction of RDB points landing in cell C, spread
uniformly within C
If r is sufficiently small, the bigger ball captures exp(d) more mass in each subcube than does the smaller ball
yields < 2-(d)
17
Prf[B(q,r)]/Prf[B(q,cr)]
f(x) = (nC/n) (1 / Vol(C))fraction of RDB points landing in cell C, spread
uniformly within C
If r is sufficiently small, the bigger ball captures exp(d) more mass in each subcube than does the smaller ball
If r is large, the small ball captures nothing or the bigger ball captures parent cube
Either way isolation cannot occur (c = 16)
18
Proof is Very Robust
Extends to many interesting cases non-uniform but bounded-ratio density fns isolator knows constant fraction of attribute
vals isolator knows lots of RDB points isolation in few attributes
very weak bounds
Can be adapted to “round” distributionsballs, spheres, mixtures of Gaussians,with effort; [work in progress w/ K. Talwar]
More General Distributions “good” islands in a sea of zero probability
19
Outline
Definitions privacy, defined in the breach sanitization requirements utility goals
Example: Recursive Histogram Sanitizations description of technique a robust proof of privacy
Example: “Round” Sanitizations nice learning properties privacy via cross-training
Setting the Real World Context dealing with auxiliary information
20
Round Sanitizations
The privacy of x is linked to its T-radius Randomly perturb it in proportion to its T-
radius
x’ = San(x) R B(x,T-rad(x)) alternatively: S(x, T-rad(x)) or d-dim Gaussian
Intuition: We are blending x in with its crowd
We are adding to x random noise with mean zero, so several macroscopic properties should be preserved.
21
Nice Learning Properties
Known algorithm for learning mixtures of Gaussians works for clustering sanitized Gaussian data
Original distribution (mixture of Gaussians) is recoveredTechnical issue: added noise is a function of the dataSubject of another talk
Diameter increases by at most x3 when finding k clusters minimizing the largest diameter
22
Privacy for n Sanitized Points?
Given n-1 points in the clear, the probability of isolating the nth is O(exp(-d))
Intuition for extension to n points is wrong! Privacy of xn given xn’ and all the other points in
the clear does not imply privacy of xn given xn’ and sanitizations of others!
Sanitization of other points reveals information about xn
Worry is for safety of the reference point (the neighbor defining the T-radius), not the principal
23
Combining the Two Sanitizations
Partition RDB into two sets A and B Cross-training
Compute histogram sanitization for B v 2 A: v = f(side length of C containing v) Output GSan(v, v)
24
Cross-Training Privacy
Privacy for B: only histogram information about B is used
Privacy for A: enough variance for enough coordinates of v, even given C containing v and sanitization v’ of v. current proof works only for |A| = 2o(d)
25
Additional Results*
Impossibility Results 9 interesting utilities that have no sanitization
protecting against isolation (cf. SFE) Impossibility of all-purpose sanitizers
There is always a choice of aux that defeats a certain natural version of privacy
Contrived, but places a limit on what can be proved Poly-time bounded adversary? Connection to
obfuscation.
Utility Exploit literature on power of randomized histograms
for algorithms for data streams (eg, Indyk)
* with assorted collaborators, eg, N, N, S, T
26
Outline
Definitions privacy, defined in the breach sanitization requirements utility goals
Example: Recursive Histogram Sanitizations description of technique a robust proof of privacy
Example: “Round” Sanitizations nice learning properties privacy via cross-training
Setting the Real World Context dealing with auxiliary information
27
A Standard Technique: Cell Suppression
Gestalt: Tabular Data (many, possibly linked, tables) entries are cells
frequency (count) data magnitude data (income, sales, etc.)
Disclosure = small counts Provides key for population unique, or almost-unique Can be used as a key into a different database
Enormous literature on suppressing “safely”
16 8 5 2 31
1 5 20 3 29
17 13 25 5 60
28
Connection to Our Definitions
Protection against isolation yields protection against learning a key for a population unique isolation on a subspace does not imply
isolation in the full-dimensional space … … but aux may contain other DBs that can be
queried to learn remaining attributes definition mandates protection against all possible
aux satisfy def ) can’t learn key
29
Connection to Our Definitions
Seems very hard to provide good sanitization in the presence of arbitrary aux Provably impossible in general Anyway, can probably already isolate people
based solely on aux Suggests we need to control aux
How should we redesign the world?
30
Two Tools
Secure Function Evaluation [Yao, GMW] Technique permitting Alice, Bob, Carol, and their
friends to collaboratively compute a function f of their private inputs =f(a,b,c,…). eg, = sum(a,b,c, …)
Each player learns only what can be deduced from and her own input to f
SuLQ databases [Dwork, Nissim] Provably preserves privacy of attributes when the
rows of the database are mutually independent Powerful [DwNi; Blum, Dwork, McSherry, Nissim]
31
Statistical Database
Query (S, f)Query (S, f)
S S [n] [n]
f : {0,1}f : {0,1}dd {0,1}{0,1}
Exact Answer Exact Answer rrSS f(row r) f(row r)
nn p
ers
on
sp
ers
on
s
d attributesd attributes
Database DBDatabase DB
ffffff
ff
00 00 11 11 00
11 00 11 00 00
11 11 00 11 11
00 00 11 00 11
11 11 00 00 11
00 00 00 11 00
Row distributionRow distributionDD (D(D11,D,D22,,
…,D…,Dnn))
32
Sub-Linear Query (SuLQ) Databases
nn p
ers
on
sp
ers
on
s
d attributesd attributes
ffffff
ff
00 00 11 11 00
11 00 11 00 00
11 11 00 11 11
00 00 11 00 11
11 11 00 00 11
00 00 00 11 00
+ noise
If the number of queries is << n, then privacy can beprotected with little noise (per query):
E(noise) = 0; standard dev << √nMuch less than sampling error!
34
Our Data, Ourselves
Individuals maintain their own data records join a DB by setting an appropriate attribute
Statistical queries via a SFE(SuLQ) privacy of SuLQ query ) this SFE is “safe”
Individuals ensure data take part in sufficiently few queries sufficient random noise is added
0 4 6 3 … 1 0 …
35
Summary
Definitions defined isolation and sanitization
Recursive Histogram Sanitizations described approach and sketched a robust proof of
privacy for a special distribution proof exploits high dimensionality (# columns)
Sanitization via perturbations utility and privacy via cross-training
Setting the Real World Context discussed a radical view of how data might be
organized to prevent a powerful class of attacks based on auxiliary data
SuLQ tool exploits large membership (# rows)
37
Larry Stockmeyer Commemoration
May 21-22, 2005Baltimore, Maryland
(in conjunction with STOC 2005) May 21:,
Tutorial by Nick Pippenger (Princeton) on some of Stockmeyer's fundamental results in complexity theory Lectures by Miki Ajtai (IBM), Anne Condon (UBC), Cynthia Dwork (Microsoft), Richard Karp (UC Berkeley), Albert Meyer (MIT), and Chris Umans (CalTech).
Some time will be reserved for personal remarks. Contact Cynthia Dwork if you want to participate in this part of the commemoration.
May 22: Lance Fortnow gives first keynote address to STOC.