Towards Privacy in Public Databases Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry...

Towards Privacy in Public Databases

Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith,

Larry Stockmeyer, Hoeteck Wee

Work Done at Microsoft Research

2

Database Privacy

Think “Census” Individuals provide information Census Bureau publishes sanitized records

Privacy is legally mandated; what utility can we achieve?

Inherent Privacy vs Utility tension One extreme – complete privacy; no information Other extreme – complete information; no privacy

Goals: Find a middle path

preserve macroscopic properties “disguise” individual identifying information

Change the nature of discourse Establish framework for meaningful comparison of

techniques

3

Outline

Definitions privacy, defined in the breach sanitization requirements utility goals

Example: Recursive Histogram Sanitizations description of technique a robust proof of privacy

Example: “Round” Sanitizations nice learning properties privacy via cross-training

Setting the Real World Context dealing with auxiliary information

4

Outline





5

What do WE mean by privacy?

[Ruth Gavison] Protection from being brought to the attention of others

inherently valuable attention invites further privacy loss

Privacy is assured to the extent that one blends in with the crowd

Appealing definition; can be converted into a precise mathematical statement…

6

A geometric view

Abstraction: Database consists of points in high dimensional space

Rd

Points are unlabeledyou are your collection of attributes

Distance is everythingpoints are more similar if and only if they are closer

Real Database (RDB), privaten unlabeled points in d-dimensional space think of d as number of sensitive attributes

Sanitized Database (SDB), publicn’ new points, possibly in a different space

7

The adversary or Isolator - Intuition

On input SDB and auxiliary information, adversary outputs a point q Rd

q “isolates” a real DB point x, if it is much closer to x than to x’s near neighbors q fails to isolate x if q looks roughly as much

like everyone in x’s neighborhood as it looks like x itself

Tightly clustered points have a smaller radius of isolation

RDB

8

(c,T)-Isolation – the definition

I(SDB,aux) = q x is (c,T)-isolated if B(q,c) contains fewer than

T other points from RDB

c – privacy parameter; eg, 4

qx

c

p

9

Requirements for the sanitizer

No way of obtaining privacy if AUX already reveals too much!

Sanitization procedure compromises privacy if giving the adversary access to the SDB considerably increases its probability of success

Definition of “considerably” can be forgiving Formally, quantify over distributions, adversaries, choice

of database, auxiliary information: D I I’ w.h.p. D aux

x |Pr[I(SDB,aux) isolates x] – Pr[I’(aux) isolates x]| is smallprobabilities over choices made by sanitizer and I, I’

Provides a framework for describing the power of a sanitization method, and hence for comparisons

Aux is going to cause trouble. Ignore it for now.

10

Utility Goals

Pointwise proofs of specific utilities averages, medians, clusters, regressions,…

Prove there is a large class of interesting utilities for which there are good approximation procedures using sanitized data

11

Outline





12

Recursive Histogram Sanitization

U = d-dim cube, side = 2 Cut into 2d subcubes

split along each axis subcube has side = 1

For each subcubeif number of RDB points > 2T

then recurse

Output: list of cells and counts

13

Recursive Histogram Sanitization

Theorem: 9c s.t. if n points are drawn uniformly from U, then recursive histogram sanitizations are safe with respect to c-isolation: Pr[I(SDB) succeeds] · exp(-d).

14

Safety of Recursive Histogram Sanitization

Rough Intuition Expected distance ||q-x|| is ≈ diameter of

cell. Distances tightly concentrated around mean. Multiplying radius by c captures almost all

the parent cell - contains at least 2T points.

15

For Very Large Values of n

Wlog can switch to ball adversaries: (q,r)I wins if B(q,r) contains at least one RDB point

and B(q,cr) contains fewer than T RDB points

Define a probability density f(x) that captures adversary’s view of the RDB

To win with probability , I needs:Prf[B(q,r)] ¸ /n

Prf[B(q,cr)] · (2T + O(log -1))/n

Prf[B(q,r)]/Prf[B(q,cr)] ¸ /(2T + O(log -1))

Bound by bounding ratio, · 2-d, < 1

16

Prf[B(q,r)]/Prf[B(q,cr)]

f(x) = (nC/n) (1 / Vol(C))fraction of RDB points landing in cell C, spread

uniformly within C

If r is sufficiently small, the bigger ball captures exp(d) more mass in each subcube than does the smaller ball

yields < 2-(d)

17

Prf[B(q,r)]/Prf[B(q,cr)]

f(x) = (nC/n) (1 / Vol(C))fraction of RDB points landing in cell C, spread

uniformly within C

If r is sufficiently small, the bigger ball captures exp(d) more mass in each subcube than does the smaller ball

If r is large, the small ball captures nothing or the bigger ball captures parent cube

Either way isolation cannot occur (c = 16)

18

Proof is Very Robust

Extends to many interesting cases non-uniform but bounded-ratio density fns isolator knows constant fraction of attribute

vals isolator knows lots of RDB points isolation in few attributes

very weak bounds

Can be adapted to “round” distributionsballs, spheres, mixtures of Gaussians,with effort; [work in progress w/ K. Talwar]

More General Distributions “good” islands in a sea of zero probability

19

Outline





20

Round Sanitizations

The privacy of x is linked to its T-radius Randomly perturb it in proportion to its T-

radius

x’ = San(x) R B(x,T-rad(x)) alternatively: S(x, T-rad(x)) or d-dim Gaussian

Intuition: We are blending x in with its crowd

We are adding to x random noise with mean zero, so several macroscopic properties should be preserved.

21

Nice Learning Properties

Known algorithm for learning mixtures of Gaussians works for clustering sanitized Gaussian data

Original distribution (mixture of Gaussians) is recoveredTechnical issue: added noise is a function of the dataSubject of another talk

Diameter increases by at most x3 when finding k clusters minimizing the largest diameter

22

Privacy for n Sanitized Points?

Given n-1 points in the clear, the probability of isolating the nth is O(exp(-d))

Intuition for extension to n points is wrong! Privacy of xn given xn’ and all the other points in

the clear does not imply privacy of xn given xn’ and sanitizations of others!

Sanitization of other points reveals information about xn

Worry is for safety of the reference point (the neighbor defining the T-radius), not the principal

23

Combining the Two Sanitizations

Partition RDB into two sets A and B Cross-training

Compute histogram sanitization for B v 2 A: v = f(side length of C containing v) Output GSan(v, v)

24

Cross-Training Privacy

Privacy for B: only histogram information about B is used

Privacy for A: enough variance for enough coordinates of v, even given C containing v and sanitization v’ of v. current proof works only for |A| = 2o(d)

25

Additional Results*

Impossibility Results 9 interesting utilities that have no sanitization

protecting against isolation (cf. SFE) Impossibility of all-purpose sanitizers

There is always a choice of aux that defeats a certain natural version of privacy

Contrived, but places a limit on what can be proved Poly-time bounded adversary? Connection to

obfuscation.

Utility Exploit literature on power of randomized histograms

for algorithms for data streams (eg, Indyk)

* with assorted collaborators, eg, N, N, S, T

26

Outline





27

A Standard Technique: Cell Suppression

Gestalt: Tabular Data (many, possibly linked, tables) entries are cells

frequency (count) data magnitude data (income, sales, etc.)

Disclosure = small counts Provides key for population unique, or almost-unique Can be used as a key into a different database

Enormous literature on suppressing “safely”

16 8 5 2 31

1 5 20 3 29

17 13 25 5 60

28

Connection to Our Definitions

Protection against isolation yields protection against learning a key for a population unique isolation on a subspace does not imply

isolation in the full-dimensional space … … but aux may contain other DBs that can be

queried to learn remaining attributes definition mandates protection against all possible

aux satisfy def ) can’t learn key

29

Connection to Our Definitions

Seems very hard to provide good sanitization in the presence of arbitrary aux Provably impossible in general Anyway, can probably already isolate people

based solely on aux Suggests we need to control aux

How should we redesign the world?

30

Two Tools

Secure Function Evaluation [Yao, GMW] Technique permitting Alice, Bob, Carol, and their

friends to collaboratively compute a function f of their private inputs =f(a,b,c,…). eg, = sum(a,b,c, …)

Each player learns only what can be deduced from and her own input to f

SuLQ databases [Dwork, Nissim] Provably preserves privacy of attributes when the

rows of the database are mutually independent Powerful [DwNi; Blum, Dwork, McSherry, Nissim]

31

Statistical Database

Query (S, f)Query (S, f)

S S [n] [n]

f : {0,1}f : {0,1}dd {0,1}{0,1}

Exact Answer Exact Answer rrSS f(row r) f(row r)

nn p

ers

on

sp

ers

on

s

d attributesd attributes

Database DBDatabase DB

ffffff

ff

00 00 11 11 00

11 00 11 00 00

11 11 00 11 11

00 00 11 00 11

11 11 00 00 11

00 00 00 11 00

Row distributionRow distributionDD (D(D11,D,D22,,

…,D…,Dnn))

32

Sub-Linear Query (SuLQ) Databases

nn p

ers

on

sp

ers

on

s

d attributesd attributes

ffffff

ff

00 00 11 11 00

11 00 11 00 00

11 11 00 11 11

00 00 11 00 11

11 11 00 00 11

00 00 00 11 00

+ noise

If the number of queries is << n, then privacy can beprotected with little noise (per query):

E(noise) = 0; standard dev << √nMuch less than sampling error!

34

Our Data, Ourselves

Individuals maintain their own data records join a DB by setting an appropriate attribute

Statistical queries via a SFE(SuLQ) privacy of SuLQ query ) this SFE is “safe”

Individuals ensure data take part in sufficiently few queries sufficient random noise is added

0 4 6 3 … 1 0 …

35

Summary

Definitions defined isolation and sanitization

Recursive Histogram Sanitizations described approach and sketched a robust proof of

privacy for a special distribution proof exploits high dimensionality (# columns)

Sanitization via perturbations utility and privacy via cross-training

Setting the Real World Context discussed a radical view of how data might be

organized to prevent a powerful class of attacks based on auxiliary data

SuLQ tool exploits large membership (# rows)

36

Larry Joseph Stockmeyer November 13, 1948 - July 31, 2004

37

Larry Stockmeyer Commemoration

May 21-22, 2005Baltimore, Maryland

(in conjunction with STOC 2005) May 21:,

Tutorial by Nick Pippenger (Princeton) on some of Stockmeyer's fundamental results in complexity theory Lectures by Miki Ajtai (IBM), Anne Condon (UBC), Cynthia Dwork (Microsoft), Richard Karp (UC Berkeley), Albert Meyer (MIT), and Chris Umans (CalTech).

Some time will be reserved for personal remarks. Contact Cynthia Dwork if you want to participate in this part of the commemoration.

May 22: Lance Fortnow gives first keynote address to STOC.

Towards Privacy in Public Databases Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry...

Documents

Transcript of Towards Privacy in Public Databases Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry...