Differential Privacy Xintao Wu Oct 31, 2012

Differential Privacy

Xintao WuOct 31, 2012

Sanitization approaches

• Input perturbation– Add noise to data– Generalize data

• Summary statistics– Means, variances– Marginal totals– Model parameters

• Output perturbation– Add noise to summary statistics

Blending/hiding into a crowd

• K-anonymity based approaches

• Adversary may have various background knowledge to breach privacy

• Privacy models often assume “the adversary’s background knowledge is given”

Classic intuition for privacy

• Privacy means that anything can be learned about a respondent from the statistical database can be learned without access to the database.

• Security of encryption– Anything about the plaintext that can be learned

from a ciphertext can be learned without the ciphertext.

• Prior and posterior views about an individual should not change much

Motivation

• Publicly release statistical information about a dataset without compromising the privacy of any individual

Requirement

• Anything that can be learned about a respondent from a statistical database should be learnable without access to the database

• Reduce the knowledge gain of joining the database

• Require that the probability distribution on the public results is essentially the same independent of whether any individual opts in to, or opts out of the dataset

Definition

Sensitivity function

• Captures how great a difference must be hidden by the additive noise

LAP distribution noise

http://upload.wikimedia.org/wikipedia/commons/8/89/Laplace_distribution_pdf.png

Guassian noise

Adding LAP noise

Proof sketch

Delta_f=1, epsilon varies

Delta_f=1 epsilon=0.01

Delta_f=1 epsilon=0.1

Delta_f=1 epsilon=1

Delta_f=1 epsilon=2

Delta_f=1 epsilon=10

Composition

• Sequential composition

• Parallel composition --for disjoint sets, the ultimate privacy

guarantee depends only on the worst of the guarantees of each analysis, not the sum.

Example

• Let us assume a table with 1000 customers and each record has attributes: name, gender, city, cancer, salary. – For attribute city, we assume the domain size is 10;

– for attribute cancer, we only record Yes or No for each customer;

– for attribute salary, the domain range is 0-10k.

– The privacy threshold \epsilon is a constant 0.1 set by data owner.

• For one single query “How many customers got cancer?”

• The adversary is allowed to ask three times of the query

shown the above.

Example (continued)

• “How many customers got cancer in each city?”

• For one single query “What is the sum of salaries across all customers?”

Type of computing (query)

• some are very sensitive, others are not

• single query vs. query sequence

• query on disjoint sets or not

• outcome expected: number vs. arbitrary

• interactive vs. not interactive

Sensitivity

• Global sensitivity

• Local sensitivity

• Smooth sensitivity

Different areas of DP

• PINQ

• DM with DP

• Optimizing linear counting queries under differential privacy.

-Matrix mechanism for answering a workload of predicate counting queries

PPDM interface--PINQ

• A programmable privacy preserving layer

• Add calibrated noise to each query

• Need to assign privacy cost budget

Data Mining with DP

• Previous study—privacy preserving interface ensures everything about DP

• Problems—inferior results if the interface is utilized simply during data mining

• Solution—consider both together• DP ID3

—noisy count

—evaluate all attributes in one exponential mechanism query using entire budget instead of splitting budget among multiple

DP in Social Networks

• Page 97-120 of pakdd11 tutorial

http://coitweb.uncc.edu/~xwu/6220/ppsn-tut.pptx

Differential Privacy Xintao Wu Oct 31, 2012

Documents

Transcript of Differential Privacy Xintao Wu Oct 31, 2012