Differential Privacy Xintao Wu Oct 31, 2012
description
Transcript of Differential Privacy Xintao Wu Oct 31, 2012
Differential Privacy
Xintao WuOct 31, 2012
Sanitization approaches
• Input perturbation– Add noise to data– Generalize data
• Summary statistics– Means, variances– Marginal totals– Model parameters
• Output perturbation– Add noise to summary statistics
Blending/hiding into a crowd
• K-anonymity based approaches
• Adversary may have various background knowledge to breach privacy
• Privacy models often assume “the adversary’s background knowledge is given”
Classic intuition for privacy
• Privacy means that anything can be learned about a respondent from the statistical database can be learned without access to the database.
• Security of encryption– Anything about the plaintext that can be learned
from a ciphertext can be learned without the ciphertext.
• Prior and posterior views about an individual should not change much
Motivation
• Publicly release statistical information about a dataset without compromising the privacy of any individual
Requirement
• Anything that can be learned about a respondent from a statistical database should be learnable without access to the database
• Reduce the knowledge gain of joining the database
• Require that the probability distribution on the public results is essentially the same independent of whether any individual opts in to, or opts out of the dataset
Definition
Sensitivity function
• Captures how great a difference must be hidden by the additive noise
LAP distribution noise
Guassian noise
Adding LAP noise
Proof sketch
Delta_f=1, epsilon varies
Delta_f=1 epsilon=0.01
Delta_f=1 epsilon=0.1
Delta_f=1 epsilon=1
Delta_f=1 epsilon=2
Delta_f=1 epsilon=10
Delta_f=2, epsilon varies
Delta_f=3, epsilon varies
Delta_f=10000, epsilon varies
Composition
• Sequential composition
• Parallel composition --for disjoint sets, the ultimate privacy
guarantee depends only on the worst of the guarantees of each analysis, not the sum.
Example
• Let us assume a table with 1000 customers and each record has attributes: name, gender, city, cancer, salary. – For attribute city, we assume the domain size is 10;
– for attribute cancer, we only record Yes or No for each customer;
– for attribute salary, the domain range is 0-10k.
– The privacy threshold \epsilon is a constant 0.1 set by data owner.
• For one single query “How many customers got cancer?”
• The adversary is allowed to ask three times of the query
shown the above.
Example (continued)
• “How many customers got cancer in each city?”
• For one single query “What is the sum of salaries across all customers?”
Type of computing (query)
• some are very sensitive, others are not
• single query vs. query sequence
• query on disjoint sets or not
• outcome expected: number vs. arbitrary
• interactive vs. not interactive
Sensitivity
• Global sensitivity
• Local sensitivity
• Smooth sensitivity
Different areas of DP
• PINQ
• DM with DP
• Optimizing linear counting queries under differential privacy.
-Matrix mechanism for answering a workload of predicate counting queries
PPDM interface--PINQ
• A programmable privacy preserving layer
• Add calibrated noise to each query
• Need to assign privacy cost budget
Data Mining with DP
• Previous study—privacy preserving interface ensures everything about DP
• Problems—inferior results if the interface is utilized simply during data mining
• Solution—consider both together• DP ID3
—noisy count
—evaluate all attributes in one exponential mechanism query using entire budget instead of splitting budget among multiple
DP in Social Networks
• Page 97-120 of pakdd11 tutorial