Making Private User Data Accessible for Information ...
Transcript of Making Private User Data Accessible for Information ...
Making Private User Data Accessible
for Information Retrieval Research:
Data Sharing with Differential Privacy
Li Xiong
Department of Mathematics and Computer Science
Department of Biomedical Informatics
Emory University
Privacy Preserving IR Workshop (PIR)
Santiago, Chile, August 13, 2015
Using user data for IR research
โข Query logs, web browsing sessions, location/context data
โข Trend detection, web traffic monitoring, anomaly detection,
location/context-aware IR
โข Privacy and confidentiality constraints
Original
Data
Sanitized
Records De-identification
anonymization
Traditional De-identification and Anonymization
โข Attribute suppression, encoding, perturbation, generalization
โข Subject to re-identification and disclosure attacks
Original
Data
Statistics/
Models/
Synthetic
Records
Differentially Private
Data Sharing
Data Sharing with Differential Privacy
โข More rigorous privacy guarantee
โข Macro data (as versus micro data)
โข Output perturbation (as versus input perturbation)
Outline
โข Preliminaries: differential privacy
โข Sharing web browsing data with differential
privacy for IR research
โข Aggregated web visits
โข Sequential patterns
โข Challenges and discussion
Original records Original histogram Perturbed histogram
with differential privacy
Statistical Data Sharing: differential privacy
with output perturbation
Preliminaries: Differential Privacy
A privacy mechanism A gives ฮต-differential privacy if
for all neighbouring databases D, Dโ, and for any
possible output S โ Range(A),
Pr[A(D) = S] โค exp(ฮต) ร Pr[A(Dโ) = S]
D Dโ
โข D and Dโ are neighboring databases if they differ in one record
โข Laplace Mechanism
For example, for a single counting query Q over a dataset
D, returning Q(D)+Laplace(1/ฮต) gives ฮต-differential privacy.
Global
Sensitivity
Preliminaries: Differential Privacy
Preliminaries: Differential Privacy
โข Composition properties
Sequential composition โiฮตi โdifferential privacy
Parallel composition max(ฮตi)โdifferential privacy
Outline
โข Preliminaries: differential privacy
โข Sharing user data with differential privacy for IR
research
โข Aggregated user behavior
โข Sequential user behavior patterns
โข Challenges
Problem: Aggregated Web Visits
โข Individual browsing sessions:
โข Aggregation of web visits at time k:
๐ฅ๐๐๐
, ๐ฅ๐๐๐๐ค๐ , โฆ
๐ฅ๐๐ : the number of requests of page i at time k
๐ : total number of web pages
โข Privacy goal: protect the presence of individual sessions
โmsnbc.comโ
๐ aggregates
Liyue Fan, Li Xiong, Vaidy Sunderam. Monitoring web browsing behavior
with differential privacy. WWW 2014
Problem: Single Aggregated Time-series
โข A univariate, discrete Time-Series ๐ = {๐ฅ๐} with 0 โค ๐ < ๐
โข xk: number of web page request at time k
โข Other examples: hourly traffic counts at an intersection, daily
flu count
โข Problem: Given time series X and differential privacy budget ฮฑ,
release ฮฑ-differentially private series R with high utility.
โข Utility: relative error
k time
R
X
error
Baseline: Laplace Perturbation Algorithm (LPA)
Laplace Perturbation k time
At each time point k
๐ฅ๐
Aggregate time-series X
Released time-series R
k time โข High perturbation error O(T)
๐๐ = ๐ฅ๐ + ๐, ๐~๐ฟ๐๐(1
๐ผ/๐)
FAST: Filtering and Adaptive Sampling for Aggregate
Time-series monitoring
โข Filtering โ model-based posterior estimation
โข Adaptive Sampling โ feedback-based sampling
โขL. Fan, L. Xiong. Real-Time Aggregate Monitoring with Differential Privacy CIKM 2012
โขL. Fan, L. Xiong, V. Sunderam. FAST: Differentially Private Real-Time Aggregate Monitor with
Filtering and Adaptive Sampling (demo track). SIGMOD, 2013
Filtering: State-Space Model
โข Process Model
๐ฅ๐+1 = ๐ฅ๐ + ๐
๐~โ(0, ๐)
โข Measurement Model
๐ง๐ = ๐ฅ๐ + ๐ ๐~๐ฟ๐๐(๐)
โข Given noisy measurement ๐ง๐, how to estimate true state ๐ฅ๐ ?
Process noise
Measurement noise
๐ฅ๐ ๐ฅ๐+1
๐ง๐ ๐ง๐+1
Process
Perturbation
Posterior Estimation
โข Denote โค๐ = ๐ง1, โฆ , ๐ง๐ - noisy observations up to k
โข Posterior estimate:
๐ฅ ๐ = ๐ธ(๐ฅ๐|โค๐)
โข Posterior distribution:
๐ ๐ฅ๐ โค๐ =๐ ๐ฅ๐ โค๐โ1 ๐(๐ง๐|๐ฅ๐)
๐ ๐ง๐ โค๐โ1
โข Challenge:
๐ ๐ง๐ โค๐โ1 and ๐ ๐ฅ๐ โค๐โ1 are difficult to compute when
๐ ๐ง๐ ๐ฅ๐ = ๐๐ is not Gaussian
Filtering: Solutions
โข Option 1: Approximate measurement noise with Gaussian
๐~โ(0, ๐ )
โ the Kalman filter
โข Option 2: Estimate posterior density by Monte-Carlo method
๐ ๐ฅ๐ โค๐ = ๐๐๐ ๐ฟ(๐ฅ๐ โ ๐ฅ๐
๐ )
๐
๐=1
where {๐ฅ๐๐ , ๐๐
๐ }1๐ is a set of weighted samples/particles.
โ particle filters
โขLiyue Fan, Li Xiong. Adaptively Sharing Real-Time Aggregates with
Differential Privacy. IEEE TKDE, 2014
FAST: Filtering and Adaptive Sampling for Aggregate
Time-series monitoring
โข Filtering โ model-based posterior estimation
โข Adaptive Sampling โ feedback-based sampling
Adaptive Sampling
26
โข Fixed sampling โ difficult to select sampling rate a priori
โข Adaptive sampling - adjust sampling rate based on feedback from
observed data dynamics
Adaptive Sampling: PID Control
โข Feedback error: measures how well the data model describes the
current trend
โข PID error (ฮ): compound of proportional, integral, and derivative errors
โข Proportional: current error
โข Integral: integral of errors in recent time window
โข Derivative: change rate of errors
โข Determines a new sampling interval:
๐ผโฒ = ๐ผ + ๐(1 โ ๐ฮโ๐๐ )
where ๐ represents the magnitude of change and ๐ is the set point for
sampling process.
โข Synthetic Data with 1000 data points:
โข Linear: process model
โข Logistic: ๐ฅ๐ = ๐ด(1 + ๐โ๐)โ1
โข Sinusoidal: ๐ฅ๐ = ๐ด โ ๐ ๐๐(๐๐ + ๐) โข Flu: CDC flu data 2006-2010, 209 data points
โข Traffic: UW/intelligent transportation systems research 2003-2004,
540 data points
โข Unemployment: ST. Louis Federal Reserve Bank, 478 data points
Evaluation: Data Sets
Flu Traffic
Illustration: Original data stream vs. released
data stream
โข FAST provides less data volume and higher data utility/integrity with
formal privacy guarantee
Fixed Sampling vs. Adaptive Sampling
โข Tradeoff between sampling error and perturbation error
โข Adaptive sampling achieves close-to-optimal result without aprior
knowledge
Aggregated Browsing Behavior
โข Individual browsing sessions:
โข Aggregation of web visits at time k:
๐ฅ๐๐๐
, ๐ฅ๐๐๐๐ค๐ , โฆ
๐ฅ๐๐ : the number of requests of page i at time k
๐ : total number of web pages
โmsnbc.comโ
๐ aggregates
Univariate Time Series Models
โข Individual models for each web page ๐
โข Little prior knowledge: constant model
โข Process Model ๐ฅ๐+1๐ = ๐ฅ๐
๐ + ๐๐๐ , ๐๐
๐ ~โ(0, ๐๐)
โข Measurement Model ๐ง๐๐ = ๐ฅ๐
๐ + ๐๐๐ , ๐๐
๐ ~๐ฟ๐๐(0,๐๐๐๐ฅ
๐ผ)
โข Gaussian Approx.:
๐๐๐ ~โ(0, ๐ )
Multivariate Time Series Models
โข Web browsing behavior -> First-order Markov Chain [Cadez et al
2000]
โข ๐ฟ๐ = (๐ฅ๐1, ๐ฅ๐
2, โฆ ๐ฅ๐๐)๐ , ๐๐ = (๐๐
1 , ๐๐2 , โฆ๐๐
๐)๐
โข Process Model:
๐ฟ๐+1 = ๐ด ๐ฟ๐ + ๐๐, ๐๐~โ(๐,๐ธ)
๐ด =
๐1,1 โฏ ๐1,๐
โฎ โฑ โฎ๐๐,1 โฏ ๐๐,๐
๐ธ =๐1,1 โฏ 0โฎ โฑ โฎ0 โฏ ๐๐,๐
๐๐,๐: transition probability
from page j to page i ๐ธ๐,๐: process noise variance
Evaluation: web browsing data
โข MSNBC data set from UIC repository
โข 989,818 browsing sessions
โข 17 web page categories
โข Longest session: 14,975 page requests
โข Average session length: 4.7
โข Simulated dynamic browsing sessions
โข Poisson arrival model
โข Random sample sessions
๐๐๐๐ = ๐๐, longer
sessions are truncated.
Outline
โข Preliminaries: differential privacy
โข Sharing user data with differential privacy for IR
research
โข Aggregated user behavior
โข Sequential behavior patterns
โข Challenges
Sequential Pattern Release: Prefix tree approach
Name Page visited
t1 t2 t3 โฆ
Alice F B B โฆ
Bob F S N โฆ
Charlie S H H โฆ
โฆ โฆ โฆ โฆ โฆ
All: 100
F: 30 S: 70
SH: 40 SN: 30 SF: 0
Original Records
DP Prefix Tree
t1
t2
t3
โฆ
โข Accurate for prefix patterns
โข Large aggregated error for substring patterns
Luca Bonomi, Li Xiong, Rui Chen, and Benjamin C. M. Fung. Frequent grams
based Embedding for Privacy Preserving Record Linkage. CIKM 2012
Sequential Pattern Release: Two-Phase
Approach
โข Prefix tree miner for prefix patterns
โข Compute top-kโ frequent patterns (kโ>k)
โข Transform the dataset to an optimal length-constrained
fingerprint
โข Refine the frequency count of the top-kโ frequent patterns
for top-k patterns
A Two-Phase Algorithm For Mining Sequential Patterns with Differential
Privacy. In CIKM 2013
Differentially Private Sequential Pattern Sharing
โข Prefix tree based approach
โข Retains sequence information, both frequent
and infrequent
โข Price: not accurate for frequent (substring)
sequences
โข Differentially private frequent pattern mining
โข Only care about frequent sequences given a
threshold
Non-private FSM โ An Example
ID
100
200
300
400
500
Record
aโcโd
bโcโd
aโbโcโeโd
dโb
aโdโcโd
Database D
Sequence
{a}
{b}
{c}
{d}
Sup.
3
3
4
4
{e} 1
C1: cand 1-seqs
Sequence
{a}
{b}
{c}
{d}
Sup.
3
3
4
4
F1: freq 1-seqs
Sequence
{aโa}
{aโb}
{aโc}
{aโd}
Sup.
0
1
3
3
{bโa}
{bโb}
{bโc}
{bโd}
0
2
2
1
{cโa}
{cโb}
{cโc}
{cโd}
0
0
0
4
{dโa}
{dโb}
{dโc}
{dโd}
0
1
1
0
C2: cand 2-seqs
Sequence
{aโc}
{aโd}
{cโd}
Sup.
3
3
4
F3: freq 2-seqs
Scan D
Scan D
Scan D
Sequence
{aโa}
{aโb}
{aโc}
{aโd}
{bโa}
{bโb}
{bโc}
{bโd}
{cโa}
{cโb}
{cโc}
{cโd}
{dโa}
{dโb}
{dโc}
{dโd}
C2: cand 2-seqs
Sequence
{aโbโc}
C3: cand 3-seqs
Sequence
{aโbโc}
Sup.
3
F3: freq 3-seqs
Naรฏve Private FSM
ID
100
200
300
400
500
Record
aโcโd
bโcโd
aโbโcโeโd
dโb
aโdโcโd
Database D
Sequence
{a}
{b}
{c}
{d}
Sup.
3
3
4
4
{e} 1
C1: cand 1-seqs
noise
0.2
-0.4
0.4
-0.5
0.8
Sequence
{aโa}
{aโc}
{aโd}
{cโa}
{cโc}
{cโd}
{dโa}
{dโc}
{dโd}
C2: cand 2-seqs
Sequence
{aโa}
{aโc}
{aโd}
Sup.
0
3
3
{cโa}
{cโc}
{cโd}
0
0
4
{dโa}
{dโc}
{dโd}
0
1
0
C2: cand 2-seqs
noise
0.2
0.3
0.2
-0.5
0.8
0.2
0.3
2.1
-0.5
Scan D
Scan D
Sequence
{aโcโd}
C3: cand 3-seqs
{aโdโc}
noise
0
0.3
Sequence
{aโcโd}
Sup.
3
{aโdโc} 1
C3: cand 3-seqs
Scan D
Sequence
{a}
{c}
{d}
Noisy Sup.
3.2
4.4
3.5
F1: freq 1-seqs
Sequence
{aโc}
{aโd}
{cโd}
Noisy Sup.
3.3
3.2
4.2
F2: freq 2-seqs
{dโc} 3.1
Sequence
{aโcโd}
Noisy Sup.
3
F3: freq 3-seqs
Lap(|C2| / ฮต2)
Lap(|C1| / ฮต1)
Lap(|C3| / ฮต3)
Differentially Private Frequent Sequence Mining via
Sampling-based Candidate Pruning
โข Observation: most candidate sequences are not frequent
โข For k-Sequences
1. Generate candidate k-sequences Ck
2. Use k-th sample DB to prune candidate k-
sequences
3. Compute noisy supports of remaining candidate k-
sequences
Ck
kth
sample database
Prune
Ckโ
Original Database
Compute
noisy support
Fk
Laplace
Mechanism
Original Database
mth
sample database2nd
sample database1st sample database
โฆโฆ
Partition
Shengzhi Xu, Sen Su, Xiang Cheng, Zhengyi Li, Li Xiong. Differentially
Private Frequent Sequence Mining via Sampling-based Candidate
Pruning. ICDE 2015
Experiments
โข Mining Results vs. Threshold
MSNBC: F-score
MSNBC: RE
BIBLE: F-score
BIBLE: RE
House_Power: F-score
House_Power: RE
Discussions
โข Domain knowledge is important
โข Web browsing patterns using Markov models
โข It is easier if you know what you want
โข Aggregated counts vs. sequential patterns
โข Frequent sequential patterns vs. all sequential
patterns
โข Need user-friendly metrics for understanding
utility and privacy tradeoff
โข Need more data and use case studies to evaluate
the feasibility and utility of using DP data for IR
research
Q: The synthetic data has noise, how can I trust
the data?
โข In some cases, we can give reasonable utility
guarantees
โข It will not replace the raw data, but to speedup
data access
Q: How to set epsilon?
โข Understand the privacy risks of data
โข Understand the value of data: pricing theory
Figure source: A Theory of Pricing Private Data, ICDT2013, TODS2014
Acknowledgement
โข Research support
โข Center for Comprehensive Informatics
โข Woodrow Wilson Foundation
โข Cisco research award
โข Students
โข James Gardner
โข Yonghui Xiao
โข Collaborators
โข Andrew Post, CCI
โข Fusheng Wang, CCI
โข Tyrone Grandison, IBM
โข Chun Yuan, Tsinghua
Location Cloaking: Problem Setting
โข Users with true (unobservable) locations, share perturbed locations
โข Need to guarantee true location is not disclosed even if an adversary
knows the moving patterns and previously released perturbed
locations
Location Cloaking: Proposed Solutions
Extended differential privacy:
โข Compute prior probability of the
locations based on Markov
model
โข Hide true location among the
probable locations
(indistinguishability set)
Perturbation mechanism:
โข Planar Isotrophic Mechanism