The Role of History and Prediction in Data Privacy
Kristen LeFevre
University of Michigan
May 13, 2009
QuickTime™ and a decompressor
are needed to see this picture.
2
Data Privacy
• Personal information collected every day
Healthcare, insurance information
Supermarket transaction data
RFID, GPS Data
E-mailEmployment history
Web search / clickstream
3
Data Privacy
• Legal, ethical, technical issues surrounding– Data ownership– Data collection– Data dissemination and use
• Considerable recent interest from technical community– High-profile mishaps and lawsuits– Compliance with data-sharing mandates QuickTime™ and a
decompressorare needed to see this picture.
4
Privacy Protection Technologies for Public Datasets
• Goal: Protect sensitive personal information while preserving data utility
• Privacy Policies and Mechanisms• Example Policies:
– Protect individual identities– Protect the values of sensitive attributes– Differential privacy [Dwork 06]
• Example Mechanisms:– Generalize (“coarsen”) the data– Aggregate the data– Add random noise to the data– Add random noise to query results
5
Observations
• Much work has focused on static data– One-time snapshot publishing– Disclosure by composing multiple different
snapshots of a static database [Xiao 07, Ganta 08]
– Auditing queries on a static database [Chin 81, Kenthapadi 06, …]
• What are the unique challenges when the data evolves over time?
6
Outline
• Sample Problem: Continuously publishing privacy-sensitive GPS traces– Motivation & problem setup– Framework for reasoning about privacy– Algorithms for continuous publishing– Experimental results
• Applications to other dynamic dataspeculation
7
GPS Traces(ongoing work w/ Wen Jin, Jignesh Patel)
• GPS devices attached to phones, cars• Interest in collecting and distributing
location traces in real time– Real-time traffic reporting– Adaptive pricing / placement of outdoor ads
• Simultaneous concern for personal privacy• Challenge: Can we continuously collect
and publish location traces without compromising individual privacy?
8
Data Recipient
QuickTime™ and a decompressor
are needed to see this picture.
Problem Setting
QuickTime™ and a decompressor
are needed to see this picture.
Central TraceRepository
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
GPS Users (7 AM)P
riva
cy P
oli
cy
“Sanitized” LocationSnapshot
“Sanitized” LocationSnapshot
GPS Users (7:05 AM)
“Sanitized” LocationSnapshot
“Sanitized” LocationSnapshot
9
Problem Setting
• Finite population of n users with unique identifiers {u1,…,un}
• Assume users’ locations are reported and published in discrete epochs t1,t2,…
• Location snapshot D(tj)– Associates each user with a location during
epoch tj
• Publish sanitized version D*(tj )
10
Threat Model
• Attacker wants to determine the location of a target user ui during epoch tj
• Auxiliary Information: Attacker knows location information during some other epochs (e.g., Yellow Pages)
QuickTime™ and a decompressor
are needed to see this picture.
11
Some Naïve Solutions
• Strawman 1: Replace users’ identifiers ({u1,…,un}) with pseudonyms ({p1,…,pn})
– Problem: Once attacker “unmasks” user pi, he can track her location forever
• Strawman 2: New pseudonyms ({p1j,…,pn
j}) at each epoch tj
– Problem: Users can still be tracked using multi-target tracking tools [Gruteser 05, Krumm 07]
12
Key Problem: Motion Prediction
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture. QuickTime™ and a decompressor
are needed to see this picture.
1
2 3{Alice, Bob, Charlie}
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
4
5
6{Alice, Bob, Charlie}
What if the speedlimit is 60 mph?
Alice Alice
13
Threat Model
• Attacker wants to determine the location of a target user ui during epoch tj
• Auxiliary Information: Attacker knows location information during some other epochs (e.g., Yellow Pages)
• Motion prediction: Given one or more locations for ui, attacker can predict (probabilistically) ui’s location during following and preceding epochs
14
Privacy Principle: Temporal Unlinkability
• Consider an attacker who is able to identify (locate) target user uj during m sequential epochs
• Under reasonable assumptions, he should not be able to locate uj with high confidence during any other epochs*
*Similar in spirit to “mix zones” [Beresford 03], which addressed a related problem in a less-formal way.
15
Sanitization Mechanism
• Needed to select a sanitization mechanism; chose one for maximum flexibility
• Assign each user ui consistent pseudonym pi
• Divide users into clusters– Within each cluster, break association between
pseudonym, location
• Release candidate for D(tj)
D*(tj) = {(C1(tj), L1(tj)),…, (CB(tj), LB(tj))} i=1..B Ci(tj) = {p1,…,pn}– Ci(tj) Ch(tj) = (i h)– Each Li(tj) contains the locations of users in Ci(tj)
16
Sanitization Mechanism: Example
• Pseudonyms {p1, p2, p3, p4}
{p1,p2}
{p3,p4}
t0
QuickTime™ and a decompressor
are needed to see this picture.1QuickTime™ and a
decompressorare needed to see this picture.2
QuickTime™ and a decompressor
are needed to see this picture.3
QuickTime™ and a decompressor
are needed to see this picture.4
{p1,p2}
{p3,p4}
t1
QuickTime™ and a decompressor
are needed to see this picture.5QuickTime™ and a
decompressorare needed to see this picture.6
QuickTime™ and a decompressor
are needed to see this picture.7
QuickTime™ and a decompressor
are needed to see this picture.8
{p1,p3}
{p2,p4}
t2
QuickTime™ and a decompressor
are needed to see this picture.9
QuickTime™ and a decompressor
are needed to see this picture.10
QuickTime™ and a decompressor
are needed to see this picture.11QuickTime™ and a
decompressorare needed to see this picture.12
17
Reasoning about Privacy
• How can we guarantee temporal unlinkability under the threats of auxiliary information and motion prediction?– (Using the cluster-based sanitization mechanism)
• Novel framework with two key components– Motion model describes location correlations
between epochs– Breach probability function describes an
attacker’s ability to compromise temporal unlinkability
18
Motion Models
• Model motion using an h-step Markov chain– Conditional probability for user’s location, given his location
during h prior (future) epochs– Same motion model used by attacker and publisher
• Forward motion model template
– Pr[Loc(P,Tj) = Lj | Loc(P,Tj-1) = Lj-1, …, Loc(P,Tj-h) = Lj-h]
• Backward motion model template
– Pr[Loc(P,Tj) = Lj | Loc(P,Tj+1) = Lj+1, …, Loc(P,Tj+h) = Lj+h]
• Independent and replaceable component– For this work, used 1-step motion model based on velocity
distribution (speed and direction)
19
Motion Models: Example
{p1,p2}
{p3,p4}
t0 t1
• Pseudonyms {p1, p2, p3, p4}• Epochs t0, t1, t2
QuickTime™ and a decompressor
are needed to see this picture.p1QuickTime™ and a
decompressorare needed to see this picture.p2
QuickTime™ and a decompressor
are needed to see this picture.p3
QuickTime™ and a decompressor
are needed to see this picture.p4
QuickTime™ and a decompressor
are needed to see this picture.aQuickTime™ and a
decompressorare needed to see this picture.b
QuickTime™ and a decompressor
are needed to see this picture.c
QuickTime™ and a decompressor
are needed to see this picture.d
t2
QuickTime™ and a decompressor
are needed to see this picture.p3
QuickTime™ and a decompressor
are needed to see this picture.p1
QuickTime™ and a decompressor
are needed to see this picture.p2QuickTime™ and a
decompressorare needed to see this picture.p4
Pr[loc(p1,t1) = a|Loc(p1,t0)=x]
Pr[loc(p1,t1) = b|Loc(p1,t0)=x]Pr[loc(p1,t1) = a|Loc(p1,t2)=y]
20
Privacy Breaches
• Forward breach probability– Pr[Loc(P,Tj) = Lj | D(Tj-1), …, D(Tj-h), D*(Tj)]
• Backward breach probability– Pr[Loc(P,Tj) = Lj | D(Tj+1), …, D(Tj+h), D*(Tj)]
• Privacy Breach: Release candidate D*(Tj) causes a breach iff either of the following is true for threshold Cmax P, Lj Pr[Loc(P,Tj) = Lj | D(Tj-1), …, D(Tj-h), D*(Tj)] > C
max P, Lj Pr[Loc(P,Tj) = Lj | D(Tj+1), …, D(Tj-h), D*(Tj)] > C
21
Privacy Breaches: Example
{p1,p2}
{p3,p4}
t0 t1
QuickTime™ and a decompressor
are needed to see this picture.p1QuickTime™ and a
decompressorare needed to see this picture.p2
QuickTime™ and a decompressor
are needed to see this picture.p3
QuickTime™ and a decompressor
are needed to see this picture.p4
QuickTime™ and a decompressor
are needed to see this picture.aQuickTime™ and a
decompressorare needed to see this picture.b
QuickTime™ and a decompressor
are needed to see this picture.c
QuickTime™ and a decompressor
are needed to see this picture.d
e1 = Pr[loc(p1,t1) = a|Loc(p1,t0)=x]
e2 = Pr[loc(p1,t1) = b|Loc(p1,t0)=x]
e3 = Pr[loc(p2,t1) = a|Loc(p2,t0)=y]
e4 = Pr[loc(p2,t1) = b|Loc(p2,t0)=y]
Pr[loc(p1,t1) = a|D(T0), D*(T1)] =
e1 * e4
e1 * e4 + e2 * e3
…Goal: Verify that all (forward and
backward) breach probabilities < threshold C
x
y
22
Checking for Breaches
• Does release candidate D*(Tj) cause a breach?
• Brute force algorithm– Exponential in release candidate cluster size
• Heuristic pruning tools– Reduce the search space considerably in
practice
23
Publishing Algorithms
• How to publish useful data, without causing a privacy breach?
• Cluster-based sanitization mechanism offers two main options– Increase cluster size (or change composition)– Reduce publication frequency
24
Publishing Algorithms
• General Case– At each epoch Tj, publish the most compact release
candidate D*(Tj) that does not cause a breach– Need to delay publishing until epoch Tj+h to check for
backward breaches– NP-hard optimization problem; proposed alternative
heuristics
• Special Case– Durable clusters (same individuals at each epoch)– Motion model satisfies symmetry property– No need to delay publishing
25
Experimental Study
• Used real highway traffic data from UM Transportation Research Institute
– GPS data sampled from cars of 72 volunteers– Sampling rate (epoch) = 0.01 seconds– Speed range 0-170 km/hour
• Also synthetic data– Able to control the generative motion distribution
26
Experimental Study
• All static “snapshot” anonymization mechanisms vulnerable to motion prediction attacks– Applied two representative algorithms (r-Gather
[Aggarwal 06] and k-Condense [Aggarwal 04])– Each produces a set of clusters with k users each
QuickTime™ and a decompressor
are needed to see this picture.
r-Gather
QuickTime™ and a decompressor
are needed to see this picture.
k-Condense
27
Speculation / Future Work
• GPS example illustrates importance of reasoning about data dynamics and history, and predictable patterns of change in privacy
• Dynamic private data in other apps.– E.g., longitudinal social science data
• Study subjects age predictably • Most people don’t move very far• Income changes predictably
• Hypothesis: History and prediction are important in these settings, too!
Top Related