Harnessing human ADAR2 for RNA repair – Recoding a PINK1 ...
HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION
-
Upload
louis-durham -
Category
Documents
-
view
32 -
download
0
description
Transcript of HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION
HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATIONPresented by: Michael ChengSupervisor: Dr. William Cheung Co-Supervisor: Dr. Byron Choi
Presentation Flow
Privacy-Preserving Data Publishing Introduction to Emerging Patterns
(EPs) Introduction to Equivalence Class Introduction to Generalization Proposed Problem and Motivation Heuristic for the Problem Experimental Results Future research plan
Privacy Preserving Data Publishing- Introduction Organizations often need to publish
or share their data for legitimate reasons
Sensitive information (e.g. personal identities, restrictive patterns) maybe inferred from the published data
Privacy Preserving Data Publishing- Objective Transform the dataset before publishing,
such that:1. Sensitive information In our case: Emerging Patterns (EPs)2. Subsequence analysis In our case: Frequent Itemset (FIS)
Mining
Introduction to Emerging Patterns (EPs) Emerging Patterns (EPs) are itemsets
exist in pair of datasets whose supports are significant in one dataset but insignificant in another
Edu Occup Marital
BA Exec Married
BA Exec Married
BA Exec Married
BA Exec Married
MSE Worker Never
Edu Occup Marital
Married
Married
BA Exec Married
BA Manager Married
BA Repair Never
MSE Exec
MSE Exec
{MSE, Exec} is an Emerging Pattern
Income >= 50k Income < 50k
Introduction to Emerging Patterns (EPs) Formally, growth rate and EPs are
defined as follow:
Manager
Introduction to Equivalence Class Tuples are said to be in the same
Equivalence Class w.r.t. a set of Attribute A if they take same values of A
ID Edu Occup Marital
1 MSE
2 MSE
3 BA
4 BA Married
5 BA Repair Never
Exec Married
Exec Married
Exec Married
Tuples {1,2,3} are in the same Equivalence Class w.r.t. {Occup,
Marital}
Introduction to Generalization Extensively studied in achieving k-Anonymity
Not studied before for hiding itemsets
Modify the original values in dataset into more general values according to a user-given hierarchy such that more tuples will share the same set of attribute values
Example:In Adult, “BA” and “MSE” maybe generalized to “Degree Holder”
Types of Generalization
Single Dimensional Global Recoding Multi Dimensional Global Recoding Multi Dimensional Local Recoding
Single Dimensional Global Recoding If we decide to generalize some values
to a single value, all tuples which contains these values will be affected
Occup
Exec
Exec
Exec
Manager
Repair
Occup
Occupation
Occupation
Occupation
Occupation
Occupation
Single Dimensional
Global Recoding
Multi Dimensional Global Recoding If we decide to generalize some values
to a single value, all tuples in the same equivalence class which contains those values will be affected
Occup
Exec
Exec
Exec
Manager
Repair
Multi Dimensional
Global Recoding
Occup
Manager
Repair
Occupation
Occupation
Occupation
Multi Dimensional Local Recoding Same as the Multi Dimensional Global
Recoding except no Equivalence Class constraint
Occup
Exec
Exec
Exec
Manager
Repair
Multi Dimensional
Local Recoding
Occup
Manager
Repair
Exec
Occupation
Occupation
Proposed Problem- Why EP and FIS ? Emerging Pattern may reveal sensitive
information
E.g. In the Adult dataset from UCI Repository, we found that: {Never-Married, Own-Child} is an EP from the class
“Income < 50k” to the class “Income >=50k” Growth Rate: 35
Frequent Itemset is a popular data mining task and supported by commercial data-mining software
Proposed Problem-Why Generalization ? Other methods studied in PPDP
For example: Adding unknowns, remove tuples, adding fake tuples
randomly Either
Incomplete information Fake information
In some applications, completeness and truthfulness of data are important
By using generalization, we can preserve the completeness and truthfulness of the data
Proposed problem- Problem Illustration
D D’Transformati
on(Local
Recoding)
Emerging PatternsFrequent Itemsets
Intuition of Local Recoding
Support of FIS = 40% Growth Rate of EP = 3
Frequent Itemset = {Exec, Married} Emerging Pattern = {MSE ,Exec}
Edu Occup Marital
Married
Married
BA Exec Married
BA Manager Married
BA Repair Never
MSE Exec
MSE Exec
Income >= 50k Income < 50k
Edu Occup Marital
BA Exec Married
BA Exec Married
BA Exec Married
BA Worker Married
MSE Manager Never
Intuition of Local RecodingEdu Occup Marital
Married
Married
BA Exec Married
BA Manager Married
BA Repair Never
MSE Exec
MSE Exec
Income >= 50k Income < 50k
Edu Occup Marital
BA Exec Married
BA Exec Married
BA Exec Married
BA Worker Married
MSE Manager Never
Edu Occup Marital
Married
Married
BA Exec Married
BA Manager Married
BA Repair Never
MSE White col
MSE White col
Income >= 50k Income < 50k
Edu Occup Marital
BA Exec Married
BA Exec Married
BA Exec Married
BA Worker Married
MSE White Col Never
Heuristic for the Problem- Greedy Approach
Repeat…
Until…
All Emerging Patterns are removed
DEmerging Patterns Mining
Applying the generalization
EPs
EP 1
EP 2
EP 3
EP 4
Equivalence ClassesUtility Gain
Class1 40
Class 2 90
Class 3 60
Class 4 20
Class 5 15
Heuristic for the Problem-Greedy Approach Drawbacks:
Trapped into some local minima Solution:
Simulated Annealing Style Approach for choosing equivalence class
Heuristic for the Problem- Simulated Annealing Style Approach
Choose Equivalence Class probabilistically
Two parameters: Initial temperature ( T0 ) Cooling Rate ( α )
Acceptance Probability: exp Utility Gain / Temperature
Temperature updating: Tn = α Tn-1
Utility Gain
T=1000
T=100 T=10
90 0.209 0.302 0.945
60 0.203 0.223 0.047
40 0.199 0.183 0.006
20 0.195 0.150 0.0009
15 0.194 0.142 0.0005Acceptance probability of different utility gain and temperature
Heuristic for the Problem- Simulated Annealing Style Approach
Repeat…
Until…
All Emerging Patterns are removed
DEmerging Patterns Mining
Applying the generalizationand
Decrease the temperature
EPs
EP 1
EP 2
EP 3
EP 4
Equivalence ClassesProbability
Class1 0.2
Class 2 0.4
Class 3 0.1
Class 4 0.25
Class 5 0.05
Two questions
How to choose an EP for generalization? How to calculate the utility gain?
How to choose an EP for generalization? Choose the EP which overlaps with the
remaining EPs the most More likely to hide other EPs
simultaneouslyEmerging Patterns
MSE Never Married
BA Divorced
BA Divorced Worker
BA Divorced Repairman
BA DivorcedOwn-Child
How to calculate utility gain?
Utility gain is a function of: Recoding Distance (RD) Reduction of Growth Rate (RG)
How to calculate utility gain ?- Recoding Distance (RD) The detail derivation is stated in the paper Intuitively, it measures…
How many and how much FIS have been generalized?
How many FIS disappeared? High level definition of RD:
θq x (generalized FIS) + ( 1- θq ) x (disappeared FIS)
,where θq is user defined parameterThe larger the value of RD, the more the distortion generated on
the Frequent Itemset
How to calculate utility gain ?- Reduction of Growth Rate(RG) After taken a local recoding, RG is
defined as: The reduction of growth rate of all EPs
Emerging Patterns
Growth Rate
Executive , Married
10
BA, Divorced 20
Executive 30
Sum of Growth Rate
60
Emerging Patterns
Growth Rate
White col, Married
5
BA, Divorced 20
Sum of Growth Rate
25
Local Recoding
RG = 60 – 25 = 35
How to calculate utility gain? Putting all these together, utility gain is defined
as:θp x RG – (1- θp ) x RD
,where θp is user defined parameters
It favors: Local recoding which can reduce lots of growth rate
It penalizes: Local recoding which generate large distortion on
FIS
Experimental Setup
Dataset: Adult dataset from UCI Repository Popular benchmark dataset used for generalization
Total number of records: 30162 Income > 50k : 7508 Income <= 50k : 22654
Use only 8 categorical attributes for experiment A well accepted hierarchy is defined
Parameters: Support of FIS : 40% Growth rate of EP : 5 Initial Temperature : 10 Cooling Rate : 0.4
Performance
RD / No. of FIS disappeared of the Greedy Approach
RD / No. of FIS disappeared ofSimulated Annealing Style Approach
(Best of 5)
Maximum RD: 623.1
Runtime (in minutes)
Greedy Approach
Simulated Annealing Style Approach(Best of 5)
Future Research Plan
Hide EPs in temporal datasets Consider multi-level FIS Hiding a group of emerging patterns at a
time
Q & A
Any Questions?