Data Transformation for Privacy-Preserving Data Mining
description
Transcript of Data Transformation for Privacy-Preserving Data Mining
Data Transformation for Privacy-Preserving Data Mining
Stanley R. M. Oliveira
Database Systems Laboratory
Computing Science Department
University of Alberta, Canada
PhD Thesis - Final Examination
November 29th, 2004
Database Laboratory
Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira
2
Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions
• Scenario 1: A collaboration between an Internet marketing company and an on-line retail company.
Objective: find optimal customer targets.
• Scenario 2: Companies sharing their transactions to build a recommender system.
Objective: provide recommendations to their customers.
MotivationIntroduction
Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira
3
Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions
A Taxonomy of the Existing Solutions
Data Modification
Fig.1: A Taxonomy of PPDM Techniques
Contributions
Data Restriction
Data Partitioning
Data Ownership
Introduction
Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira
4
Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions
Problem Definition
• To transform a database into a new one that conceals sensitive information while preserving general patterns and trends from the original database.
Framework
Non-sensitive patterns and trends
Data miningThe transformation
processOriginal Database
Transformed Database
• Data sanitization
• Dimensionality reduction
Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira
5
Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions
Problem Definition (cont.)
• Sub-Problem 1: Privacy-Preserving Association Rule Mining
I do not address privacy of individuals but the problem of protecting sensitive knowledge.
• Sub-Problem 2: Privacy-Preserving Clustering
I protect the underlying attribute values of objects subjected to clustering analysis.
Framework
Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira
6
Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions
A Framework for Privacy PPDM
A schematic view of the framework for privacy preservation
Library ofAlgorithms
PPDTMethods
Metrics
RetrievalFacilities
Sanitization
Privacy Preserving Framework
TransformedDatabase
Original data Collective Transformation
Individual Transformation
ServerClient
Framework
Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira
7
Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions
Privacy-Preserving Association Rule Mining
A taxonomy of sanitizing algorithms
Heuristic 1
Heuristic 2
Heuristic 3
PP-Assoc. Rules
Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira
8
Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions
1. Hiding Failure:
3. Artifactual Patterns:
4. Difference between D and D’:
Database D
Database D’
Data Sharing-Based Algorithms: Problems
)(#
)'(#
DS
DSHF
R
R)(#
)'(#)(#
DS
DSDSMC
R
RR
2. Misses Cost:
|'|
|'||'|
R
RRRAP
n
iDDn
iD
ififif
DDDif1
'
1
)]()([)(
1)',(
PP-Assoc. Rules
Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira
9
Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions
1. Scan a database and identify the sensitive transactions for each sensitive rule;
2. Based on the disclosure threshold , compute the number of sensitive transactions to be sanitized;
3. For each sensitive rule, identify a candidate item that should be eliminated (victim item);
4. Based on the number found in step 3, remove the victim items from the sensitive transactions.
Data Sharing-Based AlgorithmsPP-Assoc. Rules
Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira
10
Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions
Measuring Effectiveness
S4
Algorithm = 0% 6 sensitive rules
S1 S2 S3
Kosarak IGA IGA IGA IGA
Retail IGA SWA RA RA
Reuters IGA IGA IGA IGA
BMS-1 IGA IGA IGA IGA
S4
Algorithm = 0% varying values of
S1 S2 S3
Kosarak IGA IGA IGA IGA
Retail IGA SWA RA RRA
Reuters IGA IGA IGA IGA
BMS-1 IGA IGA IGA IGA
Misses Cost under condition C1
Misses Cost under condition C3
SWA
S4
Algorithm = 0% 6 sensitive rules
S1 S2 S3
Kosarak SWA SWA SWA SWA
Retail SWA
Reuters SWA SWA
BMS-1 SWA SWA
Dif (D, D’ ) under conditions C1 and C3
SWA SWA
SWA
SWA
SWA
SWA
IGA / SWA
S4
Algorithm = 0% varying the # of rules
S1 S2 S3
Kosarak IGA IGA IGA
Retail IGA
Reuters IGA
BMS-1 IGA
IGA
RA RA
IGA
IGAAlgo2a/IGA
Misses Cost under condition C2
IGA IGA
IGA
PP-Assoc. Rules
Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira
11
Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions
Pattern Sharing-Based Algorithm
Sanitization ShareAR generation
D’ D’D
Data Sharing-Based Approach
Association
Rules(D’)
ShareSanitizationD
AR generation Association
Rules(D)
Pattern Sharing-Based Approach
Association
Rules(D’)
Discovered
Patterns (D’)
PP-Assoc. Rules
(IGA)
(DSA)
Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira
12
Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions
• 1. Side Effect Factor (SEF) SEF =
• 2. Recovery Factor (RF) RF = [0, 1]
Rules to hide
Sensitive rulesSR
Non-Sensitive rules~SR
Rules hidden
R’: rules to share
R: all rules
Problem 1: RSE (Side effect)
Problem 2: Inference(recovery factor)
Pattern Sharing-Based Algorithms: Problems
|)||(|
|))||'(||(|
R
R
SR
SRR
PP-Assoc. Rules
Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira
13
Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions
Measuring EffectivenessDataset = 0% 6 sensitive rules
S1 S2 S3 S4
Kosarak IGA DSA DSA DSA
Retail DSA DSA DSA IGA
Reuters DSA DSA DSA IGA
BMS-1 DSA DSA DSA DSA
The best algorithm in terms of misses cost
Dataset = 0% varying # of rules
S1 S2 S3 S4
Kosarak IGA DSA DSA IGA /
DSA
Retail DSA DSA IGA / DSA
IGA
Reuters IGA /
DSA
DSA DSA IGA
BMS-1 DSA DSA DSA DSA
Dataset = 0% 6 sensitive rules
S1 S2 S3 S4
Kosarak IGA DSA DSA DSA
Retail DSA DSA DSA IGA
Reuters DSA DSA DSA IGA
BMS-1 DSA DSA DSA DSA
The best algorithm in terms of misses cost varying the number of rules to sanitize
The best algorithm in terms of side effect factor
PP-Assoc. Rules
Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira
14
Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions
Lessons Learned• Large dataset are our friends.
• The benefit of index: at most two scans to sanitize a dataset.
• The data sanitization paradox.
• The outstanding performance of IGA and DSA.
• Rule sanitization does not change the support and confidence of the shared rules.
• DSA reduces the flexibility of information sharing.
PP-Assoc. Rules
Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira
15
Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions
Privacy-Preserving Clustering (PPC)
• PPC over Centralized Data:
The attribute values subjected to clustering are available in a central repository.
• PPC over Vertically Partitioned Data:
There are k parties sharing data for clustering, where k 2;
The attribute values of the objects are split across the k parties.
Objects IDs are revealed for join purposes only. The values of the associated attributes are private.
PP-Clustering
Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira
16
Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions
Object Similarity-Based Representation (OSBR)
Example 1: Sharing data for research purposes (OSBR).
12810944689044286
2518340765828446
1297724705240254
1748124536456342
1939132638075123
PR_intQRSInt_defheart
rate
weightageID
A sample of the cardiac arrhythmia database(UCI Machine Learning Repository)
Original Data Transformed Data
0995.3130.4082.4020.3
0176.3884.3690.3
0477.2348.3
0243.2
0
DM
The corresponding dissimilarity matrix
PP-Clustering
Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira
17
Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions
• Limitations of the OSBR:
Expensive in terms of communication cost - O(m2), where m is the number of objects under analysis.
Vulnerable to attacks in the case of vertically partitioned data.
Conclusion The OSBR is effective for PPC over centralized data only.
Object Similarity-Based Representation (OSBR)
PP-Clustering
Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira
18
Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions
Dimensionality Reduction Transformation (DRBT)
• Random projection from d to k dimensions:
D’ n k = Dn d • Rd k (linear transformation), where
D is the original data, D’ is the reduced data, and R is a random matrix.
• R is generated by first setting each entry, as follows:
(R1): rij is drawn from an i.i.d. N(0,1) and then normalizing the columns to unit length;
(R2): rij =6/1
3/2
6/1
1
0
1
3
yprobabilitwith
yprobabilitwith
yprobabilitwith
PP-Clustering
Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira
19
Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions
Dimensionality Reduction Transformation (DRBT)
12810944689044286
2518340765828446
1297724705240254
1748124536456342
1939132638075123
PR_intQRSInt_defheart
rate
weightageID
A sample of the cardiac arrhythmia database(UCI Machine Learning Repository)
Original Data
ID Att1 Att2 Att3 Att1 Att2 Att3
123 -50.40 17.33 12.31 -55.50 -95.26 -107.93
342 -37.08 6.27 12.22 -51.00 -84.29 -83.13
254 -55.86 20.69 -0.66 -65.50 -70.43 -66.97
446 -37.61 -31.66 -17.58 -85.50 -140.87 -72.74
286 -62.72 37.64 18.16 -88.50 -50.22 -102.76
Transformed Data
RP1: The random matrix is based on the Normal distribution.
RP2: The random matrix is based on a much simpler distribution.
RP1 RP2
PP-Clustering
Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira
20
Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions
Dimensionality Reduction Transformation (DRBT)
• Security: A random projection from d to k dimensions, where k d, is a non-invertible linear transformation.
• Space requirement is of the order O(m), where m is the number of objects.
• Communication cost is of the order O(mkl), where l represents the size (in bits) required to transmit a dataset from one party to a central or third party.
• Conclusion The DRBT is effective for PPC over centralized data and vertically partitioned data.
PP-Clustering
Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira
21
Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions
DRBT: PPC over Centralized DataTransformation dr = 37 dr = 34 dr = 31 dr = 28 dr = 25 dr = 22 dr = 16
RP1 0.00 0.015 0.024 0.033 0.045 0.072 0.141
RP2 0.00 0.014 0.019 0.032 0.041 0.067 0.131
The error produced on the dataset Chess (do = 37)
Data K = 2 K = 3 K = 4 K = 5
Transformation Avg Std Avg Std Avg Std Avg Std
RP2 0.941 0.014 0.912 0.009 0.881 0.010 0.885 0.006
Average of F-measure (10 trials) for the dataset Accidents (do = 18, dr = 12)
Data K = 2 K = 3 K = 4 K = 5
Transformation Avg Std Avg Std Avg Std Avg Std
RP2 1.000 0.000 0.948 0.010 0.858 0.089 0.833 0.072
Average of F-measure (10 trials) for the dataset Iris (do = 5, dr = 3)
PP-Clustering
Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira
22
Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions
DRBT: PPC over Vertically Partitioned DataNo. Parties RP1 RP2
1 0.0762 0.0558
2 0.0798 0.0591
3 0.0870 0.0720
4 0.0923 0.0733
The error produced on the dataset Pumsb (do = 74)
Data K = 2 K = 3 K = 4 K = 5
Transformation Avg Std Avg Std Avg Std Avg Std
1 0.909 0.140 0.965 0.081 0.891 0.028 0.838 0.041
2 0.904 0.117 0.931 0.101 0.894 0.059 0.840 0.047
3 0.874 0.168 0.887 0.095 0.873 0.081 0.801 0.073
4 0.802 0.155 0.812 0.117 0.866 0.088 0.831 0.078
Average of F-measure (10 trials) for the dataset Pumsb (do = 74, dr = 38)
PP-Clustering
Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira
23
Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions
Contributions of this Research
• Foundations for further research in PPDM.
• A taxonomy of PPDM techniques.
• A family of privacy-preserving methods.
• A library of sanitizing algorithms.
• Retrieval facilities.
• A set of metrics.
Conclusions
Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira
24
Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions
Future Research
• Privacy definition in data mining.
• Combining sanitization and randomization for PPARM.
• Transforming data using one-way functions and learning from the distorted data.
• Privacy preservation in spoken language databases.
• Sanitization of documents repositories on the Web.
Conclusions