Data Transformation for Privacy-Preserving Data Mining

24
Data Transformation for Privacy-Preserving Data Mining Stanley R. M. Oliveira Database Systems Laboratory Computing Science Department University of Alberta, Canada PhD Thesis - Final Examination November 29 th , 2004 Database Laboratory

description

Database Laboratory. Data Transformation for Privacy-Preserving Data Mining. Stanley R. M. Oliveira Database Systems Laboratory Computing Science Department University of Alberta, Canada. PhD Thesis - Final Examination November 29 th , 2004. Introduction. Motivation. - PowerPoint PPT Presentation

Transcript of Data Transformation for Privacy-Preserving Data Mining

Page 1: Data Transformation for  Privacy-Preserving Data Mining

Data Transformation for Privacy-Preserving Data Mining

Stanley R. M. Oliveira

Database Systems Laboratory

Computing Science Department

University of Alberta, Canada

PhD Thesis - Final Examination

November 29th, 2004

Database Laboratory

Page 2: Data Transformation for  Privacy-Preserving Data Mining

Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira

2

Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions

• Scenario 1: A collaboration between an Internet marketing company and an on-line retail company.

Objective: find optimal customer targets.

• Scenario 2: Companies sharing their transactions to build a recommender system.

Objective: provide recommendations to their customers.

MotivationIntroduction

Page 3: Data Transformation for  Privacy-Preserving Data Mining

Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira

3

Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions

A Taxonomy of the Existing Solutions

Data Modification

Fig.1: A Taxonomy of PPDM Techniques

Contributions

Data Restriction

Data Partitioning

Data Ownership

Introduction

Page 4: Data Transformation for  Privacy-Preserving Data Mining

Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira

4

Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions

Problem Definition

• To transform a database into a new one that conceals sensitive information while preserving general patterns and trends from the original database.

Framework

Non-sensitive patterns and trends

Data miningThe transformation

processOriginal Database

Transformed Database

• Data sanitization

• Dimensionality reduction

Page 5: Data Transformation for  Privacy-Preserving Data Mining

Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira

5

Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions

Problem Definition (cont.)

• Sub-Problem 1: Privacy-Preserving Association Rule Mining

I do not address privacy of individuals but the problem of protecting sensitive knowledge.

• Sub-Problem 2: Privacy-Preserving Clustering

I protect the underlying attribute values of objects subjected to clustering analysis.

Framework

Page 6: Data Transformation for  Privacy-Preserving Data Mining

Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira

6

Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions

A Framework for Privacy PPDM

A schematic view of the framework for privacy preservation

Library ofAlgorithms

PPDTMethods

Metrics

RetrievalFacilities

Sanitization

Privacy Preserving Framework

TransformedDatabase

Original data Collective Transformation

Individual Transformation

ServerClient

Framework

Page 7: Data Transformation for  Privacy-Preserving Data Mining

Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira

7

Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions

Privacy-Preserving Association Rule Mining

A taxonomy of sanitizing algorithms

Heuristic 1

Heuristic 2

Heuristic 3

PP-Assoc. Rules

Page 8: Data Transformation for  Privacy-Preserving Data Mining

Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira

8

Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions

1. Hiding Failure:

3. Artifactual Patterns:

4. Difference between D and D’:

Database D

Database D’

Data Sharing-Based Algorithms: Problems

)(#

)'(#

DS

DSHF

R

R)(#

)'(#)(#

DS

DSDSMC

R

RR

2. Misses Cost:

|'|

|'||'|

R

RRRAP

n

iDDn

iD

ififif

DDDif1

'

1

)]()([)(

1)',(

PP-Assoc. Rules

Page 9: Data Transformation for  Privacy-Preserving Data Mining

Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira

9

Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions

1. Scan a database and identify the sensitive transactions for each sensitive rule;

2. Based on the disclosure threshold , compute the number of sensitive transactions to be sanitized;

3. For each sensitive rule, identify a candidate item that should be eliminated (victim item);

4. Based on the number found in step 3, remove the victim items from the sensitive transactions.

Data Sharing-Based AlgorithmsPP-Assoc. Rules

Page 10: Data Transformation for  Privacy-Preserving Data Mining

Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira

10

Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions

Measuring Effectiveness

S4

Algorithm = 0% 6 sensitive rules

  S1 S2 S3

Kosarak IGA IGA IGA IGA

Retail IGA SWA RA RA

Reuters IGA IGA IGA IGA

BMS-1 IGA IGA IGA IGA

S4

Algorithm = 0% varying values of

  S1 S2 S3

Kosarak IGA IGA IGA IGA

Retail IGA SWA RA RRA

Reuters IGA IGA IGA IGA

BMS-1 IGA IGA IGA IGA

Misses Cost under condition C1

Misses Cost under condition C3

SWA

S4

Algorithm = 0% 6 sensitive rules

  S1 S2 S3

Kosarak SWA SWA SWA SWA

Retail SWA

Reuters SWA SWA

BMS-1 SWA SWA

Dif (D, D’ ) under conditions C1 and C3

SWA SWA

SWA

SWA

SWA

SWA

IGA / SWA

S4

Algorithm = 0% varying the # of rules

  S1 S2 S3

Kosarak IGA IGA IGA

Retail IGA

Reuters IGA

BMS-1 IGA

IGA

RA RA

IGA

IGAAlgo2a/IGA

Misses Cost under condition C2

IGA IGA

IGA

PP-Assoc. Rules

Page 11: Data Transformation for  Privacy-Preserving Data Mining

Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira

11

Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions

Pattern Sharing-Based Algorithm

Sanitization ShareAR generation

D’ D’D

Data Sharing-Based Approach

Association

Rules(D’)

ShareSanitizationD

AR generation Association

Rules(D)

Pattern Sharing-Based Approach

Association

Rules(D’)

Discovered

Patterns (D’)

PP-Assoc. Rules

(IGA)

(DSA)

Page 12: Data Transformation for  Privacy-Preserving Data Mining

Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira

12

Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions

• 1. Side Effect Factor (SEF) SEF =

• 2. Recovery Factor (RF) RF = [0, 1]

Rules to hide

Sensitive rulesSR

Non-Sensitive rules~SR

Rules hidden

R’: rules to share

R: all rules

Problem 1: RSE (Side effect)

Problem 2: Inference(recovery factor)

Pattern Sharing-Based Algorithms: Problems

|)||(|

|))||'(||(|

R

R

SR

SRR

PP-Assoc. Rules

Page 13: Data Transformation for  Privacy-Preserving Data Mining

Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira

13

Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions

Measuring EffectivenessDataset = 0% 6 sensitive rules

S1 S2 S3 S4

Kosarak IGA DSA DSA DSA

Retail DSA DSA DSA IGA

Reuters DSA DSA DSA IGA

BMS-1 DSA DSA DSA DSA

The best algorithm in terms of misses cost

Dataset = 0% varying # of rules

S1 S2 S3 S4

Kosarak IGA DSA DSA IGA /

DSA

Retail DSA DSA IGA / DSA

IGA

Reuters IGA /

DSA

DSA DSA IGA

BMS-1 DSA DSA DSA DSA

Dataset = 0% 6 sensitive rules

S1 S2 S3 S4

Kosarak IGA DSA DSA DSA

Retail DSA DSA DSA IGA

Reuters DSA DSA DSA IGA

BMS-1 DSA DSA DSA DSA

The best algorithm in terms of misses cost varying the number of rules to sanitize

The best algorithm in terms of side effect factor

PP-Assoc. Rules

Page 14: Data Transformation for  Privacy-Preserving Data Mining

Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira

14

Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions

Lessons Learned• Large dataset are our friends.

• The benefit of index: at most two scans to sanitize a dataset.

• The data sanitization paradox.

• The outstanding performance of IGA and DSA.

• Rule sanitization does not change the support and confidence of the shared rules.

• DSA reduces the flexibility of information sharing.

PP-Assoc. Rules

Page 15: Data Transformation for  Privacy-Preserving Data Mining

Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira

15

Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions

Privacy-Preserving Clustering (PPC)

• PPC over Centralized Data:

The attribute values subjected to clustering are available in a central repository.

• PPC over Vertically Partitioned Data:

There are k parties sharing data for clustering, where k 2;

The attribute values of the objects are split across the k parties.

Objects IDs are revealed for join purposes only. The values of the associated attributes are private.

PP-Clustering

Page 16: Data Transformation for  Privacy-Preserving Data Mining

Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira

16

Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions

Object Similarity-Based Representation (OSBR)

Example 1: Sharing data for research purposes (OSBR).

12810944689044286

2518340765828446

1297724705240254

1748124536456342

1939132638075123

PR_intQRSInt_defheart

rate

weightageID

A sample of the cardiac arrhythmia database(UCI Machine Learning Repository)

Original Data Transformed Data

0995.3130.4082.4020.3

0176.3884.3690.3

0477.2348.3

0243.2

0

DM

The corresponding dissimilarity matrix

PP-Clustering

Page 17: Data Transformation for  Privacy-Preserving Data Mining

Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira

17

Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions

• Limitations of the OSBR:

Expensive in terms of communication cost - O(m2), where m is the number of objects under analysis.

Vulnerable to attacks in the case of vertically partitioned data.

Conclusion The OSBR is effective for PPC over centralized data only.

Object Similarity-Based Representation (OSBR)

PP-Clustering

Page 18: Data Transformation for  Privacy-Preserving Data Mining

Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira

18

Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions

Dimensionality Reduction Transformation (DRBT)

• Random projection from d to k dimensions:

D’ n k = Dn d • Rd k (linear transformation), where

D is the original data, D’ is the reduced data, and R is a random matrix.

• R is generated by first setting each entry, as follows:

(R1): rij is drawn from an i.i.d. N(0,1) and then normalizing the columns to unit length;

(R2): rij =6/1

3/2

6/1

1

0

1

3

yprobabilitwith

yprobabilitwith

yprobabilitwith

PP-Clustering

Page 19: Data Transformation for  Privacy-Preserving Data Mining

Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira

19

Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions

Dimensionality Reduction Transformation (DRBT)

12810944689044286

2518340765828446

1297724705240254

1748124536456342

1939132638075123

PR_intQRSInt_defheart

rate

weightageID

A sample of the cardiac arrhythmia database(UCI Machine Learning Repository)

Original Data

ID Att1 Att2 Att3 Att1 Att2 Att3

123 -50.40 17.33 12.31 -55.50 -95.26 -107.93

342 -37.08 6.27 12.22 -51.00 -84.29 -83.13

254 -55.86 20.69 -0.66 -65.50 -70.43 -66.97

446 -37.61 -31.66 -17.58 -85.50 -140.87 -72.74

286 -62.72 37.64 18.16 -88.50 -50.22 -102.76

Transformed Data

RP1: The random matrix is based on the Normal distribution.

RP2: The random matrix is based on a much simpler distribution.

RP1 RP2

PP-Clustering

Page 20: Data Transformation for  Privacy-Preserving Data Mining

Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira

20

Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions

Dimensionality Reduction Transformation (DRBT)

• Security: A random projection from d to k dimensions, where k d, is a non-invertible linear transformation.

• Space requirement is of the order O(m), where m is the number of objects.

• Communication cost is of the order O(mkl), where l represents the size (in bits) required to transmit a dataset from one party to a central or third party.

• Conclusion The DRBT is effective for PPC over centralized data and vertically partitioned data.

PP-Clustering

Page 21: Data Transformation for  Privacy-Preserving Data Mining

Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira

21

Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions

DRBT: PPC over Centralized DataTransformation dr = 37 dr = 34 dr = 31 dr = 28 dr = 25 dr = 22 dr = 16

RP1 0.00 0.015 0.024 0.033 0.045 0.072 0.141

RP2 0.00 0.014 0.019 0.032 0.041 0.067 0.131

The error produced on the dataset Chess (do = 37)

Data K = 2 K = 3 K = 4 K = 5

Transformation Avg Std Avg Std Avg Std Avg Std

RP2 0.941 0.014 0.912 0.009 0.881 0.010 0.885 0.006

Average of F-measure (10 trials) for the dataset Accidents (do = 18, dr = 12)

Data K = 2 K = 3 K = 4 K = 5

Transformation Avg Std Avg Std Avg Std Avg Std

RP2 1.000 0.000 0.948 0.010 0.858 0.089 0.833 0.072

Average of F-measure (10 trials) for the dataset Iris (do = 5, dr = 3)

PP-Clustering

Page 22: Data Transformation for  Privacy-Preserving Data Mining

Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira

22

Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions

DRBT: PPC over Vertically Partitioned DataNo. Parties RP1 RP2

1 0.0762 0.0558

2 0.0798 0.0591

3 0.0870 0.0720

4 0.0923 0.0733

The error produced on the dataset Pumsb (do = 74)

Data K = 2 K = 3 K = 4 K = 5

Transformation Avg Std Avg Std Avg Std Avg Std

1 0.909 0.140 0.965 0.081 0.891 0.028 0.838 0.041

2 0.904 0.117 0.931 0.101 0.894 0.059 0.840 0.047

3 0.874 0.168 0.887 0.095 0.873 0.081 0.801 0.073

4 0.802 0.155 0.812 0.117 0.866 0.088 0.831 0.078

Average of F-measure (10 trials) for the dataset Pumsb (do = 74, dr = 38)

PP-Clustering

Page 23: Data Transformation for  Privacy-Preserving Data Mining

Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira

23

Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions

Contributions of this Research

• Foundations for further research in PPDM.

• A taxonomy of PPDM techniques.

• A family of privacy-preserving methods.

• A library of sanitizing algorithms.

• Retrieval facilities.

• A set of metrics.

Conclusions

Page 24: Data Transformation for  Privacy-Preserving Data Mining

Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira

24

Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions

Future Research

• Privacy definition in data mining.

• Combining sanitization and randomization for PPARM.

• Transforming data using one-way functions and learning from the distorted data.

• Privacy preservation in spoken language databases.

• Sanitization of documents repositories on the Web.

Conclusions