Data Transformation for Privacy-Preserving Data Mining

Data Transformation for Privacy-Preserving Data Mining

Stanley R. M. Oliveira

Database Systems Laboratory

Computing Science Department

University of Alberta, Canada

PhD Thesis - Final Examination

November 29th, 2004

Database Laboratory

Data Transformation for Privacy-Preserving Data MiningPhD – Final Examination – Nov. 29, 2004 Stanley Oliveira

2

Introduction Framework PP-ClusteringPP-Assoc. Rules Conclusions

• Scenario 1: A collaboration between an Internet marketing company and an on-line retail company.

Objective: find optimal customer targets.

• Scenario 2: Companies sharing their transactions to build a recommender system.

Objective: provide recommendations to their customers.

MotivationIntroduction


3


A Taxonomy of the Existing Solutions

Data Modification

Fig.1: A Taxonomy of PPDM Techniques

Contributions

Data Restriction

Data Partitioning

Data Ownership

Introduction


4


Problem Definition

• To transform a database into a new one that conceals sensitive information while preserving general patterns and trends from the original database.

Framework

Non-sensitive patterns and trends

Data miningThe transformation

processOriginal Database

Transformed Database

• Data sanitization

• Dimensionality reduction


5


Problem Definition (cont.)

• Sub-Problem 1: Privacy-Preserving Association Rule Mining

I do not address privacy of individuals but the problem of protecting sensitive knowledge.

• Sub-Problem 2: Privacy-Preserving Clustering

I protect the underlying attribute values of objects subjected to clustering analysis.

Framework


6


A Framework for Privacy PPDM

A schematic view of the framework for privacy preservation

Library ofAlgorithms

PPDTMethods

Metrics

RetrievalFacilities

Sanitization

Privacy Preserving Framework

TransformedDatabase

Original data Collective Transformation

Individual Transformation

ServerClient

Framework


7


Privacy-Preserving Association Rule Mining

A taxonomy of sanitizing algorithms

Heuristic 1

Heuristic 2

Heuristic 3

PP-Assoc. Rules


8


1. Hiding Failure:

3. Artifactual Patterns:

4. Difference between D and D’:

Database D

Database D’

Data Sharing-Based Algorithms: Problems

)(#

)'(#

DS

DSHF

R

R)(#

)'(#)(#

DS

DSDSMC

R

RR

2. Misses Cost:

|'|

|'||'|

R

RRRAP

n

iDDn

iD

ififif

DDDif1

'

1

)]()([)(

1)',(

PP-Assoc. Rules


9


1. Scan a database and identify the sensitive transactions for each sensitive rule;

2. Based on the disclosure threshold , compute the number of sensitive transactions to be sanitized;

3. For each sensitive rule, identify a candidate item that should be eliminated (victim item);

4. Based on the number found in step 3, remove the victim items from the sensitive transactions.

Data Sharing-Based AlgorithmsPP-Assoc. Rules


10


Measuring Effectiveness

S4

Algorithm = 0% 6 sensitive rules

S1 S2 S3

Kosarak IGA IGA IGA IGA

Retail IGA SWA RA RA

Reuters IGA IGA IGA IGA

BMS-1 IGA IGA IGA IGA

S4

Algorithm = 0% varying values of

S1 S2 S3

Kosarak IGA IGA IGA IGA

Retail IGA SWA RA RRA

Reuters IGA IGA IGA IGA

BMS-1 IGA IGA IGA IGA

Misses Cost under condition C1


SWA

S4

Algorithm = 0% 6 sensitive rules

S1 S2 S3

Kosarak SWA SWA SWA SWA

Retail SWA

Reuters SWA SWA

BMS-1 SWA SWA

Dif (D, D’ ) under conditions C1 and C3

SWA SWA

SWA

SWA

SWA

SWA

IGA / SWA

S4

Algorithm = 0% varying the # of rules

S1 S2 S3

Kosarak IGA IGA IGA

Retail IGA

Reuters IGA

BMS-1 IGA

IGA

RA RA

IGA

IGAAlgo2a/IGA


IGA IGA

IGA

PP-Assoc. Rules


11


Pattern Sharing-Based Algorithm

Sanitization ShareAR generation

D’ D’D

Data Sharing-Based Approach

Association

Rules(D’)

ShareSanitizationD

AR generation Association

Rules(D)

Pattern Sharing-Based Approach

Association

Rules(D’)

Discovered

Patterns (D’)

PP-Assoc. Rules

(IGA)

(DSA)


12


• 1. Side Effect Factor (SEF) SEF =

• 2. Recovery Factor (RF) RF = [0, 1]

Rules to hide

Sensitive rulesSR

Non-Sensitive rules~SR

Rules hidden

R’: rules to share

R: all rules

Problem 1: RSE (Side effect)

Problem 2: Inference(recovery factor)

Pattern Sharing-Based Algorithms: Problems

|)||(|

|))||'(||(|

R

R

SR

SRR

PP-Assoc. Rules


13


Measuring EffectivenessDataset = 0% 6 sensitive rules

S1 S2 S3 S4

Kosarak IGA DSA DSA DSA

Retail DSA DSA DSA IGA

Reuters DSA DSA DSA IGA

BMS-1 DSA DSA DSA DSA

The best algorithm in terms of misses cost

Dataset = 0% varying # of rules

S1 S2 S3 S4

Kosarak IGA DSA DSA IGA /

DSA

Retail DSA DSA IGA / DSA

IGA

Reuters IGA /

DSA

DSA DSA IGA


Dataset = 0% 6 sensitive rules

S1 S2 S3 S4

Kosarak IGA DSA DSA DSA

Retail DSA DSA DSA IGA

Reuters DSA DSA DSA IGA


The best algorithm in terms of misses cost varying the number of rules to sanitize

The best algorithm in terms of side effect factor

PP-Assoc. Rules


14


Lessons Learned• Large dataset are our friends.

• The benefit of index: at most two scans to sanitize a dataset.

• The data sanitization paradox.

• The outstanding performance of IGA and DSA.

• Rule sanitization does not change the support and confidence of the shared rules.

• DSA reduces the flexibility of information sharing.

PP-Assoc. Rules


15


Privacy-Preserving Clustering (PPC)

• PPC over Centralized Data:

The attribute values subjected to clustering are available in a central repository.

• PPC over Vertically Partitioned Data:

There are k parties sharing data for clustering, where k 2;

The attribute values of the objects are split across the k parties.

Objects IDs are revealed for join purposes only. The values of the associated attributes are private.

PP-Clustering


16


Object Similarity-Based Representation (OSBR)

Example 1: Sharing data for research purposes (OSBR).

12810944689044286

2518340765828446

1297724705240254

1748124536456342

1939132638075123

PR_intQRSInt_defheart

rate

weightageID

A sample of the cardiac arrhythmia database(UCI Machine Learning Repository)

Original Data Transformed Data

0995.3130.4082.4020.3

0176.3884.3690.3

0477.2348.3

0243.2

0

DM

The corresponding dissimilarity matrix

PP-Clustering


17


• Limitations of the OSBR:

Expensive in terms of communication cost - O(m2), where m is the number of objects under analysis.

Vulnerable to attacks in the case of vertically partitioned data.

Conclusion The OSBR is effective for PPC over centralized data only.

Object Similarity-Based Representation (OSBR)

PP-Clustering


18


Dimensionality Reduction Transformation (DRBT)

• Random projection from d to k dimensions:

D’ n k = Dn d • Rd k (linear transformation), where

D is the original data, D’ is the reduced data, and R is a random matrix.

• R is generated by first setting each entry, as follows:

(R1): rij is drawn from an i.i.d. N(0,1) and then normalizing the columns to unit length;

(R2): rij =6/1

3/2

6/1

1

0

1

3

yprobabilitwith

yprobabilitwith

yprobabilitwith

PP-Clustering


19



12810944689044286

2518340765828446

1297724705240254

1748124536456342

1939132638075123

PR_intQRSInt_defheart

rate

weightageID

A sample of the cardiac arrhythmia database(UCI Machine Learning Repository)

Original Data

ID Att1 Att2 Att3 Att1 Att2 Att3

123 -50.40 17.33 12.31 -55.50 -95.26 -107.93

342 -37.08 6.27 12.22 -51.00 -84.29 -83.13

254 -55.86 20.69 -0.66 -65.50 -70.43 -66.97

446 -37.61 -31.66 -17.58 -85.50 -140.87 -72.74

286 -62.72 37.64 18.16 -88.50 -50.22 -102.76

Transformed Data

RP1: The random matrix is based on the Normal distribution.

RP2: The random matrix is based on a much simpler distribution.

RP1 RP2

PP-Clustering


20



• Security: A random projection from d to k dimensions, where k d, is a non-invertible linear transformation.

• Space requirement is of the order O(m), where m is the number of objects.

• Communication cost is of the order O(mkl), where l represents the size (in bits) required to transmit a dataset from one party to a central or third party.

• Conclusion The DRBT is effective for PPC over centralized data and vertically partitioned data.

PP-Clustering


21


DRBT: PPC over Centralized DataTransformation dr = 37 dr = 34 dr = 31 dr = 28 dr = 25 dr = 22 dr = 16

RP1 0.00 0.015 0.024 0.033 0.045 0.072 0.141

RP2 0.00 0.014 0.019 0.032 0.041 0.067 0.131

The error produced on the dataset Chess (do = 37)

Data K = 2 K = 3 K = 4 K = 5

Transformation Avg Std Avg Std Avg Std Avg Std

RP2 0.941 0.014 0.912 0.009 0.881 0.010 0.885 0.006

Average of F-measure (10 trials) for the dataset Accidents (do = 18, dr = 12)

Data K = 2 K = 3 K = 4 K = 5


RP2 1.000 0.000 0.948 0.010 0.858 0.089 0.833 0.072

Average of F-measure (10 trials) for the dataset Iris (do = 5, dr = 3)

PP-Clustering


22


DRBT: PPC over Vertically Partitioned DataNo. Parties RP1 RP2

1 0.0762 0.0558

2 0.0798 0.0591

3 0.0870 0.0720

4 0.0923 0.0733

The error produced on the dataset Pumsb (do = 74)

Data K = 2 K = 3 K = 4 K = 5


1 0.909 0.140 0.965 0.081 0.891 0.028 0.838 0.041

2 0.904 0.117 0.931 0.101 0.894 0.059 0.840 0.047

3 0.874 0.168 0.887 0.095 0.873 0.081 0.801 0.073

4 0.802 0.155 0.812 0.117 0.866 0.088 0.831 0.078

Average of F-measure (10 trials) for the dataset Pumsb (do = 74, dr = 38)

PP-Clustering


23


Contributions of this Research

• Foundations for further research in PPDM.

• A taxonomy of PPDM techniques.

• A family of privacy-preserving methods.

• A library of sanitizing algorithms.

• Retrieval facilities.

• A set of metrics.

Conclusions


24


Future Research

• Privacy definition in data mining.

• Combining sanitization and randomization for PPARM.

• Transforming data using one-way functions and learning from the distorted data.

• Privacy preservation in spoken language databases.

• Sanitization of documents repositories on the Web.

Conclusions

Data Transformation for Privacy-Preserving Data Mining

Documents

Transcript of Data Transformation for Privacy-Preserving Data Mining