Privacy preserving data mining – randomized response and association rule hiding

46
Privacy preserving data mining – randomized response and association rule hiding Li Xiong CS573 Data Privacy and Anonymity tial slides credit: W. Du, Syracuse University, Y. Gao, Peking Unive

description

Privacy preserving data mining – randomized response and association rule hiding. Li Xiong CS573 Data Privacy and Anonymity. Partial slides credit: W. Du, Syracuse University, Y. Gao, Peking University. Privacy Preserving Data Mining Techniques. Protecting sensitive raw data - PowerPoint PPT Presentation

Transcript of Privacy preserving data mining – randomized response and association rule hiding

Page 1: Privacy preserving data mining – randomized response and association rule hiding

Privacy preserving data mining – randomized response and association rule hiding

Li Xiong

CS573 Data Privacy and Anonymity

Partial slides credit: W. Du, Syracuse University, Y. Gao, Peking University

Page 2: Privacy preserving data mining – randomized response and association rule hiding

Privacy Preserving Data Mining Techniques Protecting sensitive raw data

Randomization (additive noise) Geometric perturbation and projection (multiplicative

noise) Randomized response technique

Categorical data perturbation in data collection model

Protecting sensitive knowledge (knowledge hiding)

Page 3: Privacy preserving data mining – randomized response and association rule hiding

Data Collection Model

Data Publisher

Step 1: Data Collection

IndividualData

Data Miner

Step 2: Data Publishing

Data cannot be shared directly because of privacy concern

Page 4: Privacy preserving data mining – randomized response and association rule hiding

Background:Randomized Response

)5.0(

)(

YesP

P'(Yes) P(Yes) P(No)(1 )

P'(No) P(Yes)(1 ) P(No)

Do you smoke?

Head

TailNo

Yes

The true answer is “Yes”

Biased coin:

5.0

)(

HeadP

Page 5: Privacy preserving data mining – randomized response and association rule hiding

Decision Tree Mining using Randomized Response Multiple attributes encoded in bits

)5.0(

)(

YesP

Head

TailFalse answer !E: 001

True answer E: 110Biased coin:

5.0

)(

HeadP

Column distribution can be estimated for learning a decision tree!

Using Randomized Response Techniques for Privacy-Preserving Data Mining, Du, 2003

Page 6: Privacy preserving data mining – randomized response and association rule hiding

Accuracy of Decision tree built on randomized response

Page 7: Privacy preserving data mining – randomized response and association rule hiding

Generalization for Multi-Valued Categorical Data

True Value: Si

Si

Si+1

Si+2

Si+3

q1

q2

q3

q4

P '(s1)

P '(s2)

P '(s3)

P '(s4)

q1 q4 q3 q2

q2 q1 q4 q3

q3 q2 q1 q4

q4 q3 q2 q1

P(s1)

P(s2)

P(s3)

P(s4)

M

Page 8: Privacy preserving data mining – randomized response and association rule hiding

A Generalization

RR Matrices [Warner 65], [R.Agrawal 05], [S. Agrawal 05]

RR Matrix can be arbitrary

Can we find optimal RR matrices?

M

a11 a12 a13 a14

a21 a22 a23 a24

a31 a32 a33 a34

a41 a42 a43 a44

OptRR:Optimizing Randomized Response Schemes for Privacy-Preserving Data Mining, Huang, 2008

Page 9: Privacy preserving data mining – randomized response and association rule hiding

What is an optimal matrix?

Which of the following is better?

M1 1 0 0

0 1 0

0 0 1

M2

13

13

13

13

13

13

13

13

13

Privacy: M2 is betterUtility: M1 is better

So, what is an optimal matrix?

Page 10: Privacy preserving data mining – randomized response and association rule hiding

Optimal RR Matrix

An RR matrix M is optimal if no other RR matrix’s privacy and utility are both better than M (i, e, no other matrix dominates M). Privacy Quantification Utility Quantification

A number of privacy and utility metrics have been proposed. Privacy: how accurately one can estimate

individual info. Utility: how accurately we can estimate aggregate

info.

Page 11: Privacy preserving data mining – randomized response and association rule hiding

Metrics

Privacy: accuracy of estimate of individual values

Utility: difference between the original probability and the estimated probability

Page 12: Privacy preserving data mining – randomized response and association rule hiding

Optimization Methods

Approach 1: Weighted sum:

w1 Privacy + w2 Utility Approach 2

Fix Privacy, find M with the optimal Utility. Fix Utility, find M with the optimal Privacy. Challenge: Difficult to generate M with a fixed

privacy or utility. Proposed Approach: Multi-Objective

Optimization

Page 13: Privacy preserving data mining – randomized response and association rule hiding

Optimization algorithm

Evolutionary Multi-Objective Optimization (EMOO) The algorithm

Start with a set of initial RR matrices Repeat the following steps in each iteration

Mating: selecting two RR matrices in the pool Crossover: exchanging several columns between the

two RR matrices Mutation: change some values in a RR matrix Meet the privacy bound: filtering the resultant matrices Evaluate the fitness value for the new RR matrices.

Note : the fitness values is defined in terms of privacy and utility metrics

Page 14: Privacy preserving data mining – randomized response and association rule hiding

Illustration

Page 15: Privacy preserving data mining – randomized response and association rule hiding

Output of Optimization

Privacy

Utility

Worse

Better

M1M2

M4

M3

M5

M7

M6

M8

The optimal set is often plotted in the objective space as Pareto front.

Page 16: Privacy preserving data mining – randomized response and association rule hiding

For First attribute of Adult data

Page 17: Privacy preserving data mining – randomized response and association rule hiding

Privacy Preserving Data Mining Techniques Protecting sensitive raw data

Randomization (additive noise) Geometric perturbation and projection (multiplicative

noise) Randomized response technique

Protecting sensitive knowledge (knowledge hiding) Frequent itemset and association rule hiding Downgrading classifier effectiveness

Page 18: Privacy preserving data mining – randomized response and association rule hiding

Frequent Itemset Mining and Association Rule Mining

Frequent itemset mining: frequent set of items in a transaction data

set

Association rules: associations between items

Page 19: Privacy preserving data mining – randomized response and association rule hiding

Frequent Itemset Mining and Association Rule Mining

First proposed by Agrawal, Imielinski, and Swami in SIGMOD 1993 SIGMOD Test of Time Award 2003

“This paper started a field of research. In addition to containing an innovative algorithm, its subject matter brought data mining to the attention of the database community … even led several years ago to an IBM commercial, featuring supermodels, that touted the importance of work such as that contained in this paper. ”

Apriori algorithm in VLDB 1994 #4 in the top 10 data mining algorithms in ICDM 2006

R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in

large databases. In SIGMOD ’93.

Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In VLDB '94.

Page 20: Privacy preserving data mining – randomized response and association rule hiding

April 21, 2023

20

Basic Concepts: Frequent Patterns and Association Rules

Itemset: X = {x1, …, xk} (k-itemset) Frequent itemset: X with minimum support count

Support count (absolute support): count of transactions containing X

Association rule: A B with minimum support and confidence Support: probability that a transaction contains A B

s = P(A B) Confidence: conditional probability that a transaction having A

also contains B

c = P(A | B)

Association rule mining process Find all frequent patterns (more costly) Generate strong association rules

Customerbuys diaper

Customerbuys both

Customerbuys beer

Transaction-id Items bought

10 A, B, D

20 A, C, D

30 A, D, E

40 B, E, F

50 B, C, D, E, F

Page 21: Privacy preserving data mining – randomized response and association rule hiding

April 21, 2023

Illustration of Frequent Itemsets and Association Rules

Transaction-id Items bought

10 A, B, D

20 A, C, D

30 A, D, E

40 B, E, F

50 B, C, D, E, F

Frequent itemsets (minimum support count = 3) ?

Association rules (minimum support = 50%, minimum confidence = 50%) ?

{A:3, B:3, D:4, E:3, AD:3}

A D (60%, 100%)D A (60%, 75%)

Page 22: Privacy preserving data mining – randomized response and association rule hiding

SIGMOD Ph.D. Workshop IDAR’07

22

Association Rule Hiding: what? why??

Problem: hide sensitive association rules in data without losing non-sensitive rules

Motivations: confidential rules may have serious adverse effects

Page 23: Privacy preserving data mining – randomized response and association rule hiding

SIGMOD Ph.D. Workshop IDAR’07

Problem statement

Given a database D to be released minimum threshold “MST”, “MCT” a set of association rules R mined from D a set of sensitive rules Rh R to be hided

Find a new database D’ such that the rules in Rh cannot be mined from D’ the rules in R-Rh can still be mined as many as

possible

Page 24: Privacy preserving data mining – randomized response and association rule hiding

SIGMOD Ph.D. Workshop IDAR’07

Solutions

Data modification approaches Basic idea: data sanitization D->D’ Approaches: distortion,blocking Drawbacks

Cannot control hiding effects intuitively, lots of I/O

Data reconstruction approaches Basic idea: knowledge sanitization D->K->D’ Potential advantages

Can easily control the availability of rules and control the hiding effects directly, intuitively, handily

Page 25: Privacy preserving data mining – randomized response and association rule hiding

Distortion-based Techniques

A B C D

1 1 1 0

1 0 1 1

0 0 0 1

1 1 1 0

1 0 1 1

Rule ARule A→C has: →C has:

Support(Support(A→CA→C)=80%)=80%

Confidence(Confidence(A→CA→C)=100%)=100%

Sample DatabaseSample Database

A B C D

1 1 1 0

1 0 00 1

0 0 0 1

1 1 1 0

1 0 00 1

Distorted DatabaseDistorted Database

Rule ARule A→C has now: →C has now:

Support(Support(A→CA→C)=40%)=40%

Confidence(Confidence(A→CA→C)=50%)=50%

DistortionAlgorithm

Page 26: Privacy preserving data mining – randomized response and association rule hiding

Side Effects

Before Hiding Before Hiding ProcessProcess

After Hiding After Hiding ProcessProcess

Side EffectSide Effect

Rule Ri has had

conf(Rconf(Rii)>MCT)>MCTRule Ri has now conf(Rconf(Rii)<MCT)<MCT

Rule Eliminated(Undesirable Side Effect)

Rule Ri has had

conf(Rconf(Rii)<MCT)<MCTRule Ri has now conf(Rconf(Rii)>MCT)>MCT

Ghost Rule(Undesirable Side Effect)

Large Itemset I has had sup(I)>MSTsup(I)>MST

Itemset I has now sup(I)<MSTsup(I)<MST

Itemset Eliminated(Undesirable Side Effect)

Page 27: Privacy preserving data mining – randomized response and association rule hiding

Distortion-based Techniques

Challenges/Goals:

To minimize the undesirable Side Effects that the hiding process causes to non-sensitive rules.

To minimize the number of 1’s1’s that must be deleted in the database.

Algorithms must be linear in time as the database increases in size.

Page 28: Privacy preserving data mining – randomized response and association rule hiding

Sensitive itemsets: ABC

Page 29: Privacy preserving data mining – randomized response and association rule hiding

Data distortion [Atallah 99]

Hardness result: The distortion problem is NP Hard

Heuristic search Find items to remove and transactions to

remove the items from

Disclosure Limitation of Sensitive Rules, M. Atallah, A. Elmagarmid, M. Ibrahim, E. Bertino, V. Verykios, 1999

Page 30: Privacy preserving data mining – randomized response and association rule hiding
Page 31: Privacy preserving data mining – randomized response and association rule hiding

Heuristic Approach

A greedy bottom-up search through the ancestors (subsets) of the sensitive itemset for the parent with maximum support (why?) At the end of the search, 1-itemset is selected

Search through the common transactions containing the item and the sensitive itemset for the transaction that affects minimum number of 2-itemsets

Delete the selected item from the identified transaction

Page 32: Privacy preserving data mining – randomized response and association rule hiding
Page 33: Privacy preserving data mining – randomized response and association rule hiding

Results comparison

Page 34: Privacy preserving data mining – randomized response and association rule hiding

Blocking-based Techniques

AA BB CC DD

11 11 11 00

11 00 11 11

00 00 00 11

11 11 11 00

11 00 11 11

AA BB CC DD

11 11 11 00

11 00 ?? 11

?? 00 00 11

11 11 11 00

11 00 11 11

BlockingAlgorithm

Initial DatabaseInitial Database New DatabaseNew Database

Support and Confidence becomes marginal. Support and Confidence becomes marginal.

In New Database: 60% ≤ conf(A → C) ≤ 100%In New Database: 60% ≤ conf(A → C) ≤ 100%

Page 35: Privacy preserving data mining – randomized response and association rule hiding

SIGMOD Ph.D. Workshop IDAR’07

Data reconstruction approach

D ’

DD.1 Frequent Set Mining

FS R

R-Rh’FS

.2 Perform sanitization Algorithm

3.FP-tree - based Inverse Frequent Set Mining

FP-tree

Page 36: Privacy preserving data mining – randomized response and association rule hiding

2007-7-10 SIGMOD Ph.D. Workshop IDAR’07

36

The first two phases

1. Frequent set mining Generate all frequent itemsets with their supports and

support counts FS from original database D

2. Perform sanitization algorithm Input: FS output in phase 1, R, Rh Output: sanitized frequent itemsets FS’ Process

Select hiding strategy Identify sensitive frequent sets Perform sanitization

In best cases, sanitization algorithm can ensure from FS’ ,we can exactly get the non-sensitive rules set R-Rh

FS

FS’ R-Rh

R

Page 37: Privacy preserving data mining – randomized response and association rule hiding

2007-7-10SIGMOD Ph.D. Workshop IDAR’07

37

Example: the first two phases

TID ItemsT1 ABCET2 ABCT3 ABCDT4 ABDT5 ADT6 ACD

Oiginal Database: D

σ= 4

MST=66%MCT=75%

Frequent Itemsets: FSA:6 100%B:4 66%C:4 66%D:4 66%

AB:4 66%AC:4 66%AD:4 66%

Frequent Itemsets: FS'

A:6 100%C:4 66%D:4 66%

AC:4 66%AD:4 66%

rules confid-ence support

C A 100% 66%D A 100% 66%

Association Rules: R-Rh

rules confid-ence support

B A 100% 66%C A 100% 66%D A 100% 66%

Association Rules: R

1. Frequent set mining

2. Perform sanitization algorithm

Page 38: Privacy preserving data mining – randomized response and association rule hiding

Open research questions

Optimal solution Itemsets sanitization

The support and confidence of the rules in R- Rh should remain unchanged as much as possible

Integrating data protection and knowledge (rule) protection

Page 39: Privacy preserving data mining – randomized response and association rule hiding

Coming up

Cryptographic protocols for privacy preserving distributed data mining

Page 40: Privacy preserving data mining – randomized response and association rule hiding

Classification of current algorithms Hide rules

Hide large itemsets

Data modification

Data-Distortion

Algo1aAlgo1b Algo2aWSDAPDA

Algo2b Algo2cNaïve MinFIA

MaxFIA IGA RRA RA SWA

Border-BasedInteger-ProgramingSanitization-Matrix

Data-Blocking

CR CR2

GIH

Data reconstruction CIILM

Page 41: Privacy preserving data mining – randomized response and association rule hiding

Weight-based Sorting Distortion Algorithm (WSDA) [Pontikakis 03]Pontikakis 03]

High Level Description: Input:

Initial Database Set of Sensitive Rules Safety Margin (for example 10%)

Output: Sanitized Database Sensitive Rules no longer hold in the Database

Page 42: Privacy preserving data mining – randomized response and association rule hiding

WSDA Algorithm

High Level Description: 1st step:

Retrieve the set of transactions which support sensitive rule RRSS

For each sensitive rule RRSS find the number NN11 of transaction in which, one item that supports the rule will be deleted

Page 43: Privacy preserving data mining – randomized response and association rule hiding

WSDA Algorithm

High Level Description: 2nd step:

For each rule RRii in the Database with common items with RRSS compute a weight w w that denotes how strong is RRii

For each transaction that supports RRSS compute a priority PPii, that denotes how many strong rules this transaction supports

Page 44: Privacy preserving data mining – randomized response and association rule hiding

WSDA Algorithm

High Level Description: 3rd step:

Sort the NN11 transactions in ascending order according to their priority value PPii

4th step: For the first NN11 transactions hide an item that is

contained in RRSS

Page 45: Privacy preserving data mining – randomized response and association rule hiding

WSDA Algorithm

High Level Description: 5th step:

Update confidence and support values for other rules in the database

Page 46: Privacy preserving data mining – randomized response and association rule hiding

2007-7-10SIGMOD Ph.D. Workshop IDAR’07

46

Discussion

Sanitization algorithm Compared with early popular data sanitization :

performs sanitization directly on knowledge level of data

Inverse frequent set mining algorithm Deals with frequent items and infrequent items

separately: more efficiently, a large number of outputs

Proposed Solution

Our solution provides user with a knowledge level window to perform sanitization handily and generates a number of secure databases