Privacy-preserving rule mining. Outline A brief introduction to association rule mining Privacy...

Privacy-preserving rule mining

Outline A brief introduction to association rule

mining Privacy preserving rule mining

Single party Perturbation - publishing Encryption - outsourcing

Distributed multiparty Cryptographic protocols – collaborative mining

Hiding sensitive rules

Association Rule Mining Transactional datasets

A transaction t = {a,b,c,…} a,b,c,… are called items The length of transaction = # of items Transaction length can vary

Equivalent representation The set of all items : I A transaction t can be transformed to a

boolean vector in length of |I|.

Association Rule Mining Rule mining

Goal: find the frequent itemset Some itemset, e.g.{a,b,c}, appears frequently,

higher than certain support. Rules can be derived from the itemset: AB

{a,b,c} is frequent, then abc, abc, …

Metrics Support = # of occurrences of itemset/ total #

of transactions i.e., prob (A,B)

Confidence = # of occurrences of itemset/# of occurrences of left (of the rule) I.e. the conditional prob: Pr (B|A)

Example E.g. a,b appears 100 times together,

while abc appears 50 times together in total 5000 transactions Support of abc = 50/5000 = 0.01 Confidence of abc = 50/100 = 0.5

Algorithms

Apriori observation: if itemset A is a part of B,

then support(A) >= support (B) Steps in finding frequent itemsets:

Starting from single-item set, pruning the itemsets that have support < threshold

When we have a set of k-itemsets, we expand it to k+1-itemsets, and check their supports.

Using data structures like hash tree to speed up the counting process

Algorithm Generating rules with confidence threshold

Confidence (AB) = P(B|A) = support(AB)/support(A)

Centralized approaches

Two types of methods (Categorical data) Perturbation Encryption

Perturbation Papers

1."Privacy Preserving Mining of Association Rules," SIGKDD 2002. 2."Maintaining data privacy in association rule mining”, VLDB 023.A framework for high-accuracy privacy-preserving mining, ICDE05

Basic ideas Consider a transaction as a boolean bit vector Perturb each bit with certain method

Paper 1: randomly select j items from t, then for rest of all items, with prob p to be selected

Paper 2: each bit has the prob p to be original, 1-p to be flipped

Paper 3: unify the methods with perturbation matrix

The key is…

After you perturb the data, you should still be able to find the supported rules correctly.

The accuracy is traded off by the intensity of perturbation (p)

Methods discovering the original support

Paper 1 using the correlation between “partial support” to find the original support

Concept of partial support

Prob of the length change of matched parts

N

ltATtAT

l

})(|#{#)(sup

])(|#')'([#]'[ lAtlAtpllpmk

The size of t : m, the size of itemset A: k

Some results

Let si be supi(A) and si’ be supi’ (A)

The matrix P and D are defined with only p[ll’]

1.

2.

From 1, we can estimate the original supportFrom 2, we can estimate the reliability (variance) of the support Estimation (which is related to perturbation rate p)

Privacy

Given an itemset A in perturbed transaction t’ What is the probability of an item a,

really in the itemset A, i.e., ]'|[ tAtap

]'|[ tAtap

Tradeoff between utility and privacy

Lowest discoverable support: distinguishable from zero (consider the variance of support estimation)

Encryption method Paper: Security in Outsourcing of Association Rule

Mining, VLDB2007 Substitution encryption

1-1 substitution a1, b2, … 1-n substitution a{1,10}, b{2,11,12},…

Problem: 1-1 substitution is weak Arbitrary 1-n substitution does not work

Cannot recover original rules from the rules from the substituted items.

The basic idea Fake items

Original n items, additional m fake items Define “admissible 1-n mapping”

Arbitrary 1-n mapping may result in irreversible results E.g., a{1,2}, b{2}, c{3} If we find frequent itemset {1,2,3} in the substituted set,

{ac} or {abc}, which one is the right original itemset?

Admissible 1-n mapping For each mapping, there should be at least one unique

substitute item in the mapped result, which does not appear in other mapping

E.g., a{1,2}, b{2}, c{3} breaks the definitionwhile a{1,2}, b{2,4}, c{3} is admissible

Recovering rules

When we use admissible mappings We are able to reverse the discovered

rules on substituted set. E.g., if we find {1,2,4} is a frequent set check all mappings:{1,2} a, {2,4} b {1,2,4} {ab}

cost

Additional cost Generating item mapping Generating transaction transformation

significant

Cost of rule mining Both the # of items and the average length

of transaction is increased, thus the total cost will increase

Features of encryption method

Rules can be accurately recovered A tradeoff between cost and privacy

Privacy is better preserved with more fake items

More fake items will result in higher additional cost.

The VLDB07 paper weaknesses No strong proof on security does not mention any attacks – there are

possibly attacks on this scheme

Distributed datasets

Perturbation works, but existing encryption protocols do not work might need some protocols to make the

encryption method applicable

Horizontally or vertically partitioned Paper1: "Privacy-Preserving Distributed Mining of

Association Rules on Horizontally Partitioned Data," Paper2: "Privacy preserving association rule mining in

vertically partitioned data," Shared features: using the cryptographic methods to

construct protocols

Hiding sensitive rules

When publishing data for rule mining, the rule itself can be sensitive too.

Basic methods Decrease the support

Support = # of occurrences of itemset/ total # of transactions

Decrease the confident Confidence = # of occurrences of itemset/#

of occurrences of left(rule)

discussion on rule hiding Need sufficient amount of

computational cost at the data owner side

You should know what rules are sensitive in advance! So only necessary for the case you have

to share the data

When hiding sensitive rules, other rules might be damaged

Summary Two methods for single party rule

mining Perturbation and encryption

Distributed multiparty can use protocols and perturbation method

Hiding sensitive rules is also important in some cases

Privacy-preserving rule mining. Outline A brief introduction to association rule mining Privacy...

Documents

Transcript of Privacy-preserving rule mining. Outline A brief introduction to association rule mining Privacy...