Preserving Privacy in Data Preparation for Association Rule Mining
Privacy-preserving rule mining. Outline A brief introduction to association rule mining Privacy...
-
Upload
raymond-dixon -
Category
Documents
-
view
220 -
download
1
Transcript of Privacy-preserving rule mining. Outline A brief introduction to association rule mining Privacy...
Privacy-preserving rule mining
Outline A brief introduction to association rule
mining Privacy preserving rule mining
Single party Perturbation - publishing Encryption - outsourcing
Distributed multiparty Cryptographic protocols – collaborative mining
Hiding sensitive rules
Association Rule Mining Transactional datasets
A transaction t = {a,b,c,…} a,b,c,… are called items The length of transaction = # of items Transaction length can vary
Equivalent representation The set of all items : I A transaction t can be transformed to a
boolean vector in length of |I|.
Association Rule Mining Rule mining
Goal: find the frequent itemset Some itemset, e.g.{a,b,c}, appears frequently,
higher than certain support. Rules can be derived from the itemset: AB
{a,b,c} is frequent, then abc, abc, …
Metrics Support = # of occurrences of itemset/ total #
of transactions i.e., prob (A,B)
Confidence = # of occurrences of itemset/# of occurrences of left (of the rule) I.e. the conditional prob: Pr (B|A)
Example E.g. a,b appears 100 times together,
while abc appears 50 times together in total 5000 transactions Support of abc = 50/5000 = 0.01 Confidence of abc = 50/100 = 0.5
Algorithms
Apriori observation: if itemset A is a part of B,
then support(A) >= support (B) Steps in finding frequent itemsets:
Starting from single-item set, pruning the itemsets that have support < threshold
When we have a set of k-itemsets, we expand it to k+1-itemsets, and check their supports.
Using data structures like hash tree to speed up the counting process
Algorithm Generating rules with confidence threshold
Confidence (AB) = P(B|A) = support(AB)/support(A)
Centralized approaches
Two types of methods (Categorical data) Perturbation Encryption
Perturbation Papers
1."Privacy Preserving Mining of Association Rules," SIGKDD 2002. 2."Maintaining data privacy in association rule mining”, VLDB 023.A framework for high-accuracy privacy-preserving mining, ICDE05
Basic ideas Consider a transaction as a boolean bit vector Perturb each bit with certain method
Paper 1: randomly select j items from t, then for rest of all items, with prob p to be selected
Paper 2: each bit has the prob p to be original, 1-p to be flipped
Paper 3: unify the methods with perturbation matrix
The key is…
After you perturb the data, you should still be able to find the supported rules correctly.
The accuracy is traded off by the intensity of perturbation (p)
Methods discovering the original support
Paper 1 using the correlation between “partial support” to find the original support
Concept of partial support
Prob of the length change of matched parts
N
ltATtAT
l
})(|#{#)(sup
])(|#')'([#]'[ lAtlAtpllpmk
The size of t : m, the size of itemset A: k
Some results
Let si be supi(A) and si’ be supi’ (A)
The matrix P and D are defined with only p[ll’]
1.
2.
From 1, we can estimate the original supportFrom 2, we can estimate the reliability (variance) of the support Estimation (which is related to perturbation rate p)
Privacy
Given an itemset A in perturbed transaction t’ What is the probability of an item a,
really in the itemset A, i.e., ]'|[ tAtap
]'|[ tAtap
Tradeoff between utility and privacy
Lowest discoverable support: distinguishable from zero (consider the variance of support estimation)
Encryption method Paper: Security in Outsourcing of Association Rule
Mining, VLDB2007 Substitution encryption
1-1 substitution a1, b2, … 1-n substitution a{1,10}, b{2,11,12},…
Problem: 1-1 substitution is weak Arbitrary 1-n substitution does not work
Cannot recover original rules from the rules from the substituted items.
The basic idea Fake items
Original n items, additional m fake items Define “admissible 1-n mapping”
Arbitrary 1-n mapping may result in irreversible results E.g., a{1,2}, b{2}, c{3} If we find frequent itemset {1,2,3} in the substituted set,
{ac} or {abc}, which one is the right original itemset?
Admissible 1-n mapping For each mapping, there should be at least one unique
substitute item in the mapped result, which does not appear in other mapping
E.g., a{1,2}, b{2}, c{3} breaks the definitionwhile a{1,2}, b{2,4}, c{3} is admissible
Recovering rules
When we use admissible mappings We are able to reverse the discovered
rules on substituted set. E.g., if we find {1,2,4} is a frequent set check all mappings:{1,2} a, {2,4} b {1,2,4} {ab}
cost
Additional cost Generating item mapping Generating transaction transformation
significant
Cost of rule mining Both the # of items and the average length
of transaction is increased, thus the total cost will increase
Features of encryption method
Rules can be accurately recovered A tradeoff between cost and privacy
Privacy is better preserved with more fake items
More fake items will result in higher additional cost.
The VLDB07 paper weaknesses No strong proof on security does not mention any attacks – there are
possibly attacks on this scheme
Distributed datasets
Perturbation works, but existing encryption protocols do not work might need some protocols to make the
encryption method applicable
Horizontally or vertically partitioned Paper1: "Privacy-Preserving Distributed Mining of
Association Rules on Horizontally Partitioned Data," Paper2: "Privacy preserving association rule mining in
vertically partitioned data," Shared features: using the cryptographic methods to
construct protocols
Hiding sensitive rules
When publishing data for rule mining, the rule itself can be sensitive too.
Basic methods Decrease the support
Support = # of occurrences of itemset/ total # of transactions
Decrease the confident Confidence = # of occurrences of itemset/#
of occurrences of left(rule)
discussion on rule hiding Need sufficient amount of
computational cost at the data owner side
You should know what rules are sensitive in advance! So only necessary for the case you have
to share the data
When hiding sensitive rules, other rules might be damaged
Summary Two methods for single party rule
mining Perturbation and encryption
Distributed multiparty can use protocols and perturbation method
Hiding sensitive rules is also important in some cases