Preserving Privacy in Data Preparation for
Association Rule Mining
Nan Zhang, Shengquan Wang, and Wei Zhao, Fellow, IEEE
Abstract
We address the privacy preserving association rule mining problem in a distributed system with one data miner
and multiple data providers, each holds one transaction. The literature has tacitly assumed that randomization is the
only effective approach to preserve privacy in such circumstances. We challenge this assumption by introducing
an algebraic techniques based scheme in the data preparation phase. Compared to previous approaches, our new
scheme can identify association rules more accurately but disclose less private information. Furthermore, our new
scheme can be readily integrated as a middleware with existing systems.
Index Terms
Data mining; clustering, classification, and association rules; privacy; singular value decomposition.
I. INTRODUCTION
The goal of data mining is to extract interesting knowledge from large amounts of data [1]. Traditional
data mining algorithms deal with centralized data. Recently, a number of applications on the Internet lead
to a need for mining distributed data. In this circumstance, a privacy concern arises from the distributed
The authors are with the Department of Computer Science, Texas A&M University, College Station, TX 77840. E-mail: {nzhang, swang,
zhao}@cs.tamu.edu.
A preliminary version of this paper is to be presented at the 8th European Conference on Principles and Practice of Knowledge Discovery
in Databases (PKDD), September 2004.
2
data providers. In this paper, we address issues related to production of accurate data mining results, while
preserving the private information in the data being mined.
We will focus on association rule mining, which will be briefly reviewed in the next section. Since
Agrawal, Imielinski, and Swami addressed this problem in [2], association rule mining has been an active
research area due to its wide applications and the challenges it presents. Many algorithms have been
proposed and analyzed [3]–[5]. However, few of them have addressed the issue of privacy protection.
We can classify privacy preserving association rule mining systems into two classes based on their
infrastructures, named Server to Server (S2S) and Client-to-Server (C2S), respectively. In the first category
(S2S), data are distributed across several autonomous entities (servers) [6], [7]. Each server holds a private
database that contains numerous data points (i.e., transactions). The servers collaborate with each other
to identify association rules spanning multiple databases. Since usually only a few servers are involved
in a system (e.g., less than 10), the problem can be modeled as a variation of the secured multi-party
computation [8], which has been extensively studied in cryptography [9].
In the second category (C2S), a system consists of one data miner (server) and large amounts of data
providers (clients) [10], [11]. Each data provider holds only one transaction. Association rule mining is
performed by the data miner on the aggregate transactions provided by the data providers. Online survey
is a typical example of this kind of system, as the system can be modeled as consisting of one data miner
(i.e., the survey analyzer) and thousands of data providers (i.e., the survey respondents). To ensure the
effectiveness of the survey results (e.g., to block multiple votes from a unique IP), the identity of the data
providers cannot be hidden from the data miner. Thus, privacy is of particular concern in this kind of
system. In fact, there has been wide media coverage of the public debate of protecting privacy in online
surveys [12].
Both S2S and C2S systems have wide applications. Nevertheless, we will focus on studying C2S
systems in this paper. Several studies have been carried out on privacy preserving association rule mining
in C2S systems. Most of them have tacitly assumed that an effective approach to preserving privacy is
3
randomization. If we consider the data mining as a two-phase process, which consists of the the data
prepartion phase and the data mining phase, the randomization approach is involved in both phases. We
Fig. 1. 2-phase Data Mining
challenge this assumption by introducing a new scheme that only occurs in the data preparation phase.
Our new scheme integrates algebraic techniques with random noise perturbation. It has the following
important features that distinguish it from previous approaches:
• Our scheme is easy to implement and flexible. Our scheme is involved only in the data preparation
phase and does not require a support recovery procedure in the data mining phase. Thus, our new
scheme is transparent to the data mining process. It can be readily integrated as a middleware with
existing systems.
• Our scheme can identify association rules more accurately but disclose less private information.
Roughly speaking, our scheme selects the most useful features for asssociation rule mining from the
original data and transmits only these features to the data miner. Our simulation data show that for
the same level of accuracy, our system discloses private information about five times less than the
randomization approach.
• We allow explicit negotiation between the data providers and the data miner in terms of the tradeoff
between the accuracy of data mining results and the privacy of data providers. Instead of following
the rules set by the data miner, a data provider can play a role in determining the tradeoff between
accuracy and privacy. This is an important feature because people have a wide variety of attitudes
4
towards privacy. Due to survey results in [13], the net users’ attritudes towards privacy can be
divided into three categories: 7% privacy fundamentalists who are extremely concerned about the use
of their private data, 56% pragmatic majority who are generally willing to provide data if privacy
protection measures can be offered; and 27% marginally concerned who barely care about privacy.
The negotiation feature in our scheme can help the data miner to collaborate with both hard-core
privacy fundamentalists and people comfortable with limited privacy disclosure.
The rest of this paper is organized as follows: In Section II, we brief review previous approaches. We
present our models and introduce our new scheme in Section III. The communication protocol of our
new scheme and its basic components are also provided in this section. In Section IV, we present the
theoretical analysis of the tradeoff between accuracy and privacy in our scheme. The simulation results are
presented in Section V. In Section VI, we present the experimental results on the performance evaluation
of our scheme. Implementation and overhead issues are discussed in Section VII, followed by a final
remark in Section VIII.
II. APPROACHES
In this section, we first overview the definition of association rule mining. Then, we introduce our
models of the data miners. Based on the model, we review the randomization approach, which has been
widely used in the literature of privacy preserving data mining. We will address the problems associated
with the randomization approach, which motivates us to design a new privacy preserving scheme.
A. Association Rule Mining
A motivating example for association rule mining is a survey of hobbies. Each survey respondent
chooses arbitrary number of hobbies from five options: football, soccer, beauty, video games and PC
games. These options are called items. For a survey respondent, the answer to the survey is called a
transaction. As we can see, a transaction is a set of items (e.g., t ={football, soccer}). Given a number
of transactions, association rule mining finds interesting correlation (relationship) between items [1]. For
5
example, suppose that there are 5, 000 survey respondents. The survey analyzer finds that within the 1, 000
respondents who choose video games as their hobbies, there are 800 respondents who also choose PC
games as their hobbies. Based on the survey results, the survey analyzer may infer an association rule
that can be represented as follows.
video games ⇒ PC games [support = 16%, confidence = 80%]. (1)
The support of this rule is the percentage of the respondents who choose both video games and PC
games. Roughly speaking, the confidence is the probability that a respondent who chooses video games
also chooses PC games. Both support and confidence are measures of the validity and trustworthiness of
the association rules. Technically speaking, the main task of association rule mining is to find all frequent
itemsets, which are itemsets that have support larger than a threshold determined by the data miner.
B. Model of Data Miners
Due to the privacy concern introduced to the system, we classify the data miners into two categories.
One category is legal data miners. These data miners always act legally in that they only perform regular
data mining tasks and would never intentionally invade the privacy of the data providers. The other
category is illegal data miners. These data miners would purposely compromise the privacy of the data
providers.
Like adversaries in distributed systems, illegal data miners come in many forms. In most forms, their
behavior is restricted from arbitrarily deviating from the protocol. In this paper, we focus on a particular
sub-class of illegal miners. That is, in our system, illegal data miners are honest but curious: they follow
proper protocols (i.e., they are honest), but they analyze all intermediate communications and received
transactions (i.e., they are curious) to discover private information [9]. Even though it is a relaxation from
the Byzantine behavior, this kind of honest but curious (nevertheless illegal) behavior is common and has
been widely used as the adversary model in the literature.
6
C. Randomization Approach
To prevent invasion of privacy due to the existence of illegal data miners, countermeasures must be
implemented in the data mining system. We briefly review the randomization approach, which is currently
used to preserve privacy in association rule mining.
Based on the randomization approach, the entire data mining process is a two-step process. The first
step is in the data preparation phase. In this step, a data provider first applies the randomization algorithm
to the transaction it holds. Then, the data provider transforms the randomized transaction to the data
miner. In previous studies, several randomization algorithms have been proposed including the cut-and-
paste operator [10] and MASK operator [11]. For example, when the cut-and-paste operator is used, the
data provider first randomly choose an integer j as the number of items that occur in both the original
transaction t and the randomized transaction R(t). After that, the data provider randomly choose j items
from the t and place these items into R(t). Then, for every item a 6∈ t, the data provider tosses a coin
with probability ρ to place a into R(t).
In the second step, the data miner performs association rule mining on the aggregate data. With the
randomization approach, the data miner must first employ a support recovery algorithm which intends to
reconstruct the support of candidate itemsets.
Also in the second step, an illegal data miner may invade privacy by using a privacy data recovery
algorithm on the randomized transactions supplied by the data providers.
Figure 1 depicts the privacy preserving association rule mining system with the randomization approach.
Clearly, any such system should be measured by its capability of both generating accurate association
rules and preventing invasion of privacy.
D. Problems of the Randomization Approach
While the randomization approach is intuitive, researchers have recently identified some problems of
the randomization approach as follows.
7
Fig. 2. randomization approach
• In [10], the authors remarked that if the cut-and-paste operator is applied to a transaction with 10
or more items, it is difficult, if not impossible, for the data provider to contribute to the association
rule mining with its privacy preserved. Furthermore, large itemsets have exceedingly high variances
on the recovered support. Similar problems exist with other randomization operators as they share
the similar scheme on the randomization of the original data.
An approach was proposed in [10] to solve this problem. In the approach, all data providers that
hold transactions with 10 or more items do not transfer the randomized transaction to the data miner.
Unfortunately, this approach prevents many frequent itemsets that contain 4 or more items from being
discovered by the data miner.
• In [14], the authors showed that the spectral properties of randomized data could help curious
data miners to separate noise from private data. Based on random matrix theory, they proposed
a filtering method to reconstruct private data from the randomized data set. They demonstrated that
the randomization approach preserves very little privacy in many cases. Although their work is based
8
on the randomization approach for privacy preserving data classification, we believe that the similarity
between randomization operators in association rule mining and data classification makes the problem
inherent in the randomization approach.
• The randomization approach also suffers in efficiency. Since the privacy preserving mechanism is
not restricted to the data preparation phase, it puts a heavy load on (legal) data miners at run time
(because of the support recovery) [15]. It is shown that the cost of mining randomized data set is
well within an order of magnitude in respect to that of mining original data set.
We explore the reasons behind these problems as given below.
• We note that previous randomization approaches are transaction-invariant. That is, the same pertur-
bation algorithm is applied to all data. Since more items in the original transaction always result in
more “real” items to be included in the randomized transaction, privacy protection on transactions
with a large size (e.g., |t| > 10) are doomed to failure.
• Previous randomization approaches are item-invariant. All items in the original transaction have the
same probability of being included in the perturbed transaction. No specific operation is performed
to preserve the correlation between different items. Thus, a lot of “real” items in the perturbed
transactions may not appear in any frequent itemset. That is, the disclosure of these items does not
contribute to the mining of association rules.
We remark that the transaction-invariant and item-invariant properties are inherent in the randomization
approach. The reason is that in a system using randomization approach, the communication is one-way:
from the data providers to the data miner. As such, a data provider cannot obtain any specific guidance
on the perturbation of its transaction from the (legal) data miner, nor can the data providers learn the
correlation between the items. Thus, a data provider has no choice but to use a transaction-invariant and
item-invariant approach.
This observation motivates us to develop a new scheme that allows two-way communication between
the data miner and the data providers. The two-way communication helps preserving privacy while not
9
incurring too much overhead. Thereby, we significantly improve the performance in terms of accuracy,
privacy, and efficiency. We describe the new scheme in the next section.
III. COMMUNICATION PROTOCOL AND RELATED COMPONENTS
In this section, we introduce our new scheme including the communication protocol and its basic
components.
A. Description of Our New Scheme
Fig. 3. our new scheme
Figure 2 depicts the infrastructure of a system using our new scheme. Our scheme has two key
components, perturbation guidance (PG) in the data miner server side and perturbation in the data provider
client side. Compared to the randomization approach, our scheme does not have the support recovery
component. Instead, the association rule mining is performed on the perturbed transactions (R(t)) directly.
Thus, our scheme is restricted to the data preparation stage and does not put a heavy load on the data
miner by recovering the support at run time.
10
Our scheme is a three-step process. In the first step, the data miner negotiates different perturbation level
k with different data providers. The larger k is, the more contribution R(t) will make to the association rule
mining task. The smaller k is, the more private information is preserved. Thus, a privacy fundamentalist
can choose a small k to preserve its privacy while a privacy unconcerned data provider can choose a large
k to contribute to the association rule mining.
The second step is to transmit the perturbed transactions from the data providers to the data miner.
Since each data provider comes at a different time (e.g., different survey respondents take the survey at
different time), this step can be considered as an iterative process. In each stage, the data miner dispatches
a reference (perturbation guidance) Vk to a data provider Pi. Here Vk depends on the perturbation level
kthat is negotiated by the data miner and the data provider Pi in the first step. Based on the received Vk,
the perturbation component of Pi computes the perturbed transaction R(t) from its original transaction
t. Then, Pi transmits R(t) to the perturbation guidance (PG) component of the data miner. The PG
component then updates Vk based on R(t) and forwards R(t) to the association rule mining process. A
curious data miner can also obtain R(t). In this case, the curious data miner uses private data recovery
algorithm to compromise privacy in R(t).
In the third step, the perturbed transactions received by the data miner are used by the association rule
mining process. Association rules are identified and delivered to the data miner.
The key here is to properly design Vk so that correct guidance can be given to the data providers on how
to perturb the transactions. In our scheme, we let Vk be an algebraic quantity derived from the currently
received, yet perturbed transactions. The details of computing Vk will be presented as a basic component
of our scheme.
B. Notions of Transactions
Before presenting the details of the communication protocol and its basic components, we first introduce
some notions of the data set. Let I be a set of n items (i.e., I = {a1, . . . , an}. Suppose that there are
m data providers in the system. Each data provider Ci holds a transaction ti, which is a subset of I . We
11
represent the data set by an m×n matrix T = [a1, . . . , an] = [t1; . . . ; tm] 1. An example of the transaction
matrix T is shown in Table I. Let 〈T 〉ij be the element of T with indices i and j. The element 〈T 〉ij
represents whether item j is in transaction ti. Suppose that transaction t1 contains items a1 and a2. The
first row of the matrix has 〈T 〉1,1 = 〈T 〉1,2 = 1.
TABLE I
AN EXAMPLE OF A TRANSACTION MATRIX
a1 a2 · · · an
t1 1 1 · · · 0
......
.... . .
...
tm 1 0 · · · 1
An itemset B ⊆ I is an h-itemset if and only if B contains h items (i.e., |B| = h). The support of B
is the percentage of the transactions in the data set that contain B. That is,
supp(B) =|{t ∈ T |B ⊆ t}|
m. (2)
An h-itemset B is frequent if supp(B) ≥ min supp, where min supp is a predefined minimum threshold
of support. Refer to the “survey of hobbies” example in the above subsection, {video games, PC games}
is a frequent 2-itemset with support 0.16. The set of frequent h-itemsets is denoted by Lh.
C. The Communication Protocol
We now describe the communication protocol of our new scheme. The negotiation between the data
miner and data providers is shown in Protocol 1. On the side of data miner server, there are two threads that
perform the operations in Protocol 2 and Protocol 3 iteratively after the negotiation. For a data provider,
it performs the operations in Protocol 4 to perturb and transmit its transaction to the data miner.
1Here ti and ai are used somewhat ambiguously. In the context of association rule mining, ti is a transaction and ai is an item. In the
context of matrix, ti represents a row vector in T and ai represents a column vector in T .
12
Protocol 1 NegotiationNM1. Based on the SVD of T ∗ (T ∗ = U∗Σ∗V ∗′), the data miner calculates S = 〈Σ∗〉211 + · · ·+ 〈Σ∗〉2nn;
NM2. Find the smallest k ∈ [1, n] such that∑k
i=1〈Σ∗〉2ii ≥ µ · S;
NM3. The data miner dispatches k to registered data providers;NP1. For a data provider Ci,
if Ci receives k ≤ Kt (Kt is the threshold of truncation level set by Ci) then
Ci sends ready message to the data miner;
end if
Protocol 2 Thread of registering data providerR1. Negotiate on the truncation level k with a data provider;
R2. Wait for a ready message from a data provider;
R3. Upon receiving the ready message from a data provider,
• Register the data provider;
• Send the data provider current Vk;
R4. Go to Step R1;
D. Basic Components
There are three key components in the communication protocol of our scheme: (a) the method of
computing Vk, (b) the perturbation function R(·), and (c) the negotiation on truncation level k.
1) Computation of Vk: Recall that Vk carries information from the data miner to data providers on how
to perturb the original transactions to preserve privacy. In our scheme, Vk is an estimate of the eigenvectors
of A = T ′T (i.e., the right singular vectors of T ). The justification of Vk on providing accurate mining
results and preserving privacy is presented in Appendix I.
As we are considering dynamic cases where the perturbed transactions are dynamically fed to the data
miner, the data miner keeps a copy of all received (perturbed) transactions and updates it when a new
perturbed transaction is received. Assume that the initial set of received (perturbed) transactions T ∗ is
13
Protocol 3 Thread of receiving transactionT1. Wait for a perturbed transaction R(t) from a data provider;
T2. Upon receiving the transaction from a registered data provider,
• Update Vk based on the recently received perturbed transaction;
• Deregister the data provider;
T3. Go to Step T1;
Protocol 4 Transaction perturbationP1. Negotiate on the truncation level k with the data miner;
P2. Send the data miner a ready message indicating that this provider is ready to contribute to the mining
process;
P3. Wait for a message that contains Vk from the data miner;
P4. Upon receiving the message from the data miner,
• Compute R(t) based on t and Vk;
P5. Transfer R(t) to the data miner;
empty 2. Every time when a perturbed transaction R(t) is received, T ∗ is updated by appending R(t)
to the bottom of T ∗. Thus, T ∗ is the matrix of currently received (perturbed) transactions. We derive Vk
from T ∗.
In particular, the computation of Vk is done in the following steps. Using singular value decomposition
(SVD) [16], we decompose T ∗ as (3), where Σ∗ = diag(s1, . . . , sn) is a diagonal matrix with s1 ≥ · · · ≥ sn.
T ∗ = U∗Σ∗V ∗′. (3)
The numbers s2i make up the eigenvalues of A∗ = T ∗′T ∗. V ∗ is an n× n unitary matrix composed of the
eigenvectors of A∗.
Vk is composed of the k eigenvectors of A∗ that are associated with the largest k eigenvalues of A∗. If
2T ∗ may also contain some transactions provided by privacy-careless data providers
14
V ∗ = [v1, . . . , vn], we have
Vk = [v1, . . . , vk]. (4)
Thus, we call Vk as the k-truncation of V ∗. Several incremental algorithms have been proposed to update
Vk when a perturbed transaction is received by the data miner [17], [18]. The computing cost of updating
Vk is addressed in Sect. VII.
As we will see in Sect. IV, k plays a critical role in balancing accuracy and privacy. We will also show
that by using Vk in conjunction with R(·), which is to be discussed next, we can achieve both desired
accuracy and privacy protection.
2) Perturbation function: Recall that once a data provider receives Vk from the data miner, the data
provider applies a perturbation function R(·) to its transaction t. The result is a perturbed transaction
R(t) that will be transmitted to the data miner. The computation of R(t) is defined as follows. First, for
a given Vk, the transaction t is transformed by t = tVkV′k . Note that the elements of the vector t may
not be integers. Algorithm 5 is employed to round t to 0 or 1. In this algorithm, ρt is a pre-defined
parameter. Finally, for the completeness of our work, we introduce an optional procedure to enhance the
privacy preserving capability of the system, which is shown in Algorithm 6. A data provider may use
the protocol to insert additional noise into R(t). In the protocol, ρm is a parameter determined by the
data provider. The higher ρm is, the more noise is inserted into the perturbed transaction. We remark that
this protocol is optional and is only needed by privacy fundamentalists. An example of the perturbation
process is provided in Appendix II.
3) Negociation on truncation level: In order to retain enough information for association rule mining
after the transformation from t to R(t), a textbook heuristic is to make the sum of the k eigenvalues
associated with the retained eigenvectors of A∗ larger than 85% of the sum of the eigenvalues of A∗ (i.e.,
µ = 85%) [16], [19]. The perturbation level k is usually large at the beginning but decreases and stabilizes
to be fairly small (e.g., less than 1% of n) soon. Thus, in the theoretical analysis of our scheme, we consider
the perturbation level k as a predetermined parameter rather than a variable updated throughout the data
15
Algorithm 5 Mapping
Let 〈t〉i be the element of vector t with index i. Similar notations apply to other vectors.
for every element 〈t〉i in t do
if 〈t〉i ≥ 1 − ρt then
〈R(t)〉i = 1
else
〈R(t)〉i = 0
end if
end for
Algorithm 6 Random-noise perturbationfor every item ai /∈ t do
Choose a real number j uniformly at random on [0, 1]
if j ≥ 1 − ρm then
〈R(t)〉i = 1
end if
end for
preparation process.
We have described the communication protocol of our scheme and its key components. We now discuss
the accuracy and privacy measures of our scheme.
IV. ANALYSIS ON ACCURACY AND PRIVACY
In this section, we analyze our new scheme. We define measures for accuracy and privacy and derive
their bounds, in order to provide guidelines for the tradeoff between these two measures and hence help
system managers setting parameter. We also show the simulation and experimental results of our scheme
on real datasets. For the simplicity of our discussion, we do not consider the optional random noise
insertion procedure in Algorithm 6.
16
A. Accuracy Measure
An accuracy measure should reflect the capability of the system that can correctly identify association
rules in a given dataset. We define the accuracy measure as the error of the support of frequent itemsets.
This is because that the main task of association rule mining is to identify frequent itemsets with support
larger than a threshold min supp. There are two kinds of errors: a) false drops, which are unidentified
frequent itemsets, and b) false positives, which are itemsets incorrectly identified to be frequent.
We now formally define our accuracy measure. Given itemset Ij , let supp(Ij) and supp′(Ij) be the
support of Ij in the original transactions T and the perturbed transactions R(T ), respectively. Recall that
the set of frequent h-itemsets in T is Lh. We define the errors on false drops and false positives as follows.
Definition 4.1: Given itemset size h, the error on false drops, ρh1 , and the error on false positives, ρh
2 ,
are defined as
ρh1 = max
Ij∈Lh
(supp(Ij) − supp′(Ij)), (5)
ρh2 = max
Ij 6∈Lh
(supp′(Ij) − supp(Ij)). (6)
We define our accuracy measure, degree of accuracy, as the maximum value of ρh1 and ρh
2 on all sizes of
itemsets.
Definition 4.2: The degree of accuracy in a privacy preserving association rule mining system is defined
as γ = maxh≥1 max(ρh1 , ρ
h2).
Based on the definition, we derive an upper bound on the accuracy measure.
Theorem 4.3: In our system, the degree of accuracy γ satisfies
γ ≤ σ2k+1
m
(
1 + max
{
1 − (1 − ρt)2
(1 − ρt)2,1 − ρt
ρt
})
, (7)
where σ2k+1
is the (k+1)th largest eigenvalue of A = T ′T . In particular, when ρt = (3−√
5)/2, γ reaches
its lowest upper bound at γ ≤ 2.618σ2k+1
/m.
The proof of Theorem 4.3 can be found in Appendix III. Our bound on accuracy measure is fairly small
when the number of transactions (m) is sufficiently large. This is usually the case in reality. Actually, our
17
scheme tends to enlarge the support of frequent itemsets and reduce the support of infrequent itemsets.
Thus, the upper bound is not always tight. We can observe this trend from the experimental results, which
will be presented in Section VI.
B. Privacy Measure
In our system, the data miner cannot infer the original transaction t from a perturbed transaction R(t)
deterministically because VkV′k is a singular matrix with its determinant det(VkV
′k) = 0 (i.e., it does not
have an inverse matrix). To measure the probability that an item in t can be identified from R(t), we need
a privacy measure.
Due to survey results in [12], a data provider (e.g., net user) always has a strong will of filtering out
“unwanted” data (i.e., data not contribute to the mining of association rules) before transmitting its private
data to the data miner. Given transaction t, an item ai in t is unwanted if ai does not appear in any frequent
itemset (i.e., ∀h ≥ 1, ai 6∈ {Lh|Lh ⊆ t}). That is, the disclosure of ai (i.e., ai ∈ R(t)) does not contribute
to the mining of association rules. We measure privacy by the probability that an “unwanted” item is
included in the perturbed transaction. Formally speaking, our privacy measure, named level of privacy, is
defined as follows.
Definition 4.4: Given transaction t, an item ai ∈ t is unwanted if and only if there does not exist any
frequent itemset I ⊆ t such that ai ∈ I . We define the level of privacy as the probability that an infrequent
item in t is included in the perturbed transaction R(t). That is, the level of privacy is defined as
δ = Pr{ai ∈ R(t)|ai is unwanted in t}. (8)
As we can see from the definition, a higher level of privacy results in a higher probability of privacy
invasion. With an approximation, we derive an upper bound on the level of privacy in our scheme.
Theorem 4.5: With properly set parameters, the level of privacy in our system is bounded by
δ<∼1 −√
σ2k+1
+ · · ·+ σ2n
σ21 + · · ·+ σ2
n
, (9)
where σ2i is the ith largest eigenvalue of A = T ′T .
18
The proof of Theorem 4.5 can be found in Appendix IV. Because of the uncertainty on the rounding off
operation in Algorithm 5, this bound is not tight.
As we can see from Theorem 4.3 and Theorem 4.5, there is a tradeoff between accuracy and privacy in
our system. The upper bound of the degree of accuracy is in proportion to σ2k+1
. The level of privacy is a
decreasing function of σ2k+1
. The larger the perturbation level k is, the more unwanted items are disclosed
to the data miner, and the more frequent itemsets can be correctly identified.
V. SIMULATION RESULTS
In this section, we present the simulation results of our scheme on a randomly generated dataset. The
experimental results on a real dataset will be presented in the next section.
The randomly generated dataset consists of 2, 000 transactions. Let there are 20 items in the dataset.
Each transaction is a subset of the 20 items. We represent the dataset by a 2, 000 × 20 matrix T . The
first 19 columns (items) of T , a1, . . . , a19, are independently generated. The 20th column is set to be the
same as the 10th column (i.e., a10 = a20). The first item a1 has a probability of 0.6 to be included in a
transaction. All other columns have probability of 0.1 to be included in a transaction. Thus, the expected
support of a1 is 0.6. The expected support of every other item is 0.1. Since a10 = a20, the expected
support of {a10, a20} is 0.1. We set min supp = 0.09 as the threshold (lower bound) on the support of a
frequent itemset.
In the randomly generated dataset, the support of {a1} and {a10, a20} are listed as follows.
support of {a1} = 0.6085, support of {a10, a20} = 0.0985. (10)
As we can see, these two itemsets have support much higher than other 1-itemsets and 2-itemsets,
respectively. The left part of Figure 3 shows the original dataset. In the original dataset, the total number
of appearance of all items is 4929 times including 1217 times of a1, 197 times of a10, 197 times of a20,
and 3318 times of the other items.
19
In our simulation, we set the parameters as µ = 0.85 (in Protocol 1), ρt = 0.8 (in Algorithm 5),
and µm = 0 (in Algorithm 6). The data miner updates Vk when every 10 transactions are received. The
truncation level k calculated from Protocol 1 is listed as follows. The right part of Fig. 3 shows the
TABLE II
TRUNCATION LEVEL k
k 7 6 5 4 3 2
Transaction No. 1 − 20 21 − 50 51 − 90 91 − 120 121 − 200 201 − 2000
dataset after perturbation. In the perturbed dataset, the total number of appearance of all items is 1646
times including 1217 times of a1, 197 times of a10, 197 times of a20, and 35 times of the other items.
The association rule mining results are stated as follows.
{a1}, with support 0.6085 and {a10, a20}, with support 0.0985. (11)
As we can see, the support of “interesting” itemsets are perfectly recovered with the degree of accuracy
γ = 0. The privacy is well preserved with the level of privacy δ = 1.05%.
Fig. 4. comparison of between original and perturbed transactions
20
VI. EXPERIMENTAL RESULTS ON A REAL DATASET
In this section, we make a comparison between our approach and the cut-and-paste randomization
approach [10] by experimental results on a real dataset. We use a real world dataset named “BMS
Webview 1” [20], which contains web click stream data of several months from the e-commerce website
of a leg-care company. The dataset contains 59, 602 transactions and 497 items.
We randomly choose 10, 871 transactions from the dataset as our test band. The maximum transaction
size is 181. The average transaction size is 2.90. There are 325 transactions (2.74%) that contains 10 or
more items. We set the support threshold min supp to be 0.2%. There are 798 frequent itemsets including
259 1-itemset, 350 2-itemsets, 150 3-itemsets, 37 4-itemsets and two 5-itemsets.
As a compromise between privacy and accuracy, the cutoff parameter Km of cut-and-paste random-
ization operator is set to 7. The truncation level k of our approach is set to 6. Since both our approach
and the cut-and-paste operator use the same method to add random noise, we compare the results before
noise is added. Thus we set ρm = 0 for both our approach and the cut-and-paste randomization operator.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
ρt
degr
ee o
f acc
urac
y (%
)
Our perturbation algorithm Previous approach: cut−and−paste
Fig. 5. accuracy
The solid line in Figure 4 shows the change of degree of accuracy (max{ρ1, ρ2}) of our approach with
the parameter ρt. The dotted line shows the degree of accuracy when the cut-and-paste randomization
operator is used. As we can see, our approach always identifies association rules more accurately than
21
the cut-and-paste approach.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Deg
ree
of P
rivac
y
ρt
Our perturbation algorithmPrevious approach: cut−and−paste
Fig. 6. privacy
The level of privacy in the same setting is presented in Fig. 5. As we can see, our approach preserves
privacy better than the cut-and-paste operator when ρt > 0.1. Thus, our approach is better on both privacy
and accuracy issues when 0.1 ≤ ρt ≤ 1. From the figures, we observe a recommendation for the data
providers as that ρt ∈ [0.7, 0.8] is suitable for hard-core privacy fundamentalists and ρt ∈ [0.2, 0.3] is
recommended to privacy marginally concerned people.
In particular, we analyze all 2-itemsets in the real dataset to show that our scheme tends to enlarge the
support of frequent itemsets and reduce the support of infrequent itemsets. The result is shown in Fig. 6.
The x-axis is the support of 2-itemsets in the original dataset. The y-axis is the support of itemsets in
the perturbed dataset. The figure intends to show how effectively our system blocks the unwanted items
from being divulged. If a system preserves privacy perfectly, we should have y equal to zero when x
is less than min supp. The data in Fig. 6 shows that almost all 2-itemsets with support less than 0.2%
(233, 368 unwanted 2-itemsets) have been blocked. That is, the privacy has been successfully protected.
Meanwhile, the supports of frequent 2-itemsets are enlarged. This should help the data miner to identify
frequent itemsets from additional noises.
22
0 50 100 150 200
50
100
150
200
250
Support in Original Transactions
Sup
port
in P
ertu
rbed
Tra
nsac
tions
k = 6 out of 497, ρt = 0.4
Support in R(T) (× 10−4)Support in T (× 10−4)
233,368 2−itemsets
13641 2−itemsets
Fig. 7. comparison of 2-itemsets supports between original and perturbed transactions
VII. IMPLEMENTATION
A prototypical system for privacy preserving mining of association rules has been implemented using
our new scheme. The system is designed on web browsers and servers for the application of online
surveys. Visitors taking surveys through web browsers are considered to be data providers. The web server
conducting surveys is considered to be the data miner. We implement the data perturbation algorithm as
custom code on the web browsers. The PG (perturbation guidance) part of the data miner is implemented
as custom code on the web server. All custom codes are component-based plug-ins that one can easily
install to existing systems. The components required for building the system is shown in Figure 7.
Fig. 8. system implementation
The infrastructure of our scheme is shown in Figure 8. There are three separate layers in our system:
user interface layer, perturbation layer, and web layer. The top layer, named user interface layer, provides
23
interface to data providers and the data miner. The middle layer, named perturbation layer, realizes our
privacy preserving scheme and exploits the bottom layer to transfer information. The bottom layer, named
web layer, consists of web servers and web browsers. As an important feature of our system, the details
of data perturbation on the middle layer are transparent to both data providers and the data miner. The
Fig. 9. system infrastructure
overhead of our implementation is substantially smaller than the randomization approach in the context of
online survey. The time-consuming part of the cut-and-paste randomization approach, which is the support
recovery procedure, occurs in the association rule mining process. The support recovery algorithm needs
to compute the partial support of all candidate itemsets for each transaction size, which will result in a
significant overhead.
In our system, the only overhead (possibly) incurred on the data miner is to update the perturbation
guidance Vk. Many SVD updating algorithms have been proposed including SVD-updating, folding-in
and recomputing the SVD [17], [18]. Since T ∗ is usually a sparse matrix, the complexity of updating
SVD can be considerably reduced to O(n). Besides, this overhead is not on the critical time path of the
mining process. It occurs during data collection instead of data mining process. Note that the transmitted
“perturbation guidance” Vk is of length k · n. Since k is always a small number (e.g., k ≤ 10), the
communication overhead incurred by our introduction of two-way communication is not significant.
VIII. FINAL REMARKS
In this paper, we propose a new scheme on privacy preserving mining of association rules. Compared
with previous approaches, we introduce a two-way communication mechanism between the data miner
24
and data providers with little overhead. In particular, we let the data miner send a perturbation guidance to
the data providers. Using this intelligence, the data providers distort the data transactions to be transmitted
to the miner. As a result, our scheme identifies association rules more precisely than previous approaches
and at the same time protects privacy more effectively.
Our work is preliminary and many extensions can be made. In addition to using a similar approach in
data classification [21], we are currently investigating how to apply the approach to clustering problems.
We would like to investigate a new behavior model that is stronger than the honest-but-curious model,
and can be dealt with by our scheme.
REFERENCES
[1] J. Han and M. Kamber, Data Mining Concepts and Techniques. Morgan Kaufmann, 2001.
[2] R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between sets of items in large databases,” in Proc. ACM SIGMOD
Int. Conf. on Management of Data, 1993, pp. 207–216.
[3] R. Agrawal and R. Srikant, “Fast algorithms for mining association rules in large databases,” in Proc. Int. Conf. on Very Large Data
Bases, 1994, pp. 487–499.
[4] J. S. Park, M. S. Chen, and P. S. Yu, “An effective hash-based algorithm for mining association rules,” in Proc. ACM SIGMOD Int.
Conf. on Management of Data, 1995, pp. 175–186.
[5] M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman, “Computing Iceberg queries efficiently,” in Proc. Int. Conf.
on Very Large Data Bases, 1998, pp. 299–310.
[6] J. Vaidya and C. Clifton, “Privacy preserving association rule mining in vertically partitioned data,” in Proc. ACM SIGKDD Int. Conf.
on Knowledge discovery and data mining, 2002, pp. 639–644.
[7] M. Kantarcioglu and C. Clifton, “Privacy-preserving distributed mining of association rules on horizontally partitioned data,” in Proc.
ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 2002, pp. 24–31.
[8] Y. Lindell and B. Pinkas, “Privacy preserving data mining,” Advances in Cryptology, vol. 1880, pp. 36–54, 2000.
[9] O. Goldreich, Secure Multi-Party Computation. Cambridge Univeristy Press, 2004.
[10] A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke, “Privacy preserving mining of association rules,” in Proc. ACM SIGKDD Intl.
Conf. on Knowledge Discovery and Data Mining, 2002, pp. 217–228.
[11] S. J. Rizvi and J. R. Haritsa, “Maintaining data privacy in association rule mining,” in Proc. Int. Conf. on Very Large Data Bases,
2002, pp. 682–693.
[12] J. Hagel and M. Singer, Net Worth. Harvard Business School Press, 1999.
25
[13] L. F. Cranor, J. Reagle, and M. S. Ackerman, “Beyond concern: Understanding net users’ attitudes about online privacy,” AT&T
Labs-Research, Tech. Rep. TR 99.4.3, 1999.
[14] H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar, “On the privacy preserving properties of random data perturbation techniques,” in
Proceedings of the 3rd IEEE International Conference on Data Mining. IEEE Press, 2003, pp. 99–106.
[15] S. Agrawal, V. Krishnan, and J. R. Haritsa, “On addressing efficiency concerns in privacy-preserving mining,” in Proceedings of the
9th International Conference on Database Systems for Advanced Applications. Springer Verlag, 2004, pp. 439–450.
[16] G. H. Golub and C. F. V. Loan, Matrix Computations. Johns Hopkins University Press, 1996.
[17] J. R. Bunch and C. P. Nielsen, “Updating the singular value decomposition,” Numerische Mathematik, vol. 31, pp. 111–129, 1978.
[18] M. Gu and S. C. Eisenstat, “A stable and fast algorithm for updating the singular value decomposition,” Yale University, Tech. Rep.
YALEU/DCS/RR-966, 1993.
[19] I. T. Jolliffe, Principle Component Analysis. Springer Verlag, 1986.
[20] Z. Zheng, R. Kohavi, and L. Mason, “Real world performance of association rule algorithms,” in Proc. ACM SIGKDD Int. Conf. on
Knowledge Discovery and Data Mining, 2001, pp. 401–406.
[21] N. Zhang, S. Wang, and W. Zhao, “On a new scheme on privacy preserving data classification,” Texas A&M University, Tech. Rep.
TAMU/CS 2004-10-5, 2004.
[22] B. Nobel and J. W. Daniel, Applied linear algebra. Prentice-Hall, 1988.
26
APPENDIX I
APPENDIX: JUSTIFICATION OF Vk
The main part of a privacy preserving association rule mining system is the data perturbation mechanism
employed by the data providers. In current techniques, randomization operator is used to perturb original
transactions. As we described in Section II, an item-invariant randomization operator leaves the correlation
between different items unconsidered. We investigate the item correlation to improve accuracy of mining
results.
Specifically, our data perturbation algorithm preserves the support of 2-itemsets. The intuition behind
our approach can be stated as follows. The PG part of the data miner maintains an estimation of the
support of all 1-itemsets and 2-itemsets. Based on the estimation, PG tells the data providers which
itemsets are more likely to be frequent itemsets. We know that no superset of an infrequent itemset is
frequent (anti-monotone). A data provider may safely remove all infrequent 1-itemsets and 2-itemsets
because they cannot appear in any frequent itemset 3 or more items either.
Readers may raise a question that why we choose to maintain the support of 1-itemsets and 2-itemsets
instead of 1, 2, and 3-itemsets or 1-itemsets only? For the first question, the main reason is that the
large number of 3-itemsets (n3) could impose serious overhead on the system. Besides, as we can see in
Appendix III, our algorithm also preserves the support of itemsets with 3 or more items.
On the other hand, if we choose to maintain the support of 1-itemsets only, we cannot remove many
items from original transactions to protect privacy. Roughly speaking, if {ai, aj}, {aj, ak}, and {ai, ak}
are all frequent, we have a fairly high probability that {ai, aj, ak} is also frequent. However, if both ai and
aj are frequent, the probability that {ai, aj} is frequent is much smaller. Thus, in order to guarantee that
all frequent 2-itemsets can be preserved, we cannot remove many items from the original transactions.
Now we will show that our data perturbation algorithm successfully preserves the support of 2-itemsets.
Here we consider the case when Vk is the k-truncation of the accurate eigenvectors of T ′T . In reality, the
value of Vk has to be estimated from the current copy of T ∗. Fortunately, the value of Vk converges well
27
to its accurate value. Besides, the convergence is fairly fast in most cases. The proof of convergence is
presented in Appendix V.
Consider A = T ′T
A = T ′T =
a1 · a1 · · · a1 · an
... ai · aj
...
an · a1 · · · an · an
(I.12)
where ai · aj is the dot product of ai and aj .
ai · aj =
m∑
h=1
〈ai〉h〈aj〉h (I.13)
Note that
ai · aj
m= support of {ai, aj}. (I.14)
Thus, we may preserve the support of 2-itemsets by a precise approximation of A. In particular, we use
truncated eigenvectors of A to perturb the original transactions T into T .
Given transaction matrix T , define the perturbed transaction T = TVkV′k .
T = [t1, t2, . . . tm]′ (I.15)
Vk = [v1, v2, . . . , vk] (I.16)
T = TVkV′k = [t1VkV
′k, t2VkV
′k, . . . , tmVkV
′k]. (I.17)
With some algebriac manupulation we have
A = T ′T = VkV′kT
′TVkV′k (I.18)
= VkV′kV (Σ′Σ)V ′VkV
′k (I.19)
= VkΣ′ΣV ′
k (I.20)
= σ2
1v1v′1 + · · ·+ σ2
kvkv′k (I.21)
Thus, A is the optimal rank-k approximation of A [16]. In other words, we preserve the support of
2-itemsets while cutting off its eigenvectors.
28
Our data perturbation algorithm also preserves the privacy of data providers. The data miner cannot
deduce the original t from t because VkV′k is a singular matrix with det(VkV
′k) = 0, which means that
VkV′k does not have an inverse matrix.
APPENDIX II
APPENDIX: EXAMPLE OF TRANSACTION PERTUBATION
As we have shown in Appendix I, we can reach a precise approximation of A = T ′T by perturbing t
to t = tVkV′k. Besides this “truncated eigenvectors” perturbation, we need a function to map the values
of t ∈ [0, 1] to integer values R(t) ∈ {0, 1}. Although the truncation of eigenvectors has already inserted
some “noise” items into the transactions, we still need to add more “noise” items. These tasks are done
by the Alg. 5 and Alg. 6. Here we use an example to illustrate the algorithms involved in the perturbation
process.
Example 2.1: A data provider Ci holds a transaction t = [1, 1, 0, 1, 1, 1, 0, 0]. After negotiating with the
data miner, Ci receives Vk (k = 2) from the data miner as follows.
V ′k =
0.30 0.43 0.38 0.28 0.32 0.43 0.35 0.27
−0.50 0.08 −0.01 0.37 −0.03 0.05 −0.45 0.61
(II.22)
Using truncated eigenvectors Vk as perturbation guidance, Ci first calculates t by t = tVkV′k . Now the
perturbed transaction t is
t = [0.54, 0.75, 0.67, 0.48, 0.56, 0.76, 0.63, 0.46]. (II.23)
The step of mapping t to integer values can be stated as:
1) As stated in Alg. 5, Ci rounds off t to integer. Let ρt be 0.3. Ci transforms t to
R(t) = [1, 0, 0, 1, 0, 0, 0, 0] = {a1, a4}; (II.24)
2) After that, as stated in Alg. 6, Ci inserts some items into R(t) as random noise. ρm is the probability
of a “false” item to be placed into R(t). Now R(t) may become
R(t) = [1, 0, 0, 1, 0, 0, 1, 0] = {a1, a4, a7}. (II.25)
29
TABLE III
MAPPING FUNCTION
〈t1〉i · 〈t2〉j 〈t1〉i · 〈t2〉j 〈R(t1)〉i · 〈R(t2)〉j
0 · 0 ≥ (1 − ρt)2 1 · 1
0 · 1 ≥ (1 − ρt)2 1 · 1
1 · 1 ≤ (1 − ρt) 1 · 0
1 · 1 ≤ (1 − ρt)2 0 · 0
As shown in the above example, there are three parameters, k, ρt, and ρm, in the perturbation process.
They all supply data providers controls to preserve privacy. The first parameter, k, is determined by the
negotiation between the data provider and the data miner. The other parameters are determined by the
data provider due to its sensitivity of privacy.
APPENDIX III
APPENDIX: PROOF OF THEOREM 4.3
We first consider the error on support of 2-itemsets. Then we will generalize the result to itemsets with
other sizes.
First, we consider the error introduced by the truncated eigenvectors perturbation (i.e., t = tVkV′k). As
shown in (I.21) in Appendix I, A = T ′T is a rank-k approximation of A = T ′T . Since no entry of a
matrix can exceeds the 2-norm of the matrix [22], we have
maxi,j
|〈A − A〉ij| ≤ ‖A − A‖2 = σ2
k+1 (III.26)
For 1-itemsets, assume that the error on support of itemset {ai} introduced by the perturbation is εTi . We
have εTi = |〈A − A〉ii| ≤ σ2
k+1/m. For 2-itemsets, assume that the error on support of itemset {ai, aj}
introduced by the perturbation is εTij . We have εT
ij = |〈A − A〉ij| ≤ σ2k+1
/m.
Second, we consider the error introduced by the rounding off operation in Alg. 5. Let εMij be the error
on support of {ai, aj} introduced by the rounding off operation. As we may can from Tab. III, for every
30
possible value of ti and tj , we have∣
∣
∣
∣
〈R(t1)〉i · 〈R(t2)〉j − 〈t1〉i · 〈t2〉j〈t1〉i · 〈t2〉j − 〈t1〉i · 〈t2〉j
∣
∣
∣
∣
≤ max{1 − (1 − ρt)2
(1 − ρt)2,1 − ρt
ρt
}. (III.27)
Thus, consider the ratio between εMij and εT
ij , we have
εMij
εTij
≤∣
∣
∣
∣
〈R(t1)〉i · 〈R(t2)〉j − 〈t1〉i · 〈t2〉j〈t1〉i · 〈t2〉j − 〈t1〉i · 〈t2〉j
∣
∣
∣
∣
≤ max{1 − (1 − ρt)2
(1 − ρt)2,1 − ρt
ρt
}. (III.28)
In other words, the error on support of {ai, aj} introduced by the rounding off operation satisfies
εMij ≤ max{1 − (1 − ρt)
2
(1 − ρt)2,1 − ρt
ρt
}εTij (III.29)
Thus, for any 2-itemset {ai, aj}, the error on its support introduced by the whole transformation (i.e.,
t → R(t)) satisfies
max{ρ1, ρ2} ≤ εTij + εM
ij ≤ (1 + max{1 − (1 − ρt)2
(1 − ρt)2,1 − ρt
ρt
})σ2k+1
m. (III.30)
The bound reaches its optimal (lowest) value, εMi,j ≤ 2.618σ2
k+1/m, when ρt = (3 −
√5)/2. The error of
support of 1-itemset can also be bounded by the above value following the same steps.
Now we generalize the result to itemsets with 3 or more items.
Lemma 3.1: Let ε be
ε = (1 + max{1 − (1 − ρt)2
(1 − ρt)2,1 − ρt
ρt
})σ2k+1
m(III.31)
∀h > 2, we have max{ρh1 , ρ
h2} ≤ ε.
Proof: We only show that max{ρ31, ρ
32} ≤ ε. Readers may easily prove the other cases when h > 3
following the same approach.
Suppose there exists a 3-itemset I0 such that
|supp(I0) − supp′(I0)| > ε. (III.32)
Without loss of generality, let I0 be {a1, a2, a3}. Let εi and εij be the error on support of {ai} and {ai, aj},
respectively. Consider the error on support of every itemset that is a subset of I0, we have
ε12 + ε23 + ε13 ≥ |supp(I0) − supp′(I0)| > ε, (III.33)
ε1 + ε2 + ε3 ≥ |supp(I0) − supp′(I0)| > ε. (III.34)
31
Due to Tab. III, we have
εT12 + εT
23 + εT13 >
σ2k+1
m, (III.35)
εT1 + εT
2 + εT3 >
σ2k+1
m. (III.36)
Consider the 2-norm of A − A.
‖A − A‖2 = maxx s.t. ‖x‖2=1
‖(A − A)x‖2. (III.37)
Let x0 be [√
3/3,√
3/3,√
3/3, 0, . . . , 0]′. We have
‖A − A‖2 ≥ ‖(A − A)x0‖2 (III.38)
≥
√
(
√3
3(εT
1 + εT12 + εT
13))2 + (
√3
3(εT
21 + εT2 + εT
23))2 + (
√3
3(εT
31 + εT32 + εT
3 ))2 (III.39)
>σ2
k+1
m= ‖A − A‖2. (III.40)
Here we reach a contradiction. Thus, we have max{ρ31, ρ
32} ≤ ε.
APPENDIX IV
APPENDIX: PROOF OF THEOREM 4.5
Consider the F-norm (Frobenius norm) of T − T .
‖T − T‖F ≡
√
√
√
√
m∑
i=1
n∑
j=1
|〈T − T 〉ij|2 =√
σ2k+1
+ · · ·+ σ2n. (IV.41)
Although the rounding off error of t is hard to be bounded, an overpessimistic estimation can be given as
‖T − R(T )‖F ≥ ‖T − T‖F . (IV.42)
Note that ‖T−R(T )‖F is equal to the sum of the hamming distance between all corresponding transactions
t and R(t), i.e.,
‖T − R(T )‖F =∑
t∈T
#{items at which t and R(t) differ} (IV.43)
In our system, the number of false positives (ai ∈ t, ai 6∈ t) is much less than the number of items removed
from T . We may estimate the number of removed items by ‖T −R(T )‖F . Recall that our system tends to
32
enlarge the support of frequent itemsets and reduce the support of infrequent itemsets. Thus, the number
of “unwanted” items divulged to the data miner can be estimated by
δ ≈ ‖R(T )‖F∑
t∈T |t| ≈ ‖T‖ − ‖T − R(T )‖F∑
t∈T |t| ≤ 1 −√
σ2k+1
+ · · ·+ σ2n
σ21 + · · ·+ σ2
n
(IV.44)
APPENDIX V
APPENDIX: CONVERGENCE
Assume the data miner currently holds a latest transaction matrix T , whose k-truncated SVD is UkΣkV′k .
After a data provider received Σk and Vk from the data miner, he will transform its own transaction t into
t such that
[
T
t
]
k
=
[
T
t
]
k
= UkΣkV′k, (V.45)
and then send t back to the data miner.
Lemma 5.1: For any matrix T with k-truncated SVD Tk = UkΣkV′k , we always have
Tk = U ′kUkT = TVkV
′k. (V.46)
By Lemma 5.1 and (V.45), we have[
T VkV′k
tVkV ′k
]
k
=
[
T VkV′k
tVkV ′k
]
k
, (V.47)
If we choose t = tVkV′k, then tVkV
′k = t, i.e., (V.47) and (V.45) always hold with the chosen t.
The problem is how to obtain Vk. There are several exists updating algorithms that obtain Vk based on
Σk and Vk. For example, in [], if T is a low-rank-plus-shift matrix, we have
[
T
t
]
k
≈[
Tk
t
]
k
, (V.48)
By (V.45) and (V.48), we have
VkΣ2
kV′k ≈ (T ′
kTk + tt′)k = (VkΣ2
kV′k + tt′)k. (V.49)
Given Σk and Vk, with the above equation, Vk can be computed as
Vk ≈ Vk (V.50)
Top Related