[IEEE 2009 Second International Conference on Emerging Trends in Engineering & Technology - Nagpur,...

4
Wavelet Transform Based Data Perturbation Method for Privacy Protection Vinod Patel Yogendra Kumar Jain Research Scholar M. Tech. CSE Head of the CSE Department Samrat Ashok Technological Institute Samrat Ashok Technological Institute Vidisha (M. P.) India 464001 Vidisha (M. P.) India 464001 Email: [email protected] Email: [email protected] Abstract Data mining techniques are able to derive highly sensitive knowledge from unclassied data that is not even known to database holders. Usually, data mining contains the secured information such as financial and healthcare records. To handle such large private database with, data mining algorithms with privacy is required. The privacy preserving becomes important concern when we dealing security related data. Data perturbation is one of the well- known methods for avoiding such kinds of privacy leakage. The objective of data perturbation method is to distort the individual data values while preserving the underlying statistical distribution properties. These data perturbation methods are assessed in terms of both their privacy parameters as well as its associated utility measure. Privacy parameters are used to measure the degree of privacy protection while data utility measures assess whether the dataset keeps the performance of data mining techniques after the data distortion. In this paper we present wavelet transformation for data perturbation. The experimental results show that wavelet transformation is a very promising data perturbation method. I. INTRODUCTION Data mining [1] is the extraction of implicit, previously unknown, and potentially useful information from data. Large collection of detailed personal data are regularly collected and analyzed by applications using data mining. Such data include shopping habits, criminal records, medical history, and credit records, among others [2]. On the other hand, analyzing such data opens new threats to privacy and autonomy of the individual, if not done properly. The challenging problem that we address in our paper is: how can we protect against the abuse of the knowledge discovered from secondary usage of data and meet the needs of organizations and governments to support decision making or even to promote social benets? We claim that a solution for such a problem requires two vital techniques: anonymity [3] [4] to remove identiers in the first phase of privacy protection (e.g. names, social insurance numbers, addresses, etc.), and data transformation to protect some sensitive attributes (e.g. salary, age, etc.). In this paper, we focus on the latter technique. Specically, we consider the case in which condential numerical attributes are distorted in order to meet privacy protection in classification analysis. We proposed to use Wavelet Transform based method as data distortion method. We have performed experiments and the results shown that the Wavelet method is effective in concealing the sensitive information, while preserving the performance of the data mining techniques after the data distortion. II. DATA MATRIX The matrix representation (vector-space format) [5] is one of the most popular ways to encode the object- attribute relationships in many real-life datasets. In this format, a 2-dimensional (2D) matrix is used to store the dataset in which each row of the matrix stands for an individual object, and each column represents a particular attribute of these objects. Apparently in this matrix, the privacy is a set of all confidential attributes represented by columns and all secret objects represented by rows. In such a matrix, we assume that every element is fixed, discrete, and numerical. Any missing element is not allowed. III. RELATED WORK Lots of different scientific research has been done for this work. There are several classes of data transformation or data perturbation methods already discussed for this work. For example one class is focused on data anonymization [10] [11] [12] [13]. In the other class, the whole dataset or the confidential parts of the dataset is perturbed using certain distribution of random noises [14] [15] [16] [17]. More recently singular value decomposition (SVD) [7] [18] and nonnegative matrix factorization (NMF) [19] have been used to distort numerical valued datasets. The Fourier Transform based strategies [20] [21] are also used for that work. IV. WAVELET TRANSFORM Generally speaking, the wavelet transform is a tool that divides up data, functions, or operators into different frequency components and then studies each component with a resolution matched to its scale [6]. In mathematical Second International Conference on Emerging Trends in Engineering and Technology, ICETET-09 978-0-7695-3884-6/09 $26.00 © 2009 IEEE 294

Transcript of [IEEE 2009 Second International Conference on Emerging Trends in Engineering & Technology - Nagpur,...

Page 1: [IEEE 2009 Second International Conference on Emerging Trends in Engineering & Technology - Nagpur, India (2009.12.16-2009.12.18)] 2009 Second International Conference on Emerging

Wavelet Transform Based Data Perturbation Method for Privacy Protection

Vinod Patel Yogendra Kumar Jain Research Scholar M. Tech. CSE Head of the CSE Department

Samrat Ashok Technological Institute Samrat Ashok Technological Institute

Vidisha (M. P.) India 464001 Vidisha (M. P.) India 464001

Email: [email protected] Email: [email protected]

Abstract — Data mining techniques are able to derive highly sensitive knowledge from unclassified data that is not even known to database holders. Usually, data mining contains the secured information such as financial and healthcare records. To handle such large private database with, data mining algorithms with privacy is required. The privacy preserving becomes important concern when we dealing security related data. Data perturbation is one of the well-known methods for avoiding such kinds of privacy leakage. The objective of data perturbation method is to distort the individual data values while preserving the underlying statistical distribution properties. These data perturbation methods are assessed in terms of both their privacy parameters as well as its associated utility measure. Privacy parameters are used to measure the degree of privacy protection while data utility measures assess whether the dataset keeps the performance of data mining techniques after the data distortion. In this paper we present wavelet transformation for data perturbation. The experimental results show that wavelet transformation is a very promising data perturbation method.

I. INTRODUCTION

Data mining [1] is the extraction of implicit, previously unknown, and potentially useful information from data. Large collection of detailed personal data are regularly collected and analyzed by applications using data mining. Such data include shopping habits, criminal records, medical history, and credit records, among others [2]. On the other hand, analyzing such data opens new threats to privacy and autonomy of the individual, if not done properly. The challenging problem that we address in our paper is: how can we protect against the abuse of the knowledge discovered from secondary usage of data and meet the needs of organizations and governments to support decision making or even to promote social benefits? We claim that a solution for such a problem requires two vital techniques: anonymity [3] [4] to remove identifiers in the first phase of privacy protection (e.g. names, social insurance numbers, addresses, etc.), and data transformation to protect some sensitive attributes (e.g. salary, age, etc.). In this paper, we focus on the latter technique. Specifically, we consider the case in which confidential numerical attributes are distorted in order to meet privacy protection in classification analysis.

We proposed to use Wavelet Transform based method as data distortion method. We have performed experiments and the results shown that the Wavelet method is effective in concealing the sensitive information, while preserving the performance of the data mining techniques after the data distortion.

II. DATA MATRIX

The matrix representation (vector-space format) [5] is one of the most popular ways to encode the object-attribute relationships in many real-life datasets. In this format, a 2-dimensional (2D) matrix is used to store the dataset in which each row of the matrix stands for an individual object, and each column represents a particular attribute of these objects. Apparently in this matrix, the privacy is a set of all confidential attributes represented by columns and all secret objects represented by rows. In such a matrix, we assume that every element is fixed, discrete, and numerical. Any missing element is not allowed.

III. RELATED WORK

Lots of different scientific research has been

done for this work. There are several classes of data transformation or data perturbation methods already discussed for this work. For example one class is focused on data anonymization [10] [11] [12] [13]. In the other class, the whole dataset or the confidential parts of the dataset is perturbed using certain distribution of random noises [14] [15] [16] [17]. More recently singular value decomposition (SVD) [7] [18] and nonnegative matrix factorization (NMF) [19] have been used to distort numerical valued datasets. The Fourier Transform based strategies [20] [21] are also used for that work.

IV. WAVELET TRANSFORM Generally speaking, the wavelet transform is a tool that divides up data, functions, or operators into different frequency components and then studies each component with a resolution matched to its scale [6]. In mathematical

Second International Conference on Emerging Trends in Engineering and Technology, ICETET-09

978-0-7695-3884-6/09 $26.00 © 2009 IEEE 294

Page 2: [IEEE 2009 Second International Conference on Emerging Trends in Engineering & Technology - Nagpur, India (2009.12.16-2009.12.18)] 2009 Second International Conference on Emerging

terms, a discrete wavelet transformation (DWT) is a wavelet transformation for which the input discrete samples are divided into approximation coefficients and detail coefficients, which correspond to the low frequency and high frequency decompositions of the original samples, respectively. Such wavelet decomposition process is applied recursively with high and low passing filters on the approximation coefficients of the previous level and then down-sampled. Start with a function φ(x) that is made up of smaller version of itself this is the refinement (or 2-scale, dilation) equation

( ) (2 )kk

x a x kφ φ∞

=−∞= −∑

'ka s are called filter coefficients or masks. The function

φ(x) is called the scaling function (or father wavelet). Under certain conditions,

( ) ( 1) (2 )kk

kx b x kϕ φ

=−∞= − −∑

= 1( 1) (2 )kk

ka x kφ

−=−∞

− −∑

gives a wavelet. First, the scaling function is chosen to preserve its area under each iteration so that

( ) 1x dxφ∞

−∞

=∫

Integrating the refinement equation then

( ) (2 )kxd dx a x k dxφ φ∞

−∞−∞

= −∑∫ ∫

= 1 ( )2 ka u duφ

−∞∑ ∫

Hence 2.ka =∑ So the stability of the iteration forces

a condition on the coefficient ka . Second, the convergence of wavelet expansion requires the condition

1

0( 1) 0

Nk m

kk

k a−

=− =∑

Where, 0,1, 2,..... 12Nm = − if a finite sum of wavelets

are to represent the signal as accurately as possible). Third, requiring the orthogonality of wavelets forces the condition

1

20

0N

k k mk

a a−

+=

=∑

Where 0,1, 2,..... 12Nm = − . Finally if the scaling

functions are required to be orthogonal1

2

02.

N

kk

a−

==∑ to

summarize, the conditions are

1

2

02.

N

kk

a−

==∑ , Stability

1

0( 1) 0

Nk m

kk

k a−

=− =∑ , Convergence

1

20

0N

k k mk

a a−

+=

=∑ , Orthogonal of wavelets

1

2

02.

N

kk

a−

==∑ , Orthogonal of scaling functions

This class of wavelet function is constrained, by definition, to be zero outside of a small interval. This makes the property of compact support. Most wavelet functions, when plotted, appear to be extremely irregular. This is due to the fact that the refinement equation assures that a wavelet ψ(x) function is non-differentiable everywhere. The functions, which are normally used for performing transforms, consist of a few sets of well-chosen coefficients resulting in a function, which has a discernible shape. Here we illustrate how to generate Haar wavelet. First,

consider the above constraints on the ka for N = 2. The

stability condition enforces 0 1 2.a a+ = the accuracy

condition implies 0 1 0a a− = and the orthogonality

gives 2 20 1 2.a a+ = the unique solution is 0 1 1,a a= =

if 0 1 1,a a= = then φ(x) = φ (2x) + φ (2x − 1). The refinement function is satisfied by a box function,

1,0 1( )

0,x

B xOtherwise

≤ <⎧= ⎨⎩

Once the box function is chosen as the scaling function, we then get the simplest wavelet: Haar wavelet.

11,02

1( ) 1, 12

0,

x

H x x

Otherwise

⎧ ≤ <⎪⎪⎪= − ≤ ≤⎨⎪⎪⎪⎩

V. DATA DISTORTION MEASURES We adopted the data distortion metrics used in [7] to measure the degree of data perturbed. The value difference (VD) of the datasets is represented by the relative value difference in the frobenios norm. Let S and

S denote the original and distorted data matrices respectively. In mathematical term, VD is given by

295

Page 3: [IEEE 2009 Second International Conference on Emerging Trends in Engineering & Technology - Nagpur, India (2009.12.16-2009.12.18)] 2009 Second International Conference on Emerging

S- F

F

SVDS

⏐⏐ ⏐ ⏐ = ⏐⏐ ⏐⏐

Where F⏐⏐ ⏐ ⏐ indicate the frobenius norm of the enclosed argument. After data distortion, the rank of the magnitude of data element change. RP is used to denote the average change of the rank for all the attributes. For a dataset S with n data object and m attributes,

1 1( | |)

( * )

m ni ij j

i jord ord

RPm n

= =−

=∑∑

Where ijOrd denote the rank of the jth element in

attribute i. Similarly, ijOrd denote the rank of the

corresponding distorted element. RK represents the percentage of elements that keep their ranks of magnitude in each column after the distortion.

1 1( )

( * )

m nij

i jRK

RKm n

= ==∑∑

Where ijRK = 1 if an element keeps its position in the

order of values, otherwise ijRK = 0

The Metric CP is defining to represent the change of rank of the average value of the attributes.

1

| OrdSS |m

i ii

OrdSSCP

m=

−=∑

Where OrdSSi , and iOrdSS indicate the rank of the average value of the ith attribute before and after the data distortion, respectively. Similar as RK, CK is defining to measure the percentage of the attribute that keep their ranks of the average value after distortion.

1

( )m

i

i

CKCK

m==∑

Where iCK =1, if i iOrdSS OrdSS= , otherwise iCK =0.

According to their definitions, we know that a larger RP and CP, and smaller RK and CK value the more the original data matrix is distorted, which implies the data distortion method is better in preserving privacy.

VI.UTILITY MEASURE

The data utility metrics assess whether the distorted data can maintain the accuracy of the data mining techniques.

Throughout this work, we choose the accuracy in J48 [1] classification as the data utility metric.

VII. EXPERIMENTAL RESULTS We have conducted experiments to evaluate the performance of data distortion method. We choose real-life Database obtained from the University of California Irvine (UCI), Machine Learning Repository [8]. Dataset is the Iris data. The summary of the original database are given in Table 1. Table 1: The summary of the database

Database Number of Instances

Number of Features

Number of Classes

IRIS 150 4 3 In addition to the summary, the data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other two; the latter are not linearly separable from each other. The attributes of the database only have numerical values and no missing value. Tables 2 show the performance of distorted method. We use WEKA [9] software to test the accuracy of distorted method. We have constructed the classifier for J48 classification, and a 10-fold cross validation to obtain the classification results. In the wavelet transformation, we choose the Haar basis wavelet for distortion. The results of our experiments, obtained from a Intel desktop workstation with a P4 1.8GHz CPU, 40G hard disk, and 512MB memory in JAVA 1.6 with a windows XP operating system. Table 2: How the privacy parameters and accuracy vary in IRIS data

VIII. CONCLUSIONS

In this paper, we proposed a class of new privacy preserving data distortion methods based on wavelet transformation. Wavelet Transformation based distortion method provides an effective data perturbation tool for privacy preserving data mining. On the other hand, while the privacy parameters used in this work provide some indication on the ability of these techniques to hide the original data values, it is interesting to use the other wavelet basis like Daubechies basis and compare its result with haar basis.

Data VD RP RK CP CK Acc IRIS

(Original) - - - - - 96%

Wavelet (Haar)

.91276 29.62666 .015 1.0 0.25 92%

296

Page 4: [IEEE 2009 Second International Conference on Emerging Trends in Engineering & Technology - Nagpur, India (2009.12.16-2009.12.18)] 2009 Second International Conference on Emerging

REFERENCES [1] Ian H. Witten, Eibe Frank, “Data Mining

practical Machine Learning Tools and Techniques”, Second edition, 2005.

[2] L. Brankovic and V. Estivill-Castro, “Privacy Issues in Knowledge Discovery and Data Mining”, Proc. of Australian Institute of Computer Ethics Conference (AICEC99), Melbourne, Victoria, Australia, July 1999.

[3] W. Kl. Osgen, “Anonymization Techniques for Knowledge Discovery in Databases”, Proc. of the First International Conference on Knowledge Discovery and Data Mining (KDD-95), pp. 186–191, Montreal, Canada, August 1995.

[4] M. K. Reiter and A. D. Rubin, “Crowds: Anonymity for Web Transactions”, ACM Transactions on Information and System Security, Vol. 1, No. 1, pp. 66–92, 1998.

[5] W. Frankes and R. Baeza-Yates, “Information Retrieval: Data Structures and Algorithms”, Prentice–Hall, Englewood cliffs, NJ, 1992.

[6] I. Daubechies, “Ten Lectures on Wavelets”, Society for Industrial and Applied Mathematics, Philadelphia, USA, 1992.

[7] Shuting Xu, Jun Zhang, Dianwei Han, and Jie Wang, “Data distortion for privacy protection in a terrorist Analysis system”, P. Kantor et al (Eds.): ISI 2005, LNCS 3495, pp. 459-464, 2005.

[8] UCI Machine Learning Repository http://www.ics.uci.edu/mlearn/mlsummary.html.

[9] The Weka Machine Learning Workbench. http://www.cs.waikato.ac.nz/ml/weka.

[10] A. Meyerson and R. Williams, “General k-anonymization is hard”, Carnegie Mellon University, School of Computer Science Tech Report, 03-113, 2003.

[11] L. Sweeney, “Achieving k-anonymity privacy protection using generalization and suppression”, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5):571-588, 2002.

[12] K. Wang, B. C. M. Fung and G. Dong, “Integrating private databases for data analysis”, Proceedings of the 2005 IEEE International Conference on Intelligence and Security Informatics (ISI 2005), pp. 171-182, Atlanta, GA, 2005.

[13] K. Wang, P. S. Yu, and S. Chakraborty, “Bottom-up generalization: a data mining solution to privacy protection”, Proceedings of the 4th IEEE International Conference on Data Mining (ICDM 2004), pp. 249-256, 2004.

[14] K. Chen, and L. Liu, “Privacy preserving data classification with rotation perturbation”,

Proceedings of the 5th IEEE International Conference on Data Mining (ICDM 2005), pp. 589-592, 2005.

[15] A. Evfimievski, “Randomization in privacy preserving data mining”, ACM SIGKDD Explorations Newsletter, Vol 4, no. 2, 43-48, 2002.

[16] Z. Huang,W. Du and B. Chen, “Deriving private information from randomized data”, Proceedings of the 2005 ACM SIGMOD Conference, pp. 37-48, Baltimore, MD, 2005.

[17] H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar, “On the privacy preserving properties of random data perturbation techniques”, Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), pp. 99-106, 2003.

[18] S. Xu, J. Zhang, D. Han and J. Wang, “Singular value decomposition based data distortion strategy for privacy protection”, Journal of Knowledge and Information Systems, vol. 10, no. 3, pp. 383-397, 2006.

[19] J. Wang, W. J. Zhong and J. Zhang, “NNMF- based factorization techniques for high-accuracy privacy protection on non-negative-valued datasets”, Proceedings of the 2006 IEEE Conference on Data Mining, International Workshop on Privacy Aspects of Date Mining (PADM 2006), pp. 513-517, Hong Kong, China, 2006.

[20] S. Mukherjee, Z. Chen and A. Gangopadhyay, “A privacy preserving technique for Euclidean distance-based mining algorithms using Fourier-related transforms”, The VLDB Journal, vol. 15, no. 4, pp.293-315, 2006.

[21] S. Xu and S. Lai, “Fast Fourier transform based data perturbation method for privacy protection”, Proceedings of the 2007 IEEE International Conference on Intelligence and Security Informatics, pp. 221-224, New Brunswick, NJ, 2007.

297