FindingQuasi-identifiersforK-Anonymity ModelbytheSetofCut ...sensitive information of target...

11
Abstract—The rapid development of data publishing and information access technology bring a growing number of problems in privacy leakage. In order to avoid linking attacks happened between attributes, K-anonymity model was proposed and become the most widely used in privacy preserving data publishing. Identification of quasi-identifiers (QIs) is one of the primary problems which will directly affect the effectiveness of K-anonymity method. However, most of the existing methods ignored this problem or just choose QIs empirically. These will greatly reduce the validity of K-anonymity method as well as the utility of anonymous data. In this paper, we study the problem of finding QIs for privacy preserving data publishing method based on K-anonymity model. Firstly, we analyze the roles of QIs from the view of independence of sets, and define it as a collection of attributes that can separate sensitive attributes from the other non-sensitive attributes. Then, we propose a construction method for attribute graph based on relationship matrix, which can represent potential connectivity of publishing data, published data and external knowledge. Finally, we put forward an identification algorithm for QIs based on the concept of cut-vertex, which is aiming to find the necessary and minimum QIs. The proposed algorithm is useful to avoid inconvenience and inaccuracy caused by artificial partition of QIs, and can be applied in data publishing situations with multiple sensitive attributes after some extension. Experiments and analysis show that the proposed identification algorithm has better partition ability and lower computational complexity. Therefore, it has good practical value in the application environment of publishing of big data. Index Terms—privacy preserving data publishing; k-anonymity; quasi-identifiers; attributes graph; cut-vertex I. INTRODUCTION he rapid development of cloud computing, Internet of things and big data technology bring about great convenience for accessing of information. Companies, governments and organizations are publishing and sharing more and more micro-data for the purpose of research or business. However, improper collection, analysis and publication of data will lead to privacy leakage. In recent years, there have been many global information security Manuscript received June 16, 2017; revised October 1, 2017. This work was supported in part by the National Nature Science Foundation of China (61762059), Youth Science and Technology Project of Gansu Province, China (1310RJYA004). Yan Yan is with the School of Electrical and Information Engineering, Lanzhou University of Technology, Lanzhou, 730050, China (phone: +86-931-2976010; e-mail: yanyan@ lut.cn). Wanjun Wang is with Lanzhou University of Arts and Science, Lanzhou, 730000, China (e-mail: [email protected]). Xiaohong Hao and Lianxiu Zhang are with Lanzhou University of Technology, Lanzhou, 730050, China (e-mail: [email protected], [email protected]). incidents caused by data leakage. On October 2015, more than 4 million users’ data have been leaked by the broadband service provider TalkTalk in UK. Including user's name, address, date of birth, phone number, e-mail and other privacy information. In 2016, the largest cable TV company in the United States — Time Warner said that about 32 million users’ e-mail and password information have been stolen by hacker. In the same year, great data leakage event has broken out in Turkey. Nearly 50 million citizens’ personal information was hacked for people to download. In 2017, more than 220 million users’ information has been revealed (including name, e-mail address, phone number, home address, family coordinates and social profile links) in India, because the API endpoint of McDelivery Company has not been protected. In the same year, the database of business services giant Dun & Bradstreet was leaked, which included about 33.8 million e-mail addresses and other contact information from thousands of employees and government departments in the United States. The world's largest loopholes response platform — Butian published information that more than 30 provinces in China have high-risk loopholes in their social security, household registration, centers for disease control and many other systems. Only the data involved by social security vulnerabilities reached the amount of 5279.4 million, including identity cards, social security information, finance, income, housing and other sensitive information. Security incidents mentioned above have already sounded the alarm of information security. How to manage the privacy of data and prevent leakage of personal information under the environment of big data has become an issue of common concern for the whole world [1-6] . Generally speaking, privacy can be defined as some information that is not willing to be known by outside world. For individuals, privacy is related to their personal sensitive information, for example, salaries, medical records, investment situations, account and password of transactions, financial information, etc. Therefore, in most of the data publishing methods, attributes are usually partitioned into explicit identifiers, quasi-identifiers (QIs), and sensitive attributes. Explicit identifiers refer to attributes that can explicitly and uniquely identify an individual, such as Name or ID Number. QIs refer to a subset of attributes. None particular tuple can be uniquely identified when using the attribute alone, but when these kinds of attributes are taken together, it can potentially identify an individual. For example, the combination of Gender, Date of birth and Zip code can determine about 87% of the population in United States. Therefore, incautious publication of QIs will lead to privacy leakage. Traditional privacy preserving data publishing methods delete identifier attributes to protect personal privacy, but it still cannot stop attackers to infer Finding Quasi-identifiers for K-Anonymity Model by the Set of Cut-vertex Yan Yan, Wanjun Wang, Xiaohong Hao, Lianxiu Zhang T Engineering Letters, 26:1, EL_26_1_18 (Advance online publication: 10 February 2018) ______________________________________________________________________________________

Transcript of FindingQuasi-identifiersforK-Anonymity ModelbytheSetofCut ...sensitive information of target...

Page 1: FindingQuasi-identifiersforK-Anonymity ModelbytheSetofCut ...sensitive information of target individual by linking some QIstogether. In order to avoid such linking attacks, the concept

Abstract—The rapid development of data publishing and

information access technology bring a growing number ofproblems in privacy leakage. In order to avoid linking attackshappened between attributes, K-anonymity model wasproposed and become the most widely used in privacypreserving data publishing. Identification of quasi-identifiers(QIs) is one of the primary problems which will directly affectthe effectiveness of K-anonymity method. However, most of theexisting methods ignored this problem or just choose QIsempirically. These will greatly reduce the validity ofK-anonymity method as well as the utility of anonymous data.In this paper, we study the problem of finding QIs for privacypreserving data publishing method based on K-anonymitymodel. Firstly, we analyze the roles of QIs from the view ofindependence of sets, and define it as a collection of attributesthat can separate sensitive attributes from the othernon-sensitive attributes. Then, we propose a constructionmethod for attribute graph based on relationship matrix, whichcan represent potential connectivity of publishing data,published data and external knowledge. Finally, we put forwardan identification algorithm for QIs based on the concept ofcut-vertex, which is aiming to find the necessary and minimumQIs. The proposed algorithm is useful to avoid inconvenienceand inaccuracy caused by artificial partition of QIs, and can beapplied in data publishing situations with multiple sensitiveattributes after some extension. Experiments and analysis showthat the proposed identification algorithm has better partitionability and lower computational complexity. Therefore, it hasgood practical value in the application environment ofpublishing of big data.

Index Terms—privacy preserving data publishing;k-anonymity; quasi-identifiers; attributes graph; cut-vertex

I. INTRODUCTIONhe rapid development of cloud computing, Internet ofthings and big data technology bring about great

convenience for accessing of information. Companies,governments and organizations are publishing and sharingmore and more micro-data for the purpose of research orbusiness. However, improper collection, analysis andpublication of data will lead to privacy leakage. In recentyears, there have been many global information security

Manuscript received June 16, 2017; revised October 1, 2017. This workwas supported in part by the National Nature Science Foundation of China(61762059), Youth Science and Technology Project of Gansu Province,China (1310RJYA004).

Yan Yan is with the School of Electrical and Information Engineering,Lanzhou University of Technology, Lanzhou, 730050, China (phone:+86-931-2976010; e-mail: yanyan@ lut.cn).

Wanjun Wang is with Lanzhou University of Arts and Science, Lanzhou,730000, China (e-mail: [email protected]).

Xiaohong Hao and Lianxiu Zhang are with Lanzhou University ofTechnology, Lanzhou, 730050, China (e-mail: [email protected],[email protected]).

incidents caused by data leakage. On October 2015, morethan 4 million users’ data have been leaked by the broadbandservice provider TalkTalk in UK. Including user's name,address, date of birth, phone number, e-mail and otherprivacy information. In 2016, the largest cable TV companyin the United States — Time Warner said that about 32million users’ e-mail and password information have beenstolen by hacker. In the same year, great data leakage eventhas broken out in Turkey. Nearly 50 million citizens’personal information was hacked for people to download. In2017, more than 220 million users’ information has beenrevealed (including name, e-mail address, phone number,home address, family coordinates and social profile links) inIndia, because the API endpoint of McDelivery Company hasnot been protected. In the same year, the database of businessservices giant Dun & Bradstreet was leaked, which includedabout 33.8 million e-mail addresses and other contactinformation from thousands of employees and governmentdepartments in the United States. The world's largestloopholes response platform— Butian published informationthat more than 30 provinces in China have high-riskloopholes in their social security, household registration,centers for disease control and many other systems. Only thedata involved by social security vulnerabilities reached theamount of 5279.4 million, including identity cards, socialsecurity information, finance, income, housing and othersensitive information. Security incidents mentioned abovehave already sounded the alarm of information security. Howto manage the privacy of data and prevent leakage of personalinformation under the environment of big data has become anissue of common concern for the whole world [1-6].Generally speaking, privacy can be defined as some

information that is not willing to be known by outside world.For individuals, privacy is related to their personal sensitiveinformation, for example, salaries, medical records,investment situations, account and password of transactions,financial information, etc. Therefore, in most of the datapublishing methods, attributes are usually partitioned intoexplicit identifiers, quasi-identifiers (QIs), and sensitiveattributes. Explicit identifiers refer to attributes that canexplicitly and uniquely identify an individual, such as Nameor ID Number. QIs refer to a subset of attributes. Noneparticular tuple can be uniquely identified when using theattribute alone, but when these kinds of attributes are takentogether, it can potentially identify an individual. Forexample, the combination of Gender, Date of birth and Zipcode can determine about 87% of the population in UnitedStates. Therefore, incautious publication of QIs will lead toprivacy leakage. Traditional privacy preserving datapublishing methods delete identifier attributes to protectpersonal privacy, but it still cannot stop attackers to infer

Finding Quasi-identifiers for K-AnonymityModel by the Set of Cut-vertex

Yan Yan, Wanjun Wang, Xiaohong Hao, Lianxiu Zhang

T

Engineering Letters, 26:1, EL_26_1_18

(Advance online publication: 10 February 2018)

______________________________________________________________________________________

Page 2: FindingQuasi-identifiersforK-Anonymity ModelbytheSetofCut ...sensitive information of target individual by linking some QIstogether. In order to avoid such linking attacks, the concept

sensitive information of target individual by linking someQIs together.In order to avoid such linking attacks, the concept of

K-anonymity [7] [8] was proposed, and many algorithms havebeen developed based on K-anonymity rules. These kinds ofmethods use generalization techniques to transform thevalues of QIs into less specific forms. Therefore, tuples canbe divided into some QI-groups and none of them can bedistinguished by QI-values within the same QI-group. Fromthis way, privacy has been protected in a certain extent. Howto select proper QIs is the primary problem for privacypreserving data publishing method based on K-anonymitymodel. In principle, QIs should include all the attributes thatattackers may obtain from available databases and can beused for “linking attack”. But for data publisher, it is unableto grasp all the detailed characteristics of external data, andcannot speculate all the background knowledge possessed byattackers. Most of the existing methods assume that the datapublisher knows which are the QIs, or choose QIs accordingto some personal experiences. Some of the methods evenselect all the attributes except identifiers and sensitiveattributes. Such above methods have some defects indifferent degrees. If the set of QIs contains too manyattributes, the loss of information caused by generalizationwill be exacerbated. In some extreme cases, all the tuples inthe table are generalized into one QI-group, which willseriously affect the availability of anonymous data. If the QIsare selected just based on some personal experiences,accuracy is difficult to be ensured and “linking attack” mayoccur and lead to the failure of anonymity protection. Inaddition, for different publishing data, the set of attributesused to be associated with external information may bedifferent; even for the same data, external information can bedifferent according to different publishing time andenvironment. Therefore, identification method for QIs shouldbe a dynamic and changing process which is differentaccording to the publishing data and external information. Inthis paper, we consider the problem of finding proper QIs forprivacy preserving data publishing method based onK-anonymity model.The rest of this paper is organized as follows. Section II

reviews some previous research work related to the selectionof QIs. Section III introduces the main idea to determine QIsby finding the set of cut-vertexes. The underlying concept ofattribute graph and its generation method are proposed in thispart. Section IV discusses the partitioning method of QIs forundirected and directed attribute graph respectively, andextends the method to solve the case of data publishing withmultiple sensitive attributes. Section V compares andanalyzes the size of QIs and computational complexity of theproposed algorithm with some existing methods. Section VIis the conclusion of the paper.

II. RELATED WORK

The research of privacy protection for centralized datapublishing mainly focus on privacy rules [7-10], anonymityalgorithms [11-13], improving the utility of anonymous data[14-16], etc. Some new studies have greatly expanded andimproved applications of privacy preserving data publishingtechnology. Such as anonymity methods for multiplesensitive attributes [17][18], republish and dynamic update ofanonymous data [19][20], and protection methods for theprivacy of transactional information, social network, locationand track information [21-23]. However, most of these studies

ignored the problem of identification of QIs, or simplyassume that they can separate sensitive attributes from QIs.For example, Shi et al. [24] introduced a new type of attribute“quasi-sensitive attributes” which are not sensitive bythemselves, but may become sensitive when used incombination. Sei et al. [25] refine the classification ofattributes into “explicit identifiers”, “non-sensitive QIs”,“sensitive QIs”, “non-QI sensitive attributes”, and “non-QInon-sensitive attributes”. Although they have noticed thattreating attributes that have a feature of both QIs andsensitive attributes are important, no methods have been putforward to clearly distinguish these attributes.

Motwani and Xu [26] use the concept of “separation ratio”and “distinct ratio” to quantitatively describe the distinguishability of attributes. The “distinct ratio” measures distinguishability of certain attribute by the ratio of the number of tuplesof different values and the total number of tuples. The“separation ratio” describes the distinguish ability by theratio of numbers of tuple pairs that can be distinguished bycertain attribute and the number of all possible tuple pairs.Because it is provably hard to find QIs of the minimum size,they relaxed the problem and developed efficient algorithmsto find small QIs with provable size and separation/distinctratio guarantees. Experimental results show that the proposedalgorithms only require one pass over the table and havebetter space and time complexities than the traditional greedyalgorithm. However, for a single data set, QIs consist ofattributes that can uniquely identify a tuple, but for an opendata publishing environment with huge amount of big data,the situation will be quite different. When the publishing datacan be associated with external knowledge (published data orbackground knowledge that attackers may get fromsomewhere), the probability of linking attack will be greatlyenhanced, and some attributes which are not (or not fully)belong to QIs in a single data set are likely to become QIswhich will disclosure privacy information after combiningwith external knowledge. The method of finding QIsproposed by Motwani and Xu [26] only considered thepublishing data, but ignored potential data connection risks.Therefore, it cannot guarantee the accuracy of QIs and is notable to prevent linking attacks.

LEE et al [27] analyzed the factors and probability ofre-identification for medical records according to inferableQIs. They treated QIs as variables that allow re-identificationvia a connection to certain individual and point out that it ispossible to infer QIs from some background knowledge. Inorder to avoid the invasion of patient’s privacy, five factorshave been selected to be QIs which affected the probabilityof re-identification and could be inferred from backgroundknowledge. Simulation experiments adopted by this paperuse the probability of re-identification to determine the mosteffective factors among the combinations of QIs. The firstlimitation of this paper is that the number of inferable QIsfor re-identification of medical records may be more thanfive. The actual number of QIs may be different according todifferent time, situation, and types of publishing data. Foranother one, the paper only analyzed characteristics ofattributes from medical record related with patients’ privacyand it lacked a general method suitable for commonpublishing of data.

Kumar et al [28] put forward an assessment method for theclassification of patients’ QIs. Ensembles of several

Engineering Letters, 26:1, EL_26_1_18

(Advance online publication: 10 February 2018)

______________________________________________________________________________________

Page 3: FindingQuasi-identifiersforK-Anonymity ModelbytheSetofCut ...sensitive information of target individual by linking some QIstogether. In order to avoid such linking attacks, the concept

multi-label learning algorithms have been applied to predictthe accuracy of classification of QIs (race and gender ofpatients). Three stages of experimental method have beencarried out on the UCI diabetics dataset by using differentmulti-label classifiers and evaluation measures. Such asbinary relevance (BR), classifier chains (CC), Bayesianclassifier chains (BCC), etc. Although the experimentalresults showed that the best classifier achieves a high overallaccuracy, the transformation method which used to transformthe problem of multi-label classification into single labelclassifications in this paper has a high degree of complexity.What’s more, it can be used to predict the accuracy ofprediction of patient QIs, but how to get the set of QIs was notmentioned in this paper.

Song et al [29] discussed the issue of selecting QIs for thepublishing of views and analyzed the composingcharacteristics of QIs with/without functional dependencies.They suggested that if there have no functional dependenciesin the publishing views, the set of QIs should be composed bythe public attributes among the views; while if there havesome functional dependencies, the set of QIs should not onlycontain public attributes but also include the left attribute offunctional dependency. This method required to estimaterelevant views which were possible be associated with thepublishing view, as well as the functional dependenciesamong various attributes. For data publishers, it was difficultto identify relevant views and the set of QIs in some practicalapplications, because they had little information except thepublishing data and some part of published views.

[30] and [31] proposed new algorithms for finding the setof relevant views and QIs based on hypergraph. The methodmapped the publishing view and the set of published viewsinto a hypergraph, and converted the problem of finding theset of relevant views into the searching for all the pathsbetween two given nodes. Identification method for QIs stilluses the ideas proposed in [29]. Actual process of this methodneed to degenerate the hypergraph into a common graph andhas a high computational complexity. Besides, the set of QIsget from this method may include too many attributes, whichis not good for K-anonymity model based on generalizations(detiled analysis is given in the section of experiments andanalysis).

In this paper, we study the problem of finding QIs forprivacy preserving data publishing method based onK-anonymity model. We take into account the connectivityof QIs between publishing data and others, and put forward anew way to determine QIs from the independence of sets.Identification algorithm has been developed to get QIs byfinding cut-vertex from attribute graph, which represents allthe possible relevance of publishing data, published data andexternal knowledge. Compared with some existing methods,the proposed identification method for QIs has better effecton partition ability and lower computational complexity.

III. IDENTIFICATION OF QIS BASED ON CUT-VERTEX

For data publisher, it is relatively easy to determinesensitive attributes combines with the contents of publishingdata. For example, salary in income statistics, specific diseasein medical statistics, location or trajectory in traffic statistics,etc. However, it is very difficult to determine QIs onlyaccording to some intuitive experiences. Firstly, data

publishers are not able to make detailed predictions of whichpublished information and background knowledge may beassociated with the publishing data. Secondly, attackers maynot only get some QIs of the target individual according toexternal database, but also obtain some backgroundinformation by other means so as to infer the sensitiveinformation of target individual. Some existing privacypreserving methods [26-31] mainly focus on finding out asmany as possible QIs which may related with published dataand external information. But with the increase ofbackground knowledge and diversify of attacks, these kindsof methods will lead to increasing size of QIs, as well as theaggravation of generalization and the decline of data utility.

We noticed that in the field of Markov network, there aresome conclusions regard to independence [32] [33] (shown inFig.1). If the set of node XB “split” the set XA and XC into twodifferent sets, then given the set of node XB, the set XA isindependent from the set XC. This property can be alsodescribed as: if any path which begins from one node of setXA and ends in one node of set XC contains at least one nodeof set XB, it can be determined that when the set XB is given,the set XA and XC are independent from each other.

Fig.1 Example for the independence of Markov network

Based on the property of independence mentioned aboveand combine with the main idea of K-anonymity privacyprotection method, we put forward the following assumption.If all the attributes can be expressed as nodes and all therelationships between attributes can be abstracted by edges,the publishing data together with the published data andexternal information can be represented by a graph. In thisgraph, XC denotes the set of sensitive attributes that can beclearly determined. Then we can find out a set of node XB

from the rest of attributes so that the graph can be splited intothree independent part (the third part is the set XA). Therefore,take XA and XB to be the set of identifier attributes and QIs,any path from the node of XA to the node of XC shouldcontain at least one node of XB. For privacy preserving datapublishing, it means when treating the set of node XB as QIsand carrying out K-anonymity method, it can effectivelyprevent linking attacks and achieve the expectation ofK-anonymity, no matter how much external knowledge canbe associated with the nodes of XA and XB.

The identification method for QIs proposed in this paperfirstly constructs attribute graph according to relationshipsbetween attributes. Then, the set of QIs will be determinedby finding out cut-vertex on the paths from identifier attributeto sensitive attribute.

Engineering Letters, 26:1, EL_26_1_18

(Advance online publication: 10 February 2018)

______________________________________________________________________________________

Page 4: FindingQuasi-identifiersforK-Anonymity ModelbytheSetofCut ...sensitive information of target individual by linking some QIstogether. In order to avoid such linking attacks, the concept

A. Attribute GraphDefinition 1 (Attribute Graph). Suppose all the attributes

of data can be represented by nodes (denotes asm...,i},U{V i 21 ) , and all the relationships between

attributes can be represented by edges (denotes asjiUU)}e({E , m...,j,i 21 ), The resulting pattern is called

attribute graph, denotes as ),( EVGT .If there is no functional dependency between attributes, a

pair of unordered nodes ),( ji UU )( ji can be used to

indicate the relationship between this two attributes.Attribute graph formed by this way is called an undirectedattribute graph. If there is some functional dependencybetween attributes, a pair of ordered nodes ji UU , )( ji can be used to represent the dependency start from attribute

iU to attribute jU , so as to form a directed attribute graph.

Fig.2 Undirected attribute graph

Fig.3 Directed attribute graph

Fig.2 is an example of undirected attribute graphconstructed according to staff information and medicalinformation. Suppose the released views are V1={gender, age}and V2={gender, disease}, the publishing view can berepresented as V3={name, gender, occupation}. Since thereare no functional dependencies between disease, gender, ageand occupation, all of the attributes form an undirectedattribute graph. Fig.3 is the attribute graph for the sales ofsecond-hand car and its background knowledge. Suppose thereleased information is T1={brand, years, purchasing price,selling price} , and the publishing data can be expressed asT2={brand, condition, driven distance, selling price}.Attackers may know that "someone has a car of the brand X,and it has been used for N years", so the backgroundknowledge can be expressed as T3={owner, brand, years}.Since the brand of a car and its purchasing price may affect

the selling price to a certain extent, and the condition of a carwill be affected by its driven distance. Therefore, the aboveinformation formed a directed attribute graph.

B. Generation Method of Attribute Graph

How to generate and represent attribute graph accuratelyand uniquely with the set of attributes and their relationshipsis the key point to affect the accuracy of identification of QIs.In this section, we put forward the generation method ofattribute graph.

Attributes that appeared within one table have inherentcharacteristic of interconnection. )m...,i(U i 21 represents allthe attributes that included in one table. For undirectedattribute graph, if 2i , we can use 1U and

2U as well as theundirected edge ),( 21 UUe between them to represent theirinterrelated characteristics, shown as Fig.4 (a). If mi 3 , acircle of non-redundant nodes and edges mmeU...eUeU 2211 (inwhich )U,U(e 211 , )U,U(e 322 ... )U,U(e mm 1 ) can beused to connect all the nodes together, shown as Fig.4 (b).Any node within the circle is connected with its twoneighbors and the relationships between each other can betransferred through the circle, so that any two attributeswithin one table can be connected with each other. Fordirected attribute graph, if 2i , we can use

1U and2U as

well as the directed edge 21 U,Ue to represent theirinterrelated characteristics, shown as Fig.5 (a). If mi 3 ,we may firstly adjust the nodes with dependencies adjacent toeach other, and then use a circle of non-redundant nodes andedges mmeU...eUeU 2211 (in which 211 U,Ue ,

322 U,Ue ... 1U,Ue mm ) to connect all the nodestogether, shown as Fig.5 (b). Suppose the edges or circlesappeared within one table can be denoted as subgraphs

iSG )l...,i( 21 . When the subgraph uSG contains all thenodes and corresponding relationships of subgraph

vSG )ji( , we say thatvSG is included by uSG .

(a) Edge (b) Circle

Fig.4 Edge and circle for undirected attribute graph

(a) Edge (b) Circle

Fig.5 Edge and circle for directed attribute graph

Definition 2 (Relationship Matrix). Suppose the set ofnodes for attribute graph TG can be expressed as

Engineering Letters, 26:1, EL_26_1_18

(Advance online publication: 10 February 2018)

______________________________________________________________________________________

Page 5: FindingQuasi-identifiersforK-Anonymity ModelbytheSetofCut ...sensitive information of target individual by linking some QIstogether. In order to avoid such linking attacks, the concept

}U...U,U{V m21 , relationship matrix can be defined as amatrix composed by element )m...,j,i(aij 21 .

01

ija (1)

For directed attribute graphs, some edges are composedby ordered pairs of nodes ji UU , , so that only one-way

connectivityji UU can be formed. Therefore, compared

with undirected attribute graphs, connectivity of nodes fordirected attribute graphs is declined. Reflected in therelationship matrix, there are less locations with the value

1ija . For example, the undirected attribute graph and

directed attribute graph shown in Fig.2 and Fig.3 may beexpressed by relationship matrix GaR and GbR as follows.

0001000010000111110100110

GaR

0000011001100101010010110111000101110011011111110

GbR

Algorithm 1. Attribute GraphInitialization : publishing data, published data and externalknowledge have been represent by the form of subgraphs.Termination condition:all the subgraphs have been processed or allthe nodes have been included in the attribute graph.Input:relationship matrices

iSGR )l...,i( 21 of subgraphs.

Output:relationship matrixGR of attribute graph.

1: represent all tables by RSGi

2: n = number of subgraphs3: m = total number of attributes4: Flag = zeros(n,1)5: T ← select the subgraph with largest number of nodes6: Flag(T) =17: RG ←RSGT

8: NG=19: Node ← all the nodes within T10: for i = 1: number of unprocessed subgraphs11: if RSGi contains Node completely12: Flag(i) =113: NG= NG+114: end15: end16: While (NG≠n) or (Flag≠1)17: P ← select unprocessed subgraph with largest number of nodes18: if size(P)>119: P ← select the one with most repeated nodes with RG

20: end21: RG ←insert RSGP to RG

22: Flag(P) =123: NG= NG+124: Node ← all the nodes within RG

25: for i = 1: number of unprocessed subgraphs26: if RSGi contains Node completely27: Flag(i) =128: NG= NG+129: end30: end31: end32: return(RG)

Corollary: Under the premise of no changing of thenumber of attributes and their publishing relationships,different numbering order of attributes will not influence thestructure of attribute graph. Therefore, relationship matrixGR can be used to represent an attribute graph uniquely.Proof: The “edge” and “circle” in the generation method of

attribute graph have the property of symmetry. That meansattribute nodes within the same edge/circle have the sameimportance, and connection relationships are equal for all thejoint nodes. Therefore, the number of subgraphs and theirconnection relationships stay the same if there is no changingof the number of attributes and their publishing relationships.If the number of two nodes are swapped (denote as ji UU ),

the row i and j as well as the column i and j in the relationshipmatrix are all changed. Use GR and *

GR to represent theoriginal relationship matrix and the correspondingrelationship matrix after change. Elements of the row i and jare swapped, meanwhile, the elements of the column i and jare interchanged.

mm

jm

im

m

mj

jj

ij

j

mi

ji

ii

i

m

j

i

G

a

a

a

a

a

a

a

a

a

a

a

a

a

a

a

a

R

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...111

1

1

1

11

mm

im

jm

m

mi

ii

ji

i

mj

ij

jj

j

m

i

j

G

a

a

a

a

a

a

a

a

a

a

a

a

a

a

a

a

R

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...111

1

1

1

11

*

According to the definition of elementary transformationfor matrix, if a matrix A can be transformed into a matrix B bya finite number of elementary row transformations, then the

matrix A is equivalent to the matrix B (denoted as B~Ar

). Ifa matrix A can be transformed into a matrix B by a finitenumber of elementary column transformations, then thematrix A is also equivalent to the matrix B (denoted as

B~Ac

). Therefore, ifji UU , *

GG R~R .

*

111

1

1

1

11111

1

1

1

11

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

~

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

~ G

mm

im

jm

m

mi

ii

ji

i

mj

ij

jj

j

m

i

jc

mm

im

jm

m

mj

ij

jj

j

mi

ii

ji

i

m

i

jr

G R

a

a

a

a

a

a

a

a

a

a

a

a

a

a

a

a

a

a

a

a

a

a

a

a

a

a

a

a

a

a

a

a

Rijij

The above situation can be extended into the case wheremultiple numbers of node are changed. According to thetransitivity of elementary transformation, if B~A and C~B ,then C~A , it is easy to get *

GG R~R .

Suppose the attribute graph TG is composed by nsubgraphs and contains a total number of m nodes (attributes).Running time of Algorithm 1 is mainly spent on selecting asubgraph and determining whether the connectionrelationships of the unprocessed subgraphs have already beenincluded. In the worst situation, all the subgraphs have to beselected and judged one by one. Once a subgraph has beenselected, it has to be compared with the rest unprocessed ones.Therefore, the worst time complexity of Algorithm 1 is

)nm(O .

Uiand Uj are in the same tableUi and Uj are not in the same table or i=j

Engineering Letters, 26:1, EL_26_1_18

(Advance online publication: 10 February 2018)

______________________________________________________________________________________

Page 6: FindingQuasi-identifiersforK-Anonymity ModelbytheSetofCut ...sensitive information of target individual by linking some QIstogether. In order to avoid such linking attacks, the concept

IV. DEFINITION AND IDENTIFICATION OF QIS

Definition 3 (Set of QIs). For a given attribute graph TG ,the set of explicit identifiers can be expressed as EIU , andsensitive attributes SA

iU )d...,i( 21 can be represented bythe set SAV . If the rest attribute nodes can be divided into twonon-empty set NV and QIV , and the following conditions are

satisfied, then QIV is called the set of QIs for attribute graph

TG .

a) QIN VV , SAQI VV , SAN VV ,

)G(VVVV TSAQIN ; (2)

b) NEI VU , QI

SAi VU , for )U,U(e SA

iEI there is

QIji Vee . (3)

According to Definition 3, the set of QIs QIV is thenecessary and minimum “bridge” which connects explicitidentifier with sensitive attribute. Any path starts fromexplicit identifier EIU to sensitive attribute SA

iU must have atleast one node within the set QIV . Therefore, K-anonymity

operation carried out on the set QIV can effectively avoid the

leakage of privacy information caused by linking attacks.Therefore, the problem of finding QIs for attribute graph canbe transformed into the problem of determining the set ofcut-vertex. It should be note that, the concept “cut-vertex”mentioned here is the cutting point of connectivity for all thepaths from EIU to SAU , rather than the “cut point” of structurein the theory of graph.

A. Set of QIs for attribute graph with single sensitiveattribute

For the case of data publishing with single sensitiveattribute, use EIU and SAU to denote explicit identifier andsensitive attribute. The set of QIs for attribute graph TG canbe got through the following steps.

Step 1: generate attribute graph (represented byrelationship matrix GR ) according to the publishing data,published data and external knowledge by using Algorithm 1;

Step 2: find out all the possible paths (denoted asipath )d...,i( 21 ) start from EIU to SAU by using a

depth-first searching algorithm;Step 3: determine the set of cut-vertexes (denoted as QIV )

on every path by using Algorithm 2.

Algorithm 2. Cut-vertexInput: relationship matrix RG, explicit identifier UEI (denoted bynumber a), sensitive attribute USA (denoted by number b).Output: the set of cut-vertex VQI.1: VN=a; VSA=b; VQI=Φ2: m= size(RG)3: NV=24: P ← find all possible paths from a to b5: n=size(P)6: Flag=zeros(n,1)7: While (NV≠m) or (Flag≠1)8: L ← select unprocessed path with minimum number of nodes9: if size(L)=1

10: Node ← other nodes within path L except for a and b11: if size(Node)=112: VQI ←Node13: flag(L)=114: RL ← select unprocessed path which contains Node15: flag(RL)=116: NV = NV +size(Node)17: elseif size(Node)>118: for i=1:size(Node)19: VQI ←Node(i)20: if ∀ Li ∩ Lj = VQI

21: VQI ←Node(i)22: flag(L)=123: RL← select unprocessed path contains Node(i)24: flag(RL)=125: NV = NV +size(Node)26: end27: end28: end29: elseif size(L)>130: Node ← find the intersections of paths in L31: if size(Node)=032: VQI ← insert other nodes within L except for a and b33: flag(L)=134: NV = NV +size(other nodes)35: elseif size(Node)=136: VQI ←Node37: flag(L)=138: RL ← select unprocessed path which contains Node39: flag(RL)=140: NV = NV +size(Node)41: elseif size(Node)>142: for i=1:size(Node)43: VQI ←Node(i)44: if ∀ Li ∩ Lj = VQI

45: VQI ←Node(i)46: flag(L)=147: RL ← select unprocessed path contains Node(i)48: flag(RL)=149: NV = NV +size(Node)50: end51: end52: end53: end54: end55: return(VQI)

The time cost of Algorithm 2 mainly includes two parts.Firstly, it has to find out all the possible paths from explicitidentifier to sensitive attribute by using a depth-firstsearching algorithm. Suppose the attribute graph TG containsa total number of m nodes, the frequency of recursive callingthe depth-first searching is determined by the number ofnodes connected to the starting point. In the worst case, thestarting point connects to all the other nodes, so the recursivecalling need to be performed (m-1) times at most. Returnprocedure of the recursive calling has to judge the nodesincluded in the path, computational complexity of this part isalso related to the number of connected nodes. In the worstcase, there are (m-1) nodes connected with the current node,so that the worst time complexity for the depth-first searchingalgorithm is )m(O 2 . Secondly, the while loop is used todetermine QIs which satisfy the properties of cut-vertex. Inthe most complex case, the longest path may contain all thenodes within the attribute graph. At this moment, there maybe 11 1

41

3 ...CC mm paths at most, and the pathcontains the same nodes can only appear for once. In the

Engineering Letters, 26:1, EL_26_1_18

(Advance online publication: 10 February 2018)

______________________________________________________________________________________

Page 7: FindingQuasi-identifiersforK-Anonymity ModelbytheSetofCut ...sensitive information of target individual by linking some QIstogether. In order to avoid such linking attacks, the concept

worst case, it is required to determine cut-vertex for each pathseparately. So the worst time complexity for Algorithm 2 is

)m(O)mm(O 222 .

Example 1. Let's consider the undirected attribute graphin Fig.2, in which the explicit identifier attribute UEI={name}and sensitive attribute USA={disease}. To ensure that theprivacy of individual cannot be get by linking attack, someattributes should be selected to carry out K-anonymityprocessing in the publishing view V3={name, gender,occupation}. Relationship matrix of this undirected attributegraph can be expressed as RGa. According to theidentification method for the set of QIs shown in Algorithm 2,all the paths from explicit identifier to sensitive attribute areshown in Table I (attributes are represented by theircorresponding numbers). The obtained set of cut-vertexwhich satisfies the characteristics of QIs is VQI={gender}.

TABLE IPATHS FROM “NAME” TO “DISEASE”

NO. Nodes on the path

1 1 2 5

2 1 2 3 5

TABLE IIPATHS FROM “OWNER” TO “SELLING PRICE”

NO. Nodes on the path

1 7 1 4

2 7 1 2 4

3 7 1 2 3 4

4 7 1 3 4

5 7 1 3 2 4

6 7 1 5 4

7 7 1 5 6 4

8 7 1 6 4

9 7 1 6 5 4

10 7 2 4

11 7 2 1 4

12 7 2 1 3 4

13 7 2 1 5 4

14 7 2 1 5 6 4

15 7 2 1 6 4

16 7 2 1 6 5 4

17 7 2 3 4

18 7 2 3 1 4

19 7 2 3 1 5 4

20 7 2 3 1 5 6 4

21 7 2 3 1 6 4

22 7 2 3 1 6 5 4

Example 2. Considering the directed attribute graph forthe sales of second-hand cars shown in Fig.3, in which theexplicit identifier UEI={owner} and sensitive attributeUSA={selling price}. Data publisher do not want people to getthe inference that "someone's car has sold for some price".Therefore, some QIs should be selected to carry out

K-anonymity processing in the publishing table T2={brand,condition, driven distance, selling price}. Relationshipmatrix of this directed attribute graph can be expressed as RGb.According to the identification method for the set of QIsshown in Algorithm 2, all the paths from explicit identifier tothe sensitive attribute are shown in Table II (attributes arerepresented by their corresponding numbers). The obtainedset of cut-vertex is VQI={brand, years}. Combined with thepublished data T1 and background knowledge T3, it can befound that the quasi-identifier "years" has already beenexposed. Therefore, even all the attributes in table T2 areselected and protected by K-anonymity processing, attackersstill can obtain the real selling price by linking attack. Inorder to realize data publishing without privacy disclosure,K-anonymity processing should be carried out on attributebrand and years in the published data T1 and the publishingdata T2 at the same time.

B. Set of QIs for attribute graph with multiple sensitiveattributes

The above discussion assumes that there is only onesensitive attribute in publishing data. However, during theprocess of actual data publishing may be more than onesensitive attribute need to be protected. Take medicalinformation for example, it may include "diagnoses","expenses" and other sensitive attributes which are highlyassociated with privacy. Therefore, it is necessary to discussidentification method of QIs for attribute graph with multiplesensitive attributes.

Actually, identification method of QIs for attribute graphwith multiple sensitive attributes can be get by expanding themethod for single sensitive attribute. For undirected attributegraph, if the publishing table contains multiple sensitiveattributes }...,{ 21

SAn

SASA UUU , all the sensitive attributes shouldappear on the circle formed by attributes connected one byone according to the construction method of attribute graph.Any two nodes located on the circle can connect with eachother, therefore, the path which can reach sensitive attribute

SAiU will also be able to reach sensitive attribute SA

jU )( ji .

Based on this condition, multiple sensitive attributes}...,{ 21

SAn

SASA UUU can be merged into one node SAU0 , so thatall the relationships between sensitive attributes and othernodes will be converged on the node SAU0 . Thus, the problemof finding QIs for attribute graph with multiple sensitiveattributes can be transformed into the issue for singlesensitive attribute (shown in Fig.6 and Fig.7).

For directed attribute graph with multiple sensitiveattributes, different connection relationship can be formeddue to different dependencies between attribute nodes, suchas inclusion, subordination, derivation, etc. So it is unable toensure the path which can reach sensitive attribute SA

iU willalso be able to reach sensitive attribute SA

jU )( ji . However,

it may be possible to get the set of cut-vertex for eachsensitive attribute SA

iU )...2,1( ni . So the problem of findingQIs for attribute graph with multiple sensitive attributes canbe solved by using the method of single sensitive attribute formany times. Use i

QIV )...2,1( ni to denote the set of QIs for

single sensitive attribute SAiU , the final set of QIs can be

expressed as:

Engineering Letters, 26:1, EL_26_1_18

(Advance online publication: 10 February 2018)

______________________________________________________________________________________

Page 8: FindingQuasi-identifiersforK-Anonymity ModelbytheSetofCut ...sensitive information of target individual by linking some QIstogether. In order to avoid such linking attacks, the concept

iQI

n

iQI VV

1 (4)

Fig.6 Attribute graph with multiple sensitive attributes

Fig.7 Attribute graph after conversion

V. EXPERIMENTS AND ANALYSIS

In this section, the analysis and experiments of theproposed identification method for QIs is given. Moreover,its partition ability and computational complexity iscompared with some other related methods mentionedearlier.

A. Partition ability

We consider partition ability from two aspects. The firstis whether a method takes into account the potentialassociation relationships between publishing data andexternal knowledge. As mentioned in the former section,different publishing data can be associated with differentexternal information by different attributes, which leads todifferent extent of privacy leakage. External information ischanging according to different time and environment evenfor the same publishing data. When the publishing data isassociated with some external knowledge, some attributeswhich are not (or not fully) belong to QIs are likely to becomeQIs in the open environment. So the probability of linkingattack will be greatly enhanced. Therefore, it is important forthe identification method to have the ability of evaluatingpotential association relationships between publishing dataand external knowledge. For the second one, the size of QIshas a great impact on the performance of privacy preservingalgorithm. Too many QIs will aggravate the loss ofinformation caused by generalization in K-anonymity model,and drag the running time of algorithm at the same time. Insome extreme cases, except explicit identifiers and sensitiveattributes, all the rest attributes are selected to be QIs and allthe tuples are generalized into one QI-group. This will lead tothe failure of K-anonymity model directly. In this section, theseparation/distinct ratio method [26], the re-identificationmethod based on probability [27], and the method based onhypergraph [31] are selected and compared with the proposed

identification method based on cut-vertex from the abovementioned two aspects.

[26] discuss the problem of finding minimum QIs byusing the “distinct ratio” and “separation ratio”. Greedyalgorithms for ),( separation/distinct minimum QIs havebeen put forward. For a publishing table with n tuples and m

attributes, the algorithm chooses

m

k 2log11

pairs of

tuples randomly and each attribute maps to a subset of the kpairs separated by this attribute. After that, a greedy set coveralgorithm is applied to find an exact set cover for those kpairs of tuples, and output corresponding attributes as QIs.The algorithm outputs a )1( separation QIs with the size

of *

11 )2logln1( I

m

, where *I is the smallest key. For a

publishing table, the size of key may be different from oneattribute to m attributes. Therefore, the size of QIs identifiedby this method is about

m

logln 2111

to )logln(mm

2111

.

However, the proposed method didn’t take into account thepotential correlations between publishing data and externalknowledge. So the accuracy of the selected QIs will begreatly reduced.

[27] point out that de-identified data can be re-identifiedfrom inference by using background knowledge. Afteranalyzing some de-identification and re-identificationtechniques, five factors are selected as the QIs for medicalrecords, which will affect the probability of re-identificationand could be inferred from background knowledge. Althoughthe method considers much information related with medicalrecords, such as admit information, diagnosis, length of stay,provider license, etc., it is not enough for the openenvironment of data publishing with huge amount ofinformation and dozens of attributes. Many practicalincidents show that public information gathered fromconsumption, social network, education, social security, etc.can be linked together with medical records and thereforelead to the disclosure of people’s privacy. Besides, thenumber of QIs selected by this article is also limited. It willbe variant from only one attribute to all of the five attributes.Actually, the number of inferable QIs for re-identificationof medical records may be more than five. The coverageof the attributes may also beyond the five aspects mentionedin this paper.

The algorithm for finding QIs proposed in [31] mainlyincludes two steps. First of all, the set of relevant views haveto be found out, which can connect the publishing view withsome published views so as to infer secret information. Use

)(XV and )}U(V)...U(V),U(V{V nn2211 to denote thepublishing view and the set of published views, )(XV andV can be mapped into a hypergraph. All the paths fromexplicit identifier EIU to each of the node in publishing dataX can be find out (denoted as RE ) , as well as all the pathsfrom the sensitive attribute SAU to each of the node in thepublishing data X (denoted as ER ). So that the set ofrelevant views can be determined as X)'RERE( . Next,the set of QIs can be identified by selecting and merging thepublic attribute of the publishing view and each of therelevant view, so that the final result can be expressed as

Engineering Letters, 26:1, EL_26_1_18

(Advance online publication: 10 February 2018)

______________________________________________________________________________________

Page 9: FindingQuasi-identifiersforK-Anonymity ModelbytheSetofCut ...sensitive information of target individual by linking some QIstogether. In order to avoid such linking attacks, the concept

)VX(QI i

l

i'V

1

. Suppose the publishing view contains a

total number of m attributes, there will be at least one publicattribute and at most )1( m intersection attributes between

)(XV and each of the relevant view. In the worst case, allthe attributes in publishing view may be selected into the setof QIs. The more the number of attributes included inpublishing view, the more the number of QIs. This meansmore computational complexity for the algorithm thatachieves K-anonymity protection by generalizationoperations.

Identification method for QIs proposed in this paper alsoincludes two steps. Firstly, all the paths from explicitidentifier to sensitive attribute have to be found out by adepth-first searching algorithm carried out on the relationshipmatrix GR . Secondly, cut-vertex of all the paths need to beselected to get the final set of QIs. Generally, explicitidentifier EIU and sensitive attribute SAU are not allowed toappear within the same table at the same time. According tothe generation method of attribute graph proposed in thispaper, explicit identifier EIU and sensitive attribute SAU willnot appear in the same edge or circle. There will be at leastone intermediate node and at most )2( m intermediatenodes on the path from EIU to SAU (shown in Fig.8). Let’sconsider the same situation as [31], suppose there are l tables(including the background knowledge) besides the publishingdata X , and for each iVX )...2,1( li there are at most

)1( m intersections. According to the generation method ofattribute graph proposed in this paper, we can obtain a circlewith at most )1( m nodes, and there are l nodes have a

separate edge. If the explicit identifier EIU (or the sensitiveattribute SAU ) locates on the end of edge, the sensitiveattribute SAU (or the explicit identifier EIU ) will locate onthe circle or the end of other edge. There will be at least oneinterval node and at most

21m interval nodes between them

(shown in Fig.9). It is not difficult to find that compared withthe method used in [31], the proposed identificationalgorithm based on cut-vertex has smaller size of QIs underthe same circumstance.

Fig.8 Attribute graph formed by linking of edges

Fig.9 Attribute graph formed by circle and edges

Table III shows the partition ability of the methodsdiscussed above. Among the methods consider potentialconnectivity of publishing data and external knowledge,identification method based on cut-vertex proposed in thispaper has a smaller size of QIs. [27] uses the method toestimate the combination of QIs under a fixed number ofprerequisites. It lacks the flexibility applicable for variousdata publishing environments, and has poor accuracy in theset of QIs. The method used in [31] only noticed thephenomenon of attributes linking, which is probablyhappened in various data tables or views, and takes as muchassociated attributes as possible into the set of QIs. However,the roles of associated attributes and their importance are notfurther distinguished. Therefore, the more the tables or viewsare associated with others, the more the attributes included inthe table, the larger the final set of QIs. In this paper, wedefine QIs as the necessary and minimum "bridge" fromexplicit identifier EIU to sensitive attribute SAU . Theproposed identification algorithm based on cut-vertex aims tofind out the "must" rather than "all" associated attributes, byfurther differentiating the roles and importance of associatedattributes. Hence the number of QIs has been reducedeffectively. Fewer elements of QIs can significantly reducethe computational complexity of K-anonymity algorithm,and is also helpful to reduce the loss of information caused bygeneralizations, as well as improve the availability ofanonymous data. Therefore, the proposed method hassuperior accuracy and partition effect than others.

TABLE IIICOMPARISON OF PARTITION EFFECT

Methods Connectivity Minimum Sizeof QIs

Maximum Sizeof QIs

Ref.[26] NO

m

logln 2111

)logln(mm

2111

Ref.[27] Yes 1 5

Ref.[31] Yes 1 m

Ours Yes 1 m-2

B. Computational Complexity

It has been proved that the problem of finding minimumkey is an NP-hard problem, and the best-knownapproximation algorithm for this problem is a greedyalgorithm, which has an approximation ratio of )(ln nO(where n represents the number of tuples). In view of this, [26]relaxes the problem of finding minimum QIs to the ),( separation/ distinct minimum QIs problem, which aims tofind QIs with a small size such that, with probability at least

1 , the output QIs has separation/distinct ratio at least1 . By sacrificing some accuracy on the result, [26]

developed efficient algorithms that find small size of QIswith time complexities sublinear in the number of tuples.When taking and as constants, the approximation ratioof the proposed method will be )(lnmO (where mrepresents the number of attributes), which is smaller than

)(ln nO for the traditional greedy algorithm when mn .The running time of the above algorithm is )( 4mO .

Engineering Letters, 26:1, EL_26_1_18

(Advance online publication: 10 February 2018)

______________________________________________________________________________________

Page 10: FindingQuasi-identifiersforK-Anonymity ModelbytheSetofCut ...sensitive information of target individual by linking some QIstogether. In order to avoid such linking attacks, the concept

[27] uses a comparative analysis of the probability ofre-identification according to the type and the range ofinference. The method shown in formula (5) is used formeasuring the probability of re-identification, where f refersto the size of QI-group, j is the number of QI-group in dataset, represents the probability of re-identification. Forexample, if all the tuples can be separated into q QI-groupsaccording to a quasi-identifier attribute (or a certaincombination of QIs), the probability of re-identification willbe q/1 . When using this comparative method, we have tofirstly get a set of all the combinations composed by inferableattributes. In [27], the set contains

3255

45

35

25

15

05 CCCCCC possible combinations.

Then, the publishing data need to be scanned for many timesin order to get QI-groups according to certain combination ofattributes. For a publishing dataset with n tuples and mQIs,it needs at least an approximation ratio of )(ln nO for eachscan. Therefore, the total running time of this algorithm willbe )(2)(

0nOnOC m

m

i

im

.

jj f

1 (5)

The algorithm of finding QIs proposed in [31] firstlymaps all the publishing views and published views intohypergraph. In order to get all the paths between a pair ofpoints in the hypergraph, every two nodes within ahyper-edge are connected by a line where the number ofnodes is equal to or more than three. So the hypergraph isdegenerated into a common graph. Suppose the total numberof nodes is m , the algorithm for the set of relevant views hasa complexity of )( 5mO . Next, QIs are determined accordingto whether there have functional dependencies betweenattributes. Suppose the set of relevant views has a number ofl views, if there are no functional dependencies among theattributes, the algorithm for finding the intersection part ofviews will be looped for l times. That is to say, the totalcomplexity of the process for finding QIs is

)m(O)ml(O 55 under the circumstances with nofunctional dependencies. When there are some functionaldependencies between attributes, some judgments have to becarried out on the basis of previous algorithm, the overallcomplexity of the algorithm is also similar to

)m(O)ml(O 55 .

Identification algorithm for QIs based on cut-vertexproposed in this paper firstly represents all the publishingdata, published data and other available external informationby the form of undirected/directed attribute graph.Connections among attributes such as dependency,connectivity, subordination, etc are described by therelationship matrix. Suppose the attribute graph has a totalnumber of m nodes, the identification algorithm for QIsmainly includes two steps: Firstly, all the possible paths fromexplicit identifier EIU to the sensitive attribute SAU areobtained by a depth-first searching strategy according torelationship matrix. Its complexity is about )( 2mO . Secondly,the set of QIs is determined by selecting cut-vertex from theabove-mentioned paths, complexity of this part is also )( 2mO .

So the identification algorithm for QIs based on cut vertexproposed in this paper has a time complexity of

)m(O)mm(O 222 in the worst case.

TABLE IVCOMPARISON OF RUNNING TIME

Methods Running Time

Ref.[26] O(m4)

Ref.[27] 2mO(n)

Ref.[31] O(m5)

Ours O(m2)

Table IV compares the computational complexity of themethods mentioned above. The identification method for QIsbased on cut-vertex proposed in this paper has lowercomputational complexity than others. The main reason liesin the following aspects: Firstly, the methods proposed in [26]and [27] use greedy or exhaustive algorithm to get all thepossible QIs or combinations of QIs. It requires multiplescans of the publishing data and consumes a lot of computingtime. [31] and our method resolve the problem of finding QIsby using the theory of graph. There is no need to scan thepublishing data, therefore saving a lot of running timebecause the number of attributes is far less than the number oftuples. Secondly, compared with the method proposed by[31], we use circle instead of hyper-edge, which not onlyreserves the connectivity between attributes but also savesthe process of transforming hypergraph into common graph.Thirdly, in order to determine QIs for the publishing view,[31] needs to search all the possible paths from explicitidentifier and sensitive attribute to each of other nodes withinthe publishing view. In this paper, a depth-first searchingstrategy is used based on relationship matrix, whicheffectively reduced the complexity of path searching from

)m(O 4 to )m(O 2 . Finally, the partitioning method based oncut-vertex put forward in this paper selects attribute thatsatisfies the demand of cut-vertex and eliminates the path thatincludes the attribute at the same time, in order to avoid thetraversal of all the possible paths and speed up the selectionprocess.

VI. CONCLUSION

Automatic discovery and accurate classification of QIsare important factors for the success of privacy preservingdata publishing methods. In this paper, we aim to discuss therole of QIs and the problem of finding QIs for K-anonymitymodel. We are inspired by the concept of independence fromMarkov network, and our contributions can be summarizedas follows. Firstly, we put forward a new definition for QIsfrom the aspect of independence of sets. This new definitionestablishes the theoretical basis for the identification of QIs,that is, QIs should separate sensitive attributes from the othernon-sensitive attributes. Secondly, we design a method toform directed/undirected attribute graph according torelationship matrix, which helps to reflect potentialrelationships among attributes. Finally, we put forward anefficient algorithm to identify QIs based on cut-vertex.Compared with some existing methods, the proposedidentification algorithm has better ability to reflect potentialassociation relationships of publishing data and external

Engineering Letters, 26:1, EL_26_1_18

(Advance online publication: 10 February 2018)

______________________________________________________________________________________

Page 11: FindingQuasi-identifiersforK-Anonymity ModelbytheSetofCut ...sensitive information of target individual by linking some QIstogether. In order to avoid such linking attacks, the concept

knowledge. What’s more, it has smaller size of QIs and lowercomputational complexity under the same circumstance.

REFERENCES[1] The white house. “Consumer data privacy: in a networked world”,

https://www.whitehouse.gov/sites/default/files/privacy-final.pdf[2] Meng Xiaofeng, Zhang Xiaojian. “Big data privacy management”,

Journal of Computer Research and Development, vol.52, no.2,pp265-281, 2015.

[3] European Commission. “Proposal on general data protectionregulation”, http://ec.europa.eu/justice/data-protection/document/review2012/com_2012_11_en.pdf

[4] Carlos Costa, Maribel Yasmina Santos. “Big Data: State-of-the-artConcepts, Techniques, Technologies, Modeling Approaches andResearch Challenges”, IAENG International Journal of ComputerScience, vol.44, no.3, pp285-301, 2017.

[5] Yawar Abbas Bangash, Qamar ud Din Abid, Alshreef Abed Ali A, et al.“Security Issues and Challenges in Wireless Sensor Networks: ASurvey”, IAENG International Journal of Computer Science, vol.44,no.2, pp135-149, 2017.

[6] Hirofumi Miyajima, Noritaka Shigei, Hiromi Miyajima, et al. “NewPrivacy Preserving Back Propagation Learning for Secure MultipartyComputation”, IAENG International Journal of Computer Science,vol.43, no.3, pp270-276, 2016.

[7] Sweeney L. “K-Anonymity: A model for protecting privacy”,International Journal of Uncertainty, Fuzziness and Knowledge-BasedSystems, vol.10, no.5, pp 557-570, 2002.

[8] Sweeney L. “Achieving k-anonymity privacy protection usinggeneralization and suppression”, International Journal of Uncertainty,Fuzziness and Knowledge-Based Systems, vol.10, no.5, pp571-588,2002.

[9] Priyadarsini RP, Sivakumari S, Amudha P. “Enhanced l-DiversityAlgorithm for Privacy Preserving Data Mining”, Proc. 51st AnnualConvention of the Computer-Society-of-India, India, 2016, pp14-23.

[10] Yamaoka Y, Itoh K. “k-Presence-Secrecy: Practical Privacy Model asExtension of k-Anonymity”, IEICE Transactions on Information andSysyems, issue 4, pp730-740, 2017.

[11] Casas-Roma J, Herrera-Joancomarti J, Torra V. “A summary ofk-degree anonymous methods for privacy-preserving on networks”,Advanced Research in Data Privacy, issue 567, pp231-250, 2015.

[12] Amiri Fatemeh, Yazdani Nasser, Shakery Azadeh, et al. “Hierarchicalanonymization algorithms against background knowledge attack indata releasing”, Knowledge-based Systems, vol.101, pp71-89, 2016.

[13] Le Junqing, Liao Xiaofeng, Yang Bo. “Full autonomy: A novelindividualized anonymity model for privacy preserving”, Computers &Security, vol.66, pp204-217, 2017.

[14] Nayahi JJV, Kavitha V. “Privacy and utility preserving data clusteringfor data anonymization and distribution on Hadoop”, FutureGeneration Computer System-The International Journal of Escience,vol.74, pp393-408, 2017.

[15] B Palanisamy, L Liu. “Privacy-preserving Data Publishing in theCloud: A Multi-level Utility Controlled Approach”, Proceedings of theIEEE 8th International Conference on Cloud Computing, USA,pp130-137, 2015.

[16] Sanchez David, Domingo-Ferrer Josep, Martinez Sergio, et al.“Utility-preserving differentially private data releases via individualrankingmicroaggregation”, Information Fusion, vol.30, pp1-14, 2016.

[17] Qinghai Liu, Hong Shen, Yingpeng Sang. “Privacy-Preserving DataPublishing for Multiple Numerical Sensitive Attributes”, TsinghuaScience and Technology, vol. 20, no.3, pp246-254, 2015.

[18] Zhang L, Xuan J, Si RQ, et al. “An Improved Algorithm ofIndividuation K-Anonymity for Multiple Sensitive Attributes”,Wireless Personal Communications, vol.95, no.3, pp2003-2020, 2017.

[19] Li Jin, Liu Zheli, Chen Xiaofeng, et al. “L-EncDB: A lightweightframework for privacy-preserving data queries in cloud computing”,Knowledge-Based Systems, vol.79, pp18–26, 2015.

[20] Wang J, Zhang YH, Wang YY, et al. “RPRep: A Robust andPrivacy-Preserving Reputation Management Scheme forPseudonym-Enabled VANETs”, International Journal of DistributedSensor Networks, vol.12, issue 3, pp1-15, 2016.

[21] Dargahi Tooska, Ambrosin Moreno, Conti Mauro, et al. “ABAKA: Anovel attribute-based k-anonymous collaborative solution for LBSs”,Computer Communications, vol. 85, pp1-13, 2016.

[22] Terrovitis Manolis, Poulis Giorgos, Mamoulis Nikos, et al. “LocalSuppression and Splitting Techniques for Privacy PreservingPublication of Trajectories”, IEEE Transactions on Knowledge andData Engineering, vol. 29, no. 7, pp1466-1479, 2017.

[23] Dara E. Seidl, Piotr Jankowski, Ming-Hsiang Tsou. “Privacy andspatial pattern preservation in masked GPS trajectory data”,

International Journal of Geographical Information Science,vol.30, issue 4, pp785-800, 2016.

[24] P. Shi, L. Xiong, B. Fung. “Anonymizing data with quasi-sensitiveattribute values”, Proceedings of the 19th ACM internationalconference on Information and knowledge management, Canada,pp1389–1392, 2010.

[25] Yuichi Sei, Hiroshi Okumura, Takao Takenouchi, et al.“Anonymization of Sensitive Quasi-Identifiers for l-Diversity andt-Closeness”, IEEE Transactions on Dependable and SecureComputing, DOI 10.1109/TDSC.2017.2698472

[26] Rajeev Motwani, Ying Xu. “Efficient Algorithms for Masking andFinding Quasi-Identifiers”, Proceedings of VLDB, Vienna, Austria,pp758-769, 2007.

[27] Yong Ju LEE, Kyung Ho LEE. “Re-identification of medical recordsby optimum quasi-identifiers”, International Conference on AdvancedCommunication Technology (ICACT), Bongpyeong, South Korea,pp428-435, 2017.

[28] Naveen Kumar Parachur Cotha, Marina Sokolova. “Multi-labellearning in classification of patients’ quasi-identifiers”, Prog ArtifIntell, vol.4, pp37-48, 2015.

[29] Song Jinling, Huang Liming, Liu Guohua. “Algorithm for FindingQuasi-identifiers in the k-anonymity Method”, Journal of ChineseComputer Systems, vol.29, no.9, pp1688-1693, 2008.

[30] Song Jinling, Liu Guohua, Huang Liming, et al. “Algorithm to Find theSet of Relevant Views and Quasi-Identifiers for k-anonymity Method”,Journal of Computer Research and Development, vol.46, no.1,pp77-88, 2009.

[31] Huang L M, Song J L, Lu Q C, et al. “Hypergraph-based solution forselecting quasi-identifier”, International Journal of Digital ContentTechnology and its Applications, vol.6, no.20, pp597-606, 2012.

[32] Santana R, Shakya S. “Probabilistic Graphical Models and MarkovNetworks”, Markov Networks in Evolutionary Computation, vol.14,no.1, pp 3-19, 2012.

[33] Wang Feiyue, Han Suqing. “Probabilistic Graphical Models: theoryand technology”, Tsinghua University Press, 2015.

Engineering Letters, 26:1, EL_26_1_18

(Advance online publication: 10 February 2018)

______________________________________________________________________________________