LUGS: A Scalable Non-parametric Data Synthesizer for ... · the art privacy metric, then the risk...

8
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 108 109 LUGS: A Scalable Non-parametric Data Synthesizer for Privacy-Preserving Health Data Publiction Abstract This paper introduces a non-parametric data synthesizing algorithm to generate privacy- safe “realistic but not real” synthetic health data. The proposed algorithm synthesizes ar- tificial records while preserving the statistical characteristics of the original data to the ex- tent possible. The risk from “database link- ing attack” is quantified by an l-diversified data generation process. Moreover its al- gorithmic performance is optimized using Locality-Sensitive Hashing and parallel com- putation techniques to yield a linear-time al- gorithm that is suitable for Big Health data applications. We synthesize a public Medi- care claim dataset using the proposed algo- rithm, and demonstrate multiple data min- ing applications and statistical analyses using the data. The synthetic dataset delivers re- sults that are substantially identical to those obtained from the original dataset, without revealing the actual records. 1. Introduction Synthetic data, generated from a certain random pro- cess, can address disclosure limitation issues in pub- lic use health data. Many health datasets contain privacy-sensitive and sometimes confidential informa- tion such as disease, payment, and treatment records. Revealing such health information is clearly disagree- able to many, and raises severe ethical and fiduciary issues. Instead, when created carefully, synthetic data can provide the required statistical information for various analyses in the healthcare domain without re- vealing person-specific data or person’s identity. De- veloping appropriate algorithms to generate synthetic data is critical to meeting the growing need for well- grounded health informatics research. Preliminary work. Under review by the International Con- ference on Machine Learning (ICML). Do not distribute. Using traditional “parametric model-based” synthe- sizers, however, provides only a partial solution to such objectives (Reiter et al., 2006), as it introduces two open-ended issues, namely model selection prob- lem and unquantifiable privacy-risk. The complexity of a synthetic model limits the answerable range of research questions; for example, a linear model syn- thetic data cannot address identifying quadratic re- lationships between covariates. However, most data mining research is based on retrospective analysis, where the research questions are posed post hoc and may be determined during the data exploration pro- cess. Thus, this type of synthetic data are not perfectly suited for data mining applications. Moreover, pop- ular privacy metrics such as k-anonymity (Sweeney, 2002), l-diversity (Machanavajjhala et al., 2007), or - differential privacy (Dwork, 2006) cannot be directly applied to such model-based synthesizers (Abowd & Vilhuber, 2008). Recent reports on adversarial privacy attacks (Narayanan & Shmatikov, 2008; 2009) suggest that rigorous characterization of such risk is critical in data publishing. Non-parametric synthetic data may be a remedy for the model selection issue. Moreover, if the genera- tive process of such data adheres to a certain state of the art privacy metric, then the risk from synthetic data can be well characterized. However, such non- parametric schemes are rarely adopted in generating synthetic data in practice, due to its computational cost, and non-intuitive connections to privacy metrics. In this paper, we analyze the definition of privacy in the healthcare context, and derive its connection to probabilistic generative processes. We also propose a novel and practical algorithm for generating synthetic data, adapting and coupling: parallel computation and Locality Sensitive Hashing (LSH) techniques (Gionis et al., 1999) to address the computational challenges. This data type is dominant in many health records. Furthermore, for many numeric fields, binning, also known as histogramming or quantization, can be ap- propriately applied to transform data into categorical format, supporting the generality of our framework.

Transcript of LUGS: A Scalable Non-parametric Data Synthesizer for ... · the art privacy metric, then the risk...

Page 1: LUGS: A Scalable Non-parametric Data Synthesizer for ... · the art privacy metric, then the risk from synthetic data can be well characterized. However, such non-parametric schemes

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053054

055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107108109

LUGS: A Scalable Non-parametric Data Synthesizerfor Privacy-Preserving Health Data Publiction

Abstract

This paper introduces a non-parametric datasynthesizing algorithm to generate privacy-safe “realistic but not real” synthetic healthdata. The proposed algorithm synthesizes ar-tificial records while preserving the statisticalcharacteristics of the original data to the ex-tent possible. The risk from “database link-ing attack” is quantified by an l-diversifieddata generation process. Moreover its al-gorithmic performance is optimized usingLocality-Sensitive Hashing and parallel com-putation techniques to yield a linear-time al-gorithm that is suitable for Big Health dataapplications. We synthesize a public Medi-care claim dataset using the proposed algo-rithm, and demonstrate multiple data min-ing applications and statistical analyses usingthe data. The synthetic dataset delivers re-sults that are substantially identical to thoseobtained from the original dataset, withoutrevealing the actual records.

1. Introduction

Synthetic data, generated from a certain random pro-cess, can address disclosure limitation issues in pub-lic use health data. Many health datasets containprivacy-sensitive and sometimes confidential informa-tion such as disease, payment, and treatment records.Revealing such health information is clearly disagree-able to many, and raises severe ethical and fiduciaryissues. Instead, when created carefully, synthetic datacan provide the required statistical information forvarious analyses in the healthcare domain without re-vealing person-specific data or person’s identity. De-veloping appropriate algorithms to generate syntheticdata is critical to meeting the growing need for well-grounded health informatics research.

Preliminary work. Under review by the International Con-ference on Machine Learning (ICML). Do not distribute.

Using traditional “parametric model-based” synthe-sizers, however, provides only a partial solution tosuch objectives (Reiter et al., 2006), as it introducestwo open-ended issues, namely model selection prob-lem and unquantifiable privacy-risk. The complexityof a synthetic model limits the answerable range ofresearch questions; for example, a linear model syn-thetic data cannot address identifying quadratic re-lationships between covariates. However, most datamining research is based on retrospective analysis,where the research questions are posed post hoc andmay be determined during the data exploration pro-cess. Thus, this type of synthetic data are not perfectlysuited for data mining applications. Moreover, pop-ular privacy metrics such as k-anonymity (Sweeney,2002), l-diversity (Machanavajjhala et al., 2007), or ε-differential privacy (Dwork, 2006) cannot be directlyapplied to such model-based synthesizers (Abowd &Vilhuber, 2008). Recent reports on adversarial privacyattacks (Narayanan & Shmatikov, 2008; 2009) suggestthat rigorous characterization of such risk is critical indata publishing.

Non-parametric synthetic data may be a remedy forthe model selection issue. Moreover, if the genera-tive process of such data adheres to a certain state ofthe art privacy metric, then the risk from syntheticdata can be well characterized. However, such non-parametric schemes are rarely adopted in generatingsynthetic data in practice, due to its computationalcost, and non-intuitive connections to privacy metrics.In this paper, we analyze the definition of privacy inthe healthcare context, and derive its connection toprobabilistic generative processes. We also propose anovel and practical algorithm for generating syntheticdata, adapting and coupling: parallel computation andLocality Sensitive Hashing (LSH) techniques (Gioniset al., 1999) to address the computational challenges.This data type is dominant in many health records.Furthermore, for many numeric fields, binning, alsoknown as histogramming or quantization, can be ap-propriately applied to transform data into categoricalformat, supporting the generality of our framework.

Page 2: LUGS: A Scalable Non-parametric Data Synthesizer for ... · the art privacy metric, then the risk from synthetic data can be well characterized. However, such non-parametric schemes

110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219

LUGS: A Scalable Non-parametric Data Synthesizer

2. Preliminaries

The original table and synthetic table are denotedas D and S, respectively. n and m represent thenumber of rows and columns in table D, and x =(x1, x2, ... , xm) is a row of the table. x\j is arow with the value of xj undetermined e.g. x\1 =(·, x2, ... , xm). Random variables are expressed us-ing capital letters, for example X for a scalar, and Xfor a vector. Pr(X) is a true probability distributionof the original data, and Pre(X | D) represents an em-pirical probability mass function given data D.

A categorical dataset D with indistinguishable rows,i.e. without any ID’s, can be completely describedby a contingency table. Thus, with enough number ofsamples, the contingency table can be said as the mostprecise and non-parametric approximation for the un-derlying joint distribution. We exchangeably use thenotations for the normalized contingency table1 andthe empirical distribution Pre(X | D).

The difference between the empirical and the true dis-tribution decays as the number of data points increase(Csiszar & Shields, 2004). The empirical distributionapproaches the true distribution exponentially fast asN increases. In practice, this empirical distributioncan be treated as the underlying true distributionwhen the size of the data is huge.

Obtaining a full contingency table is usually in-tractable especially with a large number of attributes.However, we can mimic sampling from a full contin-gency table using Gibbs sampling. Gibbs sampling isa prevalent estimation or sampling technique, whensampling from a joint distribution is intractable. Asynthetic sample xs from a joint distribution Pre(X |D) can be Gibbs-sampled as follows:

x1 ∼ Pre(X1 | x\1,D)

x2 ∼ Pre(X2 | x\2,D)

...

xm ∼ Pre(Xm | x\m,D)

where Pre(Xi | x\i,D) is a conditional frequency ta-ble derived from data D. When the above cycle hasreached an enough number iterations (burn-in period),the last sample x is equivalent to a random samplefrom the joint distribution. It is important to notethat, in Gibbs sampling, the information about thejoint distribution is distributed across its conditionaldistributions.

This brute-force Gibbs synthesizer, however, has three

1Normalized by the total row counts.

critical issues: a lack of convergence guarantee, com-putational inefficiency, and loose privacy guarantee.For a Gibbs sampler to converge, the Markov chainin a Gibbs sampler needs to be irreducible and aperi-odic. In the brute-force Gibbs synthesizer, the Markovchain of the empirical conditional distributions needsto be verified to satisfy both conditions. Moreover,this brute-force Gibbs synthesizer is computationallyexpensive. The number of distinct conditional distri-butions exponentially grows with the number of fea-tures, thus pre-computing Pre(Xi | X\i,D) is not de-sirable. Estimating Pre(Xi | X\i,D) on the fly is not asmart choice, since the estimation needs a linear scanfor every Gibbs iteration. Finally, synthetic samplesfrom an empirical distribution always can always befound in the original data, as the support of an em-pirical distribution is the same as the support of theoriginal data. In other words, the synthetic data are“realistic and real”, not “realistic but not real”.

3. LUGS Algorithm

We now present an l-diversified uniformly smoothedGibbs Ssynthesizer (LUGS). LUGS is an efficient non-parametric data synthesizer that meets the l-diversityprinciple, and is illustrated in Algorithm 1. The LUGSalgorithm consists of three steps: estimation, pertur-bation, and sampling steps.

Algorithm 1 LUGS Algorithm

(Step 1) Estimate Pe(Xi | g(X\i),D)(Step 2) Perturb Pe → Q s.t. −E[log Q] ≥ log l(Step 3) Gibbs-sample synthetic data from the per-turbed conditional distributions

3.1. Step 1: Estimation

Instead of Pre(Xi | X\i,D), LUGS estimates an ap-proximated distribution, Pe(Xi | g(X\i),D), whereg(X\i) is a hash function such as MinHashing (Broder,1997) and Locality-Sensitive Hashing (LSH) (Gioniset al., 1999; Indyk & Motwani, 1998). The hash func-tion is employed to reduce the number of distinct con-ditional distributions and to access the conditional dis-tributions faster.

We first reduce the hash key space using MinHash-ing, then perform a fast (approximate) nearest neigh-bor search using LSH. In this paper, we choose to usean LSH family F defined for a metric space M =(x,Hamming), but we note that other distance metricsare also applicable. MinHashing collision probabilitycan be adjusted by its parameters, and the numberof MinHashing keys can be made significantly smaller

Page 3: LUGS: A Scalable Non-parametric Data Synthesizer for ... · the art privacy metric, then the risk from synthetic data can be well characterized. However, such non-parametric schemes

220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274

275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329

LUGS: A Scalable Non-parametric Data Synthesizer

than the size of the original database. Thus, even for areally large-scale database with high-dimensional fea-tures, LUGS can sample from its (approximated) con-ditional distributions. Note that there is a trade-offbetween the exact conditional distribution and the re-quired memory size.

3.2. Step 2: Perturbation

The obtained conditional distributions are perturbedto meet a prescribed entropy l-diversity criterion forprivacy:

Definition 1 (Entropy l-diversity (Machanavajjhalaet al., 2007)). A probability mass function Q is “en-tropy l-diverse” if

−E[log Q] = −∑x

Q(x) log Q(x) ≥ log l

where log l > 0.

We perturb the approximated probability distributionPe(Xi | g(X\i),D) to satisfy the l-diversity principleas follows:

Qα(Xi | g(X\i)) = (1− α)Pe(Xi | g(X\i),D) + αU

where U is a uniform distribution over the range ofXi. As we want to keep the statistical properties ofthe original data to the extent possible, we choose theminimum α that satisfies the condition:

α∗ = arg minα

Qα(Xi | g(X\i)) ≥ log l

Q(Xi | g(X\i)) = Qα∗(Xi | g(X\i))

Although we do not show details in this paper, per-turbed distributions can be modified to satisfy otherprivacy metrics such as ε-differential privacy or Put-terfish framework (Kifer & Machanavajjhala, 2012).This paper mainly focuses on the l-diversified syntheticdata, and we leave further experiments and compar-isons with other privacy metrics to future work.

3.3. Step 3: Sampling

LUGS samples are generated through a perturbedGibbs sampler:

x1 ∼ Q(X1 | g(x\1))

x2 ∼ Q(X2 | g(x\2))

...

xm ∼ Q(Xm | g(x\m))

This Markov chain converges to a unique stationarydistribution, Q(X) if α > 0, see Theorem 1.

Theorem 1 (Existence of Q(X)). If Q(Xi|X\i) > 0(positivity condition), then the Markov chain of a per-turbed Gibbs sampler is irreducible, thus there exists aunique stationary distribution Q(X).

Proof. For any two states x and x′, we have:

Pr(X(t+1) = x | X(t) = x′) > 0

Pr(X(t+1) = x′ | X(t) = x) > 0

where X(t+1) represents the (t+ 1)th iteration sampleof the Gibbs sampler. As x and x′ intercommunicate,one of the following is true (Grimmett & Stirzaker,2001): x and x′ are both recurrent or x and x′ areboth null-recurrent. As the range of X is finite, wehave at least one recurrent state, thus all the states ofX are recurrent. Since the Markov chain is irreducibleand its all the states are recurrent, the chain has aunique stationary distribution Q(X).

The difference between Pre(X) and Q(X) is upper-bounded by the parameter log l, and hence the dif-ference between the true distribution Pr(X) and theLUGS-distribution Q(X) is also upper-bounded. Notethat the perturbed Gibbs sampler generates artificialsamples that are not in the original data. For theseartificial samples, we find the nearest neighboring per-turbed conditional distribution as follows:

Q(Xi | g(Xann\i )) s.t. Xann

\i ≈ X\i

using LSH, then use this approximated nearest neigh-bor distirubion in the Gibbs sampling iterations.

3.4. Additional Step: Parallelization

The LUGS Algorithm is embarrassingly paralleliz-able with sub-sampling. Algorithm 2 illustrates theparallel-LUGS (PLUGS) algorithm. If each process inPLUGS satisfies the l-diversity principle, the collectionof synthetic data also meet the l-diversity principle.

Algorithm 2 PLUGS Algorithm

(Step 0) Partition D into {Dp} and distribute toparallel processes(Step 1) Estimate Pe(Xi | g(X\i),Dp)(Step 2) Perturb Pe → Q s.t. −E[log Q] ≥ log l(Step 3) Gibbs-sample synthetic data from the per-turbed conditional distributions(Step 4) Combine and mix synthetic data {Sp} fromthe parallel processes

Page 4: LUGS: A Scalable Non-parametric Data Synthesizer for ... · the art privacy metric, then the risk from synthetic data can be well characterized. However, such non-parametric schemes

330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384

385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439

LUGS: A Scalable Non-parametric Data Synthesizer

Figure 1. A link attack example. Two datasets can belinked based on Age and ZIP code. Sensitive values suchas Diabetes and STD can be revealed using the new linkedvariables Name and BMI.

4. Privacy Analyses

The “realistic but not real” principle can be achievedthrough rigorous analyses on the definition of “pri-vacy” in the healthcare context. Unlike other scientificmeasurement datasets, healthcare datasets in generalcontain the information about real people, such as pa-tients and physicians. Thus, the definition of data pri-vacy in healthcare datasets can be narrowed down. Wedefine the goal of privacy protection in the healthcaredomain is to protect the identity of the people, and toprotect their sensitive information.

The attacks on published datasets can be categorizedas follows:

• Identity attack: identifying “who”

• Feature attack: identifying “what”

Although many other types of attacks may be threatsto uncover private information, such attacks are notdirectly related to our definition of privacy protection.For example, an attacker can link two datasets and findout new information for a specific row, without know-ing a person’s identity. In this case, as the attackerhas no clue about the identity of the records, this at-tack does not infringe on privacy by our definition.Again, it is possible that even such information can beuseful for later identity and feature attacks (transitiveattacks). We note that the uncertainty level of suchinformation may decrease as more linkable datasetsbecome available.

In this paper, we analyze the effect of database link-ing on feature attacks. Consider a dataset with twofeatures: X (non-sensitive) and Y (sensitive). Whenpublishing the dataset, a typical strategy is to noise

Y to be Y . However, if another dataset is linked withthe original dataset based on X, the linked informa-tion may reduce the uncertainty about the sensitivefield. Figure 1 illustrates this linking scenario. Theo-rem 2 illustrates this information gain for attackers bylinking two datasets:

Theorem 2 (Link Gain). Suppose that two datasetsDZ = {Z,X} and DY = {(Y , X)} are linked based onX, then:

Pr(Y = Y | X,Z)

Pr(Y = Y | X)=

Pr(Z | X, Y = Y )

Pr(Z | X)= Λ (1)

where Λ is the “link gain ratio” by linking two datasets.

Proof. Equation (1) is a direct result by applyingBayes’ theorem.

Theorem 2 states two competing objectives when pub-lishing data: Λ needs to be low and Pr(Y = Y | X)needs to be low. As can be seen, Λ increases asPr(Y = Y | X) decreases, and Pr(Y = Y | X) in-creases as Λ decreases. Furthermore, Λ can be arbi-trarily large when Pr(Z | X) ≈ 0, where we have nocontrol over Z and DZ . Thus, lowering Λ is very dif-ficult. LUGS addresses this issue by perturbing everyvariable in a dataset i.e. including linking variables. Inother words, instead of directly lowering Λ, we injectuncertainty on linking.

The perturbation step in LUGS can be interpretedfrom the information theory context. The mutual in-formation between Xi and X\Xi

is as follows:

I(Xi; X\i) = H(Xi)−H(Xi|X\i) (2)

where I(Xi; X\i) is Shannon Mutual Information be-tween Xi and X\i. Uniformly smoothing Pre(Xi |X\i,D) increases the conditional entropy H(Xi|X\i),decreasing I(Xi; X\i) as a result. Thus, LUGS weak-ens the link between Xi and X\i by increasing theconditional entropy. On the other hand, traditionaldata publishing methods such as k-anonimity and l-diversity decreases the entropy of H(Xi) by general-ization or suppression of the feature values. However,both LUGS and traditional data publishing methodstry to reduce the mutual information between features.

5. Empirical Studies

In this section, we demonstrate the impact of LUGSalgorithm using Medicare Claims records for both de-scriptive analysis and predictive modeling. Centersfor Medicare and Medicaid Services (CMS) providesseveral public-use data files (PUF), such as inpatient

Page 5: LUGS: A Scalable Non-parametric Data Synthesizer for ... · the art privacy metric, then the risk from synthetic data can be well characterized. However, such non-parametric schemes

440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494

495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549

LUGS: A Scalable Non-parametric Data Synthesizer

claims, line-items, drug events, etc. For our experi-ment, we use “BSA Inpatient Claims PUF”2 describ-ing Basic Stand Alone Inpatient records. The data filecontains seven variables: ID, Gender, Age, DRG (drugcode), ICD-9 (procedure code), Length (the lengthof stay), and Amount (payment), and has 15K rows.Age, Length, and Amount variables are originally nu-meric records, but CMS has categorized them into fivequantiles. Note that data re-coding methods such ask-anonymity and l-diversity cannot be directly com-pared to LUGS, as their feature granularities are dif-ferent. To the best of our knowledge, LUGS is the firstprivacy-safe data synthesizer, which adheres to the rig-orous privacy metric, “synthetic l-diversity”.

5.1. Sample Path and Marginal Distribution

We first show a sample path of Gibbs samples. Table 1illustrates a random sample from the synthetic pro-cess. Figure 2 shows the first 300 Gibbs iterations ofone synthetic sample. As can be seen, the Gibbs sam-ple traverses over all the possible combinations of thefeature space. From Figure 3, we can observe that theautocorrelations of the Gibbs samples are less than 0.1after 5 iterations, suggesting that the LUGS samplesare reasonably converged to a stationary distribution.

Table 1. Gibbs Samples over Iteration

(t) Gender Age DRG ICD9 DAYS AMT1 2 4 147 81 3 42 1 1 158 39 1 23 2 6 88 45 2 4

...

Figure 4 and 5 shows marginal histograms of ICD-9procedure and drug codes over different levels of pri-vacy metrics (log l). We can observe uniform smooth-ing effects from the LUGS-generated synthetic data;rare data points are amplified.

5.2. K-way Correlation

K-way correlation (K > 2) between categoricalvariables can be visualized using log-linear analy-sis (Simkiss et al., 2012). Log-linear analysis can beviewed as a multi-way chi-square independence test.For example, if Age (i), Gender (j), and Amount (k)variables are cross-tabulated, then each cell frequency

2http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/BSAPUFS/Inpatient Claims.html

Figure 2. Gibbs samples over Iterations.

Figure 3. Auto-correlation Functions of Gibbs Samples.

Fijk can be fully modeled as follows:

logFijk = λ0︸︷︷︸offset

+ λi + λj + λk︸ ︷︷ ︸independent effect

+

λij + λjk + λik︸ ︷︷ ︸2-way interaction terms

+ λijk︸︷︷︸3-way interaction term

However, if the features are independent of each other,then the full description can be simplified:

logFijk = λ0 + λi + λj + λk (3)

Log-linear analysis basically compares the log-likelihood values from these two models. Age, Gen-der, and Amount variables, are cross-tabulated, andtheir Pearson Chi-square (χ2) statistics for each l-diversified synthetic dataset are plotted in Figure 6.The l-diversified synthetic datasets show less correla-tion among variables.

Page 6: LUGS: A Scalable Non-parametric Data Synthesizer for ... · the art privacy metric, then the risk from synthetic data can be well characterized. However, such non-parametric schemes

550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604

605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659

LUGS: A Scalable Non-parametric Data Synthesizer

Figure 4. Marginal Histogram of ICD-9 Procedure Code vs. log l. The full range of drug codes (top) and the first sevencategories (bottom).

5.3. Missing Value Imputation

As a side effect of the l-diversified generative pro-cess, generated synthetic data tend to have less miss-ing values. Figure 6 shows missing value percentagesfor each synthetic data set. While the original data(log l = 0.0) exhibits ≈ 8% missing rate, the gener-ated synthetic data have less than 6% missing values.

5.4. Purely Artificial Records

LUGS generates “realistic but not real” synthetic data(non-support region records). The generated samplesare “not real”, since its l-diversified distributions havepositive probabilities for any possible record combina-tions. Purely artificial records are the records that can-not be found in the original data. Figure 6 illustratesthe ratios of these artificial records for each syntheticdataset. We emphasize that even with very small log lvalues, more than 60% of synthetic records are filledwith purely artificial records.

5.5. Effect on Predictive Models

Privacy preserving synthetic data can be publicly dis-closed to answer various types of data mining researchquestions. If the statistical properties of the originaldata are well preserved in the LUGS-generated data,then the data mining results from the synthetic datashould be significantly identical to those from the origi-nal data. We demonstrate illustrative results of simplepredictive modeling by arbitrarily setting Amount as

a dependent variable:

AMT︸ ︷︷ ︸y: dependent variable

∼ Age + Gender + DRG + ICD9︸ ︷︷ ︸X: independent variables

Dummy coding is applied to the categorical variablessuch as DRG and ICD9, resulting in 352 dependentvariables. We apply the Lasso regression (Friedmanet al., 2010) to deal with this high dimensionality, thenmeasure Mean Square Errors (MSE) using the originaldata:

β∗ = minβ‖ysynth −Xsynthβ‖2 + λ‖β‖

MSE = ‖yorig −Xorigβ∗‖2

Figure 7 (top) shows the measured MSE values overdifferent λ and log l values. As can be seen, MSEvalues increase with the log l value, but their devia-tion from the original MSE is negligible compared tothe spread of values at each privacy setting, log l. Wenext show a classification example setting AMT > 3as “positive” and AMT ≤ 3 as “negative”. Logisticregression with the Lasso penalty is used. Figure 7(bottom) shows the measured Area Under ReceiverOperating Characteristic (AUC) curves. As in the re-gression example, classification results are very robustand consistent with the synthetic data.

6. Discussions

We proposed a novel data synthesizer that protectssensitive information by adhering to a prescribed l-

Page 7: LUGS: A Scalable Non-parametric Data Synthesizer for ... · the art privacy metric, then the risk from synthetic data can be well characterized. However, such non-parametric schemes

660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714

715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769

LUGS: A Scalable Non-parametric Data Synthesizer

Figure 5. Marginal Histogram of Drug Code vs. log l. The full range of drug codes (top) and the first 40 categories(bottom).

Figure 6. Pearson Chi-Square Statistics from Log-linear Analysis (left), Missing Value Ratio (center), and % of PurelyArtificial Entries (right) in Synthetic Data as function of log l. 3-way correlation decreases, as log l increases.

diversity privacy metric. The proposed algorithm,LUGS, scales linearly with respect to the size of thedata. Furthermore, LUGS is a non-parametric andnon-model based technique. This property assuresthat a wide range of data mining algorithms can be ap-plied without considering the generative process of thesynthetic data. Many health datasets are not availablebecause of privacy restrictions such as HIPAA privacyregulation. We believe that the proposed solution canbe an alternative way to facilitate collaborative health-care research efforts.

References

Abowd, John M. and Vilhuber, Lars. How protectiveare synthetic data? Privacy in Statistical Databases,5262:239–246, 2008.

Broder, Andrei Z. On the resemblance and contain-

ment of documents. In Compression and Complexityof Sequences, pp. 21–29, 1997.

Csiszar, Imre and Shields, Paul C. Information Theoryand Statistics: A Tutorial. Now Publishers Inc.,2004.

Dwork, Cynthia. Differential privacy. In Proceedingsof the 33rd International Colloquium on Automata,Languages and Programming, volume 4052, pp. 1–12, 2006.

Friedman, Jerome, Hastie, Trevor, and Tibshirani,Robert. Regularization paths for generalized linearmodels via coordinate descent. Journal of StatisticalSoftware, 2010.

Gionis, A., Indyk, P., and Motwani, R. Similaritysearch in high dimensions via hashing. In Proceed-ings of the 25th Very Large Database, 1999.

Page 8: LUGS: A Scalable Non-parametric Data Synthesizer for ... · the art privacy metric, then the risk from synthetic data can be well characterized. However, such non-parametric schemes

770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824

825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879

LUGS: A Scalable Non-parametric Data Synthesizer

Figure 7. MSE (top) and AUC (bottom) vs. log l and λ (regularization parameter).

Grimmett, Geoffrey and Stirzaker, David. Probabilityand Random Processes, chapter 3.7, pp. 67. Oxford,third edition, 2001.

Indyk, P. and Motwani, R. Approximate nearestneighbors: Towards removing the curse of dimen-sionality. In Proceedings of 30th Symposium on The-ory of Computing, 1998.

Kifer, Daniel and Machanavajjhala, Ashwin. A rig-orous and customizable framework for privacy. InACM Symposium on Principles of Database Sys-tems, 2012.

Machanavajjhala, Ashwin, Kifer, Daniel, Gehrke, Jo-hannes, and Venkitasubramanian, Muthuramakr-ishnan. l-diversity: Privacy beyond k-anonymity.Transactions on Knowledge Discovery from Data, 1,2007.

Narayanan, Arvind and Shmatikov, Vitaly. Robustde-anonymization of large sparse datasets. In Pro-ceedings of the IEEE Symposium on Security andPrivacy, pp. 111–125, 2008.

Narayanan, Arvind and Shmatikov, Vitaly. De-anonymizing social networks. In IEEE Symposiumon Security and Privacy, pp. 173–187, 2009.

Reiter, Jerome P., Raghunathan, Trivellore E., andKinney, Satkartar K. The importance of modelingthe sampling design in multiple imputation for miss-ing data. Survey Methodology, 32:143–149, 2006.

Simkiss, D., Ebrahim, G. J., and Waterson, A. J. R.Chapter 14: Analysing categorical data: Log-linearanalysis. Journal of Tropical Pediatrics, 2012.

Sweeney, Latanya. k-anonymity: a model for protect-ing privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 10:557–570, October 2002. ISSN 0218-4885.