PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are...

53
“Secure” Logistic Regression with Distributed Databases Aleksandra B. Slavkovic Department of Statistics, Penn State University [email protected] Joint work with Yuval Nardi, Stephen Fienberg, Alan Karr, and Matthew Tibbits PADM-ICDM, Omaha --- Oct. 28, 2007

Transcript of PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are...

Page 1: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

“Secure” Logistic Regressionwith Distributed Databases

Aleksandra B. SlavkovicDepartment of Statistics, Penn State University

[email protected]

Joint work with Yuval Nardi, Stephen Fienberg, Alan Karr, and Matthew Tibbits

PADM-ICDM, Omaha --- Oct. 28, 2007

Page 2: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

2

Problem Formulation Setting:

Data are collected separately and intimately by several parties (agencies),representing parts of an idealized unified database, D.

Agencies want to perform perform statistical analyses on the pooled(combined) data, without actually combining the data.

Pledges of confidentiality, laws & regulations, proprietary data, business practices

Goals: Share “valid” logistic regression on D, with parameter estimation & model

fit diagnostics but without any actual data integration Using “secure tools” each agency protects its own data from the other agencies Do not reveal identities or sensitive attributes

Agencies: statistical agencies, competing corporations, individual researchers

Our approach: combines ideas from SMC, PPDM, SDL, Record Linkage …

Page 3: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

3

Data protection approaches/communities Not sharing/releasing data Sharing data via restricted access or altering data in some way

Information Security Cryptography

Do NOT answer question what methods/algorithms preserve privacy!

Privacy-Preserving Data Mining (PPDM) Data mining algorithms and their security Aggregating multiple sources into a single warehouse or mining multiple

distributed databases Secure multi-party computation (SMC) Record linkage

Statistical Disclosure Limitation/Control (SDL) Data masking techniques Introduce bias and variance

Page 4: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

4

Statistical Disclosure Limitation (Control) methods Goal: “Valid” multivariate inference as with the original data (utility) vs. disclosure risk Differences & similarities with PPDM methods

Data masking: Transform the original data (matrix X) to the disseminated data (Y) Y=AXB + C A=record transformation, B=attribute transformation, C=noise addition

Traditional approaches Aggregation: Rounding, Topcoding & Tresholding Suppression, e.g., cell suppression Data Perturbations (both input and output) Data Swapping

Modern approaches: Sampling and Simulation techniques Synthetic data Remote access servers (e.g., results of regression, trusted “third” party) Partial information releases (e.g., margins for tabular data/contingency tables) Secure computation

Page 5: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

5

Secure Multi-party Computation (SMC) 1,…,K parties each has its own value x1, …, xk

Goal: compute a known function f(x1,…,xk) such that “Secure” -- Party j only knows xj and the result f(x1,…,xk) No third party involved No actual data sharing Secure summation and matrix manipulation

Assumption: semi-honesty No collusions Parties use correct data only for the agreed computation Retain intermediate results

Statistical Analysis – result itself may violate individual privacy

Ref: Yao(1982) – Yao’s Millionaire Protocol, Goldreich et al.(1987)

Page 6: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

6

“Secure” Maximum Likelihood Estimation

Exponential families

Global log-likelihood

where Dk is Agency’s k database.

Use secure summation on each of L terms

!

log f (",x) = cl (x)dl (")l=1

L

#

!

logL(",x) = dl(")

l=1

L

# cl(x

i)

xi$D

k

#k=1

K

#%

& ' '

(

) * *

Ref: Karr et al. (2006), Fienberg et al.(2006)

Page 7: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

7

Logistic Regression

Let be i.i.d. r.v.’s whose meansdepends on covariates through

where

and is design matrix. Parameter estimates via Newton-Raphson procedure

!

Y = (Y1,Y

2,K ,Yn )

!

" i = E(Yi )

!

xi " # p+1

!

logit(" i ) = xijj= 0

p

# $ j = (X$ )i

!

logit(" ) = log(" /(1# " ))

!

X

!

n " (1+ p )

Page 8: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

8

Partitioned Database Types

n

Vertically Partitioned dataVertically Partitioned data Horizontally Partitioned dataHorizontally Partitioned data

Party:1A

2A

3A

4A

5A

n

11 p+

2p

3p

4p

5p

1n

2n

3n

4n

5n

!=

=5

1k

knn

p+1

1A

2A

3A

4A

5A

Design Matrix X

Page 9: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

9

Logistic Regression Assume K parties

Horizontal Partitioning:

each is an matrix with

Vertical Partitioning:

each is an matrix except for which has columns. All agencies must have a global identifier (e.g., SSN) and first agency has Y.

!

X'= [X

1

', X

2

',K , XK

'], Y'= [Y

1,Y

2,K ,YK ]

!

Xk

!

nk " (1+ p )

!

n = nk

k= 1

K

"

!

[Y X] = [Y X1K X

K]

!

Xk

!

n " pk

!

X1

!

1+ p1

Page 10: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

10

Parameter Estimation: Horizontal Partitioning

Linear Regression

Secure sum

Logistic Regression

Secure Newton-Raphson

!

ˆ " = (X'X)

#1X

'Y

!

X'X = (Xk'

k= 1

K

" Xk )

!

!

ˆ " new = ˆ " old # H#1 ( ˆ " old )$ ( ˆ " old )

!

!

X'Y = (X

k

'

k=1

K

" Yk)

Page 11: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

11

“Secure” Newton-Raphson Algorithm

For the iterative procedure, agencies must either (i) agree on an initial estimate or (ii)compute them via “secure” linear regression.

At each iteration, s, we find a new estimate of ß via secure summation

When the model converged, the agencies can use secure summation to share thevalues of log-likelihood, and Pearson and Deviance statistics.

!

XtW

(s)X =

k=1

K

" (X(k ))t (W(k ))(s)X

(k ) and Xt (Y#µ(s)) = X(k )t (Y(k ) # (µ(k ))(s))

j=1

K

"

where

(W(k ))(s) = Diag(nl(k )

( ˆ $ j(k ))(s)(1# ( ˆ $ j

(k ))(s))), and (µ(k ))(s) = n j

(k )( ˆ $ j(k ))(s)

for agency k, k =1,...,K, j =1,...,n(k ),and for iteration s.

!

ˆ " (s+1) = ˆ " (s) + (XtW

(s)X)

#1X

t(Y#µ(s)

)

W(s) = Diag(n j$ j

(s)(1#$ j

(s))), µ(s) = n j$ j

(s)

Page 12: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

12

Diagnostics -- Goodness of fit Pearson X2 and deviance G2 statistics

Residuals, but increase the risk of disclosure

Model selection: compare log-likelihoods of the larger model and the parsimoniousmodel

!

X2 =

(yk ) j " (nk ) j (# k ) j

(nk ) j (# k ) j (1" (# k ) j )

$

% & &

'

( ) )

j=1

nk

*k=1

K

*2

G2 = 2 (yk ) j log

(yk ) j

( ˆ µ k ) j

$

% & &

'

( ) ) + ((nk ) j " (yk ) j )log

(nk ) j " (yk ) j

(nk ) j " ( ˆ µ k ) j

$

% & &

'

( ) )

+

, - -

.

/ 0 0 j=1

nk

*k=1

K

*

!

(yk ) j log(" k ) j + (1# (yk ) j (1# log(" k ) j )){ }j=1

nk

$k=1

K

$

Page 13: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

13

Parameter Estimation: Parameter Estimation: Vertical Partitioning

Linear RegressionLinear Regression Off-diagonal blocks of :

Secure matrix product (Karr et al.(2007)) -- requires QRdecomposition to mitigate leakage.

Logistic RegressionLogistic Regression Additivity across parties

Secure Newton-Raphson (secure sum, secure matrix product)

!

X

!

Xk'Xl , k " l

!

!

logit(" i ) = (Xk#kk= 1

K

$ )i

!

Page 14: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

14

Parameter Estimation: Parameter Estimation: Vertical Partitioning

Consider two agencies

Log-likelihood up to an additive constant & initial values:

and

Secure Newton-Raphson (secure sum, secure matrix product)

Where Hl has sub-block matrices

!

Let K = 2, X = [U,V ] and " = [#,$]

Un%1+ p1,Vn% p2

and p1

+ p2

= p

!

!

l(") = yt ( Xk

k=1

K

# "k ) $ log 1+ exp (Xk

k=1

K

# "k )i

% & '

( ) *

+

, -

.

/ 0

i=1

n

#

!

ˆ " new

= ˆ " old#H

l

#1( ˆ "

old)$

l( ˆ "

old)

!

l"" = #UtW$U l"% = #V t

W$U

l%" = #UtW$V l%% = #V t

W$V

!

" (0) = (#U

(o),$

V

(o))

Page 15: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

15

Vertically Partitioned, Partially Overlapping Data

More complex partitioning Assumptions:

Semi-honest model Parties share unique identifiers (SSN) Record matching is exact No overlapping of variables between

parties

! Missing data framework

--------++++3001-4000

++++++++----2001-3000

++++----++++1001-2000

----++++++++1000

----++++++++…

----++++++++1

X5 X6X3 X4X1 X2n

Party CParty BParty A

Page 16: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

16

Vertically Partitioned, Partially Overlapping DataVertically Partitioned, Partially Overlapping Data

As before,

but now each may contain missing valuesmissing values

We write: and

][ 1 KXXX K=

kX

),( iim

i

o

iixxx =

))(,),((

))(,),((

1

1

K

m

i

m

i

m

i

K

o

i

o

i

o

i

AxAxx

AxAxx

iii

iii

K

K

=

=

Page 17: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

17

Two cases: Continuous : Gaussian Mixture Model (GMM)Gaussian Mixture Model (GMM) Categorical : Multinomial Mixture Model (MMM)Multinomial Mixture Model (MMM)

Two steps: (1) estimate GMM(MMM) parameters, (2) Two steps: (1) estimate GMM(MMM) parameters, (2) estimate logistic regressionestimate logistic regressionparametersparameters

(step1) Secure (double) EM algorithm(step1) Secure (double) EM algorithm ““doubledouble””: the GMM (MMM) parameters: the GMM (MMM) parameters

thethe

Parameters are updated securely using SMC protocols (secure sum, secure matrixproduct). Parties only share summary statistics (or sufficient statistics).

(step2) Naïve way: upon convergence, parties may impute their missing values, andthen follow the vertical case to find the logistic parameter in a secure way.

!

xi

!

xi

!

!

xi

mi

Vertically Partitioned, Partially Overlapping DataVertically Partitioned, Partially Overlapping Data

Page 18: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

18

Simulation Study

n=100, p=6, 3 parties (2 attributes each) GMM :

!

f (xi ) = "1# $ (xi ;µ1 , I6 ) + " 2 # $ (xi ;µ2 , I6 )

--------++++61-80

++++++++++++81-100

++++++++----41-60

++++----++++21-40

----++++++++1-20

Party CParty BParty A

!

"1

= 0.25, "2

= 0.75, µ2

= #µ1

= (2,K , 2)

Page 19: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

19

Simulation Study --

Initialization: Secure EM parameters:

-------------------------------------------------------------------- Initialization: Secure EM parameters:

!

" = (0.5 , 0.5), µ2

= #µ1

= (2,K , 2)

!

ˆ " 1

= (0.2700, 0.7300)

ˆ µ 1

= #(1.9208, 1.7280, 1.4835, 2.0116, 1.9428, 1.7600)

ˆ µ 2

= (2.1434, 1.9317, 1.9981, 1.9186, 2.1776, 1.9769)

!

" = (0.1 , 0.9), µ2

= #µ1

= (1,K ,1)

!

ˆ " 1

= (0.2401, 0.7599)

ˆ µ 1

= #(1.9260, 1.9113, 1.8845, 2.0219, 1.8883, 1.8353)

ˆ µ 2

= (1.8479, 1.7490, 1.8608, 2.0578, 2.1636, 1.8924)

!

"1

= 0.25, "2

= 0.75, µ2

= #µ1

= (2,K , 2)

Page 20: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

20

Simulation Study

n=1,000, p=45 Linear increase

Still needlogistic regressioncoefficients. QR decomposition Expect O(N2)

Page 21: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

21

Issues and ongoing work Privacy issues

How to choose a model a priori & agree on initial estimates? Repeated cycles of calculations may produce leakage despite secure computation For pure vertical case there is increased risk of attribute disclosure

Need to consider the (most) general case Overlapping observations for variables by individuals

How to reconcile measurement errors? Secure record linkage (“name” or other forms of matching)

Missing data again!

Additional statistical issues we are currently considering Diagnostics – Goodness of Fit Model Selection Logistic vs. Log-linear models in categorical predictor case. Serious trade-offs in simple horizontal partitioned case.

Page 22: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

22

Issues and ongoing work Combine more ideas across Communities.

Rigorous definition of Privacy.

Quantification of privacy breach and leakage of information.

Related papers & references: Fienberg et al. (2006) In Privacy in Statistical Databases. LNCS, Springer-Verlag. Fienberg et al. (2007) Bull. Int. Statist. Inst.. http://www.stat.psu.edu/~sesa http://www.niss.org/dgii/techreports.html

Acknowledgments Joint work with Yuval Nardi, Stephen E. Fienberg, Alan Karr, & Matt Tibbits NSF grants

EIA9876619, IIS0131884 to the NISS SES-0532407 to the Department of Statistics, Penn State

Army contract: DAAD19-02-1-3-0389 to CyLab at CMU

Page 23: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of
Page 24: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

24

Acknowledgments

Joint work with Yuval Nardi, Stephen E. Fienberg, Alan Karr Computational support: Matt Tibbits

NSF grants EIA9876619, IIS0131884 to the NISS SES-0532407 to the Department of Statistics, Penn State

Army contract: DAAD19-02-1-3-0389 to CyLab at CMU

Page 25: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

25

Parameters are updated securely using SMC protocols(secure sum, secure matrix product).

Parties only share summary statistics (or sufficientstatistics).

Upon convergence, parties may impute their missingvalues, and then follow the vertical case to find thelogistic parameter in a secure way.

Vertically Partitioned, Partially Overlapping DataVertically Partitioned, Partially Overlapping Data

Page 26: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

26

Privacy problem: Repeated cycles of calculations produce leakage.

Information Leakage Despite Secure Computation

Assume database yields average heights of woman ofdifferent nationalities.

Adversary who has access to database and auxiliaryinformation, Aleksandra is one inch shorter than the averageSerbian woman, learns Aleksandra’s height.

Anyone learning only the auxiliary information, withoutaccess to the average heights, learns relatively little.

Issues and ongoing work

Page 27: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

27

Why do we care about privacy & confidentiality?

Expansion of government and company databases & growing use of web and mobiledevices had led to increase of sensitive information in the public domain.

“Engaging Privacy and Information Technology in a Digital Age” -- Committee onPrivacy in the Information Age, NRC (2007) http://nap.edu/catalog/11896.html Why are privacy issues often so intractable? Different aspects:

Societal Technological Legal Policy/political

Different drivers & contexts: Economic Medical Public safety Research

Page 28: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

28

Why do we care about privacy & confidentiality? Context dependent definitions of privacy: Consumer &

Workplace Privacy, Internet/ Computer Security &Privacy, Medical Privacy, Genetics, Surveillance &Wiretapping, and “Homeland” security

Ethical: Keeping promises

Legal: Required under law Privacy is the right to keep one’s personal

information out of the public view Confidentiality is the dissemination without

public identification

Pragmatic: respondents may not provide data respondents may provide inaccurate data

Ref: http://www.amstat.org/comm/cmtepc/index.cfm?fuseaction=main

Page 29: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

29

Privacy & confidentiality in statistical databases

Users

Researchers,Policy analysts,Decision makers,Respondents …

DISSEMINATIONAgency/Servers

Respondents/Individuals/Organizations

DataSnoopers

Agencies seek to release confidential data to thepublic, and to improve analyses by sharing theirconfidential data, but via strategies that

do not reveal identities or sensitive attributes, are useful for a wide range of analyses, are easy for analysts and agencies to use

Data Utility: Are released data useful for statisticalinference?

Risk of Disclosure: Are the data or results of theiranalysis safe to be released?

(QUERIES)

Page 30: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

30

PPDM - Three Approaches

Data Perturbation Input perturbation

Change data before processing with PPDM algorithms Perturb original data Data swapping Adding noise

Output perturbation Summary statistics with noise

‘Trusted’ Third Party

Secure Multi-party Computation (SMC)

Page 31: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

31

(But) What Is Privacy?

Not yet fully defined in our work. We can take differentperspectives: Individual (Personal)Privacy

Fear that data will be misused Measures for identity disclosure (SDL) Dalenius (1977), Differential Privacy (Dwork,2006)

Corporate (Owner) Privacy We want both!

R – U confidentiality map (Duncan et al., 2001) Metrics / Measures (Clifton et al., Agrawal, Cox, . . .)

Page 32: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

32

Defining Privacy

Dalenius (1977): Access to database should not enable one tolearn anything about an individual that couldn’t be learnedwithout access. Unachievable because of the auxiliary information available

to an adversary. Aleksandra is 1in taller than an average American woman.

Dwork (2006): Differential Privacy Substitute: Risk to one’s privacy should not substantially

increase as a result of participating in a statistical database. ε- differential privacy

Page 33: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

33

SMC - Secure Sum Goal: compute f(x1,…,xk) = ∑xk

Algorithm: Agency 1 (A1): Adds random noise e to x1, then passes x1 + e to A2

Agency 2 (A2): Adds x2, and passes x1+x2+e to A3

… Agency k (Ak): Adds xk, and passes x1+x2+…+ xk+e back to A1

A1 subtracts e from ∑ xk + e , and shares the result with ALL otheragencies.

Adversary Models: Semi-honest model – parties follow the protocol, but may record messages

and computations and use these to learn more. Malicious model – allows parties to deviate from the protocol in arbitrary

ways: lie, refuse to participate at any point in the protocol, . . .

Page 34: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

34

Partitioned Database Types

--------++++3001-4000

++++++++----2001-3000

++++----++++1001-2000

----++++++++1000

----++++++++…

----++++++++1

X5 X6X3 X4X1 X2n

Party CParty BParty A

AAABBB5

AAABBB4

AAABBB3

BBBAAA2

BBBAAA1

X4 X5 X6X1 X2 X3n

Party AParty B

Complex partitioning

Vertically Partitioned, Partially Overlapping dataVertically Partitioned, Partially Overlapping data

Page 35: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

35

Log-linear models Classical tool for analysis of multivariate discrete data By analogy with ANOVA, for a IxJxK table with observed counts {nijk}, and

expected counts {mijk}under some sampling model the saturated model is

Get unsaturated model by setting any u-term to zero No-second order interaction: u123=0, for all i,j,k

Key results: MSS are the marginal totals corresponding to the highest order u-terms Goodness-of-fit of a model:

!

logmijk = u + u1(i) + u2( j ) + u3(k ) + u12( ij ) + u13( ik ) + u23( jk ) + u123( ijk )

where each subscripted u - term

u123( ijk ) =i

" u123( ijk ) =j

" u123( ijk ) =k

" 0

!

"G2

= 2(ˆ l sat# ˆ l

M)

Page 36: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

36

Secure MLE for Log-linear models

Assume multinomial sampling, the kernel of log-likelihood:

Discrete exponential family, so the MSS are n-terms adjacent tou-terms that can be used to for MLE estimates of {mijk}

Agencies only need to securely pass the MSS, i.e., the relevantmarginal tables.

!

nijk log(mijk )i, j ,k

" = Nu + ni++u1( i)i

" + n+ j+u2( j )j

" + n++ku3(k )k

"

+ nij+u12(ij )i, j

" ni+ku13( ik ) + n+ jku23( jk )j,k

" + nijku123(ijk )i, j ,k

"i,k

"

Page 37: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

37

Secure Logistic regression Via log-linear modeling

differentiating the log expectations at two levels of the response variable

No-second order interaction log-linear model equivalent to no-interaction logisticregression model 2x2x2 example where variable 3 is a binary response, and X1 and X2 predictors

Via secure summation, agencies share the three marginal totals, {n++k}, {ni+k}, {n+jk} Each agency can fit the model and produce the model fit and diagnostics!

logitij = log(mij1

mij2

) = log(mij1) " log(mij2)

= 2u3(1) + 2u13(i1) + 2u23( j1)

= #0 + #1x1 + #2x2

Page 38: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

38

Secure Logistic regression Setting: Y=response, X=predictors, πj is the probability of “success” at the jth

observation, j=1, … N.

Fit the logistic regression which requires iterative methodto estimate the coefficients

Algorithm: Agencies must agree on an initial estimate for the iterative procedure At each iteration, s, we find a new estimate of ß via secure summation When the model converged, the agencies can use secure summation to share the

values of log-likelihood, and Pearson and Deviance statistics.

More computationally intensive than via log-linear model

!

log"

1#"

$

% &

'

( ) = X*

!

ˆ " (s+1) = ˆ " (s) + (XtW

(s)X)

#1X

t(Y#µ(s)

)

W(s) = Diag(n j$ j

(s)(1#$ j

(s))), µ(s) = n j$ j

(s)

Page 39: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

39

Example: Clinical trial data (Koch (1983))

Effectiveness of an analgesic drug measured at two different centers, with twotreatments (1=Active, 2=Placebo), and responses (1=Poor, 2=Not Poor).

Two centers would like to fit a logistic regression model with no interaction term overthe “integrated” dataset, but are unwilling to share their individual cell values.

18622

26312

221121

25311

21TreatmentStatus

ResponseAgency 1

12622

13312

101121

121211

21TreatmentStatus

ResponseAgency 2

Page 40: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

40

Example: Clinical trial data (Koch (1983))

Effectiveness of an analgesic drug measured at two different centers, with twotreatments (1=Active, 2=Placebo), and responses (1=Poor, 2=Not Poor).

Two centers would like to fit a logistic regression model with no interaction term overthe “integrated” dataset, but are unwilling to share their individual cell values.

They do NOT see the “integrated” data but only 12 marginal totals, i.e., MSSs,n11+=52, n1+1=37, n+11=21, etc…

18622

26312

221121

25311

21TreatmentStatus

ResponseAgency 1

12622

13312

101121

121211

21TreatmentStatus

ResponseAgency 2

301222

39612

222221

371511

21TreatmentStatus

Response“Integrated”

Page 41: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

41

Example: Clinical trial data (Koch (1983))

Estimated odds ratios nearly identical Computational difference

Log-linear model needs only one round of secure summation

Logistic RegressionLog-linear model

!

logˆ m 111

ˆ m 112

" # $

% & '

= (0.989228

logˆ m 121

ˆ m 122

" # $

% & '

= (0.305730

logˆ m 211

ˆ m 212

" # $

% & '

= (1.707879

logˆ m 221

ˆ m 222

" # $

% & '

= (1.0243181

!

logˆ " 11

1# ˆ " 11

$ % &

' ( )

= #0.989230

logˆ " 12

1# ˆ " 12

$ % &

' ( )

= #0.305717

logˆ " 21

1# ˆ " 21

$ % &

' ( )

= #1.707895

logˆ " 22

1# ˆ " 22

$ % &

' ( )

= #1.024382

Page 42: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

42

Work at NISS

Secure Computation System Data Integration Linear regression Contingency tables

www.niss.org/dgii/techreports.html A.F. Karr, W. J. Fulp, X. Lin, J.P. Reiter, F. Vera, and S.S.Young

(2005). “Secure Privacy Preserving Analysis of Distributed databases”Technometrics (under review)

Our focus on fully categorical data via log-linear models and logisticregression

Page 43: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

43

Recent methods: Simulation & Sampling Digital Governemnt Project I & II at NISS World “without original micordata”

Synthetic & Partially synthetic data uses Bayesian methodology Raghunathan, Reiter, and Rubin (2003, JOS ) Reiter (2003, Surv. Meth.; 2005, JRSSA)

Partial data releases for tabular data & logistic regression with algebraic tools Dobra et al. (2003) Slavkovic (2004), Fienberg & Slavkovic (2005), Fienberg at al. (2006)

Remote access servers Rowland (2003, NAS Panel on Data Access). Gomatam, Karr, Reiter, Sanil (2005, Stat. Science)

Secure computation Benaloh (1987, CRYPTO86 ) Karr, Lin, Sanil, and Reiter (2005, NISS tech. rep.)

Page 44: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

44

The Problem

Consider a distributed database, where parties possess differentparts but want to share results of computation.

We want to combine ideas from: Secure multi-party computation. Statistical analyses. Record linkage. Confidentiality/privacy protection.

Not privacy-preserving datamining: PPMD has focused on “simple” statistical calulations and ignored

the effect on individual privacy. Devil is in the details! “Partially protecting privacy” is an oxymoron!

Page 45: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

45

Consequence for SMPC

Some things simply cannot be learned while completely protectingprivacy. e.g., learning average net worth of small set of individuals may

reveal that at least one has very high net worth. A little extrainformation may allow adversary to identify her.

Secure function evaluation doesn’t address these difficulties. Does address which information is safe to release.

Assumes that certain facts are, by fiat, going to be released, e.g.,average income by block.

No information that can’t be inferred from these quantitieswill be leaked.

Page 46: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

46

ε-Differentiable Privacy

Definition: A randomized function κ gives ε-differentiableprivacy if for all data sets D1 and D2 differing on at most oneelement, and all S ⊆ Range(κ),

Pr[κ(D1) ∈ S] ≤ exp(ε)×Pr[κ(D2) ∈ S]. Leads to additive noise and limits on numbers of queries as

privacy protection mechanisms:

See work of Dwork and collaborators. Statistical alternative: Risk-utility trade-off.

Page 47: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

MaximumTolerable

Risk

Original Data

No Data

Released Data

Dis

clo

sure

Ris

k

Data Utility

R-U Confidentiality Map

(Duncan, et al. 2001, 2004)

Page 48: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

48

Record linkage: attempt to determine whether pairs of data recordsdescribe the same entity, i.e., find record pairs that are co-referent: Entities: usually people (or organizations or…). Data records: names, addresses, job titles, birthdays, …

PPDM and SMPC methods presume unique identifiers or thatentities can be linked through ordering or some othermechanism.

In our problem, identifiers are withheld and even if they werepassed to trusted 3rd party for labeling, we would link with error!

Record Linkage

Page 49: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

49

Partitioned Database Types

Horizontally Partitioned Agencies have different records but same variables.

Vertically Partitioned Agencies have same records but different variables.

Partially Overlapping, Vertically Partitioned Agencies have different records and different variables,

with some common records and variables.

Page 50: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

50

The Problem

Data are collected separately and intimately by several parties(agencies), representing parts of an idealized unified database, D.

Goal: To perform statistical analyses on the pooled (combined)data, without actually combining the data.

Tools: Secure analysis – parties learn ‘nothing’ about the otherparty’s data.

Why not ‘integrate’ the data themselves? Pledges of confidentiality, laws and regulations, business practices

Page 51: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

51

ACLU Pizza Ad

American Civil Liberties Union Ad: http://www.aclu.org/pizza/

Page 52: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

52

2006 AOL Data Release Fiasco

AOL search results released for academic research (www.aolpsycho.com)

Page 53: PADM-ICDM, Omaha --- Oct. 28, 2007 - ODUmukka/cs795dm/Lecturenotes/Day6/ICDM… · Data are collected separately and intimately by several parties (agencies), representing parts of

53

Overview

Background & Motivation Privacy & its breaches: myth or reality? Privacy & confidentiality in statistical databases Overview of SDL & PPDM

“Secure” Logistic Regression method Overview Simulated example Issues & future directions