Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya ([email protected]) Joint work...

38
Privacy-Preserving Data Mining Jaideep Vaidya ([email protected]) Joint work with Chris Clifton (Purdue University)
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    225
  • download

    2

Transcript of Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya ([email protected]) Joint work...

Page 1: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

Privacy-PreservingData Mining

Jaideep Vaidya ([email protected])

Joint work with

Chris Clifton (Purdue University)

Page 2: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

Outline

• Introduction– Privacy-Preserving Data Mining– Horizontal / Vertical Partitioning of Data– Secure Multi-party Computation

• Privacy-Preserving Outlier Detection

• Privacy-Preserving Association Rule Mining

• Conclusion

Page 3: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

FutureBack in the good ol’ days

SafewayDominick’s

Jewel

Now

Page 4: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

A “real” example

• Ford / Firestone– Individual databases– Possible to join both databases (find corresponding

transactions)– Commercial reasons to not share data– Valuable corporate information - Cost structures /

business structures

• Ford Explorers with Firestone tires → Tread Separation Problems (Accidents!)

• Might have been able to figure out a bit earlier (Tires from Decatur, Ill. Plant, certain situations)

Page 5: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

Public (mis)Perception of Data Mining: Attack on Privacy

• Fears of loss of privacy constrain data mining– Protests over a National Registry

• In Japan

– Data Mining Moratorium Act• Would stop all data mining R&D by DoD

• Terrorism Information Awareness ended– Data Mining could be key technology

Page 6: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

Is Data Mining a Threat?

• Data Mining summarizes data– (Possible?) exception: Anomaly / Outlier

detection

• Summaries aren’t private– Or are they?– Does generating them raise issues?

• Data mining can be a privacy solution– Data mining enables safe use of private data

Page 7: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

Privacy-Preserving Data Mining

• How can we mine data if we cannot see it?• Perturbation

– Agrawal & Srikant, Evfimievski et al.– Extremely scalable, approximate results– Debate about security properties

• Cryptographic– Lindell & Pinkas, Vaidya & Clifton– Completely accurate, completely secure (tight bound

on disclosure), appropriate for small number of parties

• Condensation/Hybrid

Page 8: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

Assumptions

• Data distributed– Each data set held by source authorized to

see it– Nobody is allowed to see aggregate data

• Knowing all data about an individual violates privacy

• Data holders don’t want to disclose data– Won’t collude to violate privacy

Page 9: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

Gold Standard:Trusted Third Party

Page 10: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

Horizontal Partitioning of Data

CC# Active? Delinquent? Amount

Bank of America

Chase Manhattan

123 Yes Yes <$300

324 No No $300-500

919 Yes No >$1000

3450 Yes Yes <$300

4127 No No $300-500

8772 Yes No >$1000

Page 11: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

Medical Records

RPJ Yes Diabetic

CAC No Tumor No

PTR No Tumor Diabetic

Cell Phone Data

RPJ 5210 Li/Ion

CAC none none

PTR 3650 NiCd

Global Database ViewTID Brain Tumor? Diabetes? Model Battery

Vertical Partitioning of Data

Page 12: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

Secure Multi-Party Computation (SMC)

• Given a function f and n inputs, distributed at n sites, compute

the result

while revealing nothing to any site except its own input(s) and the result.

xxx n,...,,

21

nxxxfy ,,, 21

Page 13: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

Secure Multi-Party ComputationIt can be done!

• Yao’s Millionaire’s problem (Yao ’86)– Secure computation possible if function can be

represented as a circuit– Idea: Securely compute gate

• Continue to evaluate circuit

• Extended to multiple parties (BGW/GMW ’87)• Biggest Problem - Efficiency

– Will not work for lots of parties / large quantities of data

Page 14: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

SMC – Models of Computation

• Semi-honest Model– Parties follow the protocol faithfully

• Malicious Model– Anything goes!

• Provably Secure

• In either case, input can always be modified

Page 15: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

What is an Outlier?

• An object O in a dataset T is a DB(p,dt)-outlier if at least fraction p of the objects in T lie at distance greater than dt from O

• Centralized solution from Knorr and Ng– Nested loop comparison– Maintain count of objects inside threshold– If count exceeds threshold, declare non-outlier and move to next

• Clever processing order minimizes I/O cost

12

1

Page 16: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

Privacy-Preserving Solution

• Key idea: share splitting– Computations leave results (randomly) split between

parties– Only outcome is if the count of points within distance

threshold exceeds outlier threshold

• Requires pairwise comparison of all points– But failure to compare all points reveals information

about non-outliers• This alone makes it possible to cluster points• This is a privacy violation

– Asymptotically equivalent to Knorr & Ng

Page 17: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

Solution: Horizontal Partition

• Compare locally with your own points• For remote points, get random share of distance

– Calculate random share of “exceeds threshold or doesn’t”

• Sum shares and test if enough “close” points

1

1.5 -0.90.3 0.92.5 -0.71.5 3.2

323

-121

-31-312-1

24 -23

Page 18: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

Random share of distance

• x2, y2 local; sum of xy is scalar product– Several protocols for share-splitting scalar product

(Du&Atallah’01; Vaidya&Clifton’02; Ioannidis, Grama, Atallah’02)

2 2

1

( , ) ( )m

r rr

Distance X Y x y

2 21 1 1 12x x y y

2 2

1 1 1

2m m m

r r r rr r r

x y x y

Page 19: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

Shares of “Within Threshold”

• Goal: is x + y ≤ dt ?

• Essentially Yao’s Millionaires’ problem (Yao’86)– Represent function to be computed as circuit– Cryptographic protocol gives random shares

of each wire

• Solves “sum of shares from within dt exceeds minimum” as well

Page 20: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

Vertically Partitioned Data

• Each party computes its part of distance–

• Secure comparison (circuit evaluation) gives each party shares of 1/0 (close/not)

• Sum and compare as with horizontal partitioning

2 2

1

( , ) ( )m

r rr

Distance X Y x y

2 2

1 1

( ) ( )a m

r r r rr r a

x y x y

Page 21: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

Why is this Secure?

• Random shares indistinguishable from random values– Contain no knowledge in isolation– Assuming no collusion – so shares viewed in isolation

• Number of values (= number of shares) known– Nothing new revealed

• Too few close points is outlier definition– This is the desired result

• No knowledge that can’t be discovered from one’s own input and the result!

Page 22: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

Conclusion (Outlier Detection)

• Outlier detection feasible without revealing anything but the outliers– Possibly expensive (quadratic)– But more efficient solution for this definition of outlier

inherently reveals potential privacy-violating information

• Key: Privacy of non-outliers preserved– Reason why outliers are outliers also hidden

• Allows search for “unusual” entities without disclosing private information about entities

Page 23: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

• Association rules a common data mining task– Find A, B, C such that AB C holds frequently (e.g.

Diapers Beer)

• Fast algorithms for centralized and distributed computation– Basic idea: For AB C to be frequent, AB, AC, and

BC must all be frequent– Require sharing data

• Secure Multiparty Computation too expensive

Association Rules

Page 24: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

Association Rule Mining

• Find out if itemset {A1, B1} is frequent (i.e. If support of {A1, B1} ≥ k)

A B

• Support of itemset is defined as number of transactions in which all attributes of the itemset are present

• For binary data, support =|Ai Λ Bi|.

Key A1

k1 1

k2 0

k3 0

k4 1

k5 1

Key B1

k1 0

k2 1

k3 0

k4 1

k5 1

Page 25: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

• Idea based on TID-list representation of data– Represent attribute A as TID-list Atid

– Support of ABC is | Atid ∩ Btid ∩ Ctid |

• Use a secure protocol to find size of set intersection to find candidate sets

Association Rule Mining

Page 26: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

Cardinality of Set Intersection

• Use a secure commutative hash function

• Pohlig-Hellman Encryption

• Each party generates own encryption key

• All parties encrypt all the input sets

sets allin objectscommon #Result

21

XEEEXEEE jilk

Page 27: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

Cardinality of Set Intersection

• Hashing– All parties hash all sets with their key

• Initial intersection– Each party finds intersection of all sets

(except its own)

• Final intersection– Parties exchange the final intersection set,

and compute the intersection of all sets

Page 28: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

X∩Y∩Z:λ,βX∩Y∩Z:λ,β

E1(E2(E3(Z)))E1(X)E1(E2(Y))

E3(E1(X))E3(E1(E2(Y)))E2(E3(Z))E2(E3(E1(X)))

Computing Size of Intersection

2Y

1X

3Z

E3(Z)E2(Y)

Z:α,β,κ,λ,γ

Y:λ,σ,φ,υ,βX:α,λ,σ,β

Z:α,β,κ,λ,γ

X:α,λ,σ,β Y:λ,σ,φ,υ,β

Y∩Z:λ,β

X∩Y:λ,σ,βX∩Z:α,β,λ

X∩Y∩Z:λ,β

Page 29: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

Proof of Security

• Proof by Simulation• What is known

– The size of the intersection set

– Site i learns

• How it can be simulated– Protocol is symmetric, simulating view of one

party is sufficient

1

0

k

pp

S S

*, 0, , 1 , , 2p

p C

S C k i C C

Page 30: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

Proof of Security

• Hashing– Party i receives encrypted set from party i-1– Can use random numbers to simulate this

• Intersection– Party i receives fully hashed sets of all parties

Page 31: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

ABC

AB BCAC

A B C

2

4-2=2

3-2=1

8-2-2-0=4

7-2-1-0=4

6-2-1-2=1

2-2=0

|ABC| = 2, |AB| = 3, |AC| = 4, |BC| = 2, |A| = 6, |B| = 7, |C| = 8

Simulating Fully Encrypted Sets

Page 32: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

R1

R2

R4

R5

R11

R12

R13

R14

R1

R2

R3

R7

R8

R9

R10

R1

R2

R3

R4

R5

R6

A CB

Page 33: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

Optimized version

Page 34: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

Association Rule Mining (Revisited)

• Naïve algorithm => Simply use APRIORI. A single set intersection determines the frequency of a single candidate itemset– Thousands of itemsets

• Key intuition– Set Intersection algorithm developed also allows

computation of intermediate sets– All parties get fully encrypted sets for all attributes– Local computation allows efficient discovery of all

association rules

Page 35: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

Communication Cost

• k parties, m set size, p frequent attributes– k*(2k-2) = O(k2) messages– p*(2p-2)*m*encrypted message size = O(p2m)

bits– k rounds

• Independent of number of itemsets found

Page 36: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

Other Results

• ID3 Decision Tree learning– Horizontal Partitioning: Lindell&Pinkas ’00– Also vertical partitioning (Du, Vaidya)

• Association Rules– Horizontal Partitioning: Kantarcıoğlu

• K-Means / EM Clustering• K-Nearest Neighbor• Naïve Bayes, Bayes network structure• And many more

Page 37: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

Challenges

• What do the results reveal?

• A general approach (instead of per data mining technique)

• Experimental results

• Incentive Compatibility

• Note: Upcoming book in the Advances in Information Security series by Springer-Verlag

Page 38: Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris CliftonChris Clifton (Purdue University)

Questions