Information Sharing across Private Databases Rakesh Agrawal Alexandre Evfimievski Ramakrishnan...
-
Upload
daniel-melton -
Category
Documents
-
view
214 -
download
0
Transcript of Information Sharing across Private Databases Rakesh Agrawal Alexandre Evfimievski Ramakrishnan...
Information Sharing across Information Sharing across Private DatabasesPrivate Databases
Rakesh AgrawalRakesh Agrawal
Alexandre EvfimievskiAlexandre Evfimievski
Ramakrishnan SrikantRamakrishnan Srikant
IBM Almaden Research CenterIBM Almaden Research Center
Assumption: Information in each database can be Assumption: Information in each database can be freely shared.freely shared.
Today’s Information Sharing Today’s Information Sharing SystemsSystems
Mediator
Q R
Federated
Q R
Centralized
Selective Document SharingSelective Document Sharing
R is shopping for R is shopping for technology.technology.
S has intellectual S has intellectual property it may want to property it may want to license.license.
First find the specific First find the specific technologies where there technologies where there is a match, and then is a match, and then reveal further information reveal further information about those.about those.
R
ShoppingList
S
TechnologyList
Example 2: Govt. agencies sharing information on a
need-to-know basis.
Medical Research Medical Research
Validate hypothesis Validate hypothesis between adverse between adverse reaction to a drug and a reaction to a drug and a specific DNA sequence.specific DNA sequence.
Researchers should not Researchers should not learn anything beyond 4 learn anything beyond 4 counts:counts:
MayoClinic
DNA Sequences
DrugReactions
Adverse ReactionAdverse Reaction No Adv. ReactionNo Adv. Reaction
Sequence PresentSequence Present ?? ??
Sequence AbsentSequence Absent ?? ??
Minimal Necessary Information Minimal Necessary Information SharingSharing
Compute queries across databases so that no more Compute queries across databases so that no more information than necessary is revealed.information than necessary is revealed.
Need is driven by several trends:Need is driven by several trends:– End-to-end integration of information systems End-to-end integration of information systems
across companies.across companies.– Simultaneously compete and cooperate.Simultaneously compete and cooperate.– Security – need-to-know information sharingSecurity – need-to-know information sharing– Privacy legislation & stated privacy policesPrivacy legislation & stated privacy polices
Talk OutlineTalk Outline
MotivationMotivation Problem DefinitionProblem Definition ProtocolsProtocols Cost AnalysisCost Analysis ConclusionsConclusions
Current TechniquesCurrent Techniques
Trusted Third PartyTrusted Third Party– Has to be completely trusted, both wrt intent and Has to be completely trusted, both wrt intent and
competence against security breaches.competence against security breaches.
Secure Multi-Party ComputationSecure Multi-Party Computation– Given two parties with inputs x and y, compute f(x,y) such Given two parties with inputs x and y, compute f(x,y) such
that the parties learn only f(x,y) and nothing else.that the parties learn only f(x,y) and nothing else.– Can be solved by building a combinatorial ciruit, and Can be solved by building a combinatorial ciruit, and
simulating that circuit [Yao86].simulating that circuit [Yao86]. Cost makes them impractical for database-size Cost makes them impractical for database-size
problems.problems.
Our Security ModelOur Security Model
No third party.No third party. Main parties directly execute a protocol, which is designed to Main parties directly execute a protocol, which is designed to
guarantee that they do not learn any more than they would guarantee that they do not learn any more than they would have learnt had they given the data to a trusted third party and have learnt had they given the data to a trusted third party and got back the answer.got back the answer.
Honest-but-curious behavior: Parties follow protocol properly, Honest-but-curious behavior: Parties follow protocol properly, except that they can record all computation & received except that they can record all computation & received messages, and analyze them to learn additional information.messages, and analyze them to learn additional information.
Problem Statement (Ideal)Problem Statement (Ideal)
GivenGiven– Two parties: R (receiver) and S (sender)Two parties: R (receiver) and S (sender)
– Databases: DDatabases: DRR and D and DSS
– Query Q spanning the tables in DQuery Q spanning the tables in DRR and D and DSS
Compute the answer to Q and return it to R without Compute the answer to Q and return it to R without revealing any additional information to either party.revealing any additional information to either party.
Anything R can learn from the answer to the query is fair game!
Example: If Q = VR VS, then for all v VR – VS, R knows v VS.
Problem Statement (Minimal Problem Statement (Minimal Sharing)Sharing)
Given:Given:– Two parties: R (receiver) and S (sender)Two parties: R (receiver) and S (sender)
– Databases: DDatabases: DRR and D and DSS
– Query Q spanning the tables in DQuery Q spanning the tables in DRR and D and DSS
– Additional (pre-specified) categories of information IAdditional (pre-specified) categories of information I
Compute the answer to Q and return it to R without Compute the answer to Q and return it to R without revealing any additional information to either party, revealing any additional information to either party, except for the information contained in Iexcept for the information contained in I
ProtocolsProtocols
Protocols for four key operations: Protocols for four key operations: – Intersection, Equijoin, Intersection Size & Equijoin SizeIntersection, Equijoin, Intersection Size & Equijoin Size
Notation: Notation:
– TTRR , T , TS S : tables in D: tables in DRR and D and DS S respectively.respectively.
– VVRR, V, VSS : set of distinct values in T : set of distinct values in TRR and T and TS S respectively.respectively. Additional Information I: Additional Information I:
– For intersection, intersection size & equijoin, For intersection, intersection size & equijoin,
I = { |VI = { |VSS| , |V| , |VRR| }| }
– For equijoin size, I also includes the distribution of For equijoin size, I also includes the distribution of duplicates & some subset of information in Vduplicates & some subset of information in VSS V VRR
Related WorkRelated Work
[NP99]: Protocols for list intersection problem[NP99]: Protocols for list intersection problem– Oblivious evaluation of n polynomials of degree n each.Oblivious evaluation of n polynomials of degree n each.– Oblivious evaluation of nOblivious evaluation of n22 polynomials. polynomials.
[HFH99]: find people with common preferences, without [HFH99]: find people with common preferences, without revealing the preferences.revealing the preferences.– Intersection protocols are similar to ours, but do not Intersection protocols are similar to ours, but do not
provide proofs of security.provide proofs of security. Private Information RetrievalPrivate Information Retrieval Privacy Preserving Data MiningPrivacy Preserving Data Mining
Talk OutlineTalk Outline
MotivationMotivation Problem DefinitionProblem Definition ProtocolsProtocols
– IntersectionIntersection– Intersection Size & Equijoin SizeIntersection Size & Equijoin Size– JoinsJoins– Proof MethodologyProof Methodology
Cost AnalysisCost Analysis ConclusionsConclusions
A Simple, but Incorrect, A Simple, but Incorrect, Intersection ProtocolIntersection Protocol
R S
VR VS
fe(VS )
VR VS := { v VR | fe(v) fe(VS ) }
fe(VS )
Problem: For any element x, R can check whether fe(x) is in fe(VS )
R & S agree to use encryption function
fe (with key e)Shorthand for { fe(x) | x VS }
Intersection Protocol: IntuitionIntersection Protocol: Intuition
Still want to encrypt the value in VStill want to encrypt the value in VRR and V and VSS and and
compare the encrypted values.compare the encrypted values. However, want an encryption function such that it However, want an encryption function such that it
can only be jointly computed by R and S, not can only be jointly computed by R and S, not separately.separately.
Commutative EncryptionCommutative Encryption
Pair of encryption functions f and g such that Pair of encryption functions f and g such that
f(g(v)) = g(f(v))f(g(v)) = g(f(v)) Assuming the Decisional Diffie-Hellman (DDH) hypothesis, Assuming the Decisional Diffie-Hellman (DDH) hypothesis,
ffee(x) = x(x) = xee mod p mod p
wherewhere– p: safe prime number, i.e., both p and q=(p-1)/2 are primesp: safe prime number, i.e., both p and q=(p-1)/2 are primes– Dom f: all quadratic residues modulo p, andDom f: all quadratic residues modulo p, and– encryption key e encryption key e 1, 2, …, q-1 1, 2, …, q-1
is a commutative encryption.is a commutative encryption.
Commutative Encryption (2)Commutative Encryption (2)
The powers commute:The powers commute:(x(xdd mod p) mod p)ee mod p = x mod p = xdede mod p = (x mod p = (xee mod p) mod p)dd mod p mod p
DDH hypothesis: The distribution of <gDDH hypothesis: The distribution of <gaa, g, gbb, g, gabab> is > is computationally indistinguishable from the computationally indistinguishable from the distribution of <gdistribution of <gaa, g, gbb, g, gcc> where a,b,c > where a,b,c rr Dom f. Dom f.– Implication: <x, xImplication: <x, xee, y, y, y, yee> is also indistinguishable from > is also indistinguishable from
<x, x<x, xee, y, z> where x,y,z , y, z> where x,y,z rr Dom f. Dom f.
– Note: DDH does not hold if adversary can select a, b, c.Note: DDH does not hold if adversary can select a, b, c.
Intersection ProtocolIntersection Protocol
RS
VRVS
Secret keyeR
eS
feS(VS )To satisfy DDH, we apply feS on h(VS), where h is a hash function, not directly on VS.
R
Intersection ProtocolIntersection Protocol
S
VRVS
feS(VS )feS(VS )
feR(feS(VS ))
eReS
feS(feR(VS ))
Commutative property
R
Intersection ProtocolIntersection Protocol
S
VRVS
feR(VR ) feR(VR )
feS(feR(VS )) <y, feS(y)> for y feR(VR )
eReS
<x, feS(feR(x))> for x VR
<y, feS(y)> for y feR(VR )
Since R knows<x, feR(x)>
Intersection Size ProtocolIntersection Size Protocol
R S
VRVS
feR(VR ) feS(VS )
feS(VS ) feR(VR )
feR(feS(VS ))
eReS
feS(feR(VR ))
feR(feS(VR))
R cannot map z feR(feS(VR)) back to x VR.
Equijoin Size ProtocolEquijoin Size Protocol
Same as intersection size protocol, but allows duplicates.Same as intersection size protocol, but allows duplicates. Can reveal some subset of information in Can reveal some subset of information in VR VS based on
distribution of duplicates.
– If each element in VR VS has same number of duplicates in VR, does not reveal any additional information beyond the join size and the distribution of duplicates in VS.
– If each element in VR VS has unique number of duplicates in VR, reveals VR VS and the number of duplicates in VS for elements in VR VS.
Equijoin Protocol: IntuitionEquijoin Protocol: Intuition
R needs some extra information ext(v) for values v R needs some extra information ext(v) for values v V VRR V VSS..
– ext(v): information about the other attributes in ext(v): information about the other attributes in TTSS for those records where T for those records where TSS.A = v .A = v
S has second secret key eS has second secret key eSS’’
For each value v For each value v V VSS, ,
– S generates an encryption key S generates an encryption key = f = feS’eS’(v), and(v), and
– encrypts ext(v) using encryption function K with key encrypts ext(v) using encryption function K with key .. S allows R to learn fS allows R to learn feS’eS’(v) only for v (v) only for v V VRR.. K need not be a commutative encryption.K need not be a commutative encryption.
Join ProtocolJoin Protocol
R S
VR
feR(VR ) feR(VR )
<y, feS(y) , feS’(y)> for y feR(VR )
<x, feS(feR(x)), feS’(feR(x))> for x VR
eReS, eS’
<x, feS(x), feS’(x)> for x VR
feR-1(feS(feR(x))
= feR-1(feR(feS(x))
= feS(x)
S
Join ProtocolJoin Protocol
R
VR
eReS, eS’
<x, feS(x), feS’(x)> for x VR
VS + ext(VS)
<feS(v), K(feS’(v), ext(v))> for v VS
<feS(v), K(feS’(v), ext(v))> for v VS
K: encryption function, Encrypts ext(v) using feS’(v)
as the encryption key<x, feS(x), feS’(x), K(feS’(x), ext(x))>
for x VR VS
Proof MethodologyProof Methodology
Consider two distributions:Consider two distributions:– S’s view of the protocol.S’s view of the protocol.– a simulation of S’s view that only uses what S is supposed a simulation of S’s view that only uses what S is supposed
to have at the end of the protocol.to have at the end of the protocol. e.g., Ve.g., VSS, V, VSS V VRR, and |V, and |VRR| for intersection.| for intersection.
If for any VIf for any VSS and V and VRR, these two distributions are , these two distributions are
computationally indistinguishable, then the protocol computationally indistinguishable, then the protocol is secure.is secure.– i.e., S cannot learn anything else from the protocol.i.e., S cannot learn anything else from the protocol.
Proof Methodology (2)Proof Methodology (2)
Simulation only uses the knowledge S is supposed Simulation only uses the knowledge S is supposed to have at the end of the protocol.to have at the end of the protocol.
Distinguisher can also use the inputs of R, i.e., VDistinguisher can also use the inputs of R, i.e., VRR, ,
but not R’s secret keys.but not R’s secret keys.– Implication: S doesn’t learn anything from the protocol Implication: S doesn’t learn anything from the protocol
even if S (correctly) guesses some of R’s inputs.even if S (correctly) guesses some of R’s inputs.
ProofsProofs
We prove (for each protocol) that if the two We prove (for each protocol) that if the two distributions can be distinguished, the DDH distributions can be distinguished, the DDH hypothesis is false.hypothesis is false.
Easy to come up with protocols that look okay, but Easy to come up with protocols that look okay, but are flawed …are flawed …– Proof of security is important for real-world acceptance & Proof of security is important for real-world acceptance &
use.use.– The proofs are also fun! The proofs are also fun!
Talk OutlineTalk Outline
MotivationMotivation Problem StatementProblem Statement ProtocolsProtocols Cost AnalysisCost Analysis ConclusionsConclusions
Cost Analysis: OperationsCost Analysis: Operations
Cost is dominated by exponentiations.Cost is dominated by exponentiations. Let CLet Cee = cost of x = cost of xee mod p mod p
– x, e, p are all 1024-bit integersx, e, p are all 1024-bit integers– Roughly 0.02 seconds on a Pentium 3 (in 2001) [NP01], or Roughly 0.02 seconds on a Pentium 3 (in 2001) [NP01], or
2 x 102 x 1055 per hour per hour
Intersection: 2 (|VIntersection: 2 (|VRR| + |V| + |VSS|) C|) Cee
Join: (2 |VJoin: (2 |VRR| + 5 |V| + 5 |VSS|) C|) Cee Algorithms are trivially parallelizable.Algorithms are trivially parallelizable.
Selective Document Sharing: Selective Document Sharing: ImplementationImplementation
For each pair of documents dFor each pair of documents dRR D DRR and d and dSS D DSS
– R and S execute the intersection protocol to get |dR and S execute the intersection protocol to get |dRR|, |d|, |dSS|, |,
and |dand |dRR d dSS|.|.
– Then compute similarity function f between the Then compute similarity function f between the documents.documents.
Note: This protocol also reveals to R, for each Note: This protocol also reveals to R, for each document ddocument dRR D DRR, the size of |d, the size of |dRR d dSS| for each | for each
ddSS D DSS..
Selective Document Sharing:Selective Document Sharing:Cost AnalysisCost Analysis
If If – |D|DRR| = 10 documents, |D| = 10 documents, |DSS| = 100 docs, | = 100 docs,
– each document has 1000 words,each document has 1000 words,– 10 parallel processors,10 parallel processors,
2 hours computation time &2 hours computation time &
35 minutes communication time (on T1 line).35 minutes communication time (on T1 line).
Medical Research:Medical Research:ImplementationImplementation
LetLet
– VVRR = set of ids in R’s database that took the drug. = set of ids in R’s database that took the drug.
– VVRR’ = subset of V’ = subset of VRR with adverse reaction. with adverse reaction.
– VVSS = set of ids in S’s database. = set of ids in S’s database.
– VVSS’ = subset of V’ = subset of VSS with DNA sequence. with DNA sequence. Execute intersection size protocol 4 times: Execute intersection size protocol 4 times:
(V(VRR - V - VRR’) ’) (V (VS S - V- VSS’) ’) (V(VR R - V- VRR’) ’) V VSS’, ’,
VVRR’ ’ (V (VS S - V- VSS’)’) V VRR’ ’ V VSS’’
– Modified version of protocol that sends results directly to Modified version of protocol that sends results directly to researchers.researchers.
Medical Research:Medical Research:Cost AnalysisCost Analysis
If |VIf |VRR| = |V| = |VSS| = 1 million ids, and 10 parallel | = 1 million ids, and 10 parallel
processors:processors:– 4 hours computation time.4 hours computation time.– 1.5 hours communication time.1.5 hours communication time.
Talk OutlineTalk Outline
MotivationMotivation Problem StatementProblem Statement ProtocolsProtocols Cost AnalysisCost Analysis ConclusionsConclusions
SummarySummary
Identified information sharing across private Identified information sharing across private databases as a new area for database research.databases as a new area for database research.
Developed novel protocols for intersection, Developed novel protocols for intersection, intersection size & equijoin, and proved that these intersection size & equijoin, and proved that these protocols disclose minimal information.protocols disclose minimal information.– Also gave protocol for equijoin size. This protocol reveals Also gave protocol for equijoin size. This protocol reveals
some information about which tuples joined, based on the some information about which tuples joined, based on the distribution of duplicates.distribution of duplicates.
Showed how new applications can be built using Showed how new applications can be built using these protocols.these protocols.
Future WorkFuture Work
What is the tradeoff between the additional What is the tradeoff between the additional information disclosed and efficiency?information disclosed and efficiency?– Will we be able to obtain much faster protocols if we are Will we be able to obtain much faster protocols if we are
willing to disclose additional information?willing to disclose additional information?
Can we formalize models of minimal disclosure and Can we formalize models of minimal disclosure and discover corresponding protocols for higher-level discover corresponding protocols for higher-level database operations?database operations?
BackupBackup
System ComponentsSystem Components
Operating System
SecureCommunication
Cryptographic Protocol
Libraries( incl. Encryption
Primitives)Database
Lemma 1Lemma 1
For polynomial m, the distribution of the 2 For polynomial m, the distribution of the 2 m – tuple m – tuple
is indistinguishable from the distribution of the tupleis indistinguishable from the distribution of the tuple
wherewhere
)()(...
...
)( 1
1
1
1
me
m
me
m
e xf
x
xf
x
xf
x
m
m
me
m
e z
x
xf
x
xf
x
)(...
...
)( 1
1
1
1
Lemma 2Lemma 2
For polynomial m and n, the distribution of the 2 For polynomial m and n, the distribution of the 2 n – tuple n – tuple
is indistinguishable from the distribution of the tupleis indistinguishable from the distribution of the tuple
wherewhere
)(...)()(...)(
......
11
11
nememee
nmm
xfxfxfxf
xxxx
nmmee
nmm
zzxfxf
xxxx
...)(...)(
......
11
11
Lemma 3Lemma 3
For polynomial m and n, the distribution of the 3 For polynomial m and n, the distribution of the 3 n – tuple n – tuple
is indistinguishable from the distribution of the tupleis indistinguishable from the distribution of the tuple
wherewhere
)(...)()(...)(
)(...)()(...)(
......
''1'1'
11
11
nememee
nememee
nmm
xfxfxfxf
xfxfxfxf
xxxx
nmm
nmm
nmm
zzzz
yyyy
xxxx
......
......
......
11
11
11