Chaoyue Niu, Zhenzhe Zheng, Fan Wu, Xiaofeng Gao, and ...

1
Trading Data in Good Faith: Integrating Truthfulness and Privacy Preservation in Data Markets Chaoyue Niu, Zhenzhe Zheng, Fan Wu, Xiaofeng Gao, and Guihai Chen Shanghai Key Laboratory of Scalable Computing and Systems, Department of Computer Science and Engineering, Shanghai Jiao Tong University, China Email: {rvincency, zhengzhenzhe220}@gmail.com; {fwu, gao-xf, gchen}@cs.sjtu.edu.cn Trading Data in Good Faith: Integrating Truthfulness and Privacy Preservation in Data Markets Chaoyue Niu, Zhenzhe Zheng, Fan Wu, Xiaofeng Gao, and Guihai Chen Shanghai Key Laboratory of Scalable Computing and Systems, Department of Computer Science and Engineering, Shanghai Jiao Tong University, China Email: {rvincency, zhengzhenzhe220}@gmail.com; {fwu, gao-xf, gchen}@cs.sjtu.edu.cn Introduction Figure 1. Emerging Data Tradings. Data Market Model Figure 2. System Architecture of DataSift [5]. Why trade data services rather than raw data? For data contributors: privacy concerns [1] For service provider: value-added data services [2] For data consumers: data copyright infringement [12] Two Security Issues Privacy Preservation Identity Preservation: real identity, e.g., SSN Data Confidentiality: mainly against data consumers Data Truthfulness Partial Data Collection Attack No/Partial Data Processing Attack Design Challenges Truthfulness of Data Collection Privacy preservation Digital signature (non-repudiation) Message authentication code? secret key sharing Data processing Semantic inconsistency [8] Data provenance/lineage [10] Truthfulness of Data Processing Information asymmetry due to data confidentiality Differs from verifiable/outsourced computing [7] Efficiency Requirements Large-scale data acquisition (e.g., on DataSift [5] and Gnip [9]) Data authentication and data integrity? sequential verification, PKI maintenance (computation and communication overheads) Profile Matching Data Service Attribute Level Movie 3 Sports Cooking 0 5 Attribute Level Movie 2 Sports Cooking 5 1 Alice Bob δ = 2 Attribute Level Movie 3 Sports Cooking 4 2 Charlie Figure 3. Motivating Example: An Illustration of Fine-grained Profile Matching [14]. 1. The service provider defines a public attribute set A = {A 1 ,A 2 , ··· ,A β }. 2. A data contributor o i , e.g., a Twitter or OkCupid user, selects an integer u ij to indicate her level of interest in A j , and thus forms her profile vector ~ U i =(u i1 ,u i2 , ··· ,u ). 3. To generate a customized friending strategy, the data consumer also needs to provide her profile vector ~ V =(v 1 ,v 2 , ··· ,v β ) and an acceptable similarity threshold δ . 4. Without loss of generality, we assume that the service provider employs Euclidean dis- tance f (·) to measure the similarity difference, where f ( ~ U i , ~ V )= q β j =1 (u ij - v j ) 2 . Our Approach: TPDM Online Database Revocation Ciphertext Data Encryption (e.g., BGN Scheme) Data Authentication Data Integrity Truthfulness of Data Collection Real Identity Raw Data Registration Homomorphic Properties Pseudo Identity Generation Identity-based Signature Tamper-proof Device 1 st -layer Batch Verification 2 nd -layer Batch Verification Data Processing Outcome Verification Figure 4. Design Overview of TPDM. Phase I: Initialization The registration center sets up the system parameters at the beginning of data trading: Three multiplicative cyclic groups G 1 , G 2 , G T with the same prime order q ; g 1 ,g 2 are generators of G 1 , G 2 , respectively; an admissible pairing ˆ e : G 1 × G 2 G T [3]. Master keys: s 1 ,s 2 Z * q ; public keys: P 0 = g 1 s 1 ,P 1 = g 2 s 1 ,P 2 = g 2 s 2 . BGN cryptosystem [4]: an encryption scheme E (·), a decryption scheme D(·). Registration for a realidentity RID i G 1 and a password PW i . Phase II: Signing Key Generation Generate a pair of pseudo identity PID i and secret key SK i for registered o i : PID i = hPID 1 i , PID 2 i i = hg 1 r , RID i P 0 r i, (1) SK i = hSK 1 i ,SK 2 i i = hPID 1 i s 1 ,H (PID 2 i ) s 2 i, (2) where r is a per-session random nonce, represents the Exclusive-OR (XOR) operation, and H (·) is a MapToPoint hash function [3], i.e., H (·): {0, 1} * G 1 . Phase III: Data Submission Data Encryption: ~ D i = ( E (u ij ),E (u ij 2 ) ) | j [1] . Encrypted Data Signing: σ i = SK 1 i · SK 2 i h(D i ) , (3) where ·denotes the group operation in G 1 , h(·) is a one-way hash function such as SHA-1 [6], and D i is derived by concatenating all the elements of ~ D i together. Phase IV: Data Processing and Verifications First-layer Batch Verification: ˆ e n Y i=1 σ i ,g 2 ! e n Y i=1 PID 1 i ,P 1 ! ˆ e n Y i=1 H (PID 2 i ) h(D i ) ,P 2 ! . (4) Data Submission by Data Consumer: ~ D 0 = ( E (v j 2 ),E (v j ) -2 = E (-2v j ) ) | j [1] . Similarity Evaluation via Homomorphic Multiplication and Homomorphic Addition: R ij = E (1) E (v j 2 ) E (u ij ) E (-2v j ) E (u ij 2 ) E (1) = E ( (u ij - v j ) 2 ) , R i = R i1 R i2 ⊕···⊕ R = E β X j =1 (u ij - v j ) 2 = E f ( ~ U i , ~ V ) 2 . (5) Signatures Aggregation: σ = Q m i=1 σ c i , where {o c 1 ,o c 2 , ··· ,o c m } are matched ones. Second-layer Batch Verification: ˆ e (σ, g 2 )=ˆ e m Y i=1 PID 1 c i ,P 1 ! ˆ e m Y i=1 H (PID 2 c i ) h(D c i ) ,P 2 ! . (6) Outcome Verification: Real identity recovery: PID 2 c i PID 1 c i s 1 = RID c i P 0 r g 1 s 1 ·r = RID c i . No need of homomorphic multiplications: R ij = E (u ij 2 ) E (u ij ) -2v j E (v j 2 ). Sampling, e.g., p(not evaluating each profile)= 20%, 26 checks, success rate = 99.70%. Evaluation Result Dataset: R1-Yahoo! Music User Ratings of Musical Artists Version 1.0 [13] 11,557,943 ratings of 98,211 artists given by 1,948,882 users Choose β common artists as the evaluating attributes Append each user’s ratings ranging from 0 to 10 Evaluation Settings: Pairing-Based Cryptography (PBC) library [11] Identity-based signature scheme * SS512: a supersingular curve with a base field size of 512 bits and an embedding degree of 2 * MNT159: a MNT curve with a base field size of 159 bits and an embedding degree of 6 * q is 160-bit long; all hashings are implemented in SHA1, con- sidering its digest size closely matches q . BGN cryptosystem: Type A1 pairing, in which the group order is a product of two 512-bit primes. OS: 64-bit Ubuntu 14.04, Intel(R) Core(TM) i53.10GHz Table I. Computation overhead of identity-based signature scheme per data contributor. Preparation Operation Setting Pseudo Identity Generation Secret Key Generation Signing SS512 4.698ms (39.40%) 6.023ms (50.53%) 1.201ms (10.07%) MNT159 1.958ms (57.33%) 1.028ms (30.10%) 0.429ms (12.57%) 0 0.5 1 1.5 2 2.5 3 3.5 4 5 10 15 20 25 30 35 40 Run Time (s) Number of Attributes Profile Encryption Similarity Evaluation 0 2 4 6 8 10 12 14 1 10 1 10 2 10 3 10 4 10 5 10 6 Per Signature (ms) Number of Data Contributors Single Signature Verification (MNT159) Single Signature Verification (SS512) Batch Verification (MNT159) Batch Verification (SS512) 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 0 10 1 10 2 10 3 1 10 1 10 2 10 3 10 4 10 5 10 6 Communicaton Overhead (MB) Number of Data Contributors Each Data Contributor Service Provider Data Consumer Figure 5. Performance of TPDM on Yahoo! Music Ratings Dataset. Table I and Figure 5 reveal that TPDM incurs affordable computa- tion and communication overheads, even when supporting as many as 1 million data contributors. Conclusions Consider both data truthfulness and privacy preservation in data markets, and propose TPDM. Instantiate TPDM with a profile-matching service, and evaluate its performance on a real-world dataset. Evaluation results have demonstrated its scalability, especially from computation and communication overheads. References [1] 2016 TRUSTe/NCSA Consumer Privacy Infographic - US Edition,https:// www.truste.com/resources/privacy-research/ncsa-consumer-privacy-index-us/. [2] M. Balazinska, B. Howe, and D. Suciu, Data markets in the cloud: An opportunity for the database community,in VLDB, 2011. [3] D. Boneh and M. Franklin, Identity-based encryption from the weil pairing,in CRYPTO, 2001. [4] D. Boneh, E. Goh, and K. Nissim, Evaluating 2-dnf formulas on ciphertexts,in TCC, 2005. [5] DataSift,http://datasift.com/. [6] D. Eastlake and P. Jones, US Secure Hash Algorithm 1 (SHA1),IETF RFC 3174, 2001. [7] D. Fiore, R. Gennaro, and V. Pastro, Efficiently verifiable computation on encrypted data,in CCS, 2014. [8] B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu, Privacy-preserving data publishing: A survey of recent developments,ACM Computing Surveys, vol. 42, no. 4, pp. 1–53, Jun. 2010. [9] Gnip,https://gnip.com/. [10] R. Ikeda, A. D. Sarma, and J. Widom, Logical provenance in data-oriented workflows?in ICDE, 2013. [11] PBC Library,https://crypto.stanford.edu/pbc/. [12] P. Upadhyaya, M. Balazinska, and D. Suciu, Automatic enforcement of data use policies with datalawyer,in SIGMOD, 2015. [13] Yahoo! Webscope datasets,http://webscope.sandbox.yahoo.com/. [14] R. Zhang, Y. Zhang, J. Sun, and G. Yan, Fine-grained private matching for proximity-based mobile social networking,in INFOCOM, 2012.

Transcript of Chaoyue Niu, Zhenzhe Zheng, Fan Wu, Xiaofeng Gao, and ...

Page 1: Chaoyue Niu, Zhenzhe Zheng, Fan Wu, Xiaofeng Gao, and ...

Trading Data in Good Faith: Integrating Truthfulness and PrivacyPreservation in Data Markets

Chaoyue Niu, Zhenzhe Zheng, Fan Wu, Xiaofeng Gao, and Guihai ChenShanghai Key Laboratory of Scalable Computing and Systems, Department of Computer Science and Engineering, Shanghai Jiao Tong University, China

Email: {rvincency, zhengzhenzhe220}@gmail.com; {fwu, gao-xf, gchen}@cs.sjtu.edu.cn

Trading Data in Good Faith: Integrating Truthfulness and PrivacyPreservation in Data Markets

Chaoyue Niu, Zhenzhe Zheng, Fan Wu, Xiaofeng Gao, and Guihai ChenShanghai Key Laboratory of Scalable Computing and Systems, Department of Computer Science and Engineering, Shanghai Jiao Tong University, China

Email: {rvincency, zhengzhenzhe220}@gmail.com; {fwu, gao-xf, gchen}@cs.sjtu.edu.cn

Introduction

Figure 1. Emerging Data Tradings.

Data Market Model

Figure 2. System Architecture of DataSift [5].

Why trade data services rather than raw data?

• For data contributors: privacy concerns [1]

• For service provider: value-added data services [2]

• For data consumers: data copyright infringement [12]

Two Security Issues

• Privacy Preservation

– Identity Preservation: real identity, e.g., SSN

– Data Confidentiality: mainly against data consumers

• Data Truthfulness

– Partial Data Collection Attack

– No/Partial Data Processing Attack

Design Challenges

Truthfulness of Data Collection

• Privacy preservation

– Digital signature (non-repudiation)

– Message authentication code? secret key sharing

• Data processing

– Semantic inconsistency [8]

– Data provenance/lineage [10]

Truthfulness of Data Processing

• Information asymmetry due to data confidentiality

• Differs from verifiable/outsourced computing [7]

Efficiency Requirements

• Large-scale data acquisition (e.g., on DataSift [5] and Gnip [9])

• Data authentication and data integrity? sequential verification,PKI maintenance (computation and communication overheads)

Profile Matching Data Service

Attribute Level

Movie 3

Sports

Cooking

0

5

Attribute Level

Movie 2

Sports

Cooking

5

1

Alice Bob δ = 2

Attribute Level

Movie 3

Sports

Cooking

4

2

Charlie

Figure 3. Motivating Example: An Illustration of Fine-grained Profile Matching [14].

1. The service provider defines a public attribute set A = {A1, A2, · · · , Aβ}.2. A data contributor oi, e.g., a Twitter or OkCupid user, selects an integer uij to indicate

her level of interest in Aj, and thus forms her profile vector ~Ui = (ui1, ui2, · · · , uiβ).

3. To generate a customized friending strategy, the data consumer also needs to provideher profile vector ~V = (v1, v2, · · · , vβ) and an acceptable similarity threshold δ.

4. Without loss of generality, we assume that the service provider employs Euclidean dis-

tance f (·) to measure the similarity difference, where f (~Ui, ~V ) =√∑β

j=1 (uij − vj)2.

Our Approach: TPDM

Online Database

Revocation

Ciphertext

Data Encryption

(e.g., BGN Scheme)

Data Authentication

Data Integrity

Truthfulness of

Data Collection

Real Identity

Raw Data

Registration

Homomorphic Properties

Pseudo Identity

Generation

Identity-based

Signature

Tamper-proof Device

1st-layer Batch

Verification

2nd-layer Batch

Verification

Data ProcessingOutcome

Verification

Figure 4. Design Overview of TPDM.

Phase I: Initialization

• The registration center sets up the system parameters at the beginning of data trading:

– Three multiplicative cyclic groups G1,G2,GT with the same prime order q; g1, g2 aregenerators of G1,G2, respectively; an admissible pairing e : G1 ×G2 → GT [3].

– Master keys: s1, s2 ∈ Z∗q; public keys: P0 = g1s1, P1 = g2

s1, P2 = g2s2.

– BGN cryptosystem [4]: an encryption scheme E(·), a decryption scheme D(·).– Registration for a “real” identity RIDi ∈ G1 and a password PWi.

Phase II: Signing Key Generation

• Generate a pair of pseudo identity PIDi and secret key SKi for registered oi:

PIDi = 〈PID1i ,PID

2i〉 = 〈g1

r,RIDi � P0r〉, (1)

SKi = 〈SK1i , SK

2i 〉 = 〈PID1

is1, H(PID2

i)s2〉, (2)

where r is a per-session random nonce,� represents the Exclusive-OR (XOR) operation,and H(·) is a MapToPoint hash function [3], i.e., H(·) : {0, 1}∗→ G1.

Phase III: Data Submission

• Data Encryption: ~Di =(E(uij), E(uij

2))|j∈[1,β].

• Encrypted Data Signing:

σi = SK1i · SK2

ih(Di), (3)

where “·” denotes the group operation in G1, h(·) is a one-way hash function such asSHA-1 [6], and Di is derived by concatenating all the elements of ~Di together.

Phase IV: Data Processing and Verifications

• First-layer Batch Verification:

e

(n∏i=1

σi, g2

)= e

(n∏i=1

PID1i , P1

)e

(n∏i=1

H(PID2i)h(Di), P2

). (4)

• Data Submission by Data Consumer: ~D0 =(E(vj

2), E(vj)−2 = E(−2vj)

)|j∈[1,β].

• Similarity Evaluation via Homomorphic Multiplication and Homomorphic Addition:

Rij = E(1)⊗ E(vj2)⊕ E(uij)⊗ E(−2vj)⊕ E(uij

2)⊗ E(1) = E((uij − vj)2

),

Ri = Ri1 ⊕Ri2 ⊕ · · · ⊕Riβ = E

β∑j=1

(uij − vj)2

= E(f (~Ui, ~V )2

). (5)

• Signatures Aggregation: σ =∏m

i=1 σci, where {oc1, oc2, · · · , ocm} are matched ones.

• Second-layer Batch Verification:

e (σ, g2) = e

(m∏i=1

PID1ci, P1

)e

(m∏i=1

H(PID2ci)h(Dci

), P2

). (6)

• Outcome Verification:

– Real identity recovery: PID2ci� PID1

ci

s1 = RIDci � P0r � g1

s1·r = RIDci.

– No need of homomorphic multiplications: Rij = E(uij2)⊕ E(uij)

−2vj ⊕ E(vj2).

– Sampling, e.g., p(not evaluating each profile)= 20%, 26 checks, success rate = 99.70%.

Evaluation Result

• Dataset:

– R1-Yahoo! Music User Ratings of Musical Artists Version 1.0 [13]

– 11,557,943 ratings of 98,211 artists given by 1,948,882 users

– Choose β common artists as the evaluating attributes

– Append each user’s ratings ranging from 0 to 10

• Evaluation Settings:

– Pairing-Based Cryptography (PBC) library [11]

– Identity-based signature scheme

∗ SS512: a supersingular curve with a base field size of 512 bitsand an embedding degree of 2

∗ MNT159: a MNT curve with a base field size of 159 bits andan embedding degree of 6

∗ q is 160-bit long; all hashings are implemented in SHA1, con-sidering its digest size closely matches q.

– BGN cryptosystem: Type A1 pairing, in which the group orderis a product of two 512-bit primes.

– OS: 64-bit Ubuntu 14.04, Intel(R) Core(TM) i5 3.10GHz

Table I. Computation overhead of identity-basedsignature scheme per data contributor.

Preparation Operation

SettingPseudo Identity

GenerationSecret KeyGeneration

Signing

SS512 4.698ms (39.40%) 6.023ms (50.53%) 1.201ms (10.07%)MNT159 1.958ms (57.33%) 1.028ms (30.10%) 0.429ms (12.57%)

0

0.5

1

1.5

2

2.5

3

3.5

4

5 10 15 20 25 30 35 40

Ru

n T

ime

(s)

Number of Attributes

Profile Encryption

Similarity Evaluation

0

2

4

6

8

10

12

14

1 101

102

103

104

105

106

Per

Sig

nat

ure

(m

s)

Number of Data Contributors

Single Signature Verification (MNT159)

Single Signature Verification (SS512)

Batch Verification (MNT159)

Batch Verification (SS512)

10-6

10-5

10-4

10-3

10-2

10-1

100

101

102

103

1 101

102

103

104

105

106

Co

mm

un

icat

on

Ov

erh

ead

(M

B)

Number of Data Contributors

Each Data Contributor

Service Provider

Data Consumer

Figure 5. Performance of TPDM on Yahoo! Music Ratings Dataset.

Table I and Figure 5 reveal that TPDM incurs affordable computa-tion and communication overheads, even when supporting as manyas 1 million data contributors.

Conclusions

• Consider both data truthfulness and privacy preservation in datamarkets, and propose TPDM.

• Instantiate TPDM with a profile-matching service, and evaluate itsperformance on a real-world dataset.

• Evaluation results have demonstrated its scalability, especially fromcomputation and communication overheads.

References

[1]“2016 TRUSTe/NCSA Consumer Privacy Infographic - US Edition,” https://

www.truste.com/resources/privacy-research/ncsa-consumer-privacy-index-us/.

[2] M. Balazinska, B. Howe, and D. Suciu, “Data markets in the cloud: An opportunity for thedatabase community,” in VLDB, 2011.

[3] D. Boneh and M. Franklin, “Identity-based encryption from the weil pairing,” in CRYPTO,2001.

[4] D. Boneh, E. Goh, and K. Nissim, “Evaluating 2-dnf formulas on ciphertexts,” in TCC, 2005.

[5]“DataSift,” http://datasift.com/.

[6] D. Eastlake and P. Jones, “US Secure Hash Algorithm 1 (SHA1),” IETF RFC 3174, 2001.

[7] D. Fiore, R. Gennaro, and V. Pastro, “Efficiently verifiable computation on encrypted data,”in CCS, 2014.

[8] B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu,“Privacy-preserving data publishing: A surveyof recent developments,”ACM Computing Surveys, vol. 42, no. 4, pp. 1–53, Jun. 2010.

[9]“Gnip,” https://gnip.com/.

[10] R. Ikeda, A. D. Sarma, and J. Widom, “Logical provenance in data-oriented workflows?” inICDE, 2013.

[11]“PBC Library,” https://crypto.stanford.edu/pbc/.

[12] P. Upadhyaya, M. Balazinska, and D. Suciu, “Automatic enforcement of data use policies withdatalawyer,” in SIGMOD, 2015.

[13]“Yahoo! Webscope datasets,” http://webscope.sandbox.yahoo.com/.

[14] R. Zhang, Y. Zhang, J. Sun, and G. Yan, “Fine-grained private matching for proximity-basedmobile social networking,” in INFOCOM, 2012.