SplitX: High-Performance Private Analytics

36
SplitX: High- Performance Private Analytics Ruichuan Chen (Bell Labs / Alcatel- Lucent) Istemi Ekin Akkus (MPI-SWS) Paul Francis (MPI-SWS)

description

SplitX: High-Performance Private Analytics. Ruichuan Chen (Bell Labs / Alcatel-Lucent) Istemi Ekin Akkus (MPI-SWS) Paul Francis (MPI-SWS). Data analytics is important. Evaluate system performance Understand user behavior Discover statistical patterns. - PowerPoint PPT Presentation

Transcript of SplitX: High-Performance Private Analytics

Page 1: SplitX: High-Performance       Private Analytics

SplitX: High-Performance Private Analytics

Ruichuan Chen (Bell Labs / Alcatel-Lucent)Istemi Ekin Akkus (MPI-SWS)Paul Francis (MPI-SWS)

Page 2: SplitX: High-Performance       Private Analytics

Data analytics is important

Evaluate system performance

Understand user behavior

Discover statistical patterns

Page 3: SplitX: High-Performance       Private Analytics

Data exposure has become a major concern

Third-partyTrackers

Smart-phone Apps

Page 4: SplitX: High-Performance       Private Analytics

User-owned and operated

Data exposure has to be brought under control!

User-owned and operated principle Personal data should be stored in a local

host under the user’s control.

Page 5: SplitX: High-Performance       Private Analytics

Motivation and problem

How to make aggregate queries over distributed private user data while still preserving user privacy?

Data Data Data

Analyst

Page 6: SplitX: High-Performance       Private Analytics

Outline

Related work

SplitX system Key insights System design Performance comparison Implementation & deployment

Conclusion

Page 7: SplitX: High-Performance       Private Analytics

A general approach

Based on differential privacy. Differential privacy adds noise to the

output of a computation (i.e., query).

Hide the presence or absence of a user.

DatabaseQuery Module

(add noise)AnalystData

Data Data

Page 8: SplitX: High-Performance       Private Analytics

Previous systems Servers aggregate

answers without seeing individual user data.

Differentially private noise is added to the aggregate result.

Data Data Data

Analyst

Servers

Analyst

Akkus et al., CCS’12; Chen et al., NSDI’12; Dwork et al., EUROCRYPT’06; Hardt et al., CCS’12; Rastogi et al., SIGMOD’10; Shi et al., NDSS’11

Page 9: SplitX: High-Performance       Private Analytics

Primary technical problems Scale poorly

Require public-key operations or something even more expensive.Akkus et al., CCS’12; Chen et al., NSDI’12; Dwork et al., EUROCRYPT’06; Rastogi et al., SIGMOD’10; Shi et al., NDSS’11

Suffer from answer pollution Even a single malicious user can

substantially distort the aggregate result through a single answer.Hardt et al., CCS’12; Rastogi et al., SIGMOD’10; Shi et al., NDSS’11

Page 10: SplitX: High-Performance       Private Analytics

Outline

Related work

SplitX system Key insights System design Performance comparison Implementation & deployment

Conclusion

Page 11: SplitX: High-Performance       Private Analytics

SplitX

A high-performance private analytics system 2 to 3 orders of magnitude more efficient in

bandwidth 3 to 5 orders of magnitude more efficient in

computation Resistant to answer pollution

Page 12: SplitX: High-Performance       Private Analytics

Components & assumptions

Data Data Data

Analyst

Servers(1 aggregator and 2 mixes)

Analysts are potentially malicious(violating user privacy)

Clients are user devices.Clients are potentially malicious(distorting the final results)

Servers are honest but curious1) Follow the specified protocol2) Try to exploit additional info that can be learned in so doing

Analyst

Page 13: SplitX: High-Performance       Private Analytics

Outline

Related work

SplitX system Key insights System design Performance comparison Implementation & deployment

Conclusion

Page 14: SplitX: High-Performance       Private Analytics

Key insights: XOR encryption How to achieve high performance?

Client wants to send M to aggregator Client splits M, and sends split messages to

aggregator via mixes Aggregator joins split messages to recreate M

AggregatorClientMix2

Mix1M R M R

R R

Mgenerate R recreate M

Page 15: SplitX: High-Performance       Private Analytics

Key insights: XOR encryption How to achieve high performance?

M denotes that client sends two split messages of M to aggregator via Mix1 and Mix2.

For clarity

AggregatorClientMix2

Mix1M R M R

R R

AggregatorClientMix2

Mix1

M

generate R recreate M

Page 16: SplitX: High-Performance       Private Analytics

Key insights: query buckets How to limit answer pollution?

Solution: Ensure that a client cannot arbitrarily

manipulate answers. Divide answer’s value range into buckets. Enforce a binary answer in each bucket.

Page 17: SplitX: High-Performance       Private Analytics

Key insights: query buckets

Query: “SELECT age FROM splitx”

4 buckets: 0~19, 20~39, 40~59, and ≥60. Answers: a ‘1’ or ‘0’ per bucket.

30 years-old 0, 1, 0, 0 Answers encoded in a bit-vector.

An answer from a malicious client cannot substantially distort the query result!

Page 18: SplitX: High-Performance       Private Analytics

Outline

Related work

SplitX system Key insights System design Performance comparison Implementation & deployment

Conclusion

Page 19: SplitX: High-Performance       Private Analytics

System design

1) Query publish/subscribe Analyst publishes its queries Client subscribes to an analyst’s queries

2) Query answering Client answers queries Mixes add differentially private noise Mixes shuffle answers Aggregator generates query results

Page 20: SplitX: High-Performance       Private Analytics

1) Query publish/subscribe

AggregatorClient

Mix2

Mix1

Query1, Query2, …

Analyst

Analyst ID

Query1, Query2, …

Page 21: SplitX: High-Performance       Private Analytics

1) Query publish/subscribe

Query example: age distribution among male users?

QID: SQL:

Buckets: DP parameter ( ): Tend:

123

11:59:59PM on Aug 16, 2013

0~19, 20~39, 40~59, and ≥60

1.0

SELECT age FROM splitxWHERE gender=‘male’

Page 22: SplitX: High-Performance       Private Analytics

2) Query answering

Client answers queries Mixes add differentially private noise Mixes shuffle answers Aggregator generates query results

Page 23: SplitX: High-Performance       Private Analytics

Step 1: client answers queries

Client executes query over its local data and generates an answer

‘1’ or ‘0’ per bucket

Encoded as a bit-vector

Page 24: SplitX: High-Performance       Private Analytics

Step 1: client answers queries

Client splits its answer, and sends the split answers with the query ID to the two mixes, respectively.

AggregatorClient

Mix2

Mix1

Analyst

QID, answer

Mix knows which query a client answered.Privacy violation!

Page 25: SplitX: High-Performance       Private Analytics

Step 2: mixes add DP noise

Each mix individually adds some random bit-vectors as the differentially private noise

How many bit-vectors needed?c: # clients queried : DP parameter

Mix1

0100

1110

……

0111

……

Mix2

1101

1001

……

0101

……

Mix2

1101

1001

……

Mix1

0100

1110

……

random bit-vectors as noise

Page 26: SplitX: High-Performance       Private Analytics

Step 3: mixes shuffle split answers

Each mix maintains c+n split answers Mixes shuffle the split answers for each

column (i.e., bucket) in a synchronized way.

Mix1

0100

1110

……

0111

……

Mix2

1101

1001

……

0101

……

Mix1

1110

0111

……

0100

……

Mix2

1101

1101

……

0001

……

shuffle

Page 27: SplitX: High-Performance       Private Analytics

Mixes transmit shuffled answers

Each mix transmits the shuffled split answers to the aggregator.

AggregatorClient

Mix2

Mix1

Analyst

Mix1

……

Mix2

…… c+n shuffled split answers

c+n shuffled split answers

Page 28: SplitX: High-Performance       Private Analytics

Step 4: aggregator generates query result

Join each bit position in the two split answer arrays.

Sum up the values for each bucket.

Obtain the noisy count for each bucket.

Mix1

1110

0111

……

0100

……

Mix2

1101

1101

……

0001

……

Agg

0011

1010

……

0101

……

=

Page 29: SplitX: High-Performance       Private Analytics

Privacy issue at the mixes Client splits the answer, and sends the

split answers with the query ID to the two mixes

Mix knows which query a specific client answered!

AggregatorClient

Mix2

Mix1

Analyst

QID, answer

Page 30: SplitX: High-Performance       Private Analytics

Solution: double-splitting

Client

Mix2Mix2

Mix1Mix1

Mix1

Mix2

AggregatorAggregator

AggregatorAggregator

AggregatorClient

Mix2

Mix1

Analyst

QID, answer

QID, answer

Page 31: SplitX: High-Performance       Private Analytics

Duplicate answer detection

A client can answer a query many times!

How to detect and remove duplicate answers?

Triple-splitting is needed

Section 5 in the paper.

Page 32: SplitX: High-Performance       Private Analytics

Outline

Related work

SplitX system Key insights System design Performance comparison Implementation & deployment

Conclusion

Page 33: SplitX: High-Performance       Private Analytics

Computational overhead

Three to five orders of magnitude more efficient in computation than previous systems

PDDP [NSDI’12]Akkus et al. [CCS’12] – “A” is #buckets that a client reports

Page 34: SplitX: High-Performance       Private Analytics

Implementation

Client side Google Chrome extension Capture webpages browsed, searches

made, extensions installed

Server side (mix + aggregator) Web services on Jetty RPCs defined in Thrift language

Page 35: SplitX: High-Performance       Private Analytics

Deployment Query results from a 416-client

deployment

Most visited websites: google, facebook, youtube

Most used apps: gmail, youtube, google drive

91% of clients made ≤50 searches / day 70% of clients visited >50 webpages / day 97% of clients visited ≤100 websites / day

Page 36: SplitX: High-Performance       Private Analytics

Conclusion

SplitX: a high-performance private analytics system Orders of magnitude more efficient than

previous systems Resistant to answer pollution

Key insights XOR-based encryption Query buckets