Dual Query: Practical Private Query Release for High Dimensional Data

21
Dual Query: Practical Private Query Release for High Dimensional Data Speaker: Steven Wu University of Pennsylvania ICML 2014 Joint work with Marco Gaboardi Emilio Jesús Gallego Arias Justin Hsu Aaron Roth

Transcript of Dual Query: Practical Private Query Release for High Dimensional Data

Dual Query:Practical Private Query Release for

High Dimensional Data

Speaker: Steven WuUniversity of Pennsylvania

ICML 2014

Joint work withMarco Gaboardi

Emilio Jesús Gallego AriasJustin Hsu

Aaron Roth

Sensitive Database

(Medical Records)

Queries

Release answers that preserve privacy

Private Query Release

D

Differential Privacy

Algorithm

ratio bounded

AliceAlice BobBob ChrisChris DonnaDonna ErnieErnieXavierXavier

Differential Privacy (DMNS06)

• An algorithm A with domain X and range R satisfies ε-differential privacy if for every outcome r and every pair of databases D, D’ differing in one record:

Pr[ A(D) = r ] ≤ (1 + ε)Pr[ A(D’) = r ]

Useful Properties:

• Strong, worst-cast notion of privacy• Similar to stability for learning algorithms

More Formally

Release approximate answers to a large collection of queries with

Privacy and Accuracy

Answer Exponentially Many

queries

• Privately learn a distribution D’ approximating D

True Database Approximate Database

Learning Algorithm

ApproximatelySame Answers on the queries

Learn from Learning Theory

• [DRV08]: query release via boosting

• [HR10]: use multiplicative weights (MW) update algorithm to learn a distribution

• [HLM12]: experimentally evaluated the MW algorithm, performs well for ≤ 80 attributes

What is the bottleneck?

The algorithm operates on the distribution of all possible data records:

Exponential in d !

Impossibility Result• No private algorithm can answer exponentially large

collection of queries efficiently and accurately

• Shown by a line of lower bounds:[DNRRV09] [Ullman-Vadhan11] [Ullman13] [BUV14]

• Problem theoretically hard in the worst case

• But can we do something in practice? (not with exponential space)

Query Release as Zero-Sum

Game

Query Release Game

Data Player actions

Query Playeractions

MaximizeError

MinimizeError

Approximate Equilibrium Implies

Accuracy

Computing the Equilibrium

Multiplicative Weights vs. Best Response

Data Player Query Player

Converge toApproximate Equilibrium

exponential size distribution

Dual Approach

Multiplicative Weights vs. Best Response

Data PlayerQuery Player

Solve an NP-Hard Problem

Best Response Problem

• Minimize error w.r.t query player’s distribution• Concisely represented but NP-Hard• Can be encoded as an integer program

Send it to CPLEX Solver

Don’t Need to Optimize ExactlyIf the optimization problem is too hard, stop CPLEX and return the current solution

Accuracy

Accuracy versus ε

500,000 queries; 17,770 attributes

Scalability

Accuracy versus number of attributes

100,000 queries; up to 512,000 attributes

Scalability

Runtime (secs) versus Number of Attributes100,000 queries; up to 512,000 attributes

Take-Away• Private Query Release for High Dimensional Data is

Hard

• Reconfigure Existing Algorithm to Isolate the Hard Part

• Dual Query: an algorithm that performs well in practice

Dual Query:Practical Private Query Release for

High Dimensional Data

Speaker: Steven WuUniversity of Pennsylvania

ICML 2014

Joint work withMarco Gaboardi

Emilio Jesús Gallego AriasJustin Hsu

Aaron Roth