Data Privacy by Programming Language Design

112
Data Privacy by Programming Language Design David Darais University of Vermont

Transcript of Data Privacy by Programming Language Design

Page 1: Data Privacy by Programming Language Design

Data Privacy by Programming Language Design

David DaraisUniversity of Vermont

Page 2: Data Privacy by Programming Language Design

Your Personal Data

Page 3: Data Privacy by Programming Language Design

Good uses of data

Improve a product

Enable better business decisions

Support fundamental research

Page 4: Data Privacy by Programming Language Design

Bad uses of data

Stalking and harassment

Unfair business advantages

Threats and blackmail

Page 5: Data Privacy by Programming Language Design

Good ⌁ Bad

Page 6: Data Privacy by Programming Language Design

Good ⌁ Bad

Page 7: Data Privacy by Programming Language Design

Good ⌁ Bad

Page 8: Data Privacy by Programming Language Design

Non-solution: Anonymization

First Name Last Name University

David Darais U Vermont

Éric Tanter U Chile

Federico Olmedo U Chile

First Name Last Name University

##### ##### U Vermont

##### ##### U Chile

##### ##### U Chile

Page 9: Data Privacy by Programming Language Design

Non-solution: Anonymization

Page 10: Data Privacy by Programming Language Design

Dataset Visible Auxiliary Data Attack

Anonymized Netflix Viewer Data ID + ratings + dates IMDB Re-identification

(what movies you watch)

NYC Taxi Data ID + time + coordinates + fare + tip

Geotagged celebrity photos

Re-identification(celebrity trips +

tip amounts)

Anonymized AOL Search Data ID + query text Ad-hoc Re-identification

(search history)

Massachusetts Hospital Visit Data

All except:name + address + SSN

Public voting records (name,

address, birth date)

Re-identification(health records,

diagnoses + prescriptions)

Non-solution: Anonymization

Page 11: Data Privacy by Programming Language Design
Page 12: Data Privacy by Programming Language Design

For any ‘anonymized’ dataset, either the data is useless, or there exists an auxiliary dataset that re-identifies it.

–Dwork & Roth (The Algorithmic Foundations of Differential Privacy)

Page 13: Data Privacy by Programming Language Design

Almost-solution: Aggregate statistics

Page 14: Data Privacy by Programming Language Design

artificial intelligence =

aggregate statistics ?

Page 15: Data Privacy by Programming Language Design
Page 16: Data Privacy by Programming Language Design
Page 17: Data Privacy by Programming Language Design
Page 18: Data Privacy by Programming Language Design
Page 19: Data Privacy by Programming Language Design
Page 20: Data Privacy by Programming Language Design
Page 21: Data Privacy by Programming Language Design

Page 22: Data Privacy by Programming Language Design

Page 23: Data Privacy by Programming Language Design
Page 24: Data Privacy by Programming Language Design

Almost-solution: Aggregate statistics

Clearly not acceptable for small datasets

Clearly acceptable for “well-behaved” massive datasets

Central idea behind modern interpretations of“privacy-sensitive data analysis”

Must be careful with artificial intelligence applications

Page 25: Data Privacy by Programming Language Design

EU and GDPR Data breaches (security/access) = financial liability

Fines: MAX( €20 Million , 4% annual global turnover )

Sensitive vs aggregate data – only liable for sensitiveMore sensitive data = more financial riskAggregate data = cannot be re-identified

Also: California CCPA modeled on GDPR

Page 26: Data Privacy by Programming Language Design

Security Privacy

Page 27: Data Privacy by Programming Language Design

Security Privacy

SENSITIVEDATA

Access

Page 28: Data Privacy by Programming Language Design

Security Privacy

SENSITIVEDATA

SENSITIVEDATA

AGGREGATESTATISTICS

Access

Computation

Page 29: Data Privacy by Programming Language Design

Differential Privacy =

Aggregate Statistics +

Random noise =

No-reidentification guarantees =

0 Financial liability (GDPR)

Page 30: Data Privacy by Programming Language Design

Differential Privacy

Program Analysis

Duet

Deep Learning

Page 31: Data Privacy by Programming Language Design

Differential Privacy

How many peoplenamed Éric

live in Chile?

Page 32: Data Privacy by Programming Language Design

How many people named Éric live in Chile?1. How sensitive is this query?

+ Éric

= 60,000 = 60,001

Page 33: Data Privacy by Programming Language Design

How many people named Éric live in Chile?1. How sensitive is this query?

+ <anyone>

= 60,000 = 60,001

Page 34: Data Privacy by Programming Language Design

How many people named Éric live in Chile?1. How sensitive is this query?

+ <anyone>

= 60,000 = 60,001

sensitivity = 1

Page 35: Data Privacy by Programming Language Design

How many people named Éric live in Chile?2. Add noise to the result with scale ~ sensitivity

+ <anyone>

= 60,000+ <noise>

= 60,001+ <noise>

Page 36: Data Privacy by Programming Language Design

How many people named Éric live in Chile?2. Add noise to the result with scale ~ sensitivity

+ <anyone>

= 60,000+ <noise>

= 60,001+ <noise>

Page 37: Data Privacy by Programming Language Design

+ Éric ?or

Page 38: Data Privacy by Programming Language Design

How many peoplenamed Éric

live in Chile?

How many peoplenamed Éric

live at <specific address>

Page 39: Data Privacy by Programming Language Design

How many peoplenamed Éric

live in Chile?

How many peoplenamed Éric

live at <specific address>

Same sensitivity (= 1)

Same amount of noise

Very different utility

Page 40: Data Privacy by Programming Language Design

• •

= 59,900 = 60,100

How many peoplenamed Éric

live in Chile?

How many peoplenamed Éric

live at <specific address>

“roughly 60,020 people named Éric live in Chile”

Page 41: Data Privacy by Programming Language Design

= -100 = 100

How many peoplenamed Éric

live in Chile?

How many peoplenamed Éric

live at <specific address>

“roughly 37 people named Éric live at <specific address>”

• •

Page 42: Data Privacy by Programming Language Design

1 Sample

Page 43: Data Privacy by Programming Language Design

3 Samples

Page 44: Data Privacy by Programming Language Design

6 Samples

Page 45: Data Privacy by Programming Language Design

1,000,000 Samples

Page 46: Data Privacy by Programming Language Design

Privacy Cost How many samples needed to re-identify participant

Quantity = distance between distributions

Quantity = directly interpretable as privacy “budget”

ε

Page 47: Data Privacy by Programming Language Design

“Differential privacy describes a promise, made by a data holder, or curator, to a data subject: ‘You will not be affected, adversely or otherwise, by allowing

your data to be used in any study or analysis, no matter what other studies, data sets, or information

sources, are available.’” –Dwork & Roth (The Algorithmic Foundations of Differential Privacy)

Page 48: Data Privacy by Programming Language Design

DP Theorems Mechanism:

Adding Laplace noise scaled by ~s/ε to an s-sensitive query achieves ε differential privacy

Post-processing:A differentially private result can be used any number of times, and for any purpose, including arbitrary linking with auxiliary data

Composition: An ε₁-DP query followed by an ε₂-DP query is (ε₁+ε₂)-DP

New data = fresh budget

Page 49: Data Privacy by Programming Language Design

Who is using DP?

Apple

Google

US NIST

US Census

GDPR working documents

Page 50: Data Privacy by Programming Language Design

DP Challenges How to achieve better utility/accuracy?

Privacy frameworks (hard proofs):(ε,δ), Rényi, ZC, TZC

Sensitivity frameworks (hard to compose):Local sensitivity

Stronger composition (less expressive):Advanced composition

Smarter “billing” (hard to use):Independent budget for different sensitive attributes

Page 51: Data Privacy by Programming Language Design

DP Challenges

What if I don’t trust the computation provider?

Decentralized model:Local differential privacy

Cryptographic techniques:Secure multi-party communication, secure enclaves

Page 52: Data Privacy by Programming Language Design
Page 53: Data Privacy by Programming Language Design

Differential Privacy

Program Analysis

Duet

Deep Learning

Page 54: Data Privacy by Programming Language Design

Why Program Analysis 1. How sensitive is the query? (uncomputable in general)

2. Add-noise3. How private is the result? (uncomputable in general)

PA/PL literature about sensitivity analysis for programs(assumed: Laplace noise gives ε privacy)(focus: automation+proofs)

DP literature about privacy analysis for algorithms(assumed: count query is 1 sensitive)(focus: precision+proofs)

Page 55: Data Privacy by Programming Language Design

addnoise

(mechanism)

sensitivity

privacy

Page 56: Data Privacy by Programming Language Design

Operation Assumption Sensitivity

f(x) = x 1-sensitive in x

f(x) = count(x) 1-sensitive in x

f(x,y) = x+y 1-sensitive in x1-sensitive in y

f(x,y) = x*y ∞-sensitive in x∞-sensitive in y

f(x) = g(h(x)) g is α-sensitiveh is β-sensitive αβ sensitive in x

Page 57: Data Privacy by Programming Language Design

f(x,y) = k(g(x) + h(y))

Page 58: Data Privacy by Programming Language Design

f(x,y) = k(g(x) + h(y))

g is α-sensitive h is β-sensitive k is γ-sensitive

⟹ f is γ(α+0)-sensitive in x f is γ(0+β)-sensitive in y

Page 59: Data Privacy by Programming Language Design

f ∈ ℝ →ˢ ℝ

Page 60: Data Privacy by Programming Language Design

Sensitivity

f(x) is s-sensitive in x iff

when |v₁ - v₂| ≤ d

then |f(v₁) - f(v₂)| ≤ sd

When the input wiggles by some amount, how much does the output wiggle.

Page 61: Data Privacy by Programming Language Design

Sensitivity

|4 - 5| = 1 ∈ ℝ

Page 62: Data Privacy by Programming Language Design

Sensitivity

+ Éric

|4 - 5| = 1

| - | = 1

∈ ℝ

∈ 𝔻𝔹

Page 63: Data Privacy by Programming Language Design

Sensitivity

+ Éric

|4 - 5| = 1

| - | = 1

∈ ℝ

∈ 𝔻𝔹

Arbitrary metric space

Page 64: Data Privacy by Programming Language Design

Sensitivity

f(x) is s-sensitive in x iff

when |v₁ - v₂|τ₁ ≤ d

then |f(v₁) - f(v₂)|τ₂ ≤ ds

When the input wiggles by some amount, how much does the output wiggle.

Page 65: Data Privacy by Programming Language Design

Pr[f(v₂)] eε Pr[f(v₁)]

Privacy

f(x) is ε-private in x iff

when |v₁ - v₂|τ₁ ≤ 1

When the input wiggles by one, how close are the resulting distributions.

then

Page 66: Data Privacy by Programming Language Design

Privacy

Pr[f(v₂)]

Pr[f(v₁)] ≤ eε ε ————————— max

f(x) is ε-private in x iff

when |v₁ - v₂|τ₁ ≤ 1

then

When the input wiggles by one, how close are the resulting distributions.

Page 67: Data Privacy by Programming Language Design

Privacy

Pr[f(v₂)]

Pr[f(v₁)] ≤ ε ————————— max ㏑

f(x) is ε-private in x iff

when |v₁ - v₂|τ₁ ≤ 1

When the input wiggles by one, how close are the resulting distributions.

then

Page 68: Data Privacy by Programming Language Design

Privacy

Pr[f(v₂)]

Pr[f(v₁)] dε ≤ ————————— max ㏑

f(x) is ε-private in x iff

when |v₁ - v₂|τ₁ ≤ d

When the input wiggles by one, how close are the resulting distributions.

then

Page 69: Data Privacy by Programming Language Design

Privacy

f(x) is ε-private in x iff

when |v₁ - v₂|τ₁ ≤ d

then dε ≤ |f(v₁) - f(v₂)|D

where |f(x) - f(y)|D ≜ max ㏑(Pr[f(x)]/Pr[f(y)])

Page 70: Data Privacy by Programming Language Design

Privacy = SensitivityPrivacy Analysis = Sensitivity Analysis

Page 71: Data Privacy by Programming Language Design

laplace ∈ ℝ →ᵋ 𝒟(ℝ)

release ∈ τ →∞ 𝒟(τ)

post-pr ∈ 𝒟(τ₁) , (τ₁ →∞ 𝒟(τ₂)) → 𝒟(τ₂)

Privacy = SensitivityPrivacy Analysis = Sensitivity Analysis

Page 72: Data Privacy by Programming Language Design

PA Challenges

Complexity (hopefully linear)

Precision (hopefully good)

Expressiveness (objects, HO functions, abstraction)

Exotic DP definitions (no definable metric)

Trust (design? implementation?)

Page 73: Data Privacy by Programming Language Design

Differential Privacy

Program Analysis

Duet

Deep Learning

Dwork, McSherry, Nissim,Smith–2006Dwork,Roth–2014

Reed,Pierce–2010

Near,Darais,(+9)–2019

Page 74: Data Privacy by Programming Language Design

Duet: Goals

Support stronger variants of DP (ε,δ)

Support machine learning algorithms

Precise analysis

Tractable algorithm

Trustworthy design

Page 75: Data Privacy by Programming Language Design
Page 76: Data Privacy by Programming Language Design

ε-DP

f(x) is ε-private in x iff

when |v₁ - v₂|τ₁ ≤ 1

then Pr[f(v₁)] ≤ eᵋPr[f(v₂)]

When the input wiggles by one, how close are the resulting distributions.

Page 77: Data Privacy by Programming Language Design

(ε,δ)-DP

f(x) is (ε,δ)-private in x iff

when |v₁ - v₂|τ₁ ≤ 1

then Pr[f(v₁)] ≤ eᵋPr[f(v₂)] + δ

When the input wiggles by one, how close are the resulting distributions, with high (1-δ) probability.

Page 78: Data Privacy by Programming Language Design

f ∈ ℝ →² ℝ laplace ∈ ℝ →ᵋ 𝒟(ℝ) —————————————————————— laplace∘f ∈ ℝ →²ᵋ 𝒟(ℝ)

ε-DP (ε,δ)-DP

Page 79: Data Privacy by Programming Language Design

f ∈ ℝ →² ℝ laplace ∈ ℝ →ᵋ 𝒟(ℝ) —————————————————————— laplace∘f ∈ ℝ →²ᵋ 𝒟(ℝ)

f ∈ ℝ →² ℝ gauss ∈ ℝ →ε,δ 𝒟(ℝ) —————————————————————— gauss∘f ∈ ℝ →2ε,2eᵋδ 𝒟(ℝ)

ε-DP (ε,δ)-DP

Page 80: Data Privacy by Programming Language Design

Duet Design

Scaling is *very* imprecise, language should disallow itIn previous analyses, scaling is pervasive–no way out

We separate languages for sensitivity and privacyAdd APIs for data analysis and machine learningProofs of privacy for any “well-typed” program

Page 81: Data Privacy by Programming Language Design
Page 82: Data Privacy by Programming Language Design
Page 83: Data Privacy by Programming Language Design
Page 84: Data Privacy by Programming Language Design
Page 85: Data Privacy by Programming Language Design

θ X y

Page 86: Data Privacy by Programming Language Design

θ X

θ

y }

Page 87: Data Privacy by Programming Language Design

θ X }X

y

Page 88: Data Privacy by Programming Language Design

θ X }

y

y

Page 89: Data Privacy by Programming Language Design

θ X y

}how to improve θ?

Page 90: Data Privacy by Programming Language Design

ℓ = L2-normrequired by mgauss DP-mechanism

θ X y

Page 91: Data Privacy by Programming Language Design

modeldata

labels

Page 92: Data Privacy by Programming Language Design

modeldata

labels

how to improve the model

Page 93: Data Privacy by Programming Language Design

itersdatalabels rate

Page 94: Data Privacy by Programming Language Design

itersdatalabels rate

babymodel

Page 95: Data Privacy by Programming Language Design

itersdatalabels rate

smarter model

babymodel

Page 96: Data Privacy by Programming Language Design

Guaranteed Privacy =

Page 97: Data Privacy by Programming Language Design

Privacy =

Page 98: Data Privacy by Programming Language Design
Page 99: Data Privacy by Programming Language Design
Page 100: Data Privacy by Programming Language Design
Page 101: Data Privacy by Programming Language Design

Duet will be open source on GitHub (soon)

Page 102: Data Privacy by Programming Language Design

Differential Privacy

Program Analysis

Duet

Deep Learning

Page 103: Data Privacy by Programming Language Design

Deep Learning Gradients:

Bounded sensitivity for convex systems

Unbounded sensitivity for non-convex systems

Deep Learning:

Non-convex

State of the art:

Aggressive clipping during training (to bound sensitivity)

Page 104: Data Privacy by Programming Language Design

Deep Learning Recent results:

Local sensitivity + smoothness instead of GS

Analytical derivative can bound LS + smoothness

Hypothesis:

Local sensitivity + smoothness for neural networks

Gradient of the gradient via AD^2

Compositional smoothness analysis

Improved accuracy over naive clipping

Page 105: Data Privacy by Programming Language Design

IN1

IN2

OUT

N11

N12

N21

Page 106: Data Privacy by Programming Language Design

n11 = relu(w111*in1 + w112*in2) n12 = relu(w121*in1 + w122*in2) n21 = sigm(w211*n11 + w212*n12) return n21

IN1

IN2

OUT

N11

N12

N21

Page 107: Data Privacy by Programming Language Design

Neural Networks

first order, stateless programs with free variables (weights)

no branching control flow

differentiable

n11 = relu(w111*in1 + w112*in2) n12 = relu(w121*in1 + w122*in2) n21 = sigm(w211*n11 + w212*n12) return n21

Page 108: Data Privacy by Programming Language Design

NN Training

Analytic gradient used for training

Efficient automatic differentiation algorithms (backprop)

We need gradient (for local sensitivity) of the gradient

Run backprop again – 2nd order gradient

Page 109: Data Privacy by Programming Language Design

AD

Forward mode (1st derivative): dual numbers <v,d>

Forward mode (2nd derivative): ternary numbers <v,d₁,d₂>

Reverse mode (1st derivative): forward backward passes

Reverse mode (2nd derivative): FBFB passes

(+ smoothness analysis)

Page 110: Data Privacy by Programming Language Design

Duet Collaborators

JOSEPH P. NEAR, University of VermontCHIKE ABUAH, University of VermontTIM STEVENS, University of VermontPRANAV GADDAMADUGU, University of California, Berkeley LUN WANG, University of California, Berkeley NEEL SOMANI, University of California, BerkeleyMU ZHANG, Cornell UniversityNIKHIL SHARMA, University of California, BerkeleyALEX SHAN, University of California, BerkeleyDAWN SONG, University of California, Berkeley

Page 111: Data Privacy by Programming Language Design

Duet: PL for DP

+ <me> ?or

Guaranteed Privacy =

Machine Learning Algorithm =

Page 112: Data Privacy by Programming Language Design

(END)