Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe...

Post on 19-Jun-2020

1 views 0 download

Transcript of Genomic Privacy - Institute for Mathematics and its …...DTC and genomic privacy From the 23andMe...

Genomic Privacy:Limits of Individual Detection

in a PoolSriram Sankararamana, Guillaume Obozinskib, Michael I.

Jordanc, Eran Halperind

a Harvard Medical Schoolb INRIA

c UC Berkeleyd Tel Aviv University

GWAS: Genomewide Association Studies

0 1 1 0 0 0 1 0 0 1 0 10 1 1 1 0 0 0 0 1 0 1 00 1 1 1 0 0 1 0 0 1 0 01 1 0 0 0 0 1 1 0 0 0 1

Cas

es

SNP

0 1 0 0 0 0 1 0 0 1 1 00 1 0 1 0 0 0 0 1 1 0 00 1 0 1 1 1 1 1 0 0 0 10 1 0 0 1 0 1 1 0 0 0 1

Con

trol

s

GWAS factsLooking for common SNPs

Frequency above 1%

Chosen to be correlated to unobserved causal variants.

Most of these SNPs have low effect sizes.

Testing about million SNPs

Bottomline : Need a large number of samples to have sufficient power.

GWAS so far

600 studies covering around 150 traits (Manolio, 2010)

Power can be increased by combining data from multiple studies.

Tens to hundreds of thousands of participants are common.

Rheumatoid Arthritis (5K cases, 17K controls), Alzheimers’ (7K, 14K), lipid levels and cholesterol (~100K).

Has led to setting up of central data-sharing repositories such as dbGap, EGP archive, WTCCCC.

What about individual privacy ?

Some views on privacy and sharing

5

Give up privacy assurances e.g. PGP

Have streamlined procedures to regulate access to data.

The middle ground ?

Separate individual-level and summary data.

Make summary data public.

DTC and genomic privacy

From the 23andMe website:23andMe may collaborate with external parties. Under this informed consent, external parties will only have access to pooled data stripped of identifying information. 23andMe will never release your individual-level data to any third party without asking for and receiving your explicit authorization to do so.

6

Do these measures guarantee privacy of participants ?

7

Individual Detection in a Pool

8

0 1 1 0 0 0 1 0 0 1 0 10 1 1 1 0 0 0 0 1 0 1 00 1 1 1 0 0 1 0 0 1 0 01 1 0 0 0 0 1 1 0 0 0 1

Cas

esSNP

0.25 1 0.75 0.5 0 ................................ 0.5 0.25 0.5

0 1 0 0 0 0 1 0 0 1 1 00 1 0 1 0 0 0 0 1 1 0 00 1 0 1 1 1 1 1 0 0 0 10 1 0 0 1 0 1 1 0 0 0 1

Con

trol

s

0 1 0 0.5 0.5 ................................ 0.5 0.25 0.5

Individual Detection in a Pool

9

0 1 1 0 0 0 1 0 0 1 0 10 1 1 1 0 0 0 0 1 0 1 00 1 1 1 0 0 1 0 0 1 0 01 1 0 0 0 0 1 1 0 0 0 1

Cas

esSNP

0.25 1 0.75 0.5 0 ................................ 0.5 0.25 0.5

0 1 0 0 0 0 1 0 0 1 1 00 1 0 1 0 0 0 0 1 1 0 00 1 0 1 1 1 1 1 0 1 0 10 1 0 0 1 0 1 1 0 0 0 1

Con

trol

s

0 1 0 0.5 0.5 ................................ 0.5 0.25 0.5

0 1 1 1 0 0 0 0 1 0 1 0 : Is this in the case ?

High-density SNP arrays can be used to resolve DNA mixtures

Homer et al, PLoS Genetics,2008

10

Identification in Pools

11

NIH and others removed summary data.

Identification in Pools

12

NIH and others removed summary data.

Need a mathematical model of privacy.

Forensics vs Privacy

Forensics: Given data, choose a procedure to maximize power.

Privacy: Select data to expose such that the maximum power attained by an adversary is small.

13

Forensics vs Privacy

Forensics: Given data, choose a procedure to maximize power.

Privacy: Select data to expose such that the maximum power attained by an adversary is small. Bounds matter.

14

Limits of Individual Detection

Formulate individual detection in a pool as a hypothesis testing problem.

15

Limits of Individual Detection

Formulate individual detection in a pool as a hypothesis testing problem.

Likelihood-Ratio test (LR-test) is optimal for this hypothesis test (Neyman-Pearson lemma)

16

L(x) =Pr(x|H1)Pr(x|H0)

! t(!)

Pr(L(x) ! t(!)|H0) = !

Limits of Individual Detection

Formulate individual detection in a pool as a hypothesis testing problem.

Likelihood-Ratio test (LR-test) is optimal for this hypothesis test.

The power of the LR-test provides an upper bound on the power of any method.

17

Limits of Individual Detection

xXi

p

p

n! 1

xXi

p

p

n

Null Alternative

L =m!

j=1

"xj log

pj

pj+ (1! xj) log

1! pj

1! pj

#

18

Likelihood-ratio test

L = x log!

p

p

"+ (1! x) log

!1! p

1! p

"

" (x!p)(p!p)p(1! p)

! 12

(x!p)2(p!p)2

p2(1! p)2

" 1#n

x! p#p(1!p)

Z ! 12n

(x!p)2

p(1!p)Z2.

E0[L] = ! 12n

, V0(L) " 1n

E1[L] " +12n

, V1(L) " 1n

What happens for large pools?

a < p < 1! a, a > 0Need SNPs to be common

19

Main Result

20

z! + z1!" !!

m

n

1-! "

µ0 µ1

Null Alternative

z!#0 z1-"#1

Main Result

21

1-! "

µ0 µ1

Null Alternative

z!#0 z1-"#1

log !, " = 10 ! ! mn

-2.0000 1.0916-3.0000 0.5835-4.0000 0.3954-5.0000 0.2980-6.0000 0.2387-7.0000 0.1988-8.0000 0.1703-9.0000 0.1488-10.0000 0.1322

Can we apply the LR-test in practice?

Use a leave-one out procedure on a dataset to obtain empirical power estimates.

Requires an estimate of the population allele frequencies.

Use an independent reference dataset.

22

L =m!

j=1

"xj log

pj

pj+ (1! xj) log

1! pj

1! pj

#

Can we apply the LR-test in practice?

Requires an estimate of the population allele frequencies.

Use an independent reference dataset.

Drop in power.

Use a leave-one out procedure on a dataset to obtain empirical power estimates.

z! + z1!" !!

mn (1" n

n )

23

Analysis and empirical estimates agree for large pools.

−3 −2 −1 00

0.2

0.4

0.6

0.8

1WTCCC

False positive rate (Log base 10)

Pow

er

−3 −2 −1 00

0.2

0.4

0.6

0.8

1Simulated data

False positive rate (Log base 10)

Pow

er

LRLR theoryHomer et al

m=10000 m=10000

m=1000

m=33138

m=1000

24

Why does our optimal test have lower power than Homer et al?

Alternative hypothesis is the same.Tested individual is present in pool.

Nulls differ.Our null: Tested individual is sampled from the population and is not part of the reference dataset.

Null tested in Homer et al: Tested individual is part of the reference dataset.

25

Does this difference in the nulls matter?

Population has 10 individuals of which 5 are in the pool and rest in the reference.

Easy to detect individual in pool or reference.

Population has 1 million individuals of which 5 are in the pool.

Harder to detect in reference.

Even harder if only 5 out of these 1 million are available in a reference dataset.

26

Does this difference in the nulls matter?

Population has 10 individuals of which 5 are in the pool and rest in the reference.

Easy to detect individual in pool or reference.

Population has 1 million individuals of which 5 are in the pool.

Harder to detect in reference.

Even harder if only 5 out of these 1 million are available in a reference dataset.

Null tested in Homer et al. more appropriate for forensics. Our null more appropriate for privacy.

27

The other null is indeed easier to test.

−3 −2.5 −2 −1.5 −1 −0.5 00

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False positive rate

Pow

er

Homer et alLRLR theory

−3 −2.5 −2 −1.5 −1 −0.5 00

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False positive rate

Pow

er

Homer et alLRLR theory

28

Our null requires 4 times more independent SNPs to achieve the same power.

Related questions

Dependent SNPs: Slight decrease in power. Haplotype-based test can be more powerful.

Genotyping errors: Reduces power.

Relatives: Requires more SNPs.

Population-independent.

1!2

29

Xi ! X ,X = {0, 1}n

(X1, . . . , Xm)! f(X1, . . . , Xm)

An alternative framework

30

f = f + !

An alternative framework

31

Release noisy version of fMust still be useful for a non-attacker.An attacker cannot used this sanitized f to learn about X.

Differential Privacy

Relates to the LR test.

Given a test with false positive rate

Power at most

32

Pr(f(X) ! S)Pr(f(Y ) ! S)

" exp!

"(X, Y ) = 1

! exp"!

Dwork et al , 2006

Exponential mechanism

33

!(") ! exp(" #"

S(f))

Exponential mechanism

What is f ? Say the mean frequencies of the allele frequencies.

34

!(") ! exp(" #"

S(f))

S(f) = sup{x,y:!(x,y)=1}||f(x)! f(y)||1

Exponential mechanism

What is f ? Say the mean frequencies of the allele frequencies.

What is S(f) ? O (number of SNPs)

Bad news : The standard deviation of noise is proportional to the number of SNPs.

35

!(") ! exp(" #"

S(f))

S(f) = sup{x,y:!(x,y)=1}||f(x)! f(y)||1

Conclusions

A statistical framework to analyze the limits of genotype detection in pools.

Provides guidelines on data sharing to researchers.

The analytical bound is valid for large pools and common SNPs.

Use in conjunction with the empirical test.

36

Future Directions

37

Identity

PhenotypeGenotype