Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide...

38
Genome-wide association studies Usman Roshan
  • date post

    15-Jan-2016
  • Category

    Documents

  • view

    227
  • download

    0

Transcript of Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide...

Page 1: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Genome-wide association studies

Usman Roshan

Page 2: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Recap

• Single nucleotide polymorphism• Genome wide association studies

– Relative risk, odds risk (or odds ratio) as an approximation to relative risk

– Chi-square test to determine significant SNPs– Logistic regression model for determining odds

ratio

Page 3: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

SNP genotype representation

The example

F: AACACAATTAGTACAATTATGAC

M:AACAGAATTAGTACAATTATGAC

is represented as

CG CC GG …

Page 4: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

SNP genotype encoding

• If SNP is A/B (alphabetically ordered) then count number of times we see B.

• Previous example becomesA/T C/T G/T … A/T C/T G/T …

H0: AA TT GG … 0 2 0 …H1: AT CC GT … =>1 0 1 …H2: AA CT GT … 0 1 1 …

Now we have data in numerical format

Page 5: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Example GWAS

A/T C/G A/G …

Case 1 AA CC AA

Case 2 AT CG AA

Case 3 AA CG AA

Control 1 TT GG GG

Control 2 TT CC GG

Control 3 TA CG GG

Page 6: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Encoded data

A/T C/G A/G A/T C/GA/G

Case1 AA CC AA 0 0 0

Case2 AT CG AA 1 1 0

Case3 AA CG AA => 0 1 0

Con1 TT GG GG 2 2 2

Con2 TT CC GG 2 0 2

Con3 TA CG GG 1 1 2

Page 7: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Example

Odds of AA in case = (80/100)/(20/100) = 4Odds of AA in control = (70/100)/(30/100) = 7/3Odds ratio of AA = 4/(7/3) = 12/7

AA AC CC

Case 80 15 5

Control 70 15 15

Page 8: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Chi-square statistic

Define the statistic:

where

ci = observed frequency for ith outcomeei = expected frequency for ith outcomen = total outcomes

The probability distribution of this statistic is given by thechi-square distribution with n-1 degrees of freedom.Proof can be found at http://ocw.mit.edu/NR/rdonlyres/Mathematics/18-443Fall2003/4226DF27-A1D0-4BB8-939A-B2A4167B5480/0/lec23.pdf

Great. But how do we use this to get a SNP p-value?

χ 2 =(ci − ei )

2

eii=1

n

Page 9: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Null hypothesis for case control contingency table

• We have two random variables:– D: disease status– G: allele type.

• Null hypothesis: the two variables are independent of each other (unrelated)

• Under independence – P(D,G)= P(D)P(G)– P(D=case) = (c1+c2+c3)/n– P(G=AA) = (c1+c4)/n

• Expected values– E(X1) = P(D=case)P(G=risk)n

• We can calculate the chi-square statistic for a given SNP and the probability that it is independent of disease status (using the p-value).

• SNPs with very small probabilities deviate significantly from the independence assumption and therefore considered important.

AA AC CC

Case c1 c2 c3

Control c4 c5 c6

n=c1+c2+c3+c4+c5+c6

Page 10: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Chi-square statistic exercise

• Compute expected valuesand chi-square statistic• Compute chi-square statistic and p-value by referring To chi-square distribution

AA AC CC

Case 80 15 5

Control 60 15 25

Page 11: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Logistic regression

• The odds ratio estimated from the contingency table directly has a skewed sampling distribution.

• A better (discriminative) approach is to model the log likelihood ratio log(Pr(G|D=case)/Pr(G|D=control)) as a linear function. In other words:

• Why:– Log likelihood ratio is a powerful statistic

– Modeling as a linear function yields a simple algorithm to estimate parameters

• G is number of copies of the risk allele • With some manipulation this becomes

Pr(D = case |G) =1

1+ e−(w TG+w0 )

log(Pr(G |D = case)

Pr(G |D = control)) = wTG + w0

Page 12: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

How do we get the odds ratio from logistic regression? (I)

• Using Bayes rule we have

log(Pr(G | D =case)Pr(G |D=control)

) =wTG +w0

Pr(G | D =case)Pr(G |D=control)

=ewTG+w0

Pr(G =1|D=case)Pr(G=1|D=control)Pr(G=0 |D=case)Pr(G=0 |D=control)

=ew

T 1+w0

ewT 0+w0

=ew

And by taking the ratio with G=1 and G=0 we get

By exponentiating both sides we get

Page 13: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

How do we get the odds ratio from logistic regression? (II)

Pr(G =1|D=case)Pr(G=1|D=control)Pr(G=0 |D=case)Pr(G=0 |D=control)

=

Pr(G=1|D=case)Pr(G=0 |D=case)Pr(G=1|D=control)Pr(G=0 |D=control)

Since the original ratio (see previous slide) is equal to ew and is equal to the odds ratio we conclude that the odds ratio is given by this value.

Continued from previous slide: by rearranging the terms in the numerator and denominator we get

Pr(G =1|D=case)Pr(G=0 |D=case)Pr(G=1|D=control)Pr(G=0 |D=control)

=

Pr(D=case|G=1)Pr(D=control |G=1)Pr(D=case|G=0)Pr(D=control |G=0)

=OR (odds ratio)

By symmetry of odds ratio this is

Page 14: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

How to find w and w0?

• And so ew is our odds ratio. But how do we find w and w0?

– We assume that one’s disease status D given their genotype G is a Bernoulli random variable.

– Using this we form the sample likelihood

– Differentiate the likelihood by w and w0

– Use gradient descent

Page 15: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Today

• Basic classification problem• Maximum likelihood• Logistic regression likelihood• Algorithm for logistic regression likelihood• Support vector machine

Page 16: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Supervised learning for two classes

• We are given n training samples (xi,yi) for i=1..n drawn i.i.d from a probability distribution P(x,y).

• Each xi is a d-dimensional vector (xiRd) and yi is +1 or -1

• Our problem is to learn a function f(x) for predicting the labels of test samples xi’Rd for i=1..n’ also drawn i.i.d from P(x,y)

Page 17: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Loss function

• Loss function: c(x,y,f(x))

• Maps to [0,inf]

• Examples:

c(x, y, f (x)) =0 if y= f (x)1 otherwise⎧⎨⎩

c(x, y, f (x)) =0 if y= f (x)c'(x) otherwise⎧⎨⎩

c(x, y, f (x)) =(y− f(x))2

Page 18: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Test error

• We quantify the test error as the expected error on the test set (in other words the average test error). In the case of two classes:

• We’d like to find f that minimizes this but we need P(y|x) which we don’t have access to.

Rtest[ f ] =1n'

c(xi ',yj , f(xi ')j=1

2

∑ )P(yj |xi ')i=1

n'

Page 19: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Expected risk

• Suppose we didn’t have test data (x’). Then we average the test error over all possible data points x

• We want to find f that minimizes this but we don’t have all data points. We only have training data.

R[ f ] = c(x,yj , f(x)j=1

2

∑ )P(yj ,x)x∈X∑

Page 20: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Empirical risk

• Since we only have training data we can’t calculate the expected risk (we don’t even know P(x,y)).

• Solution: we approximate P(x,y) with the empirical distribution pemp(x,y)

• The delta function x(y)=1 if x=y and 0 otherwise.

pemp (x, y) =1n

xi (x)yi (y)i=1

n

Page 21: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Empirical risk

• We can now define the empirical risk as

• Once the loss function is defined and training data is given we can then find f that minimizes this.

Remp[ f ]= c(x,yj , f(x)j=1

2

∑ )pemp(yj ,x)x∈X∑

=1n

c(xi ,yi , f(xi )i=1

n

∑ )

Page 22: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Example of minimizing empirical risk (least squares)

• Suppose we are given n data points (xi,yi) where each xiRd and yiRd. We want to determine a linear function f(x)=ax+b for predicting test points.

• Loss function c(xi,yi,f(xi))=(yi-f(xi))2

• What is the empirical risk?

Page 23: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Empirical risk for least squares

Remp[ f ] = c(xi ,yi , f(xi )i=1

n

∑ )

= (yi − f(xi )i=1

n

∑ )2

= (yi −axi +b)i=1

n

∑ )2

Now finding f has reduced to finding a and b.Since this function is convex in a and b we know thereis a global optimum which is easy to find by settingfirst derivatives to 0.

Page 24: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Empirical risk for logistic regression

• Recall that logistic regression model:

• where G is number of copies of risk allele.• In order to use this to predict one’s risk of disease or

to determine the odds ratio we need to know w and w0.• We use maximum likelihood to find w and w0. But it

doesn’t yield a simple closed form solution like least squares.

Pr(D = case |G) =1

1+ e−(w TG+w0 )

Page 25: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Maximum likelihood

• We can classify by simply selecting the model M that has the highest P(M|D) where D=data, M=model. Thus classification can also be framed as the problem of finding M that maximizes P(M|D)

• By Bayes rule:

P(M | D) =P(D |M )P(M )

P(D)=P(D |M )P(M )P(D |M )P(M )

M∑

Page 26: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Maximum likelihood

• Suppose we have k models to consider and each has the same probability. In other words we have a uniform prior distribution P(M)=1/k. Then

• In this case we can solve the classification problem by finding the model that maximizes P(D|M). This is called the maximum likelihood optimization criterion.

P(M | D) =P(D |M ) 1k P(D |M )P(M )M∑ ∝ P(D |M )

Page 27: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Maximum likelihood

• Suppose we have n i.i.d. samples (xi,yi) drawn from M. The likelihood P(D|M) is

• Consequently the log likelihood is

P(D |M ) =P((x1,y1),...,(xn,yn) |M ) =P(x1,y1 |M )...P(xn,yn |M )

= P(xi ,yi |M ) = P(yi |xi ,M )P(xi )i=1

n

∏i=1

n

−logP(D |M ) = − logP(yi | xi ,M )i=1

n

∑ − P(xi )i=1

n

Page 28: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Maximum likelihood and empirical risk

• Maximizing the likelihood P(D|M) is the same as maximizing log(P(D|M)) which is the same as minimizing -log(P(D|M))

• Set the loss function to

• Now minimizing the empirical risk is the same as maximizing the likelihood

c(xi , yi , f (xi )) =−log(P(yi |xi , f ))

Page 29: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Maximum likelihood example

• Consider a set of coin tosses produced by a coin with P(H)=p (P(T)=1-p)

• We want to determine the probability P(H) of the coin that produced HTHHHTHHHTHTH.

• Solution:– Form the log likelihood– Differentiate w.r.t. p – Set to the derivative to 0 and solve for p

• How about the probability P(H) of the coin that produces k heads and n-k tails?

Page 30: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Classification by likelihood

• Suppose we have two classes C1 and C2.

• Compute the likelihoods P(D|C1) and P(D|C2).

• To classify test data D’ assign it to class C1 if P(D|C1) is greater than P(D|C2) and C2 otherwise.

Page 31: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Logistic regression (in light of disease risk prediction)

• We assume that the probability of disease given their genotype is

• where G is the numeric format genotype.• Problem: given training data we want to

estimate w and w0 by maximum likelihood€

Pr(D = case |G) =1

1+ e−(w TG+w0 )

Page 32: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Maximum likelihood for logistic regression

• Assume that one’s disease status given their genotype is a Bernoulli random variable with probability P(D|Gi).In other words training sample i is case with probability P(D=case|Gi) and control with probability 1-P(D=case|Gi).

• Assume we have m cases and n-m controls.• The likelihood is given by

P(data |w,w0 ) = P(D=case|Gi )i=1

m

∏ (1−P(D=case|G))i=m+1

n

Page 33: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Maximum likelihood for logistic regression

• Likelihood:

• -Log likelihood

• Set first derivates to 0 and solve for w and w0. • No closed form therefore have to use gradient descent.

P(data |w,w0 ) = P(D=case|Gi )i=1

m

∏ (1−P(D=case|G))i=m+1

n

−logP(data |w,w0 ) = − log(P(D = case |Gi ))i=1

m

∑ − log(1 − P(D = case |Gi ))i=m+1

n

= − log1

1 + e−(wTGi +w0 )

⎛⎝⎜

⎞⎠⎟i=1

m

∑ − log 1 −1

1 + e−(wTGi +w0 )

⎛⎝⎜

⎞⎠⎟i=m+1

n

Page 34: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Gradient descent

• Given a convex function f(x,y) how can we find x and y that minimizes this?

• Solution is gradient descent– The gradient vector points in the direction

of greatest increase in the function.– Therefore solution is to move in small

increments in the negative direction of the gradient until we reach the optimum.

Page 35: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Disease risk prediction

• How exactly do we predict risk?– Personal genomics companies: Composite

odds ratio score– Academia: Composite odds ratio score and

recently other classifiers as well– Still an open problem: depends upon

classifier and the set of SNPs

Page 36: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Composite odds ratio score

• Recall that we can obtain the odds ratio from the logistic regression model

• Define =ew

• Now we can predict the risk with n alleles

OR =

Pr(D=case|G=1)1−Pr(D=case|G=1)Pr(D=case|G=0)

1−Pr(D=case|G=0)

=ew

T 1+w0

ewT 0+w0

=ew

R(G1,G2 ,...,Gn ) =1G12

G2 ...nGn

Page 37: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Example of risk prediction study (type 1 diabetes)

Page 38: Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Example of risk prediction study (arthritis)