BOULDER WORKSHOP STATISTICS REVIEWED: LIKELIHOOD MODELS

BOULDER WORKSHOP

STATISTICS REVIEWED: LIKELIHOOD MODELS

Andrew C. Heath

PRE-HISTORY (STATISTICS 101)

Binomial distribution – gives the probabilities of various possible numbers of ‘successful’ outcomes in a fixed number of discrete trials, where all trials have the same probability of success.

Probability that X is equal to a particular value x (x= 0, 1, 2…n) is given by

Useful for genetics (e.g. transmission versus non-transmission of an allele)!

xnx ppx

nxXP

)1(

LIKELIHOOD

Focus on probability (“likelihood”) of the data, as a function of model parameters.

E.g. Sham (1998) uses Neel & Schull (1954) data on opalescent dentine in random sample of 112 offspring of an affected parent, found 52 affected, 60 normal. Compatible with hypothesis of rare autosomal dominant? Does the observed population (.464) differ from expected proportion of 0.5?

Likelihood function for segregation ratio p is

rnr ppr

npL

)1(

MAXIMUM LIKELIHOOD ESTIMATION

= Find the maximum value of the likelihood function in the range 0 p 1. In more difficult problems, usual to maximize the log-likelihood, since computationally more convenient, & this also maximizes the likelihood.

In this simple case, maximum likelihood estimate (MLE) of p, = r/n.

For the opalescent dentine data, ignoring the constant term involving n, r which does not vary as a function of p, log-likelihood function is

ln L(p) = 52 ln(p) + 60 ln(1–p)

p

Figure 2.1 Log-likelihood function of the segregation ratio for the opalescent dentine data (from Sham, 1998)

p

LIKELIHOOD-RATIO STATISTIC

Likelihood ratio statistic: twice the difference between the log likelihood of the data at the MLE (i.e. ), L1, and the log likelihood of the data at the hypothesized value of 0.5, L0: 2(ln L1 – ln L0).

For the segregation ratio example,

= 0.57 for the opalescent dentine example.

Likelihood-ratio statistic in this case is distributed as chi-square on one degree of freedom, hence non-significant.

21

nrn

nr

01 lnnlnrnlnr2LlnLln2

p

MATRIX ALGEBRA BASICS

A–1 Inverse of A

AT or A' Transpose of A

|A| Determinant of A

A B A postmultiplied by B: r x c c x r conformable for multiplication

since number of columns of A = number of rows of B. Resulting matrix is r x r.Tr (A) Trace of matrix A

HISTORY (MX INTRODUCTORY WORKSHOP)

Maximum-likelihood estimation using linear covariance structure models, e.g. fitting models to twin data:

Let p be the number of observed variables, the expected covariance matrix be E, and the expected vector of means be , where E and are functions of q free parameters to be estimated from the data. Let x1, x2…xn denote to observed variables. Assuming that the observed variables follow a multivariate normal distribution, the log-likelihood of the observed data is given by

This is the formula used for maximum likelihood model-fitting to raw continuous data, assuming a multivariate normal distribution. (Often, 2 ln L, is estimated). Requires that we provide:(a) model for expected covariance matrix – e.g. in terms of additive genetic, shared and non-shared environmental variance components – that will vary as a

function of relationship;(b) model for expected means – in simplest applications we might estimate a separate mean that might differ by gender, or possibly by twin pair zygosity.

n

1i i1T

i21

2n

1 2lnLln μxEμxE

EXAMPLE:

MZ Pairs:

E =

= [ m , m] T

DZ Pairs:

E =

= [ m , m]T

Where m is estimate of population mean, VA, VC, VE and are additive genetic, shared environmental and non-shared environmental variances, all estimated jointly from the data.

Compare e.g. ln L1 for VA VC VE model

ln L0 for VC VE model

2(ln L1 – ln L0) distributed as chi-square on one degree of freedom as before.

ECACA

CAECA

VVVVV

VVVVV

ECACA21

CA21

ECA

VVVVV

VVVVV

HISTORY (MX INTRODUCTORY WORKSHOP) - II

In practice most applications in the MX Introductory Workshop fitted models to summary covariance matrices.We can simplify (e.g. Sham, p. 238):

(1) (2)

Where m is the vector of sample means of the p observed variables, S is the observed covariance matrix. For the simple applications in the Introductory Workshop, we had no predictions for the means structure, so we can saturate that component of the model (i.e. estimate a separate mean for every observed mean), equivalent to deleting the term (m)T E1 (m) in (2). Thus the log-likelihood of the observed data becomes (3)

Under a saturated model, which equates every element of E to the corresponding element of S (i.e. a perfect fit model) we have for the log-likelihood

SEμmEμmE

μxEμxE

11T2n

n1i i

1Ti2

12n

1

trln2ln

2lnLln

SEE 12n

1 trln2lnLln

pln2ln

trln2lnLln

2n

12n

0

S

SSS

HISTORY (MX INTRODUCTORY WORKSHOP) - II

Thus the likelihood-radio test of the fitted model against the saturated model becomes

(4)

For multiple group problem, sum over groups.

ptrlnlnnLlnLln2 110 SESE

Analysis of Australian BMI data-young female MZ twins pairs-MX DYI version

Analysis of Australian BMI data-young female MZ twins pairs-MX DIY version !DA NG=1 NI=2 NO=0begin matrices;A LO 1 1 FR ! Additive genetic variance C LO 1 1 FR ! Shared environmental variance E LO 1 1 FR ! Non-shared environmental varianceM FU 2 2 ! This will be observed MZ covariance matrixD FU 2 2 ! This will be observed DZ covariance matrixg fu 1 1 ! coefficient of 0.5 for DZ pairsn fu 1 1 ! sample size for MZ pairs (female in this illustration)k fu 1 1 ! sample size for DZ pairs (female in this illustration)p fu 1 1 ! order of matrices (i.e. number of variables =2 in this case)end matrices;mat g 0.5mat p 2mat m0.7247 0.58910.5891 0.7915mat n 532mat d0.7786 0.24610.2461 0.8365mat k 326mat a 0.25mat c 0.25mat e 0.25

Analysis of Australian BMI data-young female MZ twins pairs-MX DYI version (ctd)

BEGIN ALGEBRA;t=n+k; ! total sample sizeU=(A+C+E | A+C _ A+C | A+C+E); ! Expected MZ covariance matrixV=(A+C+E | g*A+C _ g*A+C | A+C+E); ! Expected DZ covariance matrixH=n*(\ln (\det(U))-\ln (\det(M)) + \tr((U~ *M))-p); ! fit-function for MZ group J=k*(\ln (\det(V))-\ln (\det(D)) + \tr((V~ *D))-p); ! fit-function for DZ groupF=h+j;END ALGEBRA;bo 0.01 1.0 e(1,1)bo 0.0 1.0 c(1,1) a(1,1)CO F ; option user df=6end

PRE-HISTORY (STATISTICS 101) - II

LINEAR REGRESSION: requires weaker assumptions than linear covariance structure models. Does not assume multivariate normal distribution, only homoscedastic residuals. Flexible for handling selective sampling schemes where we oversample extreme values of predictor variable(s).

We can fit linear regression models by maximum likelihood e.g. using MX.

HISTORY! MX INTRODUCTORY WORKSHOP - III

“Definition variables” – an option in MX when fitting to raw data, which allows us to model effects of some variables as fixed effects, modeling their contribution to expected means.

Simple example: controlling for linear or polynomial regression of a quantitative measure on age. Don’t want to model covariance structure with age (which probably has rectangular distribution!)

Definition variables = variables whose values may vary from individual to individual, that can be read into matrices

Important example: genotypes at a given locus or set of loci.

BOULDER WORKSHOP STATISTICS REVIEWED: LIKELIHOOD MODELS

Documents

Transcript of BOULDER WORKSHOP STATISTICS REVIEWED: LIKELIHOOD MODELS