Web viewIn the case of the Gaussian distribution, these parameters are called the “mean”...

Statistical modeling and machine learning for molecular biologyAlan Moses

Chapter 4 –parameter estimation and multivariate statistics.

1. Parameter estimation

Fitting a model to data: objective functions and estimation

Given a distribution or “model” for data, the next step is to “fit” the model to the data. Typical probability distributions will have unknown parameters, numbers that change the shape of the distribution. The technical term for the procedure of finding the values of the unknown parameters of a probability distribution from data is “estimation”. During estimation one seeks to find parameters that make the model “fit” the data the “best.” If this all sounds a bit subjective, that’s because it is. In order to proceed, we have to provide some kind of mathematical definition of what it means to fit data the best. The description of how well the model fits the data is called the “objective function.” Typically, statisticians will try to find “estimators” for parameters that maximize (or minimize) an objective function. And statisticians will disagree about which estimators or objective functions are the best.

In the case of the Gaussian distribution, these parameters are called the “mean” and “standard deviation” often written as mu and sigma. In using the Gaussian distribution as a model for some data, one seeks to find values of mu and sigma that fit the data. As we shall see, what we normally think of as the “average” is in fact an “estimator” for the mu parameter of the Gaussian that we refer to as the mean.

2. Maximum Likelihood Estimation

The most commonly used objective function is known as the “likelihood”, and the most well-understood estimation procedures seek to find parameters that maximize the likelihood. These maximum likelihood estimates (once found) are often referred to as MLEs.

The likelihood is defined as a conditional probability: P(data|model), the probability of the data given the model. Typically, the only part of the model that can change are the parameters, so the likelihood is often written as P(X|θ) where X is a data matrix, and θ is a vector containing all the parameters of the distribution(s). This notation makes explicit that the likelihood depends on both the data and a choice of any free parameters in the model.

More formally, let’s start by saying we have some i.i.d., observations from a pool, say X1, X2 … Xn, which we will refer to as a vector X. We want to write down the likelihood, L, which is the conditional probability of the data given the model. In the case of independent observations, we can use the joint probability rule to write:

L=P (X|θ )=P (X1|θ ) P (X2|θ )…P (X n|θ )=∏i=1

n

P (X i|θ )


Maximum likelihood estimation says: “choose the parameters so the data is most probable given the model” or find θ that maximizes L. In practice, this equation can be very complicated, and there are many analytic and numerical techniques to solve it.

3. The likelihood for Gaussian data.

To make the likelihood specific we have to choose the model. If we assume that each observation is described by the Gaussian distribution, we have two parameters, the mean, μ and standard deviation σ.

L=∏i=1

n

P (X i|θ )=∏i=1

n

N ( X i|μ ,σ )=∏i=1

n 1σ √2 π

e−(X i−μ)2

2σ 2

Admittedly, the formula looks complicated. But in fact, this is a *very* simple likelihood function. I’ll illustrate this likelihood function by directly computing it for an example.

Observation (i) Value (Xi) P(Xi|θ)= N(Xi | μ = 6.5, σ=1.5)1 5.2 0.182692 9.1 0.0592123 8.2 0.139934 7.3 0.230705 7.8 0.18269L = 0.000063798

Notice that in this table I have chosen values for the parameters, μ and σ, which is necessary to calculate the likelihood. We will see momentarily that these parameters are *not* the maximum likelihood estimates the parameters that maximize the likelihood), but rather just illustrative values. The likelihood (for these parameters) is simply the product of the 5 values in the right column of the table.

It should be clear that as the dataset gets larger, the likelihood gets smaller and smaller, but always is still greater than zero. Because the likelihood is a function of all the parameters, even in the simple case of the Gaussian, the likelihood is still a function of two parameters (plotted in figure X) and represents a surface in the parameter space. To make this figure I simply calculated the likelihood (just as in the table above) for a large number of pairs of mean and standard deviation parameters. The maximum likelihood can be read off (approximately) from this graph as the place where the likelihood surface has its peak. This is a totally reasonable way to calculate the likelihood if you have a model with one or two parameters. You can see that the standard deviation parameter I chose for the table above (1.5) is close to the value that maximizes the likelihood, but the value I chose for the mean is probably too small – the maximum looks to occur when the mean is around 7.5.


0

0.00005

0.0001

0.00015

0.0002

0.00025

Standard deviation, σ

Mean, μ

Likelihood, L

43.532.521.51

5

10

7.5

Figure X – numerical evaluation of the likelihood function

This figure also illustrates two potential problems with numerical methods to calculate the likelihood. If your model has many parameters, drawing a graph becomes hard, and, more importantly, the parameter space might become difficult to explore: you might think you have a maximum in one region, but you might miss another peak somewhere else in high dimensions. The second problem is that you have to choose individual points to numerically evaluate the likelihood – there might always be a point in between the points you evaluated that has a slightly higher likelihood.

4. How to maximize the likelihood analytically

Although numerical approaches are always possible now-a-days, it’s still faster (and more fun!) to find the exact mathematical maximum of the likelihood function if you can. We’ll derive the MLEs for the univariate Gaussian likelihood introduced above. The problem is complicated enough to illustrate the major concepts of likelihood, as well as some important mathematical notations and tricks that are widely used to solve statistical modeling and machine learning problems. Once we have written down the likelihood function, the next step is to find the maximum of this function by taking the derivatives with respect to the parameters, setting them equal to zero and solving for the maximum likelihood estimators. Needless to say, this probably seems very daunting at this point. But if you make it through this book, you’ll look back at this problem with fondness because it was so *simple* to find the analytic solutions. The mathematical trick that makes this problem go from looking very hard to being relatively easy to solve is the following: take the logarithm. Instead of working with likelihoods, in practice, we’ll almost always use log-likelihoods because of their mathematical convenience. (Log likelihoods are also easier to work with numerically because instead of very very small positive numbers near zero, we can work with big negative numbers.) Because the logarithm is monotonic (it


doesn’t change the ranks of numbers) the maximum of the log-likelihood is also the maximum of the likelihood. So here’s the mathematical magic:

log L=log∏i=1

n

N (X i|μ ,σ )=∑i=1

n

log N (X i|μ ,σ )=∑i=1

n

−log σ−12

log (2π ) −(X ¿¿ i−μ)2

2σ2 ¿

In the preceding we have used many properties of the logarithm: log(1/x) = -log x, log(xy) = log(x) + log(y), and log (ex) = x. This formula for the log likelihood might not look much better, but remember that we are trying to find the parameters that maximize this function. To do so we want to take its derivative with respect to the parameters and set it equal to zero. To find the maximum likelihood estimate of the mean, µ, we will take derivatives with respect to µ. Using the linearity of the derivative operator, we have

∂∂μ

log L= ∂∂μ ∑i=1

n

−log σ−12

log (2π ) −(X ¿¿ i−μ)2

2σ2 =∑i=1

n −∂∂μ

log σ−12

∂∂μ

log (2π ) − ∂∂ μ

(X¿¿ i−μ)2

2σ2 =0¿¿

Since two of the terms have no dependence on µ, their derivatives are simply zero. Taking the derivatives we get

∂∂μ

log L=∑i=1

n

−0−0 +2 (X ¿¿ i−μ)2σ 2 = 1

σ2 ∑i=1

n

(X ¿¿ i−μ)=0¿¿

Where in the last step we took out of the sum the σ2 that didn’t depend on i. Since we can multiply both sides of this equation by σ2, we are left with

∂∂μ

log L=∑i=1

n

(X ¿¿ i−μ)=∑i=1

n

X i−∑i=1

n

μ=∑i=1

n

X i−nμ=0 ¿

Which we can actually solve

μ=μMLE=1n∑i=1

n

X i=mX

this equation tells us the value of µ that we should choose if we want to maximize the likelihood. I hope that it is clear that the suggestion is simply to choose the sum of the observations divided by the total number of observations – in other words, the average. I have written µMLE to remind us that this is the maximum likelihood estimator.

Notice that although the likelihood function (illustrated in figure X) depends on both parameters, the formula we obtained for the µMLE doesn’t. A similar (slightly more complicated) derivation is also possible for the standard deviation.

∂∂σ log L=0→σMLE=√ 1

n∑i=1

n

( X i−μ )2=sX

In the MLE for the standard deviation, there is an explicit dependence on the mean. Because in order to maximize the likelihood, the derivatives with respect to *all* the parameters must be


zero, to get the maximum likelihood estimate for the standard deviation you need to first calculate the MLE for the mean and plug it in to the formula above.

In general, setting the derivatives of the likelihood with respect to all the parameters to zero leads to a set of equations with as many equations and unknowns as the number of parameters. In practice there are very few problems of this kind that can be solved analytically.

Mathematicalbox-ThedistributionofparameterestimatesforMLEs.

Assumingthatyouhavemanagedtomaximizethelikelihoodofyourmodelusingeitheranalyticornumericalapproaches,itissometimespossibletotakeadvantageoftheverywell-developedstatisticaltheoryinthisareatodohypothesistestingontheparameters.Themaximumlikelihoodestimatorisafunctionofthedata,andthereforeitwillnotgivethesameanswerifanotherrandomsampleistakenfromthedistribution.However,itisknownthat(undercertainassumptions)themaximumlikelihoodestimateswillbeGaussiandistributed,withmeansequaltothetruemeansoftheparameters,andvariancesrelatedtothesecondderivativesofthelikelihoodatthemaximum,whicharesummarizedintheso-calledFisherInformationmatrix(whichIabbreviateasFI).

Var (θMLE )=E[−FI−1]θ=θMLE

Thisformulasaysthatthevarianceoftheparameterestimatesisthe(1)expectationofthenegativeof(2)inverseofthe(3)Fisherinformationmatrix,evaluatedatthemaximumofthelikelihood(sothatallparametershavebeensetequaltotheirMLEs.I’vewrittenthenumberstoindicatethatgettingthevarianceoftheparameterestimatesisactuallyatediousthreestepprocess,andit’srarelyusedinpracticeforthatreason.However,ifyouhaveasimplemodel,anddon’tmindalittlemath,itcanbeincrediblyusefultohavethesevariances.Forexample,inthecaseoftheGaussiandistribution,therearetwoparameters(uandsigma),sotheFisherinformationmatrixisa2x2matrix.

FI=[ ∂2 log L∂μ2

∂2 log L∂μ∂σ

∂2 log L∂μ∂σ

∂2 log L∂σ2 ]=[−n

σ2 0

0 −2nσ2 ]

Thefirststepingettingthevarianceofyourestimatorisevaluatingthesederivatives.Inmostcases,thismustbedonenumerically,butintextbookexamplestheycanbeevaluatedanalytically.FortheGaussianmodel,atthemaximumofthelikelihoodtheyhavethesimpleformulasthatI’vegivenabove.

Thesecondderivativesmeasurethechangeintheslopeofthelikelihoodfunction,anditmakessensethattheycomeupherebecausethevarianceofthemaximumlikelihoodestimatorisrelatedintuitivelytotheshapeofthelikelihoodfunctionnearthemaximum.Ifthelikelihoodsurfaceisveryflataroundtheestimate,thereislesscertainty,whereasifthemaximumlikelihoodestimateisataverysharppeakinthelikelihoodsurface,thereisalotofcertainty--


anothersamplefromthesamedataislikelytogivenearlythesamemaximum.Thesecondderivativesmeasurethelocalcurvatureofthelikelihoodsurfacenearthemaximum.

Onceyouhavethederivatives(usingthevaluesoftheparametersatthemaximum),thenextstepistoinvertthismatrix.Inpractice,thisisnotpossibletodoanalyticallyforallbutthesimpleststatisticalmodels.FortheGaussiancase,thematrixisdiagonal,sotheinverseisjust

FI−1=[−σ2

n0

0 −σ2

2n]

Finally,onceyouhavetheinverse,yousimplytakethenegativeofthediagonalentryinthematrixthatcorrespondstotheparameteryou’reinterestedin,andthentaketheexpectation.So

thevarianceforthemeanwouldbeσ2

n.ThismeansthatthedistributionofμMLEisGaussian,

withmeanequaltothetruemean,andstandarddeviationequaltothetruestandarddeviationdividedbythesquarerootofn.

Figure X the distribution of parameters is related to the shape of region nearby the maximum in the likelihood

5. Other objective functions

Despite the popularity, conceptual clarity and theoretical properties of maximum likelihood estimation, there are (many) other objective functions and corresponding estimators that are widely used.

Another simple, intuitive objective function is the “least squares” – simply adding up the squared differences between the model and the data. Minimizing the sum of squared differences leads to the maximum likelihood estimates in many cases, but not always. One good thing about least squares estimation is that it can be applied even when your model doesn’t actually conform to a probability distribution (or it’s very hard to write out or compute the probability distribution).

One of the most important objective functions is the so-called posterior probability and the corresponding Maximum-Apostiori-Probability or MAP estimates/estimators. In contrast to ML estimation, MAP estimation says: “choose the parameters so the model is most probable given the data we observed.” Now the objective function is P(theta|X) and the equation to solve is

∂∂θ

P(θ∨X )=0

As you probably already guessed, the MAP and ML estimation problems are related via Bayes’ theorem, so that this can be written as

P (θ|X )= P (θ )P (X )

P (X|θ )= P (θ )P (X )

L


Once again, it is convenient to think about the optimization problem in log space, where the objective function breaks into three parts, only two of which actually depend on the parameters.

∂∂θ

log P (θ|X )= ∂∂θ

log P (θ )+ ∂∂θ

log L− ∂∂θ

log P (X )= ∂∂θ

log P (θ )+ ∂∂θ

logL=0

Interestingly, optimizing the posterior probabilility therefore amounts to optimizing the likelihood function *plus* another term that depends only on the parameters.

The posterior probability objective function turns out to be one of a class of so-called “penalized” likelihood functions where the likelihood is combined with mathematical functions of the parameters to create a new objective function. As we shall see, these objective functions turn out to underlie several intuitive and powerful machine learning methods that we will see in later chapters.

No matter what objective function is chosen, estimation usually always involves solving a mathematical optimization problem, and in practice this is almost always done using a computer – either with a statistical software package such as R or Matlab, or using purpose-written codes.

Box–bias,consistencyandefficiencyofestimators

Inordertofacilitatedebatesaboutwhichestimatorsarethebest,statisticiansdevelopedseveralobjectivecriteriathatcanbeusedtocompareestimators.Forexample,itiscommonlytaughtthattheMLestimatorofthestandarddeviationfortheGaussianisbiased.Thismeansthatfora sampleofdata,thevalueofthestandarddeviationobtainedusingtheformulagivenwilltendto(inthiscase)underestimatethe“true”standarddeviationofthevaluesinthepool.Ontheotherhandtheestimatorisconsistent,meaningthatasthesamplesizedrawnfromthepoolapproachesinfinity,theestimatordoesconvergetothe“true”value.Finally,theefficiencyoftheestimatordescribeshowquicklytheestimateapproachesthetruthasafunctionofthesamplesize.Inmodernmolecularbiologywearegenerallydealingwithlargeenoughsamplesizesthatwedon’treallyneedtoworryabouttheseissues.Inaddition,theywillalmostalwaysbetakencareofbythecomputerstatisticspackageusedtodothecalculations.Thisisagreatexampleofatopicthatistraditionallycoveredatlengthinintroductorystatisticscourses,andis ofacademicinteresttostatisticians,butisofnopracticaluseinmoderndata-richscience.

So how do we choose an objective function? In practice as biologists we usually choose the one that’s simplest to apply, where we can find a way to reliably optimize it. We’re not usually interested in debating with statisticians about whether the likelihood of the model is more important than the likelihood of the data. Instead, we want to know something about the parameters that are being estimated- testing our hypothesis about whether this experiment yields more of something than another experiment, whether something is an example of A and not B, whether there are two groups or three, or whether A predicts B better than random. So as long as we use the same method of estimation on all of our data, it’s probably not that important which estimation methods we used.


Box:Bayesianestimationandpriordistributions

Aswehaveseen,theMAPobjectivefunctionandmoregenerallypenalizedlikelihoodmethodscanberelatedtotheMLobjectivefunctionthroughtheuseofBayesTheorem.Forthisreasonthesemethodssometimesaregivennameswiththeword“Bayesian”inthem.However,aslongasamethodresultsinasingleestimatorforparameters,itisnotreallyBayesianinspirit.Truly Bayesianestimationmeansthatyoudon’ttrytopindownasinglevalueforyourparameters.Instead,youembracethefundamentaluncertaintythatanyparticularestimateforyourparametersisjustonepossibleestimatedrawnfromapool.TrueBayesianstatisticsmeanthatyouconsidertheentiredistributionofyourparametersgivenyourdataandyourpriorbeliefsaboutwhattheparametersshouldbe.InpracticeBayesianestimationisrarelyusedinbiology,becausebiologistswanttoknowthevaluesoftheirparameters.Wedon’twanttoconsiderthewholedistributionofexpressionlevelsofourgenethatarecompatiblewiththeobserveddata:wewanttoknowthelevelofthegene.

AlthoughBayesianestimationisrarelyused,thecloselyassociatedconceptsofpriorandposteriordistributionsareverypowerfulandwidelyused.BecausetheBayesianperspectiveistothinkoftheparametersasrandomvariablesthatneedtobeestimated,modelsareusedtodescribethedistributionsoftheparametersbothbefore(prior)andafterconsideringtheobservations(posterior).Althoughitmightnotseemintuitivetothinkthattheparametershaveadistribution*before*weconsiderthedata,infactitmakesalotofsense:wemightrequiretheparameterstobebetween0andinfinityifweareusingaPoissonmodelforourdata,orensurethattheyaddupto1ifweareusingamultinomialorbinomialmodel.Theideaofpriordistributionsisthatwecangeneralizethistoquantitativelyweightthevaluesoftheparametersbyhowlikelytheparametersmightturnouttobe.Ofcourse,ifwedon’thaveanypriorbeliefsabouttheparameters,wecanalwaysuseuniformdistributions,sothatallpossiblevaluesoftheparametersareequallylikely(inBayesianjargonthesearecalleduninformativepriors).However,asweshallsee,wewillfindittooconvenienttoresistputtingpriorknowledgeintoourmodelsusingpriors.

6. Multivariate statistics

An important generalization of the statistical models that we’ve seen so far is to the case where multiple events are observed at the same time. In the models we’ve seen so far, observations were single events: yes or no, numbers or letters. In practice, a modern molecular biology experiment typically measures more than one thing, and a genomics experiment might yield measurements for thousands of things: all the genes in the genome.

A familiar example of an experiment of this kind might be a set of genome-wide expression level measurements. In the Immgen data, for each gene, we have measurements of gene expression over ~200 different cell types. Although in the previous chapters we considered each of the cell types independently, a more comprehensive way to describe the data is that for each gene, the observation is actually a vector, X, of length ~200, where each element of the vector is the expression measurement for a specific cell type. Alternatively, it might be more convenient to think of each observation as a cell type, where the observation is now a vector of 24,000 gene


expression measurements. This situation is known in statistics as “multivariate” to describe the idea that multiple variables are being measured simultaneously. Conveniently, the familiar Gaussian distribution generalizes to the multivariate case, expect the single numbers (scalar) mean and variance parameters are now replaced with a mean vector and (co-)variance matrix.

N ( X⃗|μ⃗ , Σ )= 1

√¿Σ∨(2 π )de

−12 ( X⃗−μ⃗)T Σ−1( X⃗−μ⃗)

Here I’ve used a small d to indicate the dimensionality of the data, so that d is the length of the vectors u and X, and the covariance is a matrix of size d x d. In this formula, I’ve explicitly written small arrows above the vectors and bolded the matrices. In general the machine learning people will not do this (I will adopt this convention) and it will be left to the reader to keep track of what are the scalars, vectors and matrices. If you’re hazy on your vector and matrix multiplications and transposes, you’ll have to review them in order to follow the rest of this section (and most of this book).

Mathematicalbox:aquickreviewofvectors,matricesandlinearalgebra

AsI’vealreadymentioned,aconvenientwaytothinkaboutmultipleobservationsatthesametime,istothinkaboutthemaslistsofnumbers,whichareknownasvectorsinmathematicaljargon.WewillrefertothealistofnumbersX,asavectorX=(x1,x2,x3,…xn),wherenisthe“length”ordimensionalityofthevector(thenumberofnumbersinthelist).Oncewe’vedefinedtheselistsofnumbers,wecangoaheadanddefinearithmeticandalgebraontheselists.Oneinterestingwrinkletothemathematicsoflistsisthatforanyoperationwedefine,wehavetokeeptrackofwhethertheresultisactuallyalistoranumber.

X−Y=(x1 , x2 ,…,xn )−( y1 , y2 ,…, yn)=(x1− y1 , x2− y2 ,…, xn− yn )

Whichturnsouttobeavector

X ∙Y=x1 y1+x2 y2+…+xn yn=∑i=1

n

x i y i

Whichisanumber.Thegeneralizationofalgebrameansthatwecanwriteequationslike

X−Y=0

Whichmeansthat

(x1− y1 , x2− y2 ,…, xn− yn )=(0,0 ,…,0)

Ashorthandwayofwritingnequationsinoneline.

Sincemathematicianslovegeneralizations,there’snoreasonwecan’tgeneralizetheideaofliststoalsoincludealistoflists,sothateachelementofthelistisactuallyavector.Thistypeofobjectiswhatwecallamatrix.A=(X1,X2,X3…,Xm)whereX1=(x1,x2,x3,…xn).TorefertoeachelementofA,wecanwriteA11,A12,A13…A21,A22,A23…Amn.Wecanthengoaheadand


definesomemathematicaloperationsonmatricesaswell:IfAandBarematrices,A–B=Cmeansthatforalliandj,Cij=Aij–Bij.

Wecanalsodomixturesofmatricesandvectorsandnumbers:

cx+Ay=(c x1+∑j=1

m

A1 j y j , c x2+∑j=1

m

A2 j y j ,…,c xn+∑j=1

m

Anj y j)Wherecisanumber,xandyarevectorsandAisamatrix.Thisturnsouttobeavector.

However,there’soneveryinelegantissuewiththegeneralizationtomatrices:whatwemeanwhenwerefertothevalueAijdependsonwhethertheireferstotheindexin1throughmor1throughn.Inotherwords,wehavetokeeptrackofthestructureofthematrix.Todealwiththisissue,linearalgebrahasdevelopedasetofinternallyconsistentnotations,whicharereferredto asthe“row”or“column”conventions.SoanytimeIwritethevector,x,bydefaultImeanthe“column”vector

X=(x1

x2

⋮xn

)XT= (x1 x2 … xn )

Toindicatethe“row”vector,Ihavetowritethe“transpose”ofX,orXT.Thetransposeisdefinedastheoperationofswitchingalltherowsandcolumns.Soinfacttherearetwokindsofproductsthatcanbedefined:

X ∙Y=XT Y =(x1

x2

⋮xn

) ( y1 y2 … yn )=∑i=1

n

xi y i,whichisthefamiliardotproductalsoknownas

the“innerproduct”,and

XY T=(x1 x2 … xn )(y1

y2

⋮ym

)=¿

whichistheso-called“outer”productthattakestwovectorsandproducesamatrix.

ThismeansthatX*X=XTX,whichisanumber.X*XTworksoutto:whichisactuallyamatrix.

Althoughyoudon’treallyhavetoworryaboutthisstuffunlessyouaredoingthecalculations,Iwilltrytouseconsistentnotation,andyou’llhavetogetusedtoseeingtheselinearalgebranotationsaswegoalong.


Finally,aninterestingpointhereistoconsiderthegeneralizationbeyondlistsoflists:it’squitereasonabletodefineamatrix,whereeachelementofthematrixisactuallyavector.Thisobjectiscalledatensor.Unfortunately,asyoucanimagine,whenwegettoobjectswiththreeindices,there’snosimpleconventionlike“rows”and“columns”thatwecanusetokeeptrackofthestructureoftheobjects.Iwillatvarioustimesinthisbookintroduceobjectswithmorethantwoindices–especiallywhendealingwithsequencedata.However,inthosecases,Iwon’tbeabletousethegeneralizationsofaddition,subtractionfromlinearalgebra,becausewewon’tbeable tokeeptrackoftheindices.We’llhavetowriteoutthesumsexplicitlywhenwegetbeyondtwoindices.

Matricesalsohavedifferenttypesofmultiplications:thematrixproductproducesamatrix,buttherearealsoinnerproductsandouterproductsthatproduceotherobjects.Arelatedconceptthatwe’vealreadyusedisthe“inverse”ofamatrix.Theinverseisthematrixthatmultipliestogiveamatrixwith1’salongthediagonal(theso-calledidentitymatrix,I).

A A−1=I=¿

Figure XA illustrates the idea of sampling lists of observations from a pool. Although this might all sound complicated, multivariate statistics is easy to understand because there’s a very straightforward, beautiful geometric interpretation to it. The idea is that we think of each component of the observation vector (say each gene’s expression level in a specific cell type) as a “dimension.” So if we measure the expression level of two genes in each cell type, we have two-dimensional data. If we measure three genes, then we have three-dimensional data. 24,000 genes, then we have … you get the idea. Of course, we won’t have an easy time making graphs of 24,000-dimensional space, so we’ll typically use 2 or 3-dimensional examples for illustrative purposes.


1

x41 x42

x4d

x1 x2 x4x3 x5

x2 =x4 =x21

x22

x2d

…

…

Gene 1 level

Gen

e 3

leve

l

x2

genotype

Phe

noty

pe

x1

AAAaaa

0

0

x3 = CAG

Codon position 1

Cod

on p

ositi

on 3

A C G T

A

C

G

T

A

C

x3 = T

100

1

1T

00

_

_

___

____

_

_

_

_

Figure – multivariate observations as vectors

In biology there are lots of other types of multivariate data, and a few of these are illustrated in Figure XB. For example, one might have observations of genotypes and phenotypes for a sample of individuals. Another ubiquitous example is DNA sequences: the letter at each position can be thought of as one of the dimensions. In this view, each of our genomes represent 3 billion dimensional vectors sampled from the pool of the human population. In an even more useful representation, each position in a DNA (or protein) sequence can be represented as a 4 (or 20) dimensional vector, and the human genome can be thought of as a 3 billion x 4 matrix of 1s and 0s. In these cases, the components of observations are not all numbers, but this should not stop us from using the geometrical interpretation that each observation is a vector in a high-dimensional space.

A key generalization that becomes available in multivariate statistics is the idea of correlation. Although we will still assume that the observations are i.i.d., the dimensions are not necessarily independent. For example, in a multivariate Gaussian model for cell-type gene expression, the observation of a highly expressed gene X might make us more likely to observe a highly expressed gene Y. In the multivariate Gaussian model, the correlation between the dimensions is controlled by the off-diagonal elements in the covariance matrix, where each off-diagonal entry summarizes the correlation between a pair of dimensions. Intuitively, an off-diagonal term of zero implies that there is no correlation between two dimensions. In a multivariate Gaussian model where all the dimensions are independent, the off-diagonal terms of the covariance matrix are all zero, so the covariance is said to be diagonal. A diagonal covariance leads to a symmetric, isotropic or most confusingly “spherical” distribution.


8

64

512

4096

8 64 512 4096

-4 -2 0 2 4

-4-2

02

4

mvrnorm(n = 1000, rep(0, 2), matrix(c(1, 0, 0, 1), 2, 2))[,1]mvr

norm

(n =

100

0, re

p(0,

2),

mat

rix(c

(1, 0

, 0, 1

), 2,

2))[

,2]

-4 -2 0 2 4

-4-2

02

4

mvrnorm(n = 1000, rep(0, 2), matrix(c(1, 0.8, 0.8, 1), 2, 2))[,1]

mvr

norm

(n =

100

0, re

p(0,

2),

mat

rix(c

(1, 0

.8, 0

.8, 1

), 2,

2))[

,2] -4 -2 0 2 4

-4-2

02

4

mvrnorm(n = 1000, rep(0, 2), matrix(c(4, 0, 0, 1), 2, 2))[,1]mvr

norm

(n =

100

0, re

p(0,

2),

mat

rix(c

(4, 0

, 0, 1

), 2,

2))[

,2]

-4 -2 0 2 4

-4-2

02

4

mvrnorm(n = 1000, rep(0, 2), matrix(c(0.2, -0.6, -0.6, 3), 2, [,1] 2))[,1]m

vrno

rm(n

= 1

000,

rep(

0, 2

), m

atrix

(c(0

.2, -

0.6,

-0.6

, 3),

2, [,

2]

2))[

,2]

X

X

CD8 antigen alpha chain (expression level)

CD

8 an

tigen

bet

a ch

ain

(exp

ress

ion

leve

l)

2

X2

1 X1

Figure – multivariate Gaussians and correlation

The panel on the left shows real gene expression data for the CD8 antigen (from ImmGen). The panel on the right shows 4 parameterizations of the multivariate Gaussian in two dimensions. In each case the mean is at (0,0). Notice that none of the simple Gaussian models fit the observed CD8 expression data very well.

7. MLEs for multivariate distributions

The ideas that we’ve already introduced about optimizing objective functions can be transferred directly from the univariate case to the multivariate case. The only minor technical complication will arise because multivariate distributions have more parameters, and therefore the set of equations to solve can be larger and more complicated.

To illustrate the kind of mathematical tricks that we’ll need to use, we’ll consider two examples of multivariate distributions that are very commonly used. First, the multinomial distribution, which is the multivariate generalization of the binomial. This distribution describes the numbers of times we observe events from multiple categories. For example, the traditional use of this type of distribution would be to describe the number of times each face of a die (1 through 6) turned up. In bioinformatics it is often used to describe the numbers of each of the bases in DNA (A, C, G, T).

If we say the X is the number of counts of the four bases, and f is a vector of probabilities of

observing each base, such that ∑i

f i=1, where i indexes the four bases. The multinomial

probability for the counts of each base in DNA is given by

MN ( X|f )=( X A+XC+XG+XT ) !X A ! XC ! XG ! XT !

f AX A f C

XC f GXG f T

X T=( ∑i∈{A,C ,G, T }

X i)!∏

i∈{A,C ,G, T }X i !

∏i∈ {A ,C ,G ,T }

f iXi


The term in front (with the factorials over the product) has to do with the number of “ways” that you could observed, say X = (535 462 433 506) for A, C, G and T. For our purposes, (to derive the MLEs) we don’t need to worry about this term because it doesn’t depend on the parameters, and when take the log and then the derivative, it will disappear.

log L=log( ∑i∈{A, C ,G, T }

X i)!∏

i∈{A, C ,G, T }X i !

+ log ∏i∈{A, C ,G, T }

f iX i

However, the parameters are the fs, so the equation to solve for each one is

∂ log L∂ f A

= ∂∂ f A

∑i∈{A, C ,G, T }

X i log f i=0

If we solved this equation directly, we get

∂ logL∂ f A

=X A

f A=0

Although this equation seems easy to solve, there is one tricky issue: the sum of the parameters has to be 1. If we solved it, we’d always set each of the parameters to infinity, in which case they would not sum up to one. The optimization (taking derivatives and setting them to zero) doesn’t know that we’re working on a probabilistic model – it’s just a straight optimization. In order to enforce that the parameters stay between 0 and 1, we need to add a constraint to the optimization. This is most easily done through the method of lagrange multipliers, where we re-

write the constraint as an equation that equals zero, e.g., 1−∑i

f i=0, and add it to the function

we are trying to optimize multiplied by a constant, the so-called lagrange multiplier, lamda.

∂ log L∂ f A

= ∂∂ f A

∑i∈{A, C ,G, T }

X i log f i+ λ(1− ∑i∈{A, C ,G, T }

f i)=0

Taking the derivatives gives

∂ log L∂ f A

=X A

f A−λ=0

Which we can solve to give

( f A )MLE=X A

λ

Where I have used MLE to indicate that we now have the maximum likelihood estimator for the parameter fA. Of course, this is not very useful because it is in terms of the lagrange multiplier.

To figure out what the actual MLEs are we have to think about the constraint 1−∑i

f i=0. Since

we need the derivatives with respect to all the parameters to be 0, we’ll get a similar equation for fC fG and fT. Putting these together gives us:


∑i∈{A,C ,G,T }

( f i )MLE= ∑i∈{A, C ,G, T }

X i

λ=1

Or λ= ∑i∈{A,C , G,T }

X i

Which says that lamda is just the total number of bases we observed. So the MLE for the parameter fA is just

( f A )MLE=X A

∑i∈{A ,C ,G ,T }

X i

Which is the intuitive result that the estimate for probability of observing A is just the fraction of bases that were actually A.

A more complicated example is to find the MLEs for the multivariate Gaussian. We’ll start by trying to find the MLEs for the mean. As before we can write the log likelihood

log L= log P(X∨θ)= log∏i=1

n

N (X i∨μ ,Σ)=∑i=1

n −12

log [¿Σ∨(2 π)d ]−12(μ−X i)

T

Σ−1(μ−X i)

Since we are working in the multivariate case, we now need to take a derivative with respect to a vector μ. One way to do this is to simply take the derivative with respect to each component of the vector. So for the first component of the vector, we could write

∂ log L∂ μ1

=∑i=1

n ∂∂ μ1

[−12 ∑

j=1

d

(X ij−μ j)∑k=1

d

(Σ−1 ) jk(X ik−μk )❑ ]=0

∂ log L∂ μ1

=∑i=1

n ∂∂ μ1

[−12 ∑

j=1

d

∑k=1

d

(X ij−μ j)(Σ−1 ) jk(X ik−μk )❑ ]=0

∂ log L∂ μ1

=∑i=1

n ∂∂ μ1

[−12 ∑

j=1

d

∑k=1

d

(Σ−1 ) jk ( X ij X ik−μ j X ik−μk X ij+μ j μk )❑]=0

Where I have tried to write out the matrix and vector multiplications explicitly. Since the derivative will be zero for all terms that don’t depend on the first component of the mean, we have

∂ log L∂ μ1

=−12 ∑

i=1

n ∂∂ μ1

[(Σ−1)11(−μ1 X i1−μ1X i1+μ1 μ1 )+∑j=2

d

(Σ−1 ) j1 (−μ j X i1−μ1 X ij+μ jμ1 )+∑k=2

d

(Σ−1 )1 k (−μ1 X i1−μk X i1+μ1 μk )❑]=0

Because of the symmetry of the covariance matrix, the last two terms are actually the same:

∂ log L∂ μ1

=−12 ∑

i=1

n ∂∂ μ1

[(Σ−1)11(−μ1 X i1−μ1 X i1+μ1μ1 )+2∑j=2

d

(Σ−1 ) j1 (−μ j X i1−μ1 X ij+μ j μ1 )❑]=0

Differentiating the terms that do depend on the first component of the mean gives


∂ log L∂ μ1

=−12 ∑

i=1

n [ (Σ−1 )11 (−2 X i1+2μ1 )+2∑j=2

d

(Σ−1 ) j 1 (−X ij+μ j )❑]=0

∂ log L∂ μ1

=−∑i=1

n [ (Σ−1 )11 (−X i1+μ1 )+∑j=2

d

(Σ−1 ) j1 (−X ij+μ j )❑]=0

Merging the first term back in to the sum, we have

∂ log L∂ μ1

=−∑i=1

n [∑j=1

d

(−X ij+μ j ) (Σ−1 ) j1❑]=−∑

i=1

n

(μ−X i )T (Σ−1 )1=0

Where I have abused the notation somewhat to write the first column of the inverse covariance matrix as a vector,(Σ−1 )1.

[Alternative derivation]

Since the derivative will be zero for all terms that don’t depend on the first component of the mean, we have

∂ log L∂ μ1

=−12 ∑

i=1

n [−∑k=1

d

(Σ−1 )1k X ik−∑j=1

d

(Σ−1 ) j1 X ij+∑k=1

d

(Σ−1 )1k μk+∑j=1

d

(Σ−1 ) j1 μ j]=0

Notice that all of these sums are over the same thing, so we can factor them and write

∂ log L∂ μ1

=−12 ∑

i=1

n [−∑k=1

d

[ (Σ−1 )1k+(Σ−1 )k1 ] X ik+∑k=1

d

[ (Σ−1 )1k+(Σ−1)k 1 ]μk ]=0

∂ log L∂ μ1

=−12 ∑

i=1

n [∑k=1

d

(−X ik+μk) [ (Σ−1 )1k+(Σ−1 )k 1 ]]=0

Because the covariance is symmetrical, (Σ−1 )1 k= (Σ−1 )k 1, and we have

∂ log L∂ μ1

=−12 ∑

i=1

n

¿¿

Where I have abused the notation somewhat to write the first column of the inverse covariance matrix as a vector,(Σ−1 )1 and multiplied both sides by -1, so as not to bother with the negative sign. Notice the problem: although were trying to find the MLE for the first component of the mean only, we have an equation that includes all the components of the mean through the off-diagonal elements of the co-variance matrix. This means we have a single equation with d variables and unknowns, which obviously cannot be solved uniquely. However, when we try to find the maximum likelihood parameters, we have to set *all* the derivatives with respect to all the parameters to zero, and we will end up an equations like this for each component of the mean. This implies that we will actually have a set of d equations for each of the d components


of the mean, each involving a different row of the covariance matrix. We’ll get a set of equations like

∂ log L∂ μ2

=∑i=1

n

(μ−X i )T (Σ−1 )2=0

…

∂ log L∂ μd

=∑i=1

n

(μ−X i )T (Σ−1 )d=0

We can write the set of equations as

∂ log L∂μ

=∑i=1

n

(μ−X i )T Σ−1=0

Where the 0 is now the vector of zeros for all the components of the mean. To solve this equation, we note that the covariance matrix does not depend on i, so we can simply multiply each term of the sum and the 0 vector by the covariance matrix. We get

∂ log L∂μ

=∑i=1

n

(μ−X i )T=0

The equation can be solved to give

∑i=1

n

X iT=nμT

Or the familiar

μT=1n∑i=1

n

X iT

Which says simply that the MLEs for the components of the mean are simply the averages of the observations in each dimension.

It turns out that there is a much faster way of solving these types of problems using so-called vector (or matrix) calculus. Instead of working on derivatives of each component of the mean individually, we will use clever linear algebra notation to write all of the equations in one line by using the following identity:

∂∂x

[xT Ax ]=xT (A+AT )

Where A is any matrix, and the derivative is now a derivative with respect to a whole vector, x. Using this trick, we can proceed directly (remember that for a symmetric matrix like the covariance A=AT)


∂ log L∂μ

= ∂∂ μ∑i=1

n

−12 (μ−X i )

T

Σ−1 (μ−X i )=∑i=1

n 12 (μ−X i )

T

2Σ−1=¿∑i=1

n

(μ−X i )T Σ−1=0¿

Similar matrix calculus tricks can be used to find the MLEs for (all the components of) the

covariance matrix. If you know that ∂

∂ Alog ¿ A∨¿=( A−1 )T ¿ and

∂∂ A

[ xT Ax ]=xxT , where again

A is a matrix and x is a vector, it’s not too hard to find the MLEs for the covariance matrix. In general, this matrix calculus is not something that biologists (or even bioinformaticians) will be familiar with, so if you ever have to differentiate your likelihood with respect to vectors or matrices, you’ll probably have to look up the necessary identities.

8. Hypothesis testing revisited – the problems with high dimensions.

Since we’ve just agreed that what biologists are usually doing is testing hypotheses, we usually think much more about our hypothesis tests than about our objective functions. Indeed, as we’ve seen already, it’s even possible to do hypothesis testing without specifying parameters or objective functions, (non-parameteric tests).

Although I said that statistics has a straightforward generalization to high-dimensions, in practice the most powerful idea from hypothesis testing, namely, the P-value, does not generalize very well. This has to do with the key idea that the P-value is the probability of observing something as extreme *or more.* In high-dimensional space, it’s not clear which direction the “or more” is in. For example, if you observed 3 genes average expression levels (7.32, 4.67, 19.3) and you wanted to know whether this was the same as these genes’ average expression levels in another set of experiments (8.21, 5.49, 5.37), you could try to form a 3-dimensional test statistic, but it’s not clear how to sum up the values of the test statistic that are more extreme than the ones you observed- you have to decide which direction(s) to do the sum. Even if you decide which direction you want to sum up each dimension, performing these multidimensional sums is practically difficult as the number of dimensions becomes large.

The simplest way to deal with hypothesis testing in multivariate statistics is just to do a univariate test on each dimension, and pretend they are independent. If any dimension is significant, then (after correcting for the number of tests) the multivariate test must also be significant. In fact, that’s what we were doing in the previous chapter when we used Bonferroni to correct the number of tests in the gene set enrichment analysis. Even if the tests are not independent this treatment is conservative, and in practice, we often want to know in which dimension the data differed. In the case of gene set enrichment analysis, we don’t really care whether ‘something’ is enriched – we want to know what exactly is the enriched category.

However, there are some cases where we might not want to simply treat all the dimensions independently. A good example of this might be a time-course of measurements or measurements that are related in some natural way, like length and width of an iris petal. If you want to test whether one sample of iris petals is bigger than another, you probably don’t want to test whether the length is bigger and then whether the height is bigger. You want to combine both into one test. Another example might be if you’ve made pairs of observations and you want to test if their ratios are different, but the data include a lot of zeros, so you can’t actually form


the ratios. One possibility is to create a new test statistic and generate some type of empirical null distribution (as described in the first chapter). However, another powerful approach is to formulate a truly multivariate hypothesis test: a likelihood ratio test.

formally:

• The observations (or data) are X1, X2, … XN , which we will write as a vector X

• H0, is the null hypothesis, and H1 is another hypothesis. The two hypotheses make specific claims about the parameters in each model. For example, H0 might state that θ = φ, some particular values of the parameters, while H1 might state that θ ≠ φ (i.e., that the parameters are anything but φ).

• The likelihood ratio test statistic is -2 log [ p(X| θ = φ)/ p(X| θ ≠ φ) ], where any parameters that are not specified by the hypotheses (i.e., free parameters) have been set to their maximum likelihood values. (This means that in order to perform a likelihood ratio test, it is necessary to be able to obtain maximum likelihood estimates, either numerically or analytically).

• Under the null hypothesis the likelihood ratio test statistic is chi square distributed, with degrees of freedom equal to the difference in the number of free parameters between the two hypotheses.

Example of LRT for multinomial: GC content in genomes

The idea of the likelihood ratio test is that when two hypotheses (or models) describe the same data using different numbers of parameters, the one with more free parameters will always achieve a slightly higher likelihood because it can fit the data better. However, the amazing result is that the improvement in fit that is due simply to chance is predicted by the chisquare distribution (which is always positive). If the model with more free parameters fits the data better than the improvement expected by chance, then we should accept that model.

The likelihood ratio test is an example of class of techniques that are widely used in machine learning to decide if adding more parameters to make a more complex model is “worth it” or if it is “over fitting” the data with more parameters than are really needed. We will see other examples of techniques in this spirit later in this book.

Excercises:

1. What is the most probable value under a univariate Gaussian distribution? What is its probability?

2. Use the joint probability rule to argue that a multivariate Gaussian with diagonal co-variance is nothing but the product of univariate Gaussians.

3. Show that the average is also the MLE for the parameter of the Poisson distribution. Explain why this is consistent with what I said about the average of the Gaussian distribution in Chapter 1.


4. Fill in the components of the vectors and matrices for the part of the multivariate

Gaussian distribution: 12 (μ−X i )

T

Σ−1 (μ−X i)=[¿⋯ ] [ ¿⋱][¿⋮ ]

5. Derive the MLE for the covariance matrix of the multivariate Gaussian (use the matrix calculus tricks I mentioned in the text).

6. Why did we need lagrange multipliers for the multinomial MLEs, but not for the Guassian MLEs ?

Web viewIn the case of the Gaussian distribution, these parameters are called the “mean”...

Documents

Transcript of Web viewIn the case of the Gaussian distribution, these parameters are called the “mean”...