An analysis of “Failure to Yield” by Doug Gurian-Sherman ...
R Introduction and Training Patrick Gurian, Drexel University CAMRA 1st QMRA Summer Institute August...
-
Upload
eunice-wright -
Category
Documents
-
view
216 -
download
0
Transcript of R Introduction and Training Patrick Gurian, Drexel University CAMRA 1st QMRA Summer Institute August...
R Introduction and Training
Patrick Gurian, Drexel UniversityCAMRA
1st QMRA Summer InstituteAugust 7, 2006
Objective
• Learn maximum likelihood estimation-a method of fitting statistical models
• Learn the R statistical programming language
• All in 2 hours!– Idea is to make you aware of techniques and
tools
Statistical Model Estimation
• Statistical models all contain parameters
• Parameters are constants that can be “tuned” to make a general class of models applicable to a specific dataset
Example: Normal distribution
PDF is:
f(x) = 1/{σ(2π)1/2} exp{(x-µ)2/ 2σ2}
µ and σ are constants that define a particular normal distribution
We want to pick values that match a given dataset
One simple way is arithmetic mean=µ
s=σ (method of moments)
But there are other ways including…
Maximum likelihood estimation
• Assume a probability model
• Calculate the probability (likelihood) or obtaining the observed data
• Now adjust parameter values until you find the values that maximize the probability of obtaining the observed data
Likelihood Function
• Observe data x, a vector of values
• Assume some pdf f(x|θ)
• Where θ is a vector of model parameters
• Probability of any particular value is f(xi| θ) where i is an index indicating a particular observation in the data
Likelihood of the data
• Generally assume data is independent and identically distributed
• Same f(x) for all data• For independent data:
prob [Event A ∩ Event B] = prob[A] Prob[B] • So multiple probability of individual observations
to get joint probability of data
• L=π f(xi| θ)
• Now find θ that maximizes L
MLE example
• A team plays 3 games: W L L
• Binomial model: what is p?
• L=(3:1)p(1-p)2
• Suppose we know sequence of wins and losses then we can say
• L=p(1-p)2
Answer
• L=0.3*0.7*0.7=0.147
• Likelihood of data is greater with p=0.3 than with p=0.5
• We prefer p=0.3 as a parameter estimate to 0.5
Example: Maximizing the likelihood
dL/dp= 3(1-p)2 - 6p(1-p)
Find maximum at dL/dp=0
0=3(1-p)2 - 6p(1-p)
0=1-p -2p
3p=1
p=1/3
MLE example: Conclusion
Can verify that this is a maxima by looking at second derivative
Note that method of moments would give us p=x/n=1/3
So we get the same result by both methods
Ln Likelihood
• Product of many numbers each <1 is quite small
• Often easiest to work with ln (L)• Since ln is a monotonic transformation of
L, the largest ln (L) will correspond to the largest L value
Ln L=ln π f(xi| θ) Applying log laws
Ln L=Σ ln f(xi| θ)
Parameter uncertainty in MLE
• MLE gives us a general way to estimate parameters of many types of models
• But how do we make inferences about these parameters?
• There is a general method for large sample sizes– Huge advantage of MLE approach
Information Matrix
I = {d2ln(L) / dθ12 d2ln(L)/dθ1dθ2 ….
d2ln(L)/dθ1dθ2 d2ln(L)/dθ2
2 …..
…
d2ln(L)/dθ1dθk … …. d2ln(L)/dθk2 }
k = number of parametersAll the second derivatives of ln(L) with respect
to parameters
Interpreting I• Note that at MLE dL/ dθi =0 for all i
– L(θ^) is at a maximum or peak• Large second derivative indicates a sharp peak
– Parameter value of θ^ + Δ is much less likely than θ^
• Small second derivative indicates a gradual slope– Parameter value of θ^ + Δ is almost as likely
as θ^
Uncertainty of MLE parameters
θ^ ~N(θ, I-1)
MLE parameter estimates are asymptotically normal and unbiased with variance given by the inverse of the information matrix
Once you have the variance of θ^i use a Z
distribution to test hypotheses about θi and set CI for true value of θi
MLE in practice
• Usually work with log likelihood
• Often work with models for which it is not tractable develop analytical solutions
• Instead use numerical analysis to identify maximum likelihood through gradient search methods and invert I to get standard errors of parameters
• Need a software tool…
R language
• Freely available software– http://www.r-project.org/
• R is a programming language, not a set of pre-programmed routines with drop down menus– It operates from a command line prompt and has
exact syntax requirements
• Large library of functions developed by users• Develop your own code specific to your needs
R Resources
• An Introduction to R– available from the help menu
• Searchable help menu
• From command line: help(<function>)
• Simple R
• R for Beginners
• R for Dummies-no such luck
Strategies for dealing with R
• Use another software for standard analyses (JMP, SPSS, etc.)
• Have your favorite text editor open when you are working, if you actually succeed in getting a command to work, save it for posterity!
• Remember R offers incredible flexibility and speed
• Remember you’re not the only one R is driving up a tree
Your assignment
• Data on Cryptosporidium oocyst concentrations from Table 6-5, page 176 of Haas, Rose, and Gerba
Assignment (cont)
• Assume data follows a Poisson distribution
• Use MLE to fit parameter estimates to this data
• For practice use numerical approach even thought closed form solution exists for this case
• Repeat with negative binomial distribution
Poisson Distribution
• Independently occurring discrete random data– Often used for microbes– Assumes they don’t systematically cluster or spread
out
• Think of each glass of water as a independent trial that may or may not have a bug in it
• Think of each sip from each glass of water as an independent trial that may or may not have a bug in it.
Negative Binomial Distribution
• Another distribution to describe the number of occurrences of discrete random events
• Negative binomial has two parameters which allows for more flexibility, ability to fit more complex data