Python Analysis - U of T Physicsphy225h/python... · 2014-09-24 · Fitting Experimental Data The...
Transcript of Python Analysis - U of T Physicsphy225h/python... · 2014-09-24 · Fitting Experimental Data The...
Python Analysis
PHYS 224September 25/26, 2014
Goals• Two things to teach in this lecture
1. How to use python to fit data2. How to interpret what python gives you
• Some references:• http://nbviewer.ipython.org/url/media.usm.maine.edu/~pauln/
ScipyScriptRepo/CurveFitting.ipynb
• http://www.physics.utoronto.ca/~phy326/python/curve_fit_to_data.py
2
Fitting Experimental Data
The goal of the lab experiments is to determine a physical quantity y (independent variable) as a function of x (dependent variable)How?
• Measure the pair (xi,yi) a number (N) times• Find a fit function y=y(x) that describes the
relationship between these two quantities
3
The Linear Case• The simplest function relating the two
variables is the linear functionf(x) = y = ax +b
• This is valid for any yi,xi combination• If a and b are known, the true value of yi
can be calculated for any xi
yi,true = axi + b
4
Linear Regression
• Linear regression calculates the most probable values of a and b such that the linear equation is validyi,true = axi + b
• When taking measurements of yi, these usually obey Gauss’ distribution
5
An Example• Ideal Gas Law: P*V = n*R*T
• Pressure * Volume = n * R * Temperature• P = [(n*R)/V]*T
6
Fitting in Python
• We’re going to use the curve_fit function, which is part of the scipy.optimize package
• The usage is as follows:fit_parameters,fit_covariance = scipy.optimize.curve_fit(fit_function,x_data,y_data,sigma,guess)
fit_parameters - an array of the output fit parametersfit_covariance - an array of the covariance of the output fit parametersfit_function - the function used to do the fitsigma - the uncertainty associated with the dataguess - the initial guess input to the fit
7
Fitting with curve_fit
8
import numpyimport scipy.optimizefrom matplotlib import peplos
#define the function to be used in the fittingdef linearFit(x,*p): return p[0]+p[1]*x
#read in the data (currently only located on my hard drive...)temp_data, vol_data = numpy.loadtxt('ideal_gas_law.txt',unpack=True)
#add an uncertainty to each measurement pointuncertainty = numpy.empty(len(vol_data))uncertainty.fill(20.)
#do the fitfit_parameters,fit_covariance = scipy.optimize.curve_fit(linearFit, temp_data, vol_data, p0=(1.0,8.0),sigma=uncertainty)
Fitting with curve_fit
9
}
X data
}Y data
}
Initial guessfor parameters
}Uncertainty
on data
Function}
import numpyimport scipy.optimizefrom matplotlib import peplos
#define the function to be used in the fittingdef linearFit(x,*p): return p[0]+p[1]*x
#read in the data (currently only located on my hard drive...)temp_data, vol_data = numpy.loadtxt('ideal_gas_law.txt',unpack=True)
#add an uncertainty to each measurement pointuncertainty = numpy.empty(len(vol_data))uncertainty.fill(20.)
#do the fitfit_parameters,fit_covariance = scipy.optimize.curve_fit(linearFit, temp_data, vol_data, p0=(1.0,8.0),sigma=uncertainty)
Results
10
fit parameters = [0.21617647 8.33058824]
fit covariance = [[2.16490542e+ 04 � 6.89053501e+ 01]
[�6.89053501e+ 01 2.20507375e� 01]]
So what does this mean?
We set up the function for the fit to be:
y = p[0] + p[1]*xSo with the fit parameters, the function is:
y = 0.216 + 8.33*x
Full Probability• For a set of N measurements of the
dependent variable yy1, y2, y3,… yN
The probability of obtaining these values is the product of the individual probaiblities
11
Pa,b(y1, y2, y3...yN ) = Pa,b(y1)Pa,b(y2)Pa,b(y3)...Pa,b(yN )
=1
�Ny
e
PN
i=1(y
i
�a�bx
i
)2
�
2y
2
Full Probability• For a set of N measurements of the
dependent variable yy1, y2, y3,… yN
The probability of obtaining these values is the product of the individual probabilities
12
Pa,b(y1, y2, y3...yN ) = Pa,b(y1)Pa,b(y2)Pa,b(y3)...Pa,b(yN )
=1
�Ny
e
PN
i=1(y
i
�a�bx
i
)2
�
2y
2
Called the chi-squared
(χ2)
Chi-Squared
• The circled part is the definition of the residuals, ie the true data (yi) minus the fit data (a + b*xi)
• Dividing this by the standard deviation (σ) tells us how many standard deviations the test data is away from the fit at that x
• The square ensures this is always positive
13
�
2 =NX
i=1
(yi � a� bxi)2
�
2y
Plotting the Residuals
14
#read in the data (currently only located on my hard drive...)temp_data,vol_data = numpy.loadtxt('/Users/kclark/Desktop/Teaching/phys224/weather_data/ideal_gas_law.txt',unpack=True)
#add an uncertainty to each measurement pointuncertainty = numpy.empty(len(vol_data))uncertainty.fill(20.)
#do the fitfit_parameters,fit_covariance = scipy.optimize.curve_fit(linearFit,temp_data,vol_data,p0=(1.0,8.0),sigma=uncertainty)
#now generate the line of the best fit#set up the temperature points for the full arrayfit_temp = numpy.arange(270,355,5)#make the data for the best fit valuesfit_answer = linearFit(fit_temp,*fit_parameters)#calculate the residualsfit_resid = vol_data-linearFit(temp_data,*fit_parameters)#make a line at zerozero_line = numpy.zeros(len(vol_data))
How do the Residuals Look?
• The residuals are obviously a large component of the χ2 value used by the minimizer
• They can be plotted to look for trends and see if the fit function is appropriate
15
Interpreting the Covariance• Elements in covariance matrix represent
the relationship between the two variables
• The diagonals are the square of the standard deviations• we will use this in our interpretation of
the answer
16
Covariance Matrix Elements
• Diagonal elements are the square of the standard deviation for that parameter
• The non-diagonal elements show the relationship between the parameters
17
fit parameters = [0.21617647 8.33058824]
fit covariance = [[2.16490542e+ 04 � 6.89053501e+ 01]
[�6.89053501e+ 01 2.20507375e� 01]]
cov(x, y) =1
N
NX
i=1
(xi � x)(yi � y)
Fit Results
18
import numpyimport scipy.optimizefrom matplotlib import pyplot
#define the function to be used in the fitting, which is linear in this casedef linearFit(x,*p): return p[0]+p[1]*x
#read in the data (currently only located on my hard drive...)temp_data,vol_data = numpy.loadtxt('/Users/kclark/Desktop/Teaching/phys224/weather_data/ideal_gas_law.txt',unpack=True)
#add an uncertainty to each measurement pointuncertainty = numpy.empty(len(vol_data))uncertainty.fill(20.)
#do the fitfit_parameters,fit_covariance = scipy.optimize.curve_fit(linearFit,temp_data,vol_data,p0=(1.0,8.0),sigma=uncertainty)
#determine the standard deviations for each parametersigma0 = numpy.sqrt(fit_covariance[0,0])sigma1 = numpy.sqrt(fit_covariance[1,1])
Fit Results
• Calculate the standard deviation on the slope (p[1])
• This is the square root of the [1,1] entry of the covariance matrix
19
fit parameters = [0.21617647 8.33058824]
fit covariance = [[2.16490542e+ 04 � 6.89053501e+ 01]
[�6.89053501e+ 01 2.20507375e� 01]]
Fit Results
• Show the p[1] parameter with the standard deviation:
p1 = 8.33 ± 0.470
20
fit parameters = [0.21617647 8.33058824]
fit covariance = [[2.16490542e+ 04 � 6.89053501e+ 01]
[�6.89053501e+ 01 2.20507375e� 01]]
Comparison to Accepted Values• We obtained the result p[1] = 8.33±0.47
• We assume that there is 1 mole in a 1m3 volume so that n=V=1
• The accepted value (currently) is 8.3144621±0.0000075
• The accepted value IS contained within our uncertainty (our one sigma range is from 7.86 to 8.80)
• These values agree “within their error”
21
Application to Non-linear Examples
• This method can also be applied to other examples• Powers: y = b √x
• can be linearized as y2 = b2*x• Polynomials: y = a + b*x + c*x2 + d*x3
• This is just a case of using multiple regression since the equation is linear in the coefficients
• Exponentials: y= a*ebx
• Can be linearized as ln(y) = ln(a) + b*x• There are many other examples
22
Return to Chi-Squared
• Here the definition of the residual has changed• Instead of yi - a - b*xi a more general term has
been used• yi is still the data• y(xi) is the fit function evaluated at xi
23
�
2 =NX
i=1
(yi � y(xi))2
�
2y
Gauss’ Distribution
• The probability is described by
24
P (x) =1p2⇡�
e
� (x�x)2
2�2
• where the average (mean) value is x and the spread in values is σ
Gauss’ Distribution
• We use the probabilities shown above to determine how probable a value is in this distribution
• When we take a measurement, we expect that 68.2% of the time it will be within 1σ from the mean value
• Another way of phrasing this is that we expect a value to be more than 3σ above the mean value only 0.1% of the time
25
Another example
26
Fitting the Gaussian
27
import numpyimport scipy.optimizeimport matplotlib.pyplot as pyplotimport pylab as py
#define the function to be used in the fitting, which is linear in this casedef gaussFit(x,*p): return p[0]+p[1]*numpy.exp(-1*(x-p[2])**2/(2*p[3]**2))
#read in the data (currently only located on my hard drive...)day_num,rain_data = numpy.loadtxt('/Users/kclark/Desktop/Teaching/phys224/weather_data/precip_2013.txt', unpack=True)
#get some (pretty good) guesses for the fitting parametersdata_mean = rain_data.mean()data_std = rain_data.std()
#set up the histogram so that it can be fitdata_plot = py.hist(rain_data,range=(0.1,90),bins=100)histx = [0.5 * (data_plot[1][i] + data_plot[1][i + 1]) for i in xrange(100)]histy = data_plot[0]
#actually do the fittingfit_parameters,fit_covariance = scipy.optimize.curve_fit(gaussFit,histx,histy,p0=(5.0,10.0,data_mean,data_std))
Another example
28
Fit mean: 7.06mmFit standard deviation:
10.13mm
Mean
}
Standard Deviation
Another example
29
Fit mean: 7.06mmFit standard deviation:
10.13mm
Mean
}
Standard Deviation
Another example
30
Fit mean: 7.06mmFit standard deviation:
10.13mm
Mean
}
Standard Deviation
Another example
31
Fit mean: 7.06mmFit standard deviation:
10.13mm
Rainfall of 85.5mm is 7.74 standard
deviations above the mean (from this data) which is extremely
unlikely
Mean
}
Standard Deviation
Chi-Squared and Goodness of Fit
• This can then be used as a “goodness of fit” test
• If the function is a good approximation, then the residual will be within one standard deviation, so this will sum to approximately N
32
�
2 =NX
i=1
(yi � y(xi))2
�
2y
Chi-Squared
• We normally use the number of degrees of freedom of the experiment to determine the fit quality
• The number of DOF is the number of data points in the sample minus the number of parameters in the fit
• For a sample with 20 data points and a linear fit (2 parameters), DOF = 18
• This is used as the goodness of fit since χ2/DOF≅1 for a good fit
33
�
2 =NX
i=1
(yi � y(xi))2
�
2y
Revisit the First Example
34
import numpyimport scipy.optimizefrom matplotlib import pyplot
#define the function to be used in the fitting, which is linear in this casedef linearFit(x,*p): return p[0]+p[1]*x
#read in the data (currently only located on my hard drive...)temp_data,vol_data = numpy.loadtxt('/Users/kclark/Desktop/Teaching/phys224/weather_data/ideal_gas_law.txt',unpack=True)
#add an uncertainty to each measurement pointuncertainty = numpy.empty(len(vol_data))uncertainty.fill(20.)
#do the fitfit_parameters,fit_covariance = scipy.optimize.curve_fit(linearFit,temp_data,vol_data,p0=(1.0,8.0),sigma=uncertainty)
#calculate the chi-squared valuechisq = sum(((vol_data-linearFit(temp_data,*fit_parameters))/uncertainty)**2)print chisq
dof = len(temp_data)-len(fit_parameters)print dof
Revisit the First Example• Is this a good fit?
35
�2 =16X
i=1
presDatai � fiti
uncertainty
�2= 65.6
• Divide this by the DOF• We have 16 data points,
2 parameters�2
DOF=
65.6
16� 2= 4.68
• This may not be a great fit...
Goodness of Fit• Previous statements only mostly true• More accurately:
• χ2 >> 1 is a very poor fit, maybe even a fit model which doesn’t match
• χ2 > 1 is not a good fit, or the uncertainty is underestimated
• χ2 << 1 means the uncertainty could be overestimated
36
Summary
• You should now be well prepared to use python to fit the data
• Your practice with this starts with the next pendulum exercise, which you can begin now!
37