Python Analysis - U of T Physicsphy225h/python... · 2014-09-24 · Fitting Experimental Data The...

Post on 05-Jun-2020

18 views 0 download

Transcript of Python Analysis - U of T Physicsphy225h/python... · 2014-09-24 · Fitting Experimental Data The...

Python Analysis

PHYS 224September 25/26, 2014

Fitting Experimental Data

The goal of the lab experiments is to determine a physical quantity y (independent variable) as a function of x (dependent variable)How?

• Measure the pair (xi,yi) a number (N) times• Find a fit function y=y(x) that describes the

relationship between these two quantities

3

The Linear Case• The simplest function relating the two

variables is the linear functionf(x) = y = ax +b

• This is valid for any yi,xi combination• If a and b are known, the true value of yi

can be calculated for any xi

yi,true = axi + b

4

Linear Regression

• Linear regression calculates the most probable values of a and b such that the linear equation is validyi,true = axi + b

• When taking measurements of yi, these usually obey Gauss’ distribution

5

An Example• Ideal Gas Law: P*V = n*R*T

• Pressure * Volume = n * R * Temperature• P = [(n*R)/V]*T

6

Fitting in Python

• We’re going to use the curve_fit function, which is part of the scipy.optimize package

• The usage is as follows:fit_parameters,fit_covariance = scipy.optimize.curve_fit(fit_function,x_data,y_data,sigma,guess)

fit_parameters - an array of the output fit parametersfit_covariance - an array of the covariance of the output fit parametersfit_function - the function used to do the fitsigma - the uncertainty associated with the dataguess - the initial guess input to the fit

7

Fitting with curve_fit

8

import numpyimport scipy.optimizefrom matplotlib import peplos

#define the function to be used in the fittingdef linearFit(x,*p): return p[0]+p[1]*x

#read in the data (currently only located on my hard drive...)temp_data, vol_data = numpy.loadtxt('ideal_gas_law.txt',unpack=True)

#add an uncertainty to each measurement pointuncertainty = numpy.empty(len(vol_data))uncertainty.fill(20.)

#do the fitfit_parameters,fit_covariance = scipy.optimize.curve_fit(linearFit, temp_data, vol_data, p0=(1.0,8.0),sigma=uncertainty)

Fitting with curve_fit

9

}

X data

}Y data

}

Initial guessfor parameters

}Uncertainty

on data

Function}

import numpyimport scipy.optimizefrom matplotlib import peplos

#define the function to be used in the fittingdef linearFit(x,*p): return p[0]+p[1]*x

#read in the data (currently only located on my hard drive...)temp_data, vol_data = numpy.loadtxt('ideal_gas_law.txt',unpack=True)

#add an uncertainty to each measurement pointuncertainty = numpy.empty(len(vol_data))uncertainty.fill(20.)

#do the fitfit_parameters,fit_covariance = scipy.optimize.curve_fit(linearFit, temp_data, vol_data, p0=(1.0,8.0),sigma=uncertainty)

Results

10

fit parameters = [0.21617647 8.33058824]

fit covariance = [[2.16490542e+ 04 � 6.89053501e+ 01]

[�6.89053501e+ 01 2.20507375e� 01]]

So what does this mean?

We set up the function for the fit to be:

y = p[0] + p[1]*xSo with the fit parameters, the function is:

y = 0.216 + 8.33*x

Full Probability• For a set of N measurements of the

dependent variable yy1, y2, y3,… yN

The probability of obtaining these values is the product of the individual probaiblities

11

Pa,b(y1, y2, y3...yN ) = Pa,b(y1)Pa,b(y2)Pa,b(y3)...Pa,b(yN )

=1

�Ny

e

PN

i=1(y

i

�a�bx

i

)2

2y

2

Full Probability• For a set of N measurements of the

dependent variable yy1, y2, y3,… yN

The probability of obtaining these values is the product of the individual probabilities

12

Pa,b(y1, y2, y3...yN ) = Pa,b(y1)Pa,b(y2)Pa,b(y3)...Pa,b(yN )

=1

�Ny

e

PN

i=1(y

i

�a�bx

i

)2

2y

2

Called the chi-squared

(χ2)

Chi-Squared

• The circled part is the definition of the residuals, ie the true data (yi) minus the fit data (a + b*xi)

• Dividing this by the standard deviation (σ) tells us how many standard deviations the test data is away from the fit at that x

• The square ensures this is always positive

13

2 =NX

i=1

(yi � a� bxi)2

2y

Plotting the Residuals

14

#read in the data (currently only located on my hard drive...)temp_data,vol_data = numpy.loadtxt('/Users/kclark/Desktop/Teaching/phys224/weather_data/ideal_gas_law.txt',unpack=True)

#add an uncertainty to each measurement pointuncertainty = numpy.empty(len(vol_data))uncertainty.fill(20.)

#do the fitfit_parameters,fit_covariance = scipy.optimize.curve_fit(linearFit,temp_data,vol_data,p0=(1.0,8.0),sigma=uncertainty)

#now generate the line of the best fit#set up the temperature points for the full arrayfit_temp = numpy.arange(270,355,5)#make the data for the best fit valuesfit_answer = linearFit(fit_temp,*fit_parameters)#calculate the residualsfit_resid = vol_data-linearFit(temp_data,*fit_parameters)#make a line at zerozero_line = numpy.zeros(len(vol_data))

How do the Residuals Look?

• The residuals are obviously a large component of the χ2 value used by the minimizer

• They can be plotted to look for trends and see if the fit function is appropriate

15

Interpreting the Covariance• Elements in covariance matrix represent

the relationship between the two variables

• The diagonals are the square of the standard deviations• we will use this in our interpretation of

the answer

16

Covariance Matrix Elements

• Diagonal elements are the square of the standard deviation for that parameter

• The non-diagonal elements show the relationship between the parameters

17

fit parameters = [0.21617647 8.33058824]

fit covariance = [[2.16490542e+ 04 � 6.89053501e+ 01]

[�6.89053501e+ 01 2.20507375e� 01]]

cov(x, y) =1

N

NX

i=1

(xi � x)(yi � y)

Fit Results

18

import numpyimport scipy.optimizefrom matplotlib import pyplot

#define the function to be used in the fitting, which is linear in this casedef linearFit(x,*p): return p[0]+p[1]*x

#read in the data (currently only located on my hard drive...)temp_data,vol_data = numpy.loadtxt('/Users/kclark/Desktop/Teaching/phys224/weather_data/ideal_gas_law.txt',unpack=True)

#add an uncertainty to each measurement pointuncertainty = numpy.empty(len(vol_data))uncertainty.fill(20.)

#do the fitfit_parameters,fit_covariance = scipy.optimize.curve_fit(linearFit,temp_data,vol_data,p0=(1.0,8.0),sigma=uncertainty)

#determine the standard deviations for each parametersigma0 = numpy.sqrt(fit_covariance[0,0])sigma1 = numpy.sqrt(fit_covariance[1,1])

Fit Results

• Calculate the standard deviation on the slope (p[1])

• This is the square root of the [1,1] entry of the covariance matrix

19

fit parameters = [0.21617647 8.33058824]

fit covariance = [[2.16490542e+ 04 � 6.89053501e+ 01]

[�6.89053501e+ 01 2.20507375e� 01]]

Fit Results

• Show the p[1] parameter with the standard deviation:

p1 = 8.33 ± 0.470

20

fit parameters = [0.21617647 8.33058824]

fit covariance = [[2.16490542e+ 04 � 6.89053501e+ 01]

[�6.89053501e+ 01 2.20507375e� 01]]

Comparison to Accepted Values• We obtained the result p[1] = 8.33±0.47

• We assume that there is 1 mole in a 1m3 volume so that n=V=1

• The accepted value (currently) is 8.3144621±0.0000075

• The accepted value IS contained within our uncertainty (our one sigma range is from 7.86 to 8.80)

• These values agree “within their error”

21

Application to Non-linear Examples

• This method can also be applied to other examples• Powers: y = b √x

• can be linearized as y2 = b2*x• Polynomials: y = a + b*x + c*x2 + d*x3

• This is just a case of using multiple regression since the equation is linear in the coefficients

• Exponentials: y= a*ebx

• Can be linearized as ln(y) = ln(a) + b*x• There are many other examples

22

Return to Chi-Squared

• Here the definition of the residual has changed• Instead of yi - a - b*xi a more general term has

been used• yi is still the data• y(xi) is the fit function evaluated at xi

23

2 =NX

i=1

(yi � y(xi))2

2y

Gauss’ Distribution

• The probability is described by

24

P (x) =1p2⇡�

e

� (x�x)2

2�2

• where the average (mean) value is x and the spread in values is σ

Gauss’ Distribution

• We use the probabilities shown above to determine how probable a value is in this distribution

• When we take a measurement, we expect that 68.2% of the time it will be within 1σ from the mean value

• Another way of phrasing this is that we expect a value to be more than 3σ above the mean value only 0.1% of the time

25

Another example

26

Fitting the Gaussian

27

import numpyimport scipy.optimizeimport matplotlib.pyplot as pyplotimport pylab as py

#define the function to be used in the fitting, which is linear in this casedef gaussFit(x,*p): return p[0]+p[1]*numpy.exp(-1*(x-p[2])**2/(2*p[3]**2))

#read in the data (currently only located on my hard drive...)day_num,rain_data = numpy.loadtxt('/Users/kclark/Desktop/Teaching/phys224/weather_data/precip_2013.txt', unpack=True)

#get some (pretty good) guesses for the fitting parametersdata_mean = rain_data.mean()data_std = rain_data.std()

#set up the histogram so that it can be fitdata_plot = py.hist(rain_data,range=(0.1,90),bins=100)histx = [0.5 * (data_plot[1][i] + data_plot[1][i + 1]) for i in xrange(100)]histy = data_plot[0]

#actually do the fittingfit_parameters,fit_covariance = scipy.optimize.curve_fit(gaussFit,histx,histy,p0=(5.0,10.0,data_mean,data_std))

Another example

28

Fit mean: 7.06mmFit standard deviation:

10.13mm

Mean

}

Standard Deviation

Another example

29

Fit mean: 7.06mmFit standard deviation:

10.13mm

Mean

}

Standard Deviation

Another example

30

Fit mean: 7.06mmFit standard deviation:

10.13mm

Mean

}

Standard Deviation

Another example

31

Fit mean: 7.06mmFit standard deviation:

10.13mm

Rainfall of 85.5mm is 7.74 standard

deviations above the mean (from this data) which is extremely

unlikely

Mean

}

Standard Deviation

Chi-Squared and Goodness of Fit

• This can then be used as a “goodness of fit” test

• If the function is a good approximation, then the residual will be within one standard deviation, so this will sum to approximately N

32

2 =NX

i=1

(yi � y(xi))2

2y

Chi-Squared

• We normally use the number of degrees of freedom of the experiment to determine the fit quality

• The number of DOF is the number of data points in the sample minus the number of parameters in the fit

• For a sample with 20 data points and a linear fit (2 parameters), DOF = 18

• This is used as the goodness of fit since χ2/DOF≅1 for a good fit

33

2 =NX

i=1

(yi � y(xi))2

2y

Revisit the First Example

34

import numpyimport scipy.optimizefrom matplotlib import pyplot

#define the function to be used in the fitting, which is linear in this casedef linearFit(x,*p): return p[0]+p[1]*x

#read in the data (currently only located on my hard drive...)temp_data,vol_data = numpy.loadtxt('/Users/kclark/Desktop/Teaching/phys224/weather_data/ideal_gas_law.txt',unpack=True)

#add an uncertainty to each measurement pointuncertainty = numpy.empty(len(vol_data))uncertainty.fill(20.)

#do the fitfit_parameters,fit_covariance = scipy.optimize.curve_fit(linearFit,temp_data,vol_data,p0=(1.0,8.0),sigma=uncertainty)

#calculate the chi-squared valuechisq = sum(((vol_data-linearFit(temp_data,*fit_parameters))/uncertainty)**2)print chisq

dof = len(temp_data)-len(fit_parameters)print dof

Revisit the First Example• Is this a good fit?

35

�2 =16X

i=1

presDatai � fiti

uncertainty

�2= 65.6

• Divide this by the DOF• We have 16 data points,

2 parameters�2

DOF=

65.6

16� 2= 4.68

• This may not be a great fit...

Goodness of Fit• Previous statements only mostly true• More accurately:

• χ2 >> 1 is a very poor fit, maybe even a fit model which doesn’t match

• χ2 > 1 is not a good fit, or the uncertainty is underestimated

• χ2 << 1 means the uncertainty could be overestimated

36

Summary

• You should now be well prepared to use python to fit the data

• Your practice with this starts with the next pendulum exercise, which you can begin now!

37