Python Crash Course Data Fitting

17
Python Crash Course Python Crash Course Data Fitting Data Fitting Sterrenkundig Practicum 2 V1.0 dd 07-01-2015 Hour 7

description

Python Crash Course Data Fitting. Sterrenkundig Practicum 2 V1.0 dd 08-01-2014 Hour 7. Data Fitting. We are given: observational data a model with a set of fit parameters Goal: Want to optimize some objective function. - PowerPoint PPT Presentation

Transcript of Python Crash Course Data Fitting

Page 1: Python Crash Course Data Fitting

Python Crash CoursePython Crash CourseData FittingData Fitting

Sterrenkundig Practicum 2

V1.0

dd 07-01-2015

Hour 7

Page 2: Python Crash Course Data Fitting

Data FittingData Fitting

We are given:– observational data

– a model with a set of fit parameters

Goal:– Want to optimize some objective function.

– E.g.: Want to minimize squared difference between some model and some given data

What we do not talk about:– How to choose your objective function.

– How to choose your model.Least squares fitting

Page 3: Python Crash Course Data Fitting

Data FittingData Fitting

Fitting Lines to Data– Define function to fit

– Define range and step size for all data points

– Calculate best fitting line

import numpy as npimport pylab as pl

# Data Fitting# First, we'll generate some fake data to usex = np.linspace(0,10,50) # 50 x points from 0 to 10

# Remember, you can look at the help for linspace too:# help(np.linspace)

# y = m x + by = 2.5 * x + 1.2

# let's plot thatpl.clf()pl.plot(x,y)

Page 4: Python Crash Course Data Fitting

Data FittingData Fitting

Noise

# looks like a simple line. But we want to see the individual data points

pl.plot(x,y,marker='s')

# We need to add noise first

noise = pl.randn(y.size)

# What's y.size?

print y.size

print len(y)

50

50

# looks like a simple line. But we want to see the individual data points

pl.plot(x,y,marker='s')

# looks like a simple line. But we want to see the individual data points

pl.plot(x,y,marker='s')

Page 5: Python Crash Course Data Fitting

Data FittingData Fitting

# We can add arrays in python just like in IDL

noisy_flux = y + noise

# We'll plot it too, but this time without any lines

# between the points, and we'll use black dots

# ('k' is a shortcut for 'black', '.' means 'point')

pl.clf() # clear the figure

pl.plot(x,noisy_flux,'k.')

# We need labels, of course

pl.xlabel("Time")

pl.ylabel("Flux")

Page 6: Python Crash Course Data Fitting

Data FittingData Fitting

# We'll use polyfit to find the values of the coefficients. The third

# parameter is the "order"

p = np.polyfit(x,noisy_flux,1)

# help(polyfit) if you want to find out more

# We'll use polyfit to find the values of the coefficients. The third

# parameter is the "order"

p = np.polyfit(x,noisy_flux,1)

# help(polyfit) if you want to find out more

[ 2.59289715 0.71443764]

Now we're onto the fitting stage. We're going to fit a function of the form

y=m x+b∗

which is the same as

f(x)=p[1] x+p[0]∗

to the data. This is called "linear regression", but it is also a special case of a more general concept: this is a first-order polynomial. "First Order" means that the highest exponent of x in the equation is 1

Page 7: Python Crash Course Data Fitting

Data FittingData Fitting

noisy_flux = y+noise*10

p = polyfit(x,noisy_flux,1)

print p

[ 3.4289715 -3.65562359]

# plot it

pl.clf() # clear the figure

pl.plot(x,noisy_flux,'k.') # repeated from above

pl.plot(x,np.polyval(p,x),'r-',label="Best fit") # A red solid line

pl.plot(x,2.5*x+1.2,'b--',label="Input") # a blue dashed line showing the REAL line

pl.legend(loc='best') # make a legend in the best location

pl.xlabel("Time") # labels again

pl.ylabel("Flux")

Let's do the same thing with a noisier data set. I'm going to leave out most of the comments this time.

Page 8: Python Crash Course Data Fitting

Data FittingData Fitting

pl.clf() # clear the figure

pl.errorbar(x,noisy_flux,yerr=10,marker='.',color='k',linestyle='none') # errorbar

requires some extras to look nice

pl.plot(x,np.polyval(p,x),'r-',label="Best fit") # A red solid line

pl.plot(x,2.5*x+1.2,'b--',label="Input") # a blue dashed line showing the REAL line

pl.legend(loc='best') # make a legend in the best location

pl.xlabel("Time") # labels again

pl.ylabel("Flux")

Despite the noisy data, our fit is still pretty good! One last plotting trick.

Page 9: Python Crash Course Data Fitting

Power Law FittingPower Law Fitting

t = np.linspace(0.1,10)

a = 1.5

b = 2.5

z = a*t**b

pl.clf()

pl.plot(t,z)

Power laws occur all the time in physis, so it's a good idea to learn how to use them.

What's a power law? Any function of the form:

f(t)=atb

where t is your independent variable, a is a scale parameter, and b is the exponent (the power).

When fitting power laws, it's very useful to take advantage of the fact that "a power law is linear in log-space". That means, if you take the log of both sides of the equation (which is allowed) and change variables, you get a linear equation!

ln(f(t))=ln(atb)=ln(a)+bln(t)

We'll use the substitutions y=ln(f(t)), A=ln(a), and x=ln(t), so that

y=a+bx

which looks just like our linear equation from before (albeit with different letters for the fit parameters).

Page 10: Python Crash Course Data Fitting

Power Law FittingPower Law Fitting

# Change the variables

# np.log is the natural log

y = np.log(z)

x = np.log(t)

pl.clf()

pl.plot(x,y)

pl.ylabel("log(z)")

pl.xlabel("log(t)")

Page 11: Python Crash Course Data Fitting

Power Law FittingPower Law Fitting

noisy_z = z + pl.randn(z.size)*10

pl.clf()

pl.plot(t,z)

pl.plot(t,noisy_z,'k.')

noisy_y = np.log(noisy_z)

pl.clf()

pl.plot(x,y)

pl.plot(x,noisy_y,'k.')

pl.ylabel("log(z)")

pl.xlabel("log(t)")

It's a straight line. Now, for our "fake data", we'll add the noise before transforming from "linear" to "log" space

Page 12: Python Crash Course Data Fitting

Power Law FittingPower Law Fitting

print noisy_y

[ 2.64567168 1.77226366 nan 1.49412348 nan 1.28855427

0.58257979 -0.50719229 1.84461177 2.72918637 nan -1.02747374

-0.65048639 3.4827123 2.60910913 3.48086587 3.84495668 3.90751514

4.10072678 3.76322421 4.06432005 4.23511474 4.26996586 4.25153639

4.4608377 4.5945862 4.66004888 4.8084364 4.81659305 4.97216304

5.05915226 5.02661678 5.05308181 5.26753675 5.23242563 5.36407125

5.41827503 5.486903 5.50263105 5.59005234 5.6588601 5.7546482

5.77382726 5.82299095 5.92653466 5.93264584 5.99879722 6.07711596

6.13466015 6.12248799]

Note how different this looks from the "noisy line" we plotted earlier. Power laws are much more sensitive to noise! In fact, there are some data points that don't even show up on this plot because you can't take the log of a negative number. Any points where the random noise was negative enough that the curve dropped below zero ended up being "NAN", or "Not a Number". Luckily, our plotter knows to ignore those numbers, but polyfit doesnt.

Page 13: Python Crash Course Data Fitting

Power Law FittingPower Law Fitting

# try to polyfit a line

pars = np.polyfit(x,noisy_y,1)

print pars

[ nan nan]

In order to get around this problem, we need to mask the data. That means we have to tell the code to ignore all the data points where noisy_y is nan.

nan == nan is False

OK = noisy_y == noisy_y

print OK

[ True True False True False True True True True True False True

True True True True True True True True True True True True

True True True True True True True True True True True True

True True True True True True True True True True True True

True True]

Page 14: Python Crash Course Data Fitting

Power Law FittingPower Law Fitting

print "There are %i OK values" % (OK.sum())

masked_noisy_y = noisy_y[OK]

masked_x = x[OK]

print "masked_noisy_y has length",len(masked_noisy_y)

There are 47 OK values

masked_noisy_y has length 47

# now polyfit again

pars = np.polyfit(masked_x,masked_noisy_y,1)

print pars

[ 1.50239281 1.9907672 ]

fitted_y = polyval(pars,x)

pl.plot(x, fitted_y, 'r--')

This OK array is a "boolean mask". We can use it as an "index array", which is pretty neat.

Page 15: Python Crash Course Data Fitting

Power Law FittingPower Law Fitting

# Convert bag to linear-space to see what it "really" looks like

fitted_z = np.exp(fitted_y)

pl.clf()

pl.plot(t,z)

pl.plot(t,noisy_z,'k.')

pl.plot(t,fitted_z,'r--')

pl.xlabel('t')

pl.ylabel('z')

The noise seems to have affected our fit.

Page 16: Python Crash Course Data Fitting

Power Law FittingPower Law Fitting

def powerlaw(x,a,b):

return a*(x**b)

pars,covar = curve_fit(powerlaw,t,noisy_z)

pl.clf()

pl.plot(t,z)

pl.plot(t,noisy_z,'k.')

pl.plot(t,powerlaw(t,*pars),'r--')

pl.xlabel('t')

pl.ylabel('z')

That's pretty bad. A "least-squares" approach, as with curve_fit, is probably going to be the better choice. However, in the absence of noise, this approach should work

Page 17: Python Crash Course Data Fitting

Introduction to languageIntroduction to language

End