AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia -...

71
AstroInformatics

Transcript of AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia -...

Page 1: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

AstroInformatics

Page 2: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Regression line

M. Brescia - Astroinformatics - lezione 3 2

Independent variable

dep

end

ent

vari

able

y

x

Straight lines

+ positive

- negative

A line can be positive if atincreasing x values correspondincreasing y values.

A line can be negative if atincreasing x values corresponddecreasing y values.

Page 3: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Regression line

M. Brescia - Astroinformatics - lezione 3 3

Independent variable

dep

end

ent

vari

able

y

x

observations

Page 4: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Regression line

M. Brescia - Astroinformatics - lezione 3 4

Independent variable

dep

end

ent

vari

able

y

x

Regression line

R line

observations

A line that fits all different points(observations) is called Regressionline.

Generally speaking, usually, given a set ofobservations about any phenomenon, theidea is to make a prediction of what isexpected to be the relationship between thevariables of the problem. This is to find thebest fit for the estimation.In these cases we want to evaluate thedistances between the estimated and actualobservations, with the aim at minimizingthose distances (or estimation errors).

Page 5: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Regression line

M. Brescia - Astroinformatics - lezione 3 5

Independent variable

dep

end

ent

vari

able

y

x

Regression line

R line

observations

A line that fits all different points(observations) is called Regressionline.

estimated

actual

error

Note that the distances sometimes arepositive, sometimes negative. Trying tosimply sum up them, it would obtainzero, so it would not bring any usefulinformation about the predictionperformance.

Positive distance

Negative distance

In fact, there could exist infiniteestimation lines which could obtain thesum of distances = 0…!!!The most useful way is indeed to takethe squared distances, because thenegative ones will change sign and maycontribute to carry more information…

𝑒2 > 0 → 𝑓𝑖𝑛𝑑 𝑡ℎ𝑒min𝑒2

Page 6: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Regression line

M. Brescia - Astroinformatics - lezione 3 6

Independent variable

dep

end

ent

vari

able

y

x

Regression line

R line

observations

A line that best fits all differentpoints (observations) is calledRegression line.

A Regression line is easily obtainedby the least squares method whichtries to minimize the difference(errors) between the estimatedvalues and the actual values

estimated

actual

error

Therefore we can define the regression line as theunique line which minimizes the sum of squared errors.

Page 7: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Regression line

M. Brescia - Astroinformatics - lezione 3 7

Independent variable

dep

end

ent

vari

able

y

x

Regression line

R line

observations

A line that best fits all differentpoints (observations) is calledRegression line.

A Regression line is easily obtainedby the least squares method whichtries to minimize the difference(errors) between the estimatedvalues and the actual values

estimated

actual

errorො𝑦 = 𝑏0 + 𝑏1𝑥

𝑏0 y intercept

𝑏1 slope

Page 8: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Regression line

M. Brescia - Astroinformatics - lezione 3 8

Independent variable

dep

end

ent

vari

able

y

x

Regression line

R line

observations

A line that best fits all differentpoints (observations) is calledRegression line.

A Regression line is easily obtainedby the least squares method whichtries to minimize the difference(errors) between the estimatedvalues and the actual values

estimated

actual

errorො𝑦 = 𝑏0 + 𝑏1𝑥

𝑏0 y intercept

𝑏1 slope

If slope is positive then the R line ispositive as well. Negative in case ofa −𝑏1 term.

Page 9: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Least squares method

M. Brescia - Astroinformatics - lezione 3 9

y

x1 2 3 4 5

1

2

3

4

5

6

Now we want to calculateregression lines using the leastsquares method.

Let’s consider some observations

Page 10: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Least squares method

M. Brescia - Astroinformatics - lezione 3 10

y

x1 2 3 4 5

1

2

3

4

5

6mean(x)

mean(y)

We can draw the means for thetwo variables

Page 11: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Least squares method

M. Brescia - Astroinformatics - lezione 3 11

y

x1 2 3 4 5

1

2

3

4

5

6mean(x)

mean(y)

R line

ො𝑦 = 𝑏0 + 𝑏1𝑥

Let’s trace the regression lines

Page 12: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Least squares method

M. Brescia - Astroinformatics - lezione 3 12

y

x1 2 3 4 5

1

2

3

4

5

6mean(x)

mean(y)

R line

It turns out that all R lines have to gothrough the intersection betweenthe two means.

ො𝑦 = 𝑏0 + 𝑏1𝑥

Page 13: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Least squares method

M. Brescia - Astroinformatics - lezione 3 13

y

x1 2 3 4 5

1

2

3

4

5

6mean(x)

mean(y)

R line

𝒙 𝒚 𝒙 − ഥ𝒙 𝒚 − ഥ𝒚

1 2 -2 -2

2 4 -1 0

3 5 0 1

4 4 1 0

5 5 2 1

3 4mean

We calculate the distances between actualvalues and their mean.

Page 14: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Least squares method

M. Brescia - Astroinformatics - lezione 3 14

y

x1 2 3 4 5

1

2

3

4

5

6mean(x)

mean(y)

R line

Take also the squared distances for x andthe product between the two distances

𝒙 𝒚 𝒙 − ഥ𝒙 𝒚 − ഥ𝒚 𝒙 − ഥ𝒙 𝟐 𝒙 − ഥ𝒙 𝒚 − ഥ𝒚

1 2 -2 -2 4 4

2 4 -1 0 1 0

3 5 0 1 0 0

4 4 1 0 1 0

5 5 2 1 4 2

3 4 sum = 10 sum = 6mean

ො𝑦 = 𝑏0 + 𝑏1𝑥

Page 15: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Least squares method

M. Brescia - Astroinformatics - lezione 3 15

y

x1 2 3 4 5

1

2

3

4

5

6mean(x)

mean(y)

R line

The slope of the R line is the ratiobetween the sums of last two columns

𝒙 𝒚 𝒙 − ഥ𝒙 𝒚 − ഥ𝒚 𝒙 − ഥ𝒙 𝟐 𝒙 − ഥ𝒙 𝒚 − ഥ𝒚

1 2 -2 -2 4 4

2 4 -1 0 1 0

3 5 0 1 0 0

4 4 1 0 1 0

5 5 2 1 4 2

3 4 sum = 10 sum = 6mean

ො𝑦 = 𝑏0 + 𝑏1𝑥

𝑏1 =σ 𝑥 − ҧ𝑥 𝑦 − ത𝑦

σ 𝑥 − ҧ𝑥 2 =6

10= .6

Page 16: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Least squares method

M. Brescia - Astroinformatics - lezione 3 16

y

x1 2 3 4 5

1

2

3

4

5

6mean(x)

mean(y)

R line

To calculate the intersect term 𝑏0 , weexploit the fact that we know at least onevalue assumed by the term ො𝑦 => ො𝑦 = 4.This is because the R line must cross thepoint (x=3, y=4).

𝒙 𝒚 𝒙 − ഥ𝒙 𝒚 − ഥ𝒚 𝒙 − ഥ𝒙 𝟐 𝒙 − ഥ𝒙 𝒚 − ഥ𝒚

1 2 -2 -2 4 4

2 4 -1 0 1 0

3 5 0 1 0 0

4 4 1 0 1 0

5 5 2 1 4 2

3 4 sum = 10 sum = 6mean

ො𝑦 = 𝑏0 + 𝑏1𝑥 = 2.2 + .6𝑥

𝑏1 =σ 𝑥 − ҧ𝑥 𝑦 − ത𝑦

σ 𝑥 − ҧ𝑥 2 =6

10= .6

4 = ො𝑦 = 𝑏0 + .6 𝑥 = 3 → 𝑏0 = 2.2

Page 17: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Regression line vs mean line

M. Brescia - Astroinformatics - lezione 3 17

The idea is to calculate the distances of actual values fromthe mean and the distances of estimated values from themean. Then we compare them.y

x1 2 3 4 5

1

2

3

4

5

6

mean(y)

R line

𝑦 − ത𝑦

𝑦 − ො𝑦

ො𝑦 − ത𝑦

distance of observed value from the mean

distance of estimated value from the mean

distance between observed and estimated values error

𝑆𝑆𝑅 = ො𝑦 − ത𝑦 2

𝑆𝑆𝐸 = 𝑦 − ො𝑦 2

𝑆𝑆𝑇 = 𝑦 − ത𝑦 2

SST= Sum of Squares TotalSSR = Sum of Squares due to RegressionSSE = Sum of Squares due to Errors

𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸

Page 18: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

R squared using regression analysis

M. Brescia - Astroinformatics - lezione 3 18

The idea is to calculate the distances of actual valuesfrom the mean and the distances of estimatedvalues from the mean. Then we compare them.

y

x1 2 3 4 5

1

2

3

4

5

6

mean(y)

R line

The R squared (in Italian “coefficiente dideterminazione”) is a statistical measure ofhow well a model explains and predicts futureoutcomes.It is indicative of the level of variability in thedata set. R-squared is used as a guideline tomeasure the accuracy of the model.

𝑹𝟐 =𝑺𝑺𝑹

𝑺𝑺𝑻

𝑦 − ത𝑦

ො𝑦 − ത𝑦 𝑆𝑆𝑅 = ො𝑦 − ത𝑦 2

𝑆𝑆𝑇 = 𝑦 − ത𝑦 2

Page 19: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

R squared using regression analysis

M. Brescia - Astroinformatics - lezione 3 19

y

x1 2 3 4 5

1

2

3

4

5

6

mean(y)

R line

𝑦 − ത𝑦

ො𝑦 − ത𝑦 𝑆𝑆𝑅 = ො𝑦 − ത𝑦 2

𝑆𝑆𝑇 = 𝑦 − ത𝑦 2

High SSELow R2

Low SSEHigh R2

𝑦 − ො𝑦 𝑆𝑆𝐸 = 𝑦 − ො𝑦 2

𝑹𝟐 =𝑺𝑺𝑹

𝑺𝑺𝑻

Page 20: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

R squared using regression analysis

M. Brescia - Astroinformatics - lezione 3 20

The idea is to calculate the distances of actual values fromthe mean and the distances of estimated values from themean. Then we compare them.y

x1 2 3 4 5

1

2

3

4

5

6

mean(y)

R line

𝒙 𝒚 𝒚 − ഥ𝒚 𝒚 − ഥ𝒚 𝟐 ෝ𝒚 ෝ𝒚 − ഥ𝒚

1 2 -2 4 2.8 -1.2

2 4 0 0 3.4 -.6

3 5 1 1 4 0

4 4 0 0 4.6 .6

5 5 1 1 5.2 1.2

4 sum = 6

ො𝑦 = 2.2 + .6𝑥

Page 21: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

R squared using regression analysis

M. Brescia - Astroinformatics - lezione 3 21

The idea is to calculate the distances of actual values fromthe mean and the distances of estimated values from themean. Then we compare them.y

x1 2 3 4 5

1

2

3

4

5

6

mean(y)

R line

𝒙 𝒚 𝒚 − ഥ𝒚 𝒚 − ഥ𝒚 𝟐 ෝ𝒚 ෝ𝒚 − ഥ𝒚 ෝ𝒚 − ഥ𝒚 𝟐

1 2 -2 4 2.8 -1.2 1.44

2 4 0 0 3.4 -.6 .36

3 5 1 1 4 0 0

4 4 0 0 4.6 .6 .36

5 5 1 1 5.2 1.2 1.44

4 sum = 6 sum=0 sum=3.6

ො𝑦 = 2.2 + .6𝑥

Page 22: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

R squared using regression analysis

M. Brescia - Astroinformatics - lezione 3 22

The idea is to calculate the distances of actual values fromthe mean and the distances of estimated values from themean. Then we compare them.y

x1 2 3 4 5

1

2

3

4

5

6

mean(y)

R line

𝒙 𝒚 𝒚 − ഥ𝒚 𝒚 − ഥ𝒚 𝟐 ෝ𝒚 ෝ𝒚 − ഥ𝒚 ෝ𝒚 − ഥ𝒚 𝟐

1 2 -2 4 2.8 -1.2 1.44

2 4 0 0 3.4 -.6 .36

3 5 1 1 4 0 0

4 4 0 0 4.6 .6 .36

5 5 1 1 5.2 1.2 1.44

4 sum = 6 sum=3.6

ො𝑦 = 2.2 + .6𝑥

𝑅2 =𝑆𝑆𝑅

𝑆𝑆𝑇=σ ො𝑦 − ത𝑦 2

σ 𝑦 − ത𝑦 2 =3.6

6= .6

Page 23: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

R squared using regression analysis

M. Brescia - Astroinformatics - lezione 3 23

The value .6 is a quite good result. It says that theestimated values are pretty much close to the actualones.

A R-squared = 1 indicates that estimated values areperfectly coincident to the actual ones.

A R-squared < .4 indicates a very poor estimation.

y

x1 2 3 4 5

1

2

3

4

5

6

𝑅2 =𝑆𝑆𝑅

𝑆𝑆𝑇=σ ො𝑦 − ത𝑦 2

σ 𝑦 − ത𝑦 2

R2 is the proportion of the variation in Y being explained by the variation in X

Page 24: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

R squared using regression analysis

M. Brescia - Astroinformatics - lezione 3 24

R2=0 complete random scatter of points(NO CORRELATION)

𝑅2 =𝑆𝑆𝑅

𝑆𝑆𝑇=σ ො𝑦 − ത𝑦 2

σ 𝑦 − ത𝑦 2

R2 is the proportion of the variation in Y being explained by the variation in X

Page 25: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

R2 vs degrees of freedom

M. Brescia - Astroinformatics - lezione 3 25

We are now interested to know what is the minimumnumber of observations required to estimate theregression.

𝑅2 =𝑆𝑆𝑅

𝑆𝑆𝑇=σ ො𝑦 − ത𝑦 2

σ 𝑦 − ത𝑦 2R2 is the proportion of the variation in Y being explained by the variation in X

𝑌𝑖 = 𝑏0 + 𝑏1𝑋𝑖 + 𝜀𝑖

y

x

y

x

𝑅2 = 1 𝑅2 =0.87 y

x

𝑅2 =0.77

𝐷𝐹 = 0 𝐷𝐹 = 1 𝐷𝐹 =2

The degree of freedom (DF) let the system be free tovariate its regression estimation. What happens if weincrease the number of variables?

𝑌𝑖 = 𝑏0 + 𝑏1𝑋1𝑖 + 𝑏2𝑋2𝑖 + 𝜀𝑖

Page 26: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

R2 vs degrees of freedom

M. Brescia - Astroinformatics - lezione 3 26

y

x1

𝑅2 = 1

𝐷𝐹 = 0

The degree of freedom (DF) let the system be free tovariate its regression estimation. What happens if weincrease the number of variables?

𝑌𝑖 = 𝑏0 + 𝑏1𝑋1𝑖 + 𝑏2𝑋2𝑖 + 𝜀𝑖

x2

Regression line becomes a regression plane

y

x1

𝑅2 = 0.80

𝐷𝐹 = 1

x2

By increasing the number of variables, DF decreases. The empirical law is hence:

Note that before, 4 observations induce DF=2

𝐷𝐹 = 𝑛 − 𝑘 − 1 n observationsk variables (explanatory X variables or parameter space)

Page 27: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

R2 vs degrees of freedom

M. Brescia - Astroinformatics - lezione 3 27

So, what can do DF for us?𝑌𝑖 = 𝑏0 + 𝑏1𝑋1𝑖 + 𝑏2𝑋2𝑖 + 𝜀𝑖

DF defines the number of relationships between X variablesof the problem and the result Y, i.e. the way in whichvariables may be cross-correlated to give a prediction on asystem.

For example, how does DFs relate to R2?

As DF decreases, i.e. more variables added to the model, R2 will only increase.Essentially because you decrease the degrees of freedom of your problem…!!

We can consider the so-called “Adjusted-R2”, or A-R2

𝐴 − 𝑅2 = 1 − 1 − 𝑅2𝑛 − 1

𝑛 − 𝑘 − 1= 1 −

𝑆𝑆𝐸

𝑆𝑆𝑇

𝑛 − 1

𝑛 − 𝑘 − 1

1 − 𝑅2 = 1 −𝑆𝑆𝑅

𝑆𝑆𝑇=𝑆𝑆𝑇 − 𝑆𝑆𝑅

𝑆𝑆𝑇=𝑆𝑆𝐸

𝑆𝑆𝑇

As k increases, A-R2 will tend to decrease.Essentially reflecting the reduced power in the model. The only exception is whenyou add most useful variables to your model. In such cases the A-R2 may increase aswell, because you introduce a balance to the lack of degrees of freedom.

Page 28: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

R2 vs degrees of freedom

M. Brescia - Astroinformatics - lezione 3 28

Let’s consider 8 different models of your problem, with different number of observations and variables ofyour parameter space.

𝒏 𝒌 𝑹𝟐 DF

25 4 0.71 20

25 5 0.76 19

25 6 0.78 18

25 7 0.79 17

10 4 0.71 5

10 5 0.76 4

10 6 0.78 3

10 7 0.79 2

As DF decreases, i.e. morevariables added to themodel, R2 will only increase.

𝑨 − 𝑹𝟐

0.6520

0.6968

0.7067

0.7035

0.4780

0.4600

0.3400

0.0550

In the first 3 models A-R2

increases. Ok, probably,those variables addedwere carrying muchinformation

But by adding the 7th

variable, A-R2 decreases.This means that thisvariable was carryingmisleading informationHere, few observations

do not affect the R2,but reduce drasticallythe A-R2

too few observations affect the prediction of the model; too many variables may decrease the prediction performance of the model; Too few variables increase the estimation error occurrence. The red one is the best model (best compromise).

Page 29: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Standard error of the estimate

M. Brescia - Astroinformatics - lezione 3 29

The standard error of the estimate is given by taking the actual values, drawing theregression line, calculating the estimated values, in order to measure the distancebetween actual and estimated values (which is the estimation error).y

x1 2 3 4 5

1

2

3

4

5

6𝒙 𝒚 ෝ𝒚

1 2 2.8

2 4 3.4

3 5 4

4 4 4.6

5 5 5.2

𝑅𝑀𝑆𝐸 =σ ො𝑦 − 𝑦 2

𝑛 − 2= .89

Therefore, given n observations, we can calculate the Root Mean Square Error (RMSE):actual

estimated

Page 30: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Normal distribution

M. Brescia - Astroinformatics - lezione 3 30

Most people believes that “everything in Nature is normally distributed”. Is it true?

A Normal distribution is defined by its Probability Distribution Function (PDF), where eachsample of a population is distributed by following a relationship based on two parameters: μor mean (middle of the distribution) and σ or standard deviation (spread of the distribution).

As an example, let’s consider a flight fromSidney to Los Angeles and try to describein a probabilistic way how much time aflight would takes.

Page 31: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Normal distribution

M. Brescia - Astroinformatics - lezione 3 31

Most people believes that “everything in Nature is normally distributed”. Is it true?

As we see, any flight may take a time ranging between 13 and 16 hours, depending on someconditions (weather, plane model, weight, route, etc.). But the likelihood is spread over adifferent level, inducing to consider a more frequent time of about 14.5 hours (on average).

We have: 𝜇 = 14.5𝜎 = 0.5

Are we confident that the flight time is normally distributed?

The answer is: perhaps, it might be…

Page 32: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Normal distribution

M. Brescia - Astroinformatics - lezione 3 32

More in general, we know the Central Limit Theorem: Given a population N, as N increases,the distribution of the sample mean or sum approaches a normal distribution

Let’s flip a coinNumber of heads

One/two flips no/pseudo normal distribution

more flips normal distribution

Page 33: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Standard Normal distribution

M. Brescia - Astroinformatics - lezione 3 33

The standard Normal distribution is a particular type with μ = 0 and σ = 1

50% of probability is less than 0;It’s symmetrical;Area under the curve is 1;Its range is -∞, +∞;

0

-1 1.645

ONLY for Standard Normal there is a table reporting all probabilities

𝑍 =𝑋 − 𝜇

𝜎

Z-score

Standard Normal

-∞ +∞

Page 34: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Normal distribution

M. Brescia - Astroinformatics - lezione 3 34

In other cases, no table exists for any arbitrary combination of μ and σ

?? Which is the probability that a flight takes less than 14 hours?

The idea is to convert the 14.0 to thecorresponding value (Z-score) of theStandard Normal distribution

𝑍 =𝑋 − 𝜇

𝜎=14.0 − 14.5

0.5= −1

-1 1.645

In fact, we can see that the area before14.0 corresponds exactly to the area ofthe Standard Normal before -1. Thereforethe probability is 0.159.

𝑃 𝑍 < 14.0 = 0.159

𝜇 = 14.5𝜎 = 0.5

Page 35: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Normal distribution

M. Brescia - Astroinformatics - lezione 3 35

The 68-95-99.7 rule

For a Normal distribution with mean μ and standard deviation σ:

68% of the observations fall within 1σ from the mean μ;95% of the observations fall within 2σ from the mean μ;99.7% of the observations fall within 3σ from the mean μ;

Page 36: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Normal distribution

M. Brescia - Data Mining - lezione 4 36

Usually, a confidence level of 95% is often sufficient, sometimes 99%. But this decision isup to you. Note that the more stringent your confidence level, the more likely would bean error (i.e. you don’t find any difference that is actually there…)

It can be demonstrated that about 68% of values drawn from a normal distribution arewithin one standard deviation σ away from the mean; about 95% of the values lie withintwo standard deviations; and about 99.7% are within three standard deviations. This factis known as the 68-95-99.7 (empirical) rule.

In statistics, the so-called 68–95–99.7 rule is ashorthand used to remember the percentageof values that lie within a band aroundthe mean in a normal distribution with a widthof one, two and three standard deviations,respectively

Page 37: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Normal distribution

M. Brescia - Data Mining - lezione 4 37

In the empirical sciences the so-called 3-σ rule of thumb expresses a conventionalheuristic that "nearly all" values are taken to lie within three standard deviations of themean, i.e. that it is empirically useful to treat 99.7% probability as "near certainty".

The usefulness of this heuristic depends significantly on the specific context:in the social sciences a result may be considered "significant" if its confidence level is ofthe order of a 2σ effect (95%)while in particle physics there is a convention of a 5σ effect (99.99994% confidence) beingrequired to qualify as a "discovery".The 3-σ rule of thumb is related to the generalized 3-σ rule:even for non-normally distributed variables, at least 98% of cases should fall withinproperly-calculated 3σ intervals.

The 68–95–99.7 rule is often used to quickly get a rough probability estimate ofsomething, given its standard deviation, if the population is assumed to be normal. It isalso as a simple test for outliers if the population is assumed normal, and as a normalitytest if the population is potentially not normal.

To use as a test for outliers or a normality test, one computes the size of deviations interms of standard deviations, and compares this to expected frequency.For example the fraction of residuals > Xσ (with arbitrary X) could be assigned as outlierfraction.

Page 38: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Statistical estimation

M. Brescia - Astroinformatics - lezione 3 38

In reality, for example by generating a normaldistribution from simulated data on a computer,this is what looks like.It doesn’t look perfect, but this is what youusually get in practice. And you ask if it isnormally distributed at all…

Page 39: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Statistical estimation

M. Brescia - Data Mining - lezione 4 39

But, more in general, what happens when data are NOT taken from a normal distribution?

ASSUMPTION: THE REAL WORLD IS NEVER GAUSSIAN!!!Since in many cases the extracted sample from a population of data is randomly chosen, itis necessary to formulate statistical measures or estimates on that sample.Purpose of the statistical inference is then the estimate of ALL statistical parameters of apopulation. However it can be unfeasible possibly due to the infinity or too high extensionof the population itself.

If we cannot assume that our data are at least approximately normally distributed(because there is a small number of elements in the sample, the distribution is unknown,or the data are ordinal, i.e. can only be ranked), then we must use non-parametric tests toevaluated hypotheses.

We can formulate the hypotheses in terms of any statistical estimator. Besides mean, variance, standard deviation, there are other important estimators, involving some non-parametric tests. We just recall, for example, what’s the main differences among mean, median, mode andrange.

Page 40: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Statistical estimation

M. Brescia - Data Mining - lezione 4 40

The Mean is computed byadding all of the numbers inthe data together anddividing by the number ofelements contained in thedata set.

The Median of a data set is dependenton whether the number of elementsin the data set is odd or even. Firstreorder the data set from the smallestto the largest, then if the number ofelements are odd, then the Median isthe element in the middle of the dataset. If the number of elements areeven, then the Median is the averageof the two middle terms.

Page 41: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Statistical estimation

M. Brescia - Data Mining - lezione 4 41

The Mode for a data set is the elementthat occurs the most often. It is notuncommon for a data set to have morethan one mode. This happens whentwo or more elements occur withequal frequency in the data set. A dataset with two modes is called bimodal.A data set with three modes is calledtrimodal.

The Range for a data set is the differencebetween the largest value and smallest valuecontained in the data set. First reorder thedata set from smallest to largest then subtractthe first element from the last element.

Page 42: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Statistical estimation

M. Brescia - Data Mining - lezione 4 42

Non-parametric statistics are statistics not based on parameterized families of probabilitydistributions. The typical parameters are the mean, variance, median, mode etc...Unlike parametric statistics, non-parametric statistics make no assumptions aboutthe probability distributions of the variables being assessed. The differencebetween parametric models and non-parametric models is that the former has a fixednumber of parameters, while the latter grows the number of parameters with the amountof training data (i.e. parameters are determined by the training data, not the model).

In statistics, the term non-parametric statistics has at least two different meanings:

The first meaning of non-parametric covers techniques that do not rely on data belongingto any particular distribution.

The second meaning of non-parametric covers techniques that do not assume thatthe structure of a model is fixed. Typically, the model grows in size to accommodate thecomplexity of the data. In these techniques, individual variables are typically assumed tobelong to parametric distributions, and assumptions about the types of connectionsamong variables are also made. A typical example is the non-parametric regression, usedto infer the estimation of photometric redshifts from the hidden correlation betweenspectroscopic redshifts (precise measures) and sky object photometry.

Page 43: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Statistical estimation

M. Brescia - Data Mining - lezione 4 43

Non-parametric models differ from parametric models in that the model structure is notspecified a priori but is instead determined from data. The term non-parametric is notmeant to imply that such models completely lack parameters but that the number andnature of the parameters are flexible and not fixed in advance.

Examples:

A histogram is a simple non-parametric estimate of a probability distribution. Skewness, kurtosis, MAD, NMAD, RMSE are also simple non parametric estimators. Kernel density estimation provides better estimates of the density than histograms. Non-parametric regression methods have been developed based on kernels, splines,

and wavelets. Data envelopment analysis provides efficiency coefficients similar to those obtained

by multivariate analysis without any distributional assumption. KNNs (K-Nearest Neighbors) classify the unseen instance based on the K points in the

training set which are nearest to it. A Support Vector Machine (SVM with a Gaussian kernel) is a non-parametric large-

margin classifier.

Page 44: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Regression indicators

M. Brescia - Data Mining - lezione 4 44

∆𝑧 = 𝑡 − 𝑜 𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 ∆𝑧′ =(𝑡 − 𝑜)

(1 + 𝑡)

𝑏𝑖𝑎𝑠 =σ𝑖=1𝑁 (|∆𝑧𝑖

2|)

𝑁𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝜎 =

σ𝑖=1𝑁 ∆𝑧𝑖 −

σ𝑖=1𝑁 ∆𝑧𝑖𝑁

2

𝑁

𝑀𝐴𝐷 = 𝑀𝑒𝑑𝑖𝑎𝑛 ∆𝑧 − 𝑀𝑒𝑑𝑖𝑎𝑛 ∆𝑧𝑁𝑀𝐴𝐷 = 1.4826𝑀𝐴𝐷

𝑅𝑀𝑆 =σ𝑖=1𝑁 (∆𝑧𝑖)

2

𝑁

RMS = 𝑏𝑖𝑎𝑠2 + 𝜎2

By supposing that the target (desired) value of any regression problem is t, while the outputof the learning model is o, the following statistical operators are useful for qualityevaluation.

Although the RMS and standard deviation arein principle different, sometimes they differvery little when the errors are sufficiently smallbut a little bit higher in some error bins.

the median of the absolute deviations fromthe data's median

(1, 1, 2, 2, 4, 6, 9). It has a median value of2. The absolute deviations about 2 are (1,1, 0, 0, 2, 4, 7) which in turn have a medianvalue of 1 (because the sorted absolutedeviations are (0, 0, 1, 1, 2, 4, 7)). So theMAD for this data is 1

Page 45: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Outliers evaluation in regression

M. Brescia - Data Mining - lezione 4 45

The MAD is a robust statistic, measure of statistical dispersion. Moreover, the MAD is lesssensible to outliers in a data set than the standard deviation. In the standard deviation, thedistances from the mean are squared, so large deviations are weighted more heavily, andthus outliers can heavily influence it. In the MAD, the deviations of a small number ofoutliers are irrelevant.

So far, MAD could give more reliable information on the dispersion of data, without beinginfluenced by the outlier distribution.

Besides such consideration, outliers percentage is very important. Especially to identifyanomalies within data (serendipity).

%|Δz| > 1𝜎% |Δz| > 2𝜎% |Δz| > 3𝜎% |Δz| > 4𝜎

Page 46: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Statistical estimation

M. Brescia - Data Mining - lezione 4 46

Key features of Mean, standard deviation, Median and Mode are:

Centered Fixed score distribution Unimodal symmetrical

Benefits:Best measure for symmetrical distributionsInfluenced by all dataMost reliableGood for interval and ratio data

Limitations:Presence of outliers may dramatically affect the estimation

Page 47: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Statistical estimation

M. Brescia - Data Mining - lezione 4 47

Page 48: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Statistical estimation

M. Brescia - Data Mining - lezione 4 48

Page 49: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Statistical estimation

M. Brescia - Data Mining - lezione 4 49

Kth central moment = 𝝁𝒌 = 𝑬 𝑿 − 𝝁 𝒌

• µ = first central moment

• 𝜎2 = 2𝑛𝑑 𝑐𝑒𝑛𝑡𝑟𝑎𝑙 𝑚𝑜𝑚𝑒𝑛𝑡 = 𝐸 𝑋 − 𝜇 2

• Skew =3𝑟𝑑 𝑐𝑒𝑛𝑡𝑟𝑎𝑙 𝑚𝑜𝑚𝑒𝑛𝑡

𝜎3=

𝐸 𝑋−𝜇 3

𝜎3

• Kurtosis =4𝑡ℎ 𝑐𝑒𝑛𝑡𝑟𝑎𝑙 𝑚𝑜𝑚𝑒𝑛𝑡

𝜎4=

𝐸 𝑋−𝜇 4

𝜎4

𝑬 𝑿 − 𝝁 𝒌

𝐸 𝑋 − 𝜇 2

𝐸 𝑋 − 𝜇 3

𝜎3

𝐸 𝑋 − 𝜇 4

𝜎4

~ 10 − 31.1 2 ∗ 0.05example

Sum of column

Page 50: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Statistical estimation

M. Brescia - Data Mining - lezione 4 50

MAD (Median Absolute Deviation) is a measure of dispersion (or variation) of the data.NMAD (Normalized MAD) is the value close to standard deviation: 𝜎 ≈ 1.4826𝑀𝐴𝐷The MAD measures dispersion by finding the value where half of the data is closer to the median andhalf of the data is farther from the median than that value.It is obtained by finding the median of the absolute values of the deviations of the data values from themedian.

The MAD is a robust measure of how spread out a set of data is. The variance and standarddeviation are also measures of spread, but they are more affected by extremely high or extremely lowvalues and non normality. If your data is normal, the standard deviation is usually the best choice forassessing spread. However, if your data isn’t normal, the MAD (or NMAD) is the suitable statistic.

𝑀𝐴𝐷(𝑋) = 𝑚𝑒𝑑𝑖𝑎𝑛 𝑋𝑖 −𝑚𝑒𝑑𝑖𝑎𝑛 𝑋𝑖

Normal: MAD slightly less than the standard deviation(SD) as it down-weights the tails. This is onereason not to use the MAD for normal distributions.Double exponential: The SD is usually too large whilethe MAD is about the same as it would be for thenormal.Cauchy: the SD is very large, while the MAD is onlyjust a little larger than normal.Tukey-Lambda Distribution: this distribution hastruncated (cut off) tails; the SD and MAD are aboutthe same

Page 51: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Statistical estimation

M. Brescia - Data Mining - lezione 4 51

Non-parametric (or distribution-free) inferential statistical methods are mathematicalprocedures for statistical hypothesis testing which, unlike parametric statistics, make noassumptions about the probability distributions of the variables being assessed.

Examples:

Statistical bootstrap methods: estimates the accuracy/sampling distribution of a statistic Kolmogorov–Smirnov test: tests whether a sample is drawn from a given distribution, or

whether two samples are drawn from the same distribution Median test: tests whether two samples are drawn from distributions with equal

medians Sign test: tests whether matched pair samples are drawn from distributions with equal

medians Spearman's rank correlation coefficient: measures statistical dependence between two

variables using a monotonic function Squared ranks test: tests equality of variances in two or more samples Wilcoxon signed-rank test: tests whether matched pair samples are drawn from

populations with different mean ranks

Page 52: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Statistical tests - recap

M. Brescia - Data Mining - lezione 4 52

Inferential statistics are theset of statistical tests tomake inferences aboutdata. They can tell us if thepattern we are observing isreal or just due to chance.

What kind of test to use?

Types of statistical tests:decision about test to usedepends on the problem,data distribution and typeof variables.

In general, if data arenormally distributed weuse parametric tests, elsenon-parametric.

https://cyfar.org/types-statistical-tests

Correlational look for an association between variables

Pearson correlation for strength of association between two continuous variables

Spearman correlationfor strength of association between two ordinal variables (does not rely on assumption of normal distributed data)

Chi-square for strength of association between two categorical variables

Comparison of Means: look for the difference between the means of variablesPaired T-test for difference between two related variablesIndependent T-test for difference between two independent variablesRegression: assess if change in one variable predicts change in another variable

Simple regressionTests how change in the predictor variable predicts the level of change in the outcome variable

Multiple regressionTests how change in the combination of two or more predictor variables predict the level of change in the outcome variable

Non-parametric: used when data do not meet assumptions for parametric tests

Wilcoxon rank-sumfor difference between two independent variables - takes into account magnitude and direction of difference

Wilcoxon sign-rankfor difference between two related variables - takes into account magnitude and direction of difference

Sign testTests if two related variables are different – ignores magnitude of change, only takes into account direction

Page 53: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Statistical estimation

M. Brescia - Data Mining - lezione 4 53

Common to non-parametric there is the concept, known in inferential statistics, as theterm "null hypothesis" usually refers to a general statement or default position that thereis no relationship between two measured phenomena.Rejecting or disproving the null hypothesis thus makes the conclusion that there is roomfor believing that there is a relationship between two phenomena.The null hypothesis is generally assumed to be true until evidence indicates otherwise.

In the hypothesis testing approach, a null hypothesis is contrasted with an alternativehypothesis, and the two hypotheses are distinguished on the basis of data, with certainerror rates.

Statistical tests can be significance tests or hypothesis tests. There are many typesof significance tests for one, two or more samples, for means, variances and proportions,paired or unpaired data, for different distributions, for large and small samples... All havenull hypotheses. A significance test may be used to verify an hypothesis test.

In frequentist statistics, the p-value is a function of the observed sample results (a teststatistic) relative to a statistical model, which measures how extreme the observation is. Itis a threshold value, used to measure the significance level α of the test.If the p-value is ≤ to a chosen α, then the test suggests that the observed data areinconsistent with the null hypothesis, so the null hypothesis must be rejected.

Page 54: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Kolmogorov-Smirnov Test (1)

M. Brescia - Data Mining - lezione 4 54

The Kolmogorov-Smirnov test is a statistical test for equality of continuous probabilitydistributions.It can either compare a sample with a reference probability distribution or it can directlycompare two sample datasets. The first is referred to as the one-sample K-S test and servesas a goodness of fit test and the second as the two-sample K-S test.

The basis of the test is that it relates the distance between the cumulative fraction functionsof the two samples as a number, D, which is then compared to the critical-D value for thatdata distribution.If D is greater than critical-D, then it can be concluded that the distributions are indeeddifferent, otherwise there is not enough evidence to prove difference between the twodatasets.A P-value can also be calculated from the D-value and the sample size of the two data sets;this value answers the question of what is the probability that the D-value would be thatlarge or larger if two samples were randomly sampled from identical populations as wasobserved?

Page 55: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Kolmogorov-Smirnov Test (2)

M. Brescia - Data Mining - lezione 4 55

example of an empiricaldistribution function

log-transformed empiricaldistribution function

example of the calculated D-value for a 2-sample K-S test

Procedure:1. Order data sets from smallest to largest2. For each value in the data set, calculate the percent of data strictly smaller than that value3.Plot all calculated percent values as steps on a cumulative fraction function, one for eachdata set if it is a two-sample K-S test.4.If steps are bunched close to one another on one side of the graph, you can take the log ofall data points and plot the distribution function based on that instead. For log, all data pointsmust be nonzero and nonnegative.5. Calculate the maximum vertical distance between the two functions to acquire the D-value.This value along with the corresponding P-value states whether data sets differ significantly.

Page 56: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Kolmogorov-Smirnov Test (3)

M. Brescia - Data Mining - lezione 4 56

Matlab function kstest2Two-sample Kolmogorov-Smirnov test

[h,p,ks2stat] = kstest2(x1, x2)

h = kstest2(x1,x2) performs a two-sample Kolmogorov-Smirnov test to compare thedistributions of the values in the two data vectors x1 and x2.

The null hypothesis is that x1 and x2 are from the same continuous distribution.The alternative hypothesis is that they are from different continuous distributions.

The result h is 1 if the test rejects the null hypothesis at the 5% significance level; 0 otherwise.

The test statistic is: max(|F1(x)-F2(x)|)where F1(x) is the proportion of x1 values less than or equal to x and F2(x) is the proportion ofx2 values less than or equal to x.

p and ks2stat are optional output, respectively p-valueand the test statistic (D-value)

F1(x)F2(x)

Page 57: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

15 16 17 18 19 20 21 220

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

psfMag i

Cu

mu

lati

ve

Dis

trib

uti

on

Fu

nc

tio

n

CDF of complete distributions

MLPQNA photo

SDSS spectro

Kolmogorov-Smirnov Test (4)

M. Brescia - Data Mining - lezione 4 57

cfull = load('cfull.txt');kbfull = load('kbfull.txt');cfull = cfull.';kbfull = kbfull.';

These are SDSS DR10 i-band distributions,one from the galaxy spectroscopic catalogueand the second a QSO photometric distributionobtained by a classifier neural network.

%%%% K-S test on the entire distributions[htotfull,ptotfull,ktotfull] = kstest2(cfull,kbfull);

figure;F1totfull = cdfplot(cfull); hold onF2totfull = cdfplot(kbfull);set(F1totfull,'LineWidth',3,'Color','r')set(F2totfull,'LineWidth',3,'Color','k')legend([F1totfull F2totfull],'MLPQNA photo','SDSS spectro','Location','NorthWest')xlabel('psfMag i','fontsize',12,'fontweight','b')ylabel('Cumulative Distribution Function','fontsize',12,'fontweight','b')title('CDF of complete distributions','fontsize',12,'fontweight','b')xlim([15.0 22.0])

Output: htotfull = 1, ptotfull = 0, ktotfull = 0.3266

ktotfull = 0.3266

Page 58: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Kolmogorov-Smirnov Test (5)

M. Brescia - Data Mining - lezione 4 58

%%%% Divide entire distributions into bright and faint parts:%%%% bright for psfMag_i < 19.1%%%% faint for psfMag_i >= 19.1j1 = 0; j2 = 0;for i = 1:length(cfull)

if cfull(i) < 19.1j1 = j1 + 1;cfull_bright(j1) = cfull(i);

elsej2 = j2 + 1;cfull_faint(j2) = cfull(i);

endend

j1 = 0; j2 = 0;for i = 1:length(kbfull)

if kbfull(i) < 19.1j1 = j1 + 1;kbfull_bright(j1) = kbfull(i);

elsej2 = j2 + 1;kbfull_faint(j2) = kbfull(i);

endend

Now we divide the data based on a i-band magnitude limit, used to discriminate betweenbright and faint sources.

Page 59: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Kolmogorov-Smirnov Test (6)

M. Brescia - Data Mining - lezione 4 59

%%%% K-S test on the bright distributions[hfullbright,pfullbright,kfullbright] = kstest2(cfull_bright,

kbfull_bright);

figure;F1brightfull = cdfplot(cfull_bright);hold onF2brightfull = cdfplot(kbfull_bright);set(F1brightfull,'LineWidth',3,'Color','r')set(F2brightfull,'LineWidth',3,'Color','k')legend([F1brightfull F2brightfull],'MLPQNA photo','SDSS spectro','Location','NorthWest')xlabel('psfMag i','fontsize',12,'fontweight','b')ylabel('Cumulative Distribution Function','fontsize',12,'fontweight','b')title('CDF of bright distributions','fontsize',12,'fontweight','b')xlim([16.0 19.5])

16 16.5 17 17.5 18 18.5 19 19.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

psfMag i

Cu

mu

lati

ve

Dis

trib

uti

on

Fu

nc

tio

n

CDF of bright distributions

MLPQNA photo

SDSS spectro

Output: hfullbright = 0, pfullbright = 0, kfullbright = 0.0419

kfullbright = 0.0419

Page 60: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Kolmogorov-Smirnov Test (7)

M. Brescia - Data Mining - lezione 4 60

%%%% K-S test on the faint distributions[hfullfaint,pfullfaint,kfullfaint] = kstest2(cfull_faint,kbfull_faint);

figure; F1faintfull = cdfplot(cfull_faint);hold onF2faintfull = cdfplot(kbfull_faint);set(F1faintfull,'LineWidth',3,'Color','r')set(F2faintfull,'LineWidth',3,'Color','k')legend([F1faintfull F2faintfull],'MLPQNA photo','SDSSspectro','Location','NorthWest')xlabel('psfMag i','fontsize',12,'fontweight','b')ylabel('Cumulative Distribution Function','fontsize',12,'fontweight','b')title('CDF of faint distributions','fontsize',12,'fontweight','b')xlim([19.0 22.0])

19 19.5 20 20.5 21 21.5 220

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

psfMag i

Cu

mu

lati

ve

Dis

trib

uti

on

Fu

nc

tio

n

CDF of faint distributions

MLPQNA photo

SDSS spectro

Output: hfullfaint = 1, pfullfaint = 0, kfullfaint = 0.2306

kfullfaint = 0.2306

Page 61: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Kolmogorov-Smirnov Test (8)

M. Brescia - Data Mining - lezione 4 61

Strengths of the K-S test:

1. It is nonparametric 2. D-value will not change if X values are transformed to logs or any other transformation 3. No restriction on sample size 4. The D-value is easy to compute and the graph can be understood easily 5. One sample K-S test can serve as a goodness-of-fit test and can link data and theory

Drawbacks:

The K-S test is less sensitive when the differences between curves is greatest at thebeginning or the end of the distributions. It works best when they deviate the most nearthe center of the distribution

The K-S test cannot be applied in multi-dimension cases

Page 62: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Covariance statistics

M. Brescia - Data Mining - lezione 4 62

Quite often, statistical courses forget to mention covariance matrix analysis, which is strictlycorrelated with the so-called “bivariate relationships”.Covariance is one of a family of statistical measures used to analyze the linear relationshipbetween two variables. It is strictly connected to the concepts of linear regression andcorrelation. Main difference is that correlation deals much more with the strength of arelationship, while covariance determines the trend of the relationship.

Covariance measures the linear association between two variables: a positive value indicatesa direct or increasing linear relationship; a negative value a decreasing relationship; a zerovalue means no relationship.

For example, in this diagram, the twovariables follow the same linear pattern. Inthis case we can say that the two variableshave a positive linear relationship (whenone grows up, the other grows up too).

This is called COVARIANCE!They CO-vary, and the measure ofcovariance tells us how they changetogether.

Page 63: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Covariance statistics

M. Brescia - Data Mining - lezione 4 63

There is a differencedepending on the kind ofmean chosen, samplepopulation or populationmeans?

This is an example of sample covariance

Page 64: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Covariance statistics

M. Brescia - Data Mining - lezione 4 64

A first easy way to look attheir relationships is togenerate the matrix ofscatter plots by comparingone variable with eachother. Of course the matrix issymmetrical.

An example. We have 4 variables, with 20 population samples each. We want to verify their covariance.

Page 65: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Covariance statistics

M. Brescia - Data Mining - lezione 4 65

From a quick look to the matrix, x1 vs x2 relationship is obvious

Page 66: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Covariance statistics

M. Brescia - Data Mining - lezione 4 66

The diagonal of a covariance matrix provides the variance of each individual variable(covariance with itself).

The off-diagonal entries in the matrix provide the covariance between each variable pair.

Page 67: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Covariance statistics

M. Brescia - Data Mining - lezione 4 67

The diagonal of a covariance matrix provides the variance of each individual variable(covariance with itself).

The off-diagonal entries in the matrix provide the covariance between each variable pair.

Page 68: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Covariance statistics

M. Brescia - Data Mining - lezione 4 68

Pay attention to use any software to calculate the covariance, since it depends on the use ofdenominator of N instead of the N-1!!!! (For instance, EXCEL uses N instead of N-1). Using Nmeans to calculate the “population covariance”; N-1 means to calculate the “samplecovariance”.

Remember to multiply each cell by N/(N-1) to transform population covariance into sampleone.

Page 69: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Covariance/Correlation statistics

M. Brescia - Data Mining - lezione 4 69

Covariance indicates a positive linear relationship (it doesn’t care about its value, since the most important thing is its sign).

It does not say anything about thestrength of the relationship, butonly about the direction of therelationship. This is a matter ofCORRELATION

Covariance provides DIRECTION of the linear relationship, correlation providesSTRENGTH.Covariance result has no upper/lower bound and its size is dependent on the scale of thevariables; correlation is always in [-1, +1] and its scale is independent of the scale ofvariables.Covariance is not standardized; correlation is standardized, just think to z-score…

𝑍 =𝑋 − 𝜇

𝜎Standard Normal

Page 70: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Correlation statistics

M. Brescia - Data Mining - lezione 4 70

Don’t become crazy by starting to compute correlation before to have looked into the scatter plots of your data! Then approach only what seems to be a correlation…

Correlation can be applied to linear relationships ONLY.Correlation is not CAUSATION, i.e. two unrelated elements in real life having a mathematical correlation (ex. Dog barks and moon phases…)

Correlation strength does not necessarily mean a statistically significant relationship. It is mostly related to sample size.

The correlation formula is based on the so-called Pearson correlation coefficient r

What’s important new?....…if you know the σ and r value, this is an easy way to calculate the covariance!!!

Page 71: AstroInformatics - INAFbrescia/documents/ASTROINFOEDU/... · Least squares method M. Brescia - Astroinformatics - lezione 3 9 y x 1 2 3 4 5 1 2 3 4 5 6 Now we want to calculate regression

Correlation statistics

M. Brescia - Data Mining - lezione 4 71

That’s why you have always tolook into the scatter plots of yourdata before!

More in practice, how can we more objectively state whether or not a relationship exists between two variables?

Rule of thumb: