FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to...

FINDING RELATIONS BETWEEN VARIABLES

Correlation

Relation between coupled variables

What couples of variables are in relation?

Correlated variables

Uncorrelated variables

Definition The covariance of two random variable X and Y

is

Theorem For any two random variables X and Y.

Cov( , ) E[( [ ])( E[ ])]X Y X X Y Y

Var[ ] Var[ ] Var[ ] 2Cov( , ).X Y X Y X Y

Variance and Moments of a Random Variable

Independent variables COV=0

][][][

][][][][][][][),(

YEXEXYE

YEXEXEYEYEXEXYEYXCov

][][)()()()(

)(

][

,

,

YEXEyPyxPxyPxPyx

yxPyx

XYE

j

jj

i

ii

ji

jiji

ji

jiji

For discrete variables (for continuous, integral instead of sum)

For Independent variables

X,Y independent COV (X,Y) =0

The viceversa is not always true

Covariance and Pearson’s Correlation index

Variable 1 Variable 2

Item1 x11 x21

Item 2 x12 x22

Item i x1i x2i

Item m x1m x2m

Mean M1- M2-

m

i

i

m

i

i

xn

M

xn

M

1

22

1

11

1

1

m

i

ii

m

i

ii

MxMx

nxxcorr

MxMxn

xx

1 21

221121

1

221121

1

1),(

1

1),cov(

Correlation

1,1),( 21 xxcorr

When is a correlation significant?

m

i

ii MxMx

nxxr

1 21

221121

1

1),(

Given a correlation index:

A test variable can be computed under the null hypothesis that r=0

t is distributed as Student’s t test with n-2 degrees of freedom It assumes normality of x

Graph showing the minimum value of Pearson's correlation coefficient that is significantly different from zero at the 0.05 level, for a given sample size.

Example: discovery of a misconduct

Repeatability test: 2 different experimentalist were asked to take the same solution and to perform 24 independent ELISA assays on a 6x4 plate.

They submitted to the assessor the following results out of the spectrophotometer, ordered following the well

S1 S2P1 0,481 0,496P2 0,485 0,501P3 0,479 0,495P4 0,506 0,522P5 0,467 0,48P6 0,474 0,491P7 0,469 0,48P8 0,475 0,489P9 0,514 0,52P10 0,52 0,524P11 0,526 0,531P12 0,494 0,509P13 0,535 0,54P14 0,524 0,526P15 0,481 0,492P16 0,502 0,509P17 0,479 0,484P18 0,491 0,495P19 0,503 0,515P20 0,472 0,481P21 0,481 0,486P22 0,503 0,512P23 0,448 0,454P24 0,519 0,526

The assessor suspectedthat the experimenter submitted two reads of the same plate

How to prove it?


0,44

0,45

0,46

0,47

0,48

0,49

0,5

0,51

0,52

0,53

0,54

0,55

0,44 0,46 0,48 0,5 0,52 0,54

S2

S1

R=0.978


R=0.978 n=24 t=22.05

Objection: the test is valid only when data are normally distributed and we cannot prove that.

Any other idea?

Use the data themselves to generate random experiments

S1 S2P1 0,481 0,496P2 0,485 0,501P3 0,479 0,495P4 0,506 0,522P5 0,467 0,48P6 0,474 0,491P7 0,469 0,48P8 0,475 0,489P9 0,514 0,52P10 0,52 0,524P11 0,526 0,531P12 0,494 0,509P13 0,535 0,54P14 0,524 0,526P15 0,481 0,492P16 0,502 0,509P17 0,479 0,484P18 0,491 0,495P19 0,503 0,515P20 0,472 0,481P21 0,481 0,486P22 0,503 0,512P23 0,448 0,454P24 0,519 0,526

S1 Random(S2)P1 0,481 0,495P2 0,485 0,522P3 0,479 0,48P4 0,506 0,491P5 0,467 0,48P6 0,474 0,489P7 0,469 0,52P8 0,475 0,524P9 0,514 0,531P10 0,52 0,509P11 0,526 0,54P12 0,494 0,526P13 0,535 0,492P14 0,524 0,509P15 0,481 0,484P16 0,502 0,495P17 0,479 0,515P18 0,491 0,481P19 0,503 0,486P20 0,472 0,512P21 0,481 0,454P22 0,503 0,526P23 0,448 0,496P24 0,519 0,501

0,44

0,46

0,48

0,5

0,52

0,54

0,56

0,44 0,46 0,48 0,5 0,52 0,54

Rand

om (S2)

S1

R=0.25

Building a distribution by random resampling

Iterate the process of shuffling and computation of r many times (say 1000)

Compute a cumulative histogram counting the resamplings scoring with correlation≥r

0,00

200,00

400,00

600,00

800,00

1000,00

0,00 0,10 0,20 0,30 0,40 0,50 0,60 0,70 0,80 0,90

Building a distribution by random resampling

The plot gives the probability (per thousand) of obtaining a given correlation with random pairings of the original data P-value independent on the assumptions on the data distribution

0,00

200,00

400,00

600,00

800,00

1000,00

0,00 0,10 0,20 0,30 0,40 0,50 0,60 0,70 0,80 0,90

This is only an example plot: Compute by yourself the plot corresponding to the data available in misconduct.xls file

Bootstrapping

The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where the main character pulls himself out of a swamp by his hair (specifically, his pigtail), but the Baron does not, in fact, pull himself out by his bootstraps

Correlation index assumes linear dependence

R=0.816

Non parametric correlation: Spearman

Given a set of paired (xi,yi) sort separately the two variables, obtaining the ranks.

The Sperman’s correlation is the Pearson’s correlation of the ranked variables: (Rxi,Ryi),

Under the null hypothesis (r=0)

Is distributed as a Student’s t test with n-2 degrees of freedom

Categorical data: Matthews correlation index

Secreted Non Secreted Total

With Signal peptide a b a + b

Without Signal Peptide c d c + d

total a + c b + d n

cdbdcaba

ad-bcMCC

REGRESSION

Regression

Regression analysis: any techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variableand one or more independent variables.

Plot of Height vs Weight

100 140 180 220 260

Weight

4.6

5

5.4

5.8

6.2

6.6

7

Heig

ht

http://en.wikipedia.org/wiki/Dependent_variable

http://en.wikipedia.org/wiki/Independent_variable

Linear regression: Which is the linethat best fits the data?

?

?

?

Linear regression

residuals or errors

iii baxy

baxy

Interpolating line

Data Points:

x y

1 6

2 1

3 9

4 5

5 17

6 12

Least squares line

baxy Choose the line (a,b) that minimize

m

i

ii

m

i

i baxyE1

2

1

2

Minimizing

m

i

ii

m

i

i baxyE1

2

1

2

2

1

2

1

1

2

1

11

2

1 1

111

2

11

111

),cov(

1

1

0020

11020

x

m

i

i

m

i

ii

m

i

i

m

i

ii

m

i

i

m

i

i

m

i

m

i

iii

m

i

i

m

i

i

m

i

i

m

i

iii

m

i

ii

m

i

i

m

i

i

m

i

ii

yxa

xxxm

xyyxm

xxmx

xymyx

xxx

xyyx

a

xxaxyxayxxbaxya

E

xaybxm

aym

bbaxyb

E

Data Points:

x y

1 6

2 1

3 9

4 5

5 17

6 12

Polynomial interpolation

The same technique can be applied byimposing a polynomial regression model

p is the degree of the polynomial

ak and b are the trainable coefficients

p

k

k

k bxaxPy1

)(

Polynomials can perfectly interpolate a set of points

A set of data consisting of m points can be perfectlyinterpolated with a polynomial of degree m-1 2 points define a unique line, 3 points define a unique parabola (or a

line, if aligned) and so on…

Increasing the degree of the polynomial corresponds to decreasingthe error.

Bishop C, Pattern recognition and Machine Learning, Springer

values for parameters

M=0 M=1 M=3 M=9

b 0.19 0.82 0.31 0.35

a1 -1.27 7.99 232.37

a2 -25.43 -5321

a3 17.37 48568

a4 -231639

a5 640042

a6 -1061800

a7 1042400

a8 -557682

a9 125201

But…

The interpolation is useless and do not gives a goodpredictive model for extrapolationOVERFITTING

We cannot use the model for predicting Y in this region

Overfitting

Low number of points increasesrisk of overfitting

Bishop C Pattern Recognition and Machine Learning

High values for parameters

M=0 M=1 M=3 M=9

b 0.19 0.82 0.31 0.35

a1 -1.27 7.99 232.37

a2 -25.43 -5321

a3 17.37 48568

a4 -231639

a5 640042

a6 -1061800

a7 1042400

a8 -557682

a9 125201


Regularized Error function

Discouraging high values for coefficients

p

k

k

m

i

ii axPyE1

2

1

2)(

Parameter weighting the strength of regularization

Values for parameters for M=9

ln λ=0 ln λ=-18 ln λ=-∞

w0 0.13 0.35 0.35

w1 -0.05 4.74 232.37

w2 -0.06 -0.77 -5321

w3 -0.05 -31.97 48568

w4 -0.03 -3.89 -231639

w5 -0.02 55.28 640042

w6 -0.01 41.32 -1061800

w7 0.00 -45.95 1042400

w8 0.00 -91.53 -557682

w9 0.01 72.68 125201


λ=0Bishop C Pattern Recognition and Machine Learning

Regularized regression avoids overfitting


FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to...

Documents

Transcript of FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to...