17 Correlation Chapter17 p399 Semimetric distance – Pearson correlation coefficient or Covariance...

13
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    0

Transcript of 17 Correlation Chapter17 p399 Semimetric distance – Pearson correlation coefficient or Covariance...

17Correlation

Chapter17 p399

Semimetric distance – Pearson correlation coefficient or Covariance

How about higher dimension data ? - It is useful to have a similar measure to find out how much the dimensions vary from the mean with respect to each other. - Covariance is measured between 2 dimensions,- suppose one have a 3-dimension data set (X,Y,Z), then one can calculate Cov(X,Y), Cov(X,Z) and Cov(Y,Z)

- to compare heterogenous pairs of variables, define the correlation coefficient or Pearson correlation coefficient, -1≦ XY ≦1

))(var(var

),(

YX

YXCovXY -1 perfect anticorrelation

0 independent+1 perfect correlation

1

)()( 1

22

n

xxsxVar

n

i i

1

))((),( 1

n

yyxxYXCov

n

i ii

Semimetric distance – the squared Pearson correlation coefficient

• Pearson correlation coefficient is useful for examining correlations in the data

• One may imagine an instance, for example, in which the same TF can cause both enhancement and repression of expression.

• A better alternative is the squared Pearson correlation coefficient (pcc),

)var()var(

)],([ 22

YX

YXCovXYsq

The square pcc takes the values in the range 0 ≦ sq 1.≦0 uncorrelate vector1 perfectly correlated or anti-correlated

pcc are measures of similaritySimilarity and distance have a reciprocal relationshipsimilarity↑ distance↓ d = 1 – is typically used as a measure of distance

Semimetric distance – Pearson correlation coefficient or Covariance

- The resulting XY value will be larger than 0 if a and b tend to increase together, below 0 if they tend to decrease together, and 0 if they are independent.Remark: XY only test whether there is a linear dependence, Y=aX+b- if two variables independent low XY, - a low XY may or may not independent, it may be a non-linear relation- a high XY is a sufficient but not necessary condition for variable dependence

Semimetric distance – the squared Pearson correlation coefficient

• To test for a non-linear relation among the data, one could make a transformation by variables substitution

• Suppose one wants to test the relation u(v) = avn

• Take logarithm on both sides

• log u = log a + n log v

• Set Y = log u, b = log a, and X = log v a linear relation, Y = b + nX log u correlates (n>0) or anti-correlates (n<0) with log v

Semimetric distance – Pearson correlation coefficient or Covariance matrix

A covariance matrix is merely collection of many covariances in the form of a d x d matrix:

Spearman’s rank correlation (SRC)

• One of the problems with using the PCC is that it is susceptible to being skewed by outliers: a single data point can result in two genes appearing to be correlated, even when all the other data points suggest that they are not.

• Spearman’s rank correlation (SRC) is a non-parametric measure of correlation that is robust to outliers.

• SRC is a measure that ignores the magnitude of the changes. The idea of the rank correlation is to transform the original values into ranks, and then to compute the correlation between the series of ranks.

• First we order the values of gene A and B in ascending order, and assign the lowest value with rank 1. The SRC between A and B is defined as the PCC between ranked A and B.

• In case of ties assign mid-ranks both are ranked 5, then assign a rank of 5.5

Spearman’s rank correlation

n

i i

n

i i

n

i iiSRC

yyxx

yyxxYX

1

2

1

2

1

])(][)([

))((),(

The SRC can be calculated by the following formula, where xi and yi denote the rank of the x and y respectively.

An approximate formula in case of ties is given by

)1(

)(61),(

21

2

nn

yxYX

n

i iiSRC

SRC vs. PCC

Time Gene A ratio Gene B ratio Gene A rank Gene B rank

0.5 -0.76359 -4.05957 1 1

2 2.276659 -1.7788 6 2

5 2.137332 -0.97433 5 4

7 1.900334 -1.44114 4 3

9 0.932457 -0.87574 3 5

11 0.761866 -0.52328 2 6

PCC(A, B) = 0.633

SRC(A,B) = -0.086

Chapter17 p401

Chapter17 p408