Tales of correlation inflation (2013 CADD GRC)

13
Tales of correlation inflation (Eu prefiro a minha comida cozida e meus dados brutos) Peter W Kenny, Universidade de São Paulo

description

Presented at 2013 Computer-Aided Drug Design Gordon conference and subtitled, 'Eu prefiro a minha comida cozida e meus dados brutos' and the chicken in the title slide photo had been walking around the day before the photo was taken.

Transcript of Tales of correlation inflation (2013 CADD GRC)

Page 1: Tales of correlation inflation (2013 CADD GRC)

Tales of correlation inflation(Eu prefiro a minha comida cozida e meus dados brutos)

Peter W Kenny, Universidade de São Paulo

Page 2: Tales of correlation inflation (2013 CADD GRC)

Correlation

• Strong correlation implies good predictivity

– I have observed a correlation so you must use my rule

• Multivariate data analysis (e.g. PCA) usually involves transformation to orthogonal basis

• Applying cutoffs (e.g. MW restriction) to data can distort correlations

Page 3: Tales of correlation inflation (2013 CADD GRC)

Quantifying strengths of relationships between continuous variables

• Correlation measures

– Pearson product-moment correlation coefficient (R)

– Spearman's rank correlation coefficient ()

– Kendall rank correlation coefficient (τ)

• Quality of fit measures

– Coefficient of determination (R2) is the fraction of the variance in Y that is explained by model

– Root mean square error (RMSE)

Page 4: Tales of correlation inflation (2013 CADD GRC)

Difference in mean values of Y for X = A and X = B

Scale by standard deviation

Scale by standard error

Cohen’s d (independent of

sample size)

Student’s t(depends on sample size)

Size of effect for categorical XR2 can be seen as analogous to Cohen’s d

Page 5: Tales of correlation inflation (2013 CADD GRC)

r

N 1202

R 0.247 ( 95% CI: 0.193 | 0.299)

0.215 ( P < 0.0001)

0.148 ( P < 0.0001)

N 8

R 0.972 ( 95% CI: 0.846 | 0.995)

0.970 ( P < 0.0001)

0.909 ( P = 0.0018)

Correlation Inflation in FlatlandSee Lovering, Bikker & Humblet (2009) JMC 52:6752-6756 DOI

Page 6: Tales of correlation inflation (2013 CADD GRC)

Preparation of synthetic data setsKenny & Montanari (2013) JCAMD 27:1-13 DOI

Add Gaussian noise (SD=10) to Y

Page 7: Tales of correlation inflation (2013 CADD GRC)

Correlation inflation by hiding variationSee Hopkins, Mason & Overington (2006) Curr Opin Struct Biol 16:127-136 DOI

Leeson & Springthorpe (2007) NRDD 6:881-890 DOI

Data is naturally binned (X is an integer) and mean value of Y is calculated for each value of X. In some studies, averaged data is only presented graphically and it is left to the reader to judge the strength of the correlation.

R = 0.34 R = 0.30 R = 0.31

R = 0.67 R = 0.93 R = 0.996

Page 8: Tales of correlation inflation (2013 CADD GRC)

Masking variation with standard errorSee Gleeson (2008) JMC 51:817-834 DOI

Partition by value of X into 4 bins with equal numbers of data points and display 95% confidence interval for mean (green) and mean ± SD (blue) for each bin.

R = 0.12 R = 0.29 R = 0.28

Page 9: Tales of correlation inflation (2013 CADD GRC)

N Bins Degrees of Freedom F P

40 4 3 0.2596 0.8540

400 4 3 12.855 < 0.0001

4000 4 3 115.35 < 0.0001

4000 2 1 270.91 < 0.0001

4000 8 7 50.075 < 0.0001

“In each plot provided, the width of the errors bars and the difference in the mean values of the different categories are indicative of the strength of the relationship between the parameters.” Gleeson (2008) JMC 51:817-834 DOI

The error of standard error

ANOVA for binned data sets

Page 10: Tales of correlation inflation (2013 CADD GRC)

Know your data

• Assays are typically run in replicate making it possible to estimate assay variance

• Every assay has a finite dynamic range and it may not always be obvious what this is for a particular assay

• Dynamic range may have been sacrificed for thoughput but this, by itself, does not make the assay bad

• We need to be able analyse in-range and out-of-range data within single unified framework– See Lind (2010) QSAR analysis involving assay results which are only known to

be greater than, or less than some cut-off limit. Mol Inf 29:845-852 DOI

Page 11: Tales of correlation inflation (2013 CADD GRC)

Depicting variation with percentile plots

This graphical representation of data makes it easy to visualize variation and can be used with mixed in-range and out-of-range data. See Colclough et al (2008) BMCL 16:6611-6616 DOI

Page 12: Tales of correlation inflation (2013 CADD GRC)

Binning continuous data restricts your options for analysis and places burden of proof on you to show that your conclusions are independent of the binning scheme. Think before you bin!

Averaging the binned data was

your idea so don’t try blaming me this

time!

Page 13: Tales of correlation inflation (2013 CADD GRC)

Some stuff to think about

• Model continuous data as continuous data– RMSE is most relevant to prediction but you still need R2

– Fitted parameters may provide insight (e.g. solubility is more sensitive than potency to lipophilicity)

• When selecting training data think in terms of Design of Experiments (e.g. evenly spaced values of X)

• Try to achieve normally distributed Y (e.g. use pIC50 rather than IC50)• Never make statements about the strength of a relationship when

you’ve hidden variation in the data (unless you want a starring role in Correlation Inflation 2)

• To be meaningful a measure of the spread of a distribution must be independent of sample size

• Reviewers/editors, mercilessly purge manuscripts of statements like, “A negative correlation was observed between X and Y” or “A and B are correlated/linked”