Multi-experiment Viewer Pipeline for percentage...

41
Multi-experiment Viewer Pipeline for percentage analysis Samuel GRANJEAUD (CRCM, Marseille) Nantes 2018-03-06

Transcript of Multi-experiment Viewer Pipeline for percentage...

Multi-experiment Viewer

Pipeline for

percentage analysis Samuel GRANJEAUD

(CRCM, Marseille)

Nantes 2018-03-06

CRCM-INSERM-U1068 Marseille

• Facility "Integrative Bio-informatics"

– Ghislain Bidaut

• Team "Immunology and Cancer"

– Françoise Gondois-Rey

– Anne-Sophie Chrétien, Cyril Fauriat

– Daniel Olive, Jacques Nunès

• Centre d’Immunophénomique (Ciphe)

– Hervé Luche

– Quentin Barbier

– Camille Santa-Maria, Emilie Grégory

IMPACT Tools

http://impact.marseille.inserm.fr/

http://impact.marseille.inserm.fr/contexte/pourcentages/

INTERPRET POPULATIONS

USING STATISTICS

Features analysis

Analyse de données

Méthodes Non Supervisées Agrégation naturelle Réduction de dimensions Exploration des données

• Classifications hiérarchiques

• Nuées dynamiques (K-means)

• Analyse en composantes principales

• Cartes de Kohonen (SOM)

• Coordonnées parallèles

• Heatmaps

Méthodes Supervisées Agrégation orientée Réponse à une question Recherche ciblée

• Tests statistiques

• Corrélation à un profil

• Score discriminant

• Réseaux de neurones (SVM)

• Random Forest

• LDA, PLS, Lasso…

MeV capabilities

MeV history

• Transcriptomics

– year ~ 2000 > 2009

• Data are matrices of

RNA expression level

• Columns are samples

• Rows are genes

• Numerical matrix

is displayed as

color heatmap

• Numeric to color

mapping is global

Computational analysis of microarray data

Quackenbush J

TM4: a free, open-source system for microarray data

management and analysis online

Saeed AI, …, Quackenbush J

Microarray data normalization and transformation

Quackenbush J

TDMS file format

• Numerical measurements

at the center

• Annotations at margins

• Export from Excel

– text tabulated format

– '.' as decimal separator

• Tips

– Set GB as your regional

– Inspect file with NotePad++

TDMS file format

MeV does comparisons between columns

MeV for repetitive analyses

• MeV does a comparison between columns

• MeV repeats this comparison over all rows

• Templates for coloring group • sex to color mapping is always the same

• Templates for building identifiers • treatment, mutation, sex merging is always the

same

• Scale up for many panels, organs…

• Scale up for MFI (asinh transform) and

Luminex

mev-pivot

• MeV does comparisons between columns

• Percentages arise from multiple panels,

organs, time points…

– multiple dimensions

– different point of views

• mev-pivot reorganizes data

– keep TDMS format

– avoid copy/paste errors

– allows safe reorganisations of

data for MeV and Excel

Go to mev-pivot

Or view

Excel presentation

MeV installation

MeV interface

Analysis Pipeline

Adjust

• Log2 transform (or asinh)

• Center each row

• Filter rows

Explore

• Unsupervised analysis: HClustering, PCA

• Identify outliers

• Remove outliers

Identify

• Supervised analysis: statistical methods

• Identify difference between groups

• Remove small differences

Adjust

• Percentages are

usually displayed

in a log scale

• some MFI also

• Luminex concentr.

• Centering focus on

differences

between groups

within populations

Log2 properties

• log2 is proportional to log10

• 1 qRT-PCR cycle ~ x 2

• ratio => addition

x 2 => + 1

• is symmetric: +100% = x 2 = +1

- 50% = / 2 = -1

• stabilizes the dispersion

• log2( a / b ) = log2( a ) - log2( b )

MeV menu

Heatmap with annotation

Analysis Pipeline

Adjust

• Log2 transform (or asinh)

• Center each row

• Filter rows

Explore

• Unsupervised analysis: HClustering, PCA

• Identify outliers

• Remove outliers

Identify

• Supervised analysis: statistical methods

• Identify difference between groups

• Remove small differences

Tools for exploring

PCA of samples

Look for an outlier sample among

biological groups of samples

Comprehensive PCA

GA

GB

Mut Ctrl

Look for an outlier sample among

biological groups of samples

Analysis Pipeline

Adjust

• Log2 transform (or asinh)

• Center each row

• Filter rows

Explore

• Unsupervised analysis: HClustering, PCA

• Identify outliers

• Remove outliers

Identify

• Supervised analysis: statistical methods

• Identify difference between groups

• Remove small differences

Tools for identifying

t-test analysis

t-test results

Conclusions

• One analysis of matrices of percentages

• MeV, a graphical tool to explore and query

• Transform percentages and get

informative color heatmaps • Interpret difference between samples

• Highlight most interesting populations • Exploratory analyses

• Apply statistical tests for each population

• statistical significance vs practical importance

• Cope with the question of multiple tests

Hands on workflow

• Import

• Log2 transform

• Center rows

• Adjust color scale

• HClustering • euclidean distance

• Import factor

• Find outlier sample • using PCA

• Filter rows

• Apply stat. methods • t-test, ANOVA 2-ways

• SAM

• Label rows

• Export images

• Save analysis

• Analysis pipeline fits

any type of data:

percentages, MFI,

Luminex…

MORE ABOUT

STATISTICS

Statistics

• Important, Unavoidable

• Abusive or abused » Statistics does not tell us whether we are right.

It tells us the chances of being wrong.

• Point Of Significance in Nat. Meth. » http://mkweb.bcgsc.ca/pointsofsignificance/

» http://blogs.nature.com/methagora/2013/08/giving_st

atistics_the_attention_it_deserves.html

What is statistical test?

• 𝑠𝑐𝑜𝑟𝑒 = 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒

𝑔𝑟𝑜𝑢𝑝 𝑑𝑖𝑠𝑝𝑒𝑟𝑠𝑖𝑜𝑛

𝑐𝑜𝑢𝑛𝑡

=𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒

𝑑𝑖𝑠𝑝𝑒𝑟𝑠𝑖𝑜𝑛 𝑁

• score => P-value ~ the chances of

obtaining the observed (or more extreme)

score if no real effect exists (that is, if the

no-difference hypothesis is correct).

• the bigger the absolute score is,

the more "significant" the result is.

Perhaps we are asking the wrong questions.

Agent Brown, Matrix.

What is "significance"?

• Statistical significance is not the same as

practical importance.

• P-value does not tell whether the result is

of a practical importance.

• Statistics does not tell us whether we are

right. It tells us the chances of being

wrong.

• Any particular threshold for declaring

significance is arbitrary.

A Difference, to Be a Difference, Must Make a Difference

Gertrude Stein

.

FC vs P: Volcano plot

Volcano Plots in Analyzing Differential Expressions with mRNA Microarrays

Wentian Li

log fold change

-log

10(P

-valu

e)

P-value not

significant

Difference not

important

important

and

significant

Difference threshold important

and

significant

Annotate selection

Cluster propagation

The annotation is

automatically added

to all displays

Multiple testing

• p-value is the risk of false positive

• 5% means that among 100 statistical tests

5 will be called positive, but are false

positive in fact

• 5% is the risk of being wrong (when not

rejecting the null hypothesis)

• 5% for 100 populations => 5 False Pos,

5% is misleading when computing many

tests

• One must control this risk => FDR

Adjusting FDR with SAM

𝐹𝐷𝑅 = 𝐹𝑎𝑙𝑠𝑒 𝐷𝑖𝑠𝑐𝑜𝑣𝑒𝑟𝑦 𝑅𝑎𝑡𝑒

𝐹𝐷𝑅 = 𝐹𝑎𝑙𝑠𝑒 𝐷𝑖𝑠𝑐𝑜𝑣𝑒𝑟𝑖𝑒𝑠

𝐷𝑖𝑠𝑐𝑜𝑣𝑒𝑟𝑖𝑒𝑠

YOU control FDR

using the slider

http://mkweb.bcgsc.ca/pointsofsignificance/

Nat. Methods

More about multiple testing

Statistical trade-off

importance

effect size significance

P-value

raw

P-value corrected

P-value

Volcano Plot

report important

and significant

False Discovery Rate

correct

multiple tests