Multivariate Analysis of Pathways. Multivariate Approaches to Gene Set Selection.

29
Multivariate Analysis of Pathways

Transcript of Multivariate Analysis of Pathways. Multivariate Approaches to Gene Set Selection.

Multivariate Analysis of Pathways

Multivariate Approaches to Gene Set Selection

Key Multivariate Ideas

• PCA (Principal Components Analysis)

• SVD (Singular Value Decomposition)

• MDS (Multi-dimensional Scaling)

• Hotelling T2

PCA

Three correlated variablesPCA1 lies along the direction ofmaximal correlation; PCA 2 atright angles with the next highest variation.

Multivariate Representation of Pathways

• BAD pathwayNormal

IBC

Other BC

• Clear separation between groups

• Variation differences

• Compute distance between sample means using (common) metric of covariation

• Where

• Multidimensional analog of t (actually F) statistic

Hotelling’s T2

Principles of Kong et al Method

• Normal covariation generally acts to preserve homeostasis

• The transcription of genes that participate in many processes will be changed

• The joint changes in genes will be most distinctive for those genes active in pathways that are working differently

Critiques of Hotelling’s T

• Small samples: unreliable estimates– N < p

• Estimates of x and not robust to outliers

• Assumes same covariance in each sample– = ? Usually not in disease

– Kong et al propose analog of Welch t-test– Permutation in samples for significance

Making it Stable

1. Insufficient information to capture all relationships – too much correlation!

– Power of Hotelling’s method comes from identifying directions of rare variation

– Many (spurious) directions of 0 variation

2. Random variation in data leads to random variation in PCA

• Regularization strategy: force covariance to be more like IID

Making it Robust

• Microarray data has many outliers

• Multivariate methods are very much distorted by outliers

• Robust estimates of covariance could give robust PCA

• Simple approach: trim outliers

Handling Changes of Covariance

• Power of Hotelling’s method comes from identifying directions of rare variation

• If one group shows little covariation in one direction but the other does – how to test for changes?

• If one group is control then its rare covariance changes should be taken as standard– Robust measure of means in both groups

Detecting changes of covariance

Meaning of Covariance Change

• Meaning of covariance across individuals– Homeostasis in face of individual variation– e.g. BAD pathway: largest loadings of PC1 on

PRKARB & ADCY1– PRKARB represses CREB1; ADCY activates CREB1

• Gene sets whose covariance diminishes may– be responding to different inputs – have escaped their usual regulatory control

• Characteristic of cancers

Testing Covariance Changes

• Idea: directions of small variation in one should match directions of small variation in other

• Mathematical approach – Find solutions of S1 – S2

– Solutions should all be near 1, if no change

– Test statistic: easily computed

• Computational approach– Ratio of largest to smallest: max / min

pii

i,..,1

2

1

1

Network Connectivity Methods

Network Topology

• Connections represent interactions:– Regulatory (one-way)– Protein interaction (two-way)

• Hubs are genes with many connections

• Bottlenecks are single genes that connect two parts of a functional network

Devising Tests Based on Topology

• Issues: how to weight more heavily the genes that are hubs

• How to assess directionality of change

• How to measure co-operativity (activation or repression changes in appropriate ways)

Draghici et. al. Approach

• Overall measure

• Effective contribution (perturbation factor)

Analysis of Outliers

Outliers: Clues to Disease Process?

• Outliers usually reflect idiosyncratic events• Recurrent outliers reflect rare events that are selected• If a particular pathway is disrupted in disease, but by

many different mechanisms, then the expression profiles should – Lose healthy covariance– Show recurrent outliers

• How to test for ‘consistent’ outliers?• COPA: a method for flagging recurrent outliers in

expression data– Finds consistent fusion gene

A Test Statistic for Consistent Outliers

• Ratio of quantile differences to normal variation: (q.90 – q.10)tumor/max( (q.9-

q.1)normal,0.4)

• Compare to null distribution by permutation

• Many genes show much higher ratios

Statistical Significance

• Find false positives confidence limits by permutations

• Several hundred genes appear significant at 10-20% FDR – Actual scores: 267 scores are greater than 5,

where 90% of permutations have fewer than 34 scores over 5

A Test for Functional Groups

• For each group G of genes

• sG <- sum(scores[G])/sqrt(length(G))

• Scores: t-scores or range ratios

• PAGE (BMC Bioinformatics, 2005)

Do Genes Make Sense? • Quantile Ratio• [1] "DNA replication"• [2] "response to pathogenic fungi"• [6] "cleavage of lamin"• [7] "spindle organization and biogenesis"• [15] "response to osmotic stress"• [16] "nutrient import"• [22] "response to mercury ion"

• T-test• [2] "sodium ion homeostasis"• [3] "leukocyte adhesive activation"• [4] "positive regulation of calcium-independent cell-cell adhesion"• [5] "oxytocin receptor activity"• [6] "ADP biosynthesis"• [7] "dADP biosynthesis"• [10] "regulation of muscle contraction"• [11] "caveolar membrane"• [12] "response to cold"• [16] "stress fiber formation"• [18] "positive regulation of complement activation"• [19] "astrocyte activation"• [22] "regulation of long-term neuronal synaptic plasticity"• [24] "positive regulation of endocytosis"• [25] "embryonic hemopoiesis"

Cancer Functional Groups

• Do very probable cancer genes show high-discrepancy in few samples?

• Program: identify genes that might contribute to cancer processes: growth signaling, loss of cell-matrix adhesion, apoptosis1. Do most samples from these categories show at

least one gross mis-regulation?

2. Are they the same genes in most samples?

Example: Cell Growth

• Select genes in GO:001558 ‘regulation of cell growth’

• Expect most samples to have at least one very serious mis-regulated gene from this category.

• Compute maximum aberration score across category

Aberrations

• Aberration score indicated by color: vanilla: 0; red: 4

• Nine normals at left• No gene misregulated in

even 50% of samples• BUT: Only a few genes

commonly misregulated

Simplest Summary

• Maximum aberration score for samples

Testing the Pathway for Outliers

• Many genes show aberrations in tumor group

• Null distribution: medians of maxima from randomly selected gene groups of size 37

• P < .01

NB. The results for cell-matrix interaction are very similar; angiogenesis not so strong