Lecture 7 (D. Geman)
-
Upload
alain-trouve -
Category
Documents
-
view
215 -
download
0
Transcript of Lecture 7 (D. Geman)
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 1/67
STATISTICAL LEARNING IN CANCERBIOLOGY: LECTURE 7
Donald Geman, Michael Ochs, Laurent YounesJohns Hopkins Unversity
ENS-Cachan
February 27, 2013
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 2/67
LECTURE SERIES
Lecture 1: Introduction (DG)
Lecture 2: Cancer Biology (MO)
Lecture 3: Cell Signaling Inference (MO)
Lecture 4: Genetic Variation (DG)
Lecture 5: Massive Testing (LY)
Lecture 6: Biomarker Discovery (LY)
Lecture 7: Phenotype Prediction (DG) Lecture 8: Embedding Mechanism (DG)
2 / 64
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 3/67
OUTLINE
Biology and Statistical Learning
Predicting from Comparisons Pathway De-regulation
Breast Cancer Prognosis
Metastatic Cancer
3 / 64
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 4/67
RECAP
Statistical methods for analyzing cancer data permeate the
literature.
Prominent examples examined in previous lectures include Modeling the accumulation of driver mutations during
tumorigenesis; Identifying perturbed signaling in tumor cells; Discovering risk-bearing DNA sequence variation; and Finding differentially expressed genes and gene products.
The final two lectures are about learning classifiers that candistinguish between cellular phenotypes from mRNA
transcript levels collected from cells in assayed tissue.
4 / 64
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 5/67
BIOLOGICAL RATIONALE
In cancer, malignant phenotypes arise from the net effect of
interactions among multiple genes and other molecular
agents within biological networks.
The resulting perturbations in signaling pathways can be
detected and quantified with mRNA concentrations.
Statistical learning can serve as a basis for: Detecting disease (e.g., “tumor” vs “normal”); Discriminating among cancer sub-types (e.g., “GIST” vs
“LMS” or “BRCA1 mutation” vs “no BRCA1 mutation”); Predicting outcomes (e.g., “poor prognosis” vs “good
prognosis”).
5 / 64
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 6/67
STATISTICAL LEARNING (I)
X : High-throughput genomic data.
The traditional approach – experimental and
molecule-by-molecule – is not feasible at this scale. A principled approach is required to extract knowledge from
X.
Statistical learning has emerged as a core methodology for
the analysis of X.
6 / 64
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 7/67
STATISTICAL LEARNING (II)
Training set: L = (x(1), y (1)), . . . , (x(n ), y (n )). x(i ) ∈ Rd : mRNA expression profile for sample i ; y (i ) ∈ 1, 2, ..., K : cellular phenotype of sample i .
Standard Goals: Learn a predictor f : Rd −→ 1, ..., K or
class-conditional model p (x|k ) from L.
Less Standard: Develop statistical metrics and models for
regulation and mechanism.
7 / 64
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 8/67
BARRIERS (I)
Applications to biomedicine, specifically the implications for
clinical practice, are widely acknowledged to remain limited.
One major barrier is the study-to-study diversity in reported
prediction accuracies and “signatures” (lists of
discriminating genes).
Some of this variation can be attributed to the over-fitting
that results from the infamous “small n, large d” dilemma.
Typically, the number of samples (chips, profiles, patients)
per class is n = 10 − 1000 whereas the number of features
(exons, transcripts, genes) is d = 1000− 50, 000.
8 / 64
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 9/67
SOME PUBLIC MICROARRAY DATASETS
Study Class 0 (size) Class 1 (size) Probes d ReferenceD1 Colon Normal (22) Tumor (40) 2000 [?]D2 BRCA1 non-BRCA1 (93) BRCA1 (25) 1658 [?]D3 CNS Classic (25) Desmoplastic (9) 7129 [?]D4 DLBCL DLBCL (58) FL (19) 7129 [?]D5 Lung Mesothelioma (150) ADCS (31) 12533 [?]D6 Marfan Normal (41) Marfan (60) 4123 [?]
D7 Crohn’s Normal (42) Crohn’s (59) 22283 [?]D8 Sarcoma GIST (37) LMS (31) 43931 [?]D9 Squamous Normal (22) Head-Neck Cancer (22) 12625 [?]D10 GCM Normal (90) Tumor (190) 16063 [?]D11 Leukemia 1 ALL (25) AML (47) 7129 [?]D12 Leukemia 2 AML1 (24) AML2 (24) 12564 [?]
D13 Leukemia 3 ALL(710) AML (501) 19896 [?]D14 Leukemia 4 Normal (138) AML (403) 19896 [?]D15 Prostate 1 Normal (50) Tumor (52) 12600 [?]D16 Prostate 2 Normal (38) Tumor (50) 12625 [?]D17 Prostate 3 Normal (9) Tumor (24) 12626 [?]D18 Prostate 4 Normal (25) Primary (65) 12619 [?]D19 Prostate 5 Primary (25) Metastatic (65) 12558 [?]
D20 Breast 1 ER-positive (61) ER-negative(36) 16278 [?]D21 Breast 2 ER-positive(127) ER-negative(80) 9760 [?]9 / 64
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 10/67
BARRIERS (II)
However, complex decision rules are perhaps the central
obstacle to mature applications. The methods applied were
usually designed for other purposes and with little emphasis
on transparency.
Specifically, the rules generated by nearly all standard,
off-the-shelf techniques applied to genomics data, such as
boosting, neural networks, multiple decision trees, support vector machines, and linear discriminant analysis , usually
involve nonlinear functions of hundreds or thousands ofgenes, and a great many parameters.
10 / 64
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 12/67
BARRIERS (IV)
Consequently, standard decision rules are too complex to characterize biologically.
Moreover, what is notably missing is a solid link with potential mechanism, which seem to be a necessary
condition for “translational medicine”, i.e., drug development
and clinical decision-making.
12 / 64
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 13/67
ACCURACY AND CONTEXT
Needless to say, accuracy is also necessary.
But the accuracy of many of the methods mentioned above
is already high enough to be of potential clinical value for
many important phenotype distinctions.
Also, it is now common to follow methodologicaldevelopment with a “biological story” about the genes
appearing in the support (“signature”) of the classifier, e.g.,
an “enrichment analysis.”
However, this does not substitute for providing a potential mechanistic characterization of the decision rules in terms
of biochemical interactions or specific regulatory motifs.
13 / 64
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 14/67
PROPOSED FRAMEWORK
Translational objectives, and small-sample issues, argue for
limiting the number of parameters and introducing strong
biases.
The two principal objectives for the family of classifiersdescribed below are: Use elementary and parameter-free building blocks to
assemble a classifier which is determined by its support. Demonstrate that these can be as discriminating as those
that emerge from the most powerful methods in statistical learning.
14 / 64
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 15/67
EXPRESSION ORDERING
The building blocks we choose are two-gene comparisons,
regarded as “biological switches” related to regulatory
“motifs” or other properties of transcriptional networks.
The decision rules are then determined by expression orderings .
However, explicitly connecting statistical classification and molecular mechanism for cancer is a major, largely open,
challenge.
A more modest goal is to propose a potential statistical
framework.
15 / 64
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 16/67
OUTLINE
Biology and Statistical Learning
Predicting from Comparisons
Pathway De-regulation
Breast Cancer Prognosis
Metastatic Cancer
16 / 64
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 17/67
STRATEGY
Use (within sample) ranks to enhance robustness.
Adapt models to sample size.
Introduce bias to control variance.
Bias towards potential mechanism.
Hypothesis-driven learning?
17 / 64
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 18/67
NOTATION (I)
G : list of d genes.
X = (X 1, ..., X d ): expression profile.
Y ∈ 1, 2, ..., K : classes or phenotypes.
Data: d × n matrix of mRNA counts.
May restrict G to a network m with d m genes.
18 / 64
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 19/67
NOTATION (II)
Order the expression values: x π1≤ · · · ≤ x πd
.
Let r i be the rank of gene i in the ordering.
Then r = (r 1, ..., r d ) ∈ Ωd , the set of permutations of1, ..., d , and r = π−1.
Thus, x i < x j for two genes i , j if and only if r i < r j .
Replace x ∈ Rd by r ∈ Ωd .
Define binary variables z ij = δ (r i < r j ).
19 / 64
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 20/67
NOTATION (III)
Since gene expression is inherently stochastic, consider
x, r , z as realizations of r.v.s X, R , Z .
Clearly, R determines Z = Z ij and vice-versa. Z : Ωd −→ 0, 1(d
2), with d ! legitimate comparison strings.
Write p (r |k ) = P (R = r |Y = k ), r ∈ Ωd , and
p (z |k ) = P (Z = z |Y = k ).
20 / 64
E O Z C B D
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 21/67
EVEN ONE Z ij CAN BE DISCRIMINATING
TSP: Differentiate between two phenotypes by finding a pairof genes whose ordering typically reverses (Stat. Appl. in Genetics and Molecular Biology , 3, 2004.)
For each pair of genes i , j , define a score |∆ij |, where
∆ij = P (Z ij = 1|Y = 1) − P (Z ij = 1|Y = 0),
estimated from L.
Unique TSP: Y = Z i ∗ j ∗ (∆ > 0) or Y = 1− Z i ∗ j ∗ (∆ < 0).
Maximizing the score minimizes the average of sensitivityand specificity:
1 − ∆ij = P L(Y = 1|Y = 0) + P L(Y = 0|Y = 1).
For multiple TSPs, vote.
21 / 64
A “N F L ” E B
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 22/67
A “NO FREE LUNCH” ERROR BOUND
L = (x1, y 1), ..., (xn 1+n 2 , y n 1+n 2 ): training set
T 1 = (i 1, j 1), ..., (i M , j M ): TSPs for L.
E m = 1 ≤ s ≤ n 1 + n 2 : TSP (i m , i m ) errs on s E = ∪m E m : samples incorrectly classified by at least one TSP.
e cv : LOOCV error rate.
e app (f ): apparent error rate of the TSP classifier.
THEOREM: Any sample s ∈ E is erroneously classified during
LOOCV. In particular,
e app (f ) ≤|E |
n 1 + n 2≤ e cv .
22 / 64
K T S P
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 23/67
K TOP SCORING PAIRS
Base prediction on the k highest scoring pairs:
Θ∗k = (i 1, j 1), . . . , (i k , j k ).
More generally, the natural discriminant is
g k (X; Θk ) =
(i , j )∈Θk
δ X i < X j
The k-TSP classifier is majority voting:
f (X) = δ g k (X : Θk ) >
k
2
Varying the threshold allows for trading off sensitivity and
specificity.
23 / 64
C OOS G
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 24/67
CHOOSING K
Only crude measures of the separation between
P (g k (X)|Y = 0) and P (g k (X)|Y = 1) can resist over-fitting.
In particular, resubstitution error is less effective than a
simple mean-variance criterion:
T k := E(g k (X)|Y = 0) − E(g k (X)|Y = 1)[var (g k (X)|Y = 0) + var (g k (X)|Y = 1)]1/2
Given any Θk = (i 1, j 1), . . . , (i k , j k ), choose k to maximize
T k . The numerator is just
(i , j )∈Θk
∆ij , evidently maximized at
Θ∗
k . Since the denominator varies more slowly, our choice of
k and the gene pairs is roughly equivalent to maximizing T k .
24 / 64
FURTHER HOMEGROWN DEVELOPMENTS
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 25/67
FURTHER HOMEGROWN DEVELOPMENTS
Comparisons with discriminative methods (SVM, PAM,
k-NN, RF, naive Bayes) on “standard” cancer datasets:
“Simple decision rules for classifying human cancers from gene
expression profiles,” Bioinformatics , 21, 3896-3904, 2005.
Specialized to prostate cancer: “Robust prostate cancer
marker genes discovered from direct integration of inter-study
microarray data,” 21, 3905-3911, Bioinformatics , 2005.
25 / 64
EXTERNAL VALIDATION
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 26/67
EXTERNAL VALIDATION
Highly accurate two-gene classifier for differentiatinggastrointestinal stromal tumors and leiomyosarcomasNathan D. Price*, Jonathan Trent†, Adel K. El-Naggar‡, David Cogdell‡, Ellen Taylor‡, Kelly K. Hunt§, Raphael E. Pollock§,Leroy Hood*¶, Ilya Shmulevich*, and Wei Zhang‡ʈ
*Institute for Systems Biology, Seattle, WA 98103; and Departments of †Sarcoma Medical Oncology, ‡Pathology, and §Surgical Oncology, University of TexasM. D. Anderson Cancer Center, Houston, TX 77030
CLINICAL TRIALS AND OBSERVATIONS
A2-gene classifier for predicting response to the farnesyltransferase inhibitor
tipifarnib in acute myeloid leukemiaMitch Raponi,1 Jeffrey E. Lancet,2 Hongtao Fan,3 Lesley Dossey,1 Grace Lee,1 Ivana Gojo,4 Eric J. Feldman,5 Jason Gotlib,6
Lawrence E. Morris,7 Peter L. Greenberg,6 John J. Wright,8 Jean-Luc Harousseau,9 Bob Lowenberg,10 Richard M. Stone,11
Peter De Porre,12 Yixin Wang,1 and Judith E. Karp13
26 / 64
EXTERNAL VALIDATION (CONT)
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 27/67
EXTERNAL VALIDATION (CONT)
ORIGINAL ARTICLE
Usefulness of the top-scoring pairs of genes for predictionof prostate cancer progression
H Zhao, CJ Logothetis and IP GorlovDepartment of Genitourinary Medical Oncology, The University of Texas MD, Anderson Cancer Center, Houston, TX, USA
Prostate Cancer and Prostatic Diseases (2010), 1– 8& 2010 Nature Publishing Group All rights reserved 1365-7852/10 $32.00
www.nature.com/pcan
An interferon-related gene signature for DNA damageresistance is a predictive marker for chemotherapyand radiation for breast cancerRalph R. Weichselbauma,b, Hemant Ishwaranc, Taewon Yoona,b, Dimitry S. A. Nuytend,e, Samuel W. Bakera,b,Nikolai Khodareva, Andy W. Sua,b, Arif Y. Shaikha,b, Paul Roachf, Bas Kreiked,e, Bernard Roizmang, Jonas Berghh,Yudi Pawitani, Marc J. van de Vijverd, and Andy J. Minna,b,1
27 / 64
TOP SCORING MEDIANS (TSM)
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 28/67
TOP-SCORING MEDIANS (TSM)
G 1,G 2: Two disjoint sets of genes of size m , the “context”
ν G 1 , ν G 2 : The median expression in G 1, G 2
Classification rule: f (X) = δ ν G 1 < ν G 2 Choose the “context” by maximizing the (apparent)
accuracy P (f (X) = Y ). Let s (G 1, G 2) = |P (ν G 1 < ν G 2 |Y = 0)− P (ν G 1 < ν G 2 |Y = 1)|. Then choose the context to maximize s (G 1, G 2).
28 / 64
FINDING THE CONTEXT (I)
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 29/67
FINDING THE CONTEXT (I)
Exact optimization (for m > 1) is computationally impossibleand would lead to massive overfitting anyway.
Let ν G 1 = R π1, ν G 2 = R π2
(ranks are computed in G 1 ∪G 2).
Suppose:
(i) X i < X j ⊥ π1 = i , π2 = j |Y for each i ∈ G 1, j ∈ G 2;(ii) (π1, π2) is uniformly distributed given Y .
Then
P (ν G 1 < ν G 2|Y ) =1
m 2 i ∈G 1, j ∈G 2
P (X i < X j |Y ).
29 / 64
FINDING THE CONTEXT (II)
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 30/67
FINDING THE CONTEXT (II)
Both assumptions are true in practice. Consequently,
s (G 1, G 2) ∝
i ∈G 1, j ∈G 2 ∆ij .
Finally,
( G 1, G 2) = arg maxG 1,G 2 i ∈G 1, j ∈G 2
∆ij
This search is feasible either (i) exactly, but with gene
filtering, for m ≈ 5; or (ii) greedily, adding one gene at a
time, without gene filtering.
30 / 64
CLASSIFICATION RESULTS
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 31/67
CLASSIFICATION RESULTS
31 / 64
OUTLINE
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 32/67
OUTLINE
Biology and Statistical Learning
Predicting from Comparisons
Pathway De-regulation
Breast Cancer Prognosis
Metastatic Cancer
32 / 64
PERTURBED NETWORKS
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 33/67
PERTURBED NETWORKS
Diseased cells arise from aberrant activity in cellular
signaling, and pathways are the fundamental scale of many
cancer processes.
These aberrations cannot be identified from phenotypic
information typically measured in the clinic.
Moreover, they are the net effect of interactions among
multiple molecular agents.
Generally, network analyses do not account for
combinatorial (multi-way) interactions among genes or geneproducts, and do not quantify de-regulation.
33 / 64
BABY ILLUSTRATION
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 34/67
BABY ILLUSTRATION
34 / 64
SWAP DISTANCE
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 35/67
SWAP DISTANCE
A distance between permutations π and π of 1, . . . , d .
D (π, π
): the minimum number of adjacent swaps needed totransform π into π.
Example: D ((3, 1, 2, 4), (1, 2, 3, 4)) = 2.
35 / 64
PATHWAY VARIABLES
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 36/67
PATHWAY VARIABLES
Consider a network m with d m genes.
Let π = (π1, . . . , πd m ) be the order statistics for
x = (x 1, . . . , x d m ): x π1
< x π2
< · · · < x πd m
.
Let D (x, x) be the swap distance between π(x) and π(x).
Then D (x, x) is also the normalized Hamming distance
between z (x ) and z (x ), the corresponding comparison
strings.
36 / 64
ORDER INDEX
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 37/67
ORDER INDEX
Fix a phenotype k and let X and X
be i.i.d. expressionprofiles under p (x|k ).
Define the Order Index : µ(k ,m ) = 1−
d m
2
−1E[D (X, X)].
Then it is easy to show that
µ(k ,m ) = 1−d m
2
−1 i , j ∈G m
2P (Z ij = 1|k )P (Z ij = 0|k ).
.5 ≤ µ ≤ 1, but generally µ .5 since there are many genepairs expressed on different scales.
µ(k ,m ) 1: A highly disorganized system.
37 / 64
EXAMPLES
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 38/67
EXAMPLES
In the Death network, for prostate tissue, µ(normal ) = 0.924and µ(metastatic ) = 0.823. The difference is highly signficant
(p < .001).
Overall, 75 networks have significant differences in µ, which
is usually smaller in metastatic tumors.
38 / 64
DE-REGULATION IN DISEASE
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 39/67
GU O S S
A general trend emerges: when pairs of phenotypes
represent gradations of disease, the order index is usually smaller in the more malignant one when there is a
significant difference.
In the following plots, each point represents a pair
(µ(A,m ), µ(B ,m )) for a network m , where A is more malignant
than B.
39 / 64
GLOBAL PICTURE
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 40/67
9
40 / 64
DISTANCE-BASED CLASSIFICATION
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 41/67
Fix a context G (set of genes).
Let D G be the swap distance restricted to G .
Classify by nearest-neighbor in L.
Choose G so that the distance D G (X, X’) betweenindependent samples is
Large if X, X are from different classes; Small if from the same class.
This can be done in a similar fashion to kTSP and TSM.
41 / 64
OUTLINE
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 42/67
Biology and Statistical Learning
Predicting from Comparisons
Pathway De-regulation
Breast Cancer Prognosis
Metastatic Cancer
42 / 64
BREAST CANCER PROGNOSIS
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 43/67
Objective: separate BC microarray samples into “good” vs“poor” prognosis determined by recurrence within five years.
Mammaprint Signature: List of 70 genes and corresponding
(correlation-based) decision rule.
One of three “signatures” approved by the FDA for clinicaluse.
Learned from a training set L with n = 162 samples (46
recurrent and 116 non-recurrent).
Achieves 89% sensitivity and 41% specificity on the Buysetest set of n = 302 samples (46 recurrent and 256
non-recurrent).
43 / 64
MAXIMUM ENTROPY MODELS ON
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 44/67
PERMUTATIONS
Fix ten genes (e.g., the five top-scoring pairs).
Let x be the expression profile and r ∈ Ω10 the rank vector.
Construct two distributions p (r |good ) and p (r |poor ) bymaximizing entropy subject to fixing all
102
= 45 pairwise
comparison probabilities.
Use “Iterative Projection” to learn the parameters.
With d = 10, everything can be computed, includingnormalizing constants and entropies.
44 / 64
MORE FORMALLY
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 45/67
Let q be a prob. dist. on Ω10, and let p L be the empirical
distribution on L.
For k ∈ poor , good :
p (r |k ) = argq max H (q )
s .t . ∀i < j : q (r : r i < r j ) = p L(r : r i < r j |k )
45 / 64
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 46/67
LIKELIHOOD RATIO TEST
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 47/67
Classify sample x as “poor” if
p (r (x)|poor )
p (r (x)|good )> τ.
For τ = 1, 70% sensitivity and 64% specificity (overall 66%).
Varying τ trades off sensitivity and specificity.
Entropies are H = 14.22 (“good”), H = 17.45 (“poor”),H = 21.79 (uniform).
47 / 64
OUTLINE
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 48/67
Biology and Statistical Learning
Predicting from Comparisons
Pathway De-regulation
Breast Cancer Prognosis
Metastatic Cancer
48 / 64
METASTATIC CANCER
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 49/67
Cancer is an acquired genetic disorder due to the
accumulation over time of DNA alterations that lead to
uncontrolled cell growth and proliferation.
Ninety percent of deaths result from metastasis, meaning
that cancer cells break away and migrate to distant organs.
By lodging in other organs they replace normal cells until
the organ no longer functions.
49 / 64
TUMOR SITE OF ORIGIN
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 50/67
In approximately 4% of cancers, a metastatic tumor is found
of unknown primary origin (Hillen, 2000).
However, the appropriate treatment depends on the tissue
of origin.
The GEO or Gene Expression Omnibus (Barrett et al.,
2006) contains 16,715 tumor samples from 20 sites of
origin for the most popular platform.
Objective: Build a classifier for distinguishing among the
20 sites of origin and validate it with cross-study errorestimation.
50 / 64
GENERIC PROBLEM: BATCH EFFECTS
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 51/67
Systematic variation across samples is highly correlated
with date, lab, etc.
Especially problematic when batch “labels” are confounded
with class label.
Affects not only the patterns of expression of individual
genes, but in fact the entire dependency structure, including
correlations.
51 / 64
BATCH EFFECTS
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 52/67
Samples from the same phenotype but different dates, labs,
etc. display systematic differences in the distribution ofindividual genes and dependency structure.
52 / 64
BATCH EFFECTS: REVERSE CORRELATION
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 53/67
Figure: The fraction of significantly correlated gene pairs for which the
sign reverses between pairs of batches.53 / 64
STUDY EFFECTS
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 54/67
Within class, but across studies, there are differences dueto age, location, etc., as well as platform and mRNA
storage/extraction methods.
Combined with batch effects, samples from different studies
are not even approximately identically distributed. Must take this into account in estimating generalization
error.
The consequence of confounding, batch and study effects
make cross-study validation , as opposed to oridinarycross-validation , imperative.
54 / 64
UNBIASED VALIDATION: ACCURACY
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 55/67
Overall accuracy is a poor measure of utility with major
class imbalance in training.
Instead use Mean Class Conditional Accuracy (MCCA). Generalizes the average of sensitivity and specificity to
multiclass.
Take the average of P (F (X ) = y |Y = y ) for y = 1, ..., L.
55 / 64
METHODS OF ESTIMATING ACCURACY
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 56/67
Resubstitution: Validate on L, the training data. Strong
optimistic bias.
Holdout: Randomly partition data into training and
validation. Still optimistic because training and validation
are identically distributed.
Cross-validation: Still optimistic for same reason.
Cross-Study Validation: Validate on a different study, done
in a different lab than the training study. Higher bar, but the
gold standard.
56 / 64
LEAVE-STUDY-OUT VALIDATION
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 57/67
57 / 64
DECISION TREES OF COMPARISONS
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 58/67
Goal: Generalize kTSP and related algorithms to multiclass
problems.
Build decision trees with comparison questions: ”Is gene i more highly expressed than gene j?”
With the site of origin data, can build trees with depth up to
fifteen queries.
58 / 64
TREE OF COMPARISONS
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 59/67
59 / 64
TSP TREES: RESULTS
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 60/67
One decision tree: 91.4% accuracy, 75.4% MCCA. Random Forest with 10 trees and 10k gene pairs chosen at
random for each tree: 95.8% accuracy, 84.2% MCCA
Three trees with no common genes: 94.4% accuracy,
79.9% MCCA Lack of independence problematic for ensembles, even if
disjoint.
Tree 1 Wrong Tree 1 CorrectTree 2 Wrong 741 868Tree 2 Correct 690 14416
60 / 64
REDUCING DIVERSITY AND SAMPLE SIZE
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 61/67
Reducing Diversity: Train on largest study for each site.
Test on the rest. Accuracy = 85.8%, MCCA = 74.0%.
Reducing n: Keep only 10 samples per study-site of origin
pair. Notice that n is smaller for every site of origin.
61 / 64
EFFECTS OF REDUCING DIVERSITY AND
SAMPLE SIZE
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 62/67
SAMPLE SIZE
62 / 64
BREAST VS NON-BREAST: CROSS-STUDY VS
HOLDOUT
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 63/67
HOLDOUT
An experiment to compare the performance of cross-study
and (randomized) CV.
Breast vs all 19 other sites.
For non-breast samples, half for training and testing. Randomly order the breast tumor studies. Let n k be the
sample size study k .
Cross-study: Train on studies 1 thru k and validate on study
k + 1. Cross-validation: Randomly choose n k +1 breast samples
from studies 1, ..., k + 1 for testing, train on the rest, repeat.
63 / 64
RESULTS OF CROSS-STUDY VS
CROSS-VALIDATION
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 64/67
CROSS-VALIDATION
64 / 64
RANDOMIZING STUDY LABELS (I)
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 65/67
Goal: Quantify how much batch/study effects reduce
accuracy, MCCA.
Randomize study labels within each phenotype .
After shuffling study labels: Accuracy = 98.6%, MCCA =
96.1%.
∼ 8 points of MCCA lost to batch/study effects.
65 / 64
RANDOMIZING STUDY LABELS (II)
7/29/2019 Lecture 7 (D. Geman)
http://slidepdf.com/reader/full/lecture-7-d-geman 66/67
66 / 64
CONCLUSIONS