Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and...
-
Upload
joshua-day -
Category
Documents
-
view
217 -
download
1
Transcript of Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and...
Tree Based Methodsfor Analyzing
Tissue Microarray Data
Steve HorvathHuman Genetics and Biostatistics
University of California, Los Angeles
Acknowledgements
• Horvath Lab– Yunda Huang – Xueli Liu Ph.D.– Zeke Fang Ph.D.– Tuyen Hoang
• UCLA Tissue Microarray Core– David Seligson– Aarno Palotie
• Clinicians– Hyung Kim– Arie Belldegrun
Contents
• Statistical issues with tissue microarray (TMA) data
• Random forest (RF) predictors
• RF clustering
• Application of RF clustering to TMA data
• Supervised Learning Methods
Background TMA data
Description of TMA data
• TMA data are a high-throughput tool in validating newly-identified biomarker in genome wide discovery
• Basic technique was summarized in Kononen et al. 1998
donor block array block slide
Tissue Microarray (TMA) TechnologyKononen et al. Nature Medicine 1998
• Hundreds of tiny (typically 0.6 mm diameter) cylindrical tissue cores
–densely and precisely arrayed into a single histologic paraffin block.
• From this new array block, up to 300 serial 4-8 m thick sections may be produced.
• Targets for fluorescence in situ hybridization (FISH) and protein expression by immunohistochemical studies.
Pathologists score each spot by looking through a microscope. slide by David Seligson
Non-normal and highly correlated
Several Spots per Pathology Case Several “Scores” per Spot
• Maximum intensity = Max (1 – 4)
• Percent of cells staining = Pos (0 – 100)
• Percent of cells staining with the
maximum intensity = PosMax (0 – 100)
• Spots have a spot grade: NL,1,2,..
• Indicator of informativeness
• Each case is usually represented by 4 or more spots
– >3 malignant lesions, 1 matched normal
0 20 40 60 80 100
05
01
00
15
02
00
25
0
0 20 40 60 80 100
05
01
00
15
02
00
25
0
0 20 40 60 80 100
05
01
00
15
02
00
0 0.5 1 1.5 2 2.5 3
05
01
00
15
0
0 0.5 1 1.5 2 2.5 3
05
01
00
15
02
00
0 0.5 1 1.5 2 2.5 3
05
01
00
15
0
P53 CA9 EpCamPercent of Cells Staining(POS)
Maximum Intensity (MAX)
Histogram of tumor marker expression scores: POS and MAX
P53 and Ki67: Max versus Pos
0.0 0.5 1.0 1.5
1.5 2.0 2.5 3.0
1.5
2.0
2.5
3.0
0.0
0.5
1.0
1.5KiNuclMax
0 20 40
40 60 80
40
60
80
0
20
40KiPos
0.0 0.5 1.0 1.5
1.5 2.0 2.5 3.0
1.5
2.0
2.5
3.0
0.0
0.5
1.0
1.5P5NuclMax
0 20 40
60 80 100
60
80
100
0
20
40P5Pos
Characteristics of TMA data
• Non-normal, discrete, strongly correlated• Mixed variable types • Pooling (combining) spot measurements across
every patient – between 1 to 10 spots of different grade
– current strategy pools tumor spots and forms median, mean, minimum or max
• Message: tumor marker intensity is measured by up to 12 highly correlated staining scores multicollinearity
Our main tool are random forest predictors
• Unsupervised analysis of TMA data– RF clustering
• Supervised Analysis– RF based pre-validation method
Background random forest predictors
L. Breiman 1999
Random Forests (RFs)
• RFs are a collection of tree predictors such that each tree depends on the values of an independently sampled random vector
Classification and Regression Trees (CART)
by– Leo Breiman,
UC Berkeley– Jerry Friedman,
Stanford University– Charles J. Stone,
UC Berkeley– Richard Olshen,
Stanford University
An example of CART
• Goal: For the patients admitted into ER, to predict who is at higher risk of heart attack
• Training data set:– # of subjects = 215– Outcome variable = High/Low Risk
determined– 19 noninvasive clinical and lab variables were
used as the predictors
High 12%Low 88%
High 17%Low 83%
Is BP <= 91?
High 70%Low 30%
High 11%Low 89%
High 50%Low 50%
High 2%Low 98%
High 23%Low 77%
Is age <= 62.5?Classified as high risk!
Classified as low risk!
Classified as high risk! Classified as low risk!
Is ST present?
CART construction
Yes No
No
No
Yes
Yes
CART Construction
BINARY RECURSIVE PARTITIONING
• Binary: split parent node into two child nodes
• Recursive: each child node can be treated as parent node
• Partitioning: data set is partitioned into mutually exclusive subsets in each split
RF Construction
…
Prediction by plurality voting
• The forest consists of N trees.
• Class prediction: – Each tree votes for a class; the predicted
class C for an observation is the plurality, maxC k [fk(x,T) == C]
• Regression random forest: – predicted value is the average prediction
Clustering with random forest predictors
Intrinsic Proximity Measure
• Terminal tree nodes contain few observations
• If case i and case j both land in the same terminal node, increase the proximity between i and j by 1.
• At the end of the run divide by 2* no. of trees.
• Dissimilarity=sqrt(1-Proximity)
Casting an unsupervised problem into a supervised RF
problem • Key Idea (Breiman 1999)
– Label observed data as class 1– Generate synthetic observations and
label them as class 2– Construct a RF predictor to distinguish
class 1 from class 2– Use the resulting dissimilarity measure
in unsupervised analysis
How to generate synthetic observations
• Synthetic observations are simulated to contain no clusters– e.g. randomly sampling from the product of
empirical marginal distributions of the input.
RF clustering
• Compute distance matrix from RF– distance matrix = sqrt(1-proximity matrix)
• Compute the first 2~3 classical multi-dimensional scaling coordinates based on the distance matrix
• Conduct partitioning around medoid (PAM) clustering analysis
– input parameter=no. of clusters k – use the Euclidean distance between the resulting
scaling points
Theoretical Study of RF Clustering
Ref: Using random forest proximity for unsupervised learning, BIOKDD-CBGI'03, 7th Joint Conference on Information Sciences, Cary, North Carolina.
Applying Random Forest Clustering to Tissue Microarray Data--Application to Kidney Cancer
Tao Shi and Steve Horvath
Scientific Question:Can one discover cancer subtypes
based on the protein expression patterns of tumor markers?
Why use RF clustering for TMA data?
• no need to transform the often highly skewed features– based on ranks of features
• natural way of weighing tumor marker contributions to the dissimilarity
• elegant way to deal with missing covariates
• intrinsic proximity matrix handles mixed variable types well
Kidney Multi-marker Data
• 366 patients with Renal Cell Carcinoma (RCC) admitted to UCLA between 1989 and 2000.
• Immuno-histological measures of total 8 tumor markers were obtained from tissue microarrays constructed from the tumor samples of these patients.
MDS plot of clear cell patients
• Labeled and colored by their RF cluster
-0.1 0.0 0.1 0.2 0.3
-0.2
-0.1
0.0
0.1
cmd plot
coordinate 1
coo
rd 2
1
2
1
2
1
1 11
2
1
2
2
1
2
2 2
1
22
2
11
3
1
3
2
1
1
3
11
1
2
2
2
3
1
2
3
2
22
2 2
2
2
3
1
22
2
1
1
3
1
32
2
1
2
3
1
2 2
1
2 22
2
3322
2
22
2
2
3
2
22
2
1
22
22
11
2
1
2
2
2
1
2
2
2
2
3
1
2
3
3
2
3
2
2
2
2
1
2
22
2
22
2
2
1
2
1
222
1
2
2
1
2
1
1
2
2
1
2
2
2
3
22
1
2
2 3
1
21
2
2
2
1
2
2
222
2
2
2
1
2
2
222
2
2
2
3
2
222
1
2
2
1
3
2
1
2
2
2
2
2
22
1
1
1
2
1
1
22
1
22
2
2
1
22
2
22
2
2
22
2
3
2
11
1
2
2
2
1
22
1
2
1
2
2
3
2
2
1
3
2
22
3
2
3
1
1
2
1
1
31
22
22
1
2
2
2
2
1
2 2
2
22
22
2
2
2
2
1
22
3
2
3
2
2
2
1
2
23
1
2
2
3
1
3
1
2
11
1
22
22
1
2
23
2
2
2
1
3
2 2
2
2
1
22
22
31
3
1
2
2
2
2
2
22
1
22
22
1
2
3
1
1
2
2
3
2
2
1
2
1
1
1
1
3
2
3
2
22
2
22
2
2
1
2
2
22
2
2
1
2
Interpreting the clusters in terms of survival
0 2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
K-M curves
Time to death(Months)
Su
rviv
al
1 Log Rank p value= 0.00037423
Clustering label
Non clearCell
patients
Clear cellpatients
1 0 92
2 20 215
3 30 9
Hierarchical clustering with Euclidean distance leads to less satisfactory results
11 1
11 1
11
1 11
1 11 1 1
01 1
1 11
1 11
1 11
1 11
1 11
11 1
11
1 11
1 11 1
11 1
11
1 11 1
11
1 11
1 11 1
1 11 1
11 1 1
11 1
1 1 1 11
11 1
11 1 11 1
11 1 1 1
11 1
1 11
1 11 1
1 11 1
11 1
11 1 1
11 1
1 11 1 1 1 1
1 1 1 11
1 1 11
1 1 1 1 1 11 1
11 1 1 1 11 1 1
1 1 11
1 1 11 1
11
1 11
1 1 1 1 1 1 1 11 1
1 11
11
1 11 1
1 11 1
1 1 1 11
1 11
1 11
1 11
1 1 1 11 1
11 1
11 1
1 1 1 1 1 0 11 0
11
1 11
11
1 11 1
11 1
11 1
1 11
11 1 1 1
1 11
1 11 1
1 11 1
11
0 1 11 1
11
1 11
11 1
01 1
11
0 11
11 1
1 01
1 10
01
1 11 1
01
00 0
11 1
11 1
10 0
0 00 0
1 11 0 0
0 00 0
11
1 01
00 1
10 0
0 10
10 1
1 10
00 0 0 0
0 00
0 00 0
11
0 10
0 01
1 1
05
01
00
15
0
Cluster Dendrogram
hclust (*, "average")dist(KidneyRF)
He
igh
t
Cluster-ing label
NonclearCell
patients
Clearcell
patients
1 9 (20)
286 (307)
2 41(30)
30 (9)
* RF clustering grouping in red
Euclidean vs. RF Distance
RF
dis
tan
ce
Euclidean distance
Molecular grouping vs. Pathological grouping
Message: molecular grouping is superior to pathological grouping
0 2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
Time to death (years)
Su
rviv
al
327 patients in cluster 1 and 239 patients in cluster 3
0 2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
Time to death (years)
Su
rviv
al316 non-clear cell patients50 clear cell patients
p = 0.0229p = 9.03e-05
Molecular Grouping Pathological Grouping
Identify “irregular” patients
Clustering label
Non clearCell
patients
Clear cellpatients
1 0 92
2 20 215
3 30 9
Message: molecular grouping can be used to refine clear celldefinition.
0 2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
Time to death (years)
Su
rviv
al
p = 0.00522
9 irregular clear cell patients307 regular clear cell patients
50 non-clear cell patients
Detect novel cancer subtypes
• Group clear cell grade 2 patients into two clusters with significantly different survival.
0 2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
K-M curves
Time to death (years)
Su
rviv
al
p value= 0.0125
Results TMA clustering
• Clusters reproduce well known clinical subgroups– Ex: global expression differences between
clear cell and non-clear cell patients– RF clustering works better than clustering
based on the Euclidean distance for TMA data
• RF clustering allows one to identify “outlying” tumor samples.
• Can detect previously unknown sub-groups
Boxplots of tumor marker expression vs. cluster
1 2 3
020
40
60
80
100
CA
9M
em
PosM
n
p= 9.95e-28
1 2 3
020
40
60
80
100
CA
12M
em
PosM
n
p= 4.61e-15
1 2 3
010
20
30
40
50
Ki6
7P
osM
n
p= 3.51e-13
1 2 3
020
40
60
80
100
GeP
osH
arr
iMn
p= 3.33e-21
1 2 3
020
40
60
80
p53P
osM
n
p= 1.7e-10
1 2 3
020
40
60
80
100
EpD
ctP
osM
n
p= 1.64e-14
1 2 3
020
40
60
80
100
pT
EN
PosM
np= 1.43e-27
1 2 30
20
40
60
80
100
Vim
Pos
p= 7.97e-14
Message: clusters can be explained in terms of tumor expression values, i..e in terms of biological pathways.
Conclusions
• There is a need to develop tailor made data mining methods for TMA data– Major differences:
• highly non-normal data • Euclidean distance metrics seems to be sub-
optimal for TMA data
• tree or forest based methods work well for kidney and prostate TMA data