2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification...
-
date post
21-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification...
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
1
Cancer Classification with Data-dependent Kernels
Anne Ya Zhang
(with Xue-wen Chen & Huilin Xiong)EECS & ITTC
University of Kansas
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
2
Outline
Introduction Data-dependent Kernel Results Conclusion
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
3
Cancer facts Cancer is a group of many related diseases
Cells continue to grow and divide and do not die when they should.
Changes in the genes that control normal cell growth and death. Cancer is the second leading cause of death in the
United States Cancer causes 1 of every 4 deaths
NIH estimate overall costs for cancer in 2004 at $189.8 billion ($64.9 billion for direct medical cost)
Cancer types Breast cancer, Lung cancer, Colon cancer, …
Death rates vary greatly by cancer type and stage at diagnosis
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
4
Motivation
Why do we need to classify cancers? The general way of treating cancer is to:
Categorize the cancers in different classes Use specific treatment for each of the classes
Traditional way to classify cancers Morphological appearance
Not accurate! Enzyme-based histochemical analyses. Immunophenotyping. Cytogenetic analysis.
Complicated & needs highly specialized laboratories
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
5
Motivation
Why traditional ways are not enough ? There exists some tumors in the same class with
completely different clinical courses May be more accurate classification is needed
Assigning new tumors to known cancer classes is not easy e.g. assigning an acute leukemia tumor to one of the
AML (acute myeloid leukemia) ALL (acute lymphoblastic leukemia)
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
6
DNA Microarray-based Cancer Diagnosis
Cancer is caused by changes in the genes that control normal cell growth and death.
Molecular diagnostics offer the promise of precise, objective, and systematic cancer classification These tests are not widely applied because
characteristic molecular markers for most solid tumors have to be identified.
Recently, microarray tumor gene expression profiles have been used for cancer diagnosis.
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
7
Microarray A microarray experiment monitors the
expression levels for thousands of genes simultaneously.
Microarray techniques will lead to a more complete understanding of the molecular variations among tumors, hence to a more reliable classification.
G1G2G3G4G5G6G7G6G7
C1 C2 C3 C4 C5 C6 C7LowZeroHigh
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
8
Microarray
Microarray analysis allows the monitoring of the activities of thousands of genes over many different conditions.
From a machine learning point of view…Gene\Experiment ex-1 ex-2 …… ex-m
g-1
g-2
…….
…….
g-n
The large volume of the data requires the computational aid in analyzing the expression data.
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
9
Machine learning tasks in cancer classification
There are three main types of machine learning problems associated with cancer classification: The identification of new cancer classes using gene
expression profiles The classification of cancer into known classes The identifications of “marker” genes that characterize the
different cancer classes In this presentation, we focus on the second type of
problems.
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
10
Project Goals
To develop a more systematic machine learning approach to cancer classification using microarray gene expression profiles.
Use an initial collection of samples belonging to the known classes of cancer to create a “class predictor” for new, unknown, samples.
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
11
Challenges in cancer classification
Gene expression data are typically characterized by high dimensionality (i.e. a large number of genes) small sample size
Curse of dimensionality!
Methods Kernel techniques Data resampling Gene selection
AML
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
12
Outline
Introduction Data-dependent Kernel Results Conclusion
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
13
Data-dependent kernel model
),()()(),( 0 yxkyqxqyxk
function kernel basic a is ),(0 yxk
m
iii xxkxq
110 ),()(
2||||1
1),( ixxi exxk
Optimizing the data-dependent kernel is to choose the coefficient vector T
l ),,,( 10
Data dependent
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
14
Optimizing the kernel
Criterion for kernel optimization
Maximum class separability of the training data in the kernel-induced feature space
)(
)(max
w
b
Str
StrJ
matrixscatter class- within:
matrixscatter class-between :
w
b
S
S
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
15
The Kernel Optimization
00 NM
)(
)(max
w
b
Str
StrJ
1000 and of functions are , KKNM
0
0maxN
MJ
T
T
rnonsingula is 0NIn reality, the matrix N0 is usually singular
)( 00 INM 0
α: eigenvector corresponding to the largest eigenvalue
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
16
Kernel optimization
Before Kernel Optimization
After Kernel Optimization
Training data Test data
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
17
Distributed resampling
Original training data:
Training data with resampling:
),...2,1( } ,{ miyx ii
)3,...2,1( } ,{ miba ii
mix
mixa
r
ii
1
miy
miyb
r
ii
1
) ,0(~ 2 N
treplacemen with }{ of sample random : ir xx
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
18
Gene selection
A filter method: class separability
2
1
2
2
1
2
))()((
))()(()(
k Ciki
kkk
kjxjx
jxjxmjg
samples trainingall of expression average :)(
class across expression average :)(
in samples ofnumber the:
classth -k ofset index :
jx
k-thjx
Cm
C
k
kk
k
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
19
Outline
Introduction Data-dependent Kernel Results Conclusion
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
20
Comparison with other methods
k-Nearest Neighbor (kNN) Diagonal linear discriminant analysis (DLDA) Uncorrelated Linear Discriminant analysis
(ULDA) Support vector machines (SVM)
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
21
Data sets
AML Subtypes: ALL vs. AML
Status of Estrogen receptor
Status of lymph nodal
Outcome of treatment
Tumor vs. healthy tissue
Subtypes: MPM vs. ADCA
Different lymphomas cells
Cancer vs. non-cancer
Tumor vs. healthy tissue
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
22
Experimental setup
Data normalization Zero mean and unity variance at the gene
direction Random partition data into two disjoint
subsets of equal size – training data + test data
Repeat each experiment 100 times
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
23
Parameters
DLDA: no parameter KNN: Euclidean distance, K=3 ULDA: K=3 SVM: Gaussian kernel, use leave-one-out on
the training data to tune parameters KerNN: Gaussian kernel for basic kernel k0,
γ0 andσare empirically set. Use leave-one-out on the training data to tune the rest parameters. KNN for classification
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
24
Effect of data resampling
Prostate
102 samples
Lung
181 samples
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
25
Effect of gene selection
ALL-AML
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
26
Effect of gene selection
Colon
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
27
Effect of gene selection
Prostate
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
28
Comparison resultsALL-AML BreastER
BreastLN Colon
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
29
Comparison resultsCNS lung
Ovarian Prostate
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
30
Outline
Introduction Data-dependent Kernel Results Conclusion
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
31
Conclusion
By maximizing the class separability of training data, the data-dependent kernel is also able to increase the separability of test data.
The kernel method is robust to high dimensional microarray data
The distributed resampling strategy helps to alleviate the problem of overfitting
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
32
Conclusion
The classifier assign samples more accurately than other approaches so we can have better treatments respectively.
The method can be used for clarifying unusual cases e.g. a patient which was diagnosed as AML but with
atypical morphology. The method can be applied to distinctions relating to
future clinical outcomes.
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
33
Future work
How to estimate the parameters Study the genes selected
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
34
Reference H. Xiong, M.N.S. Swamy, and M.O. Ahmad. Optimizing the data-dependent
kernel in the empirical feature space. IEEE Trans. on Neural Networks 2005, 16:460-474.
H. Xiong, Y. Zhang, and X. Chen. Data-dependent Kernels for Cancer Classification. Under review.
A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini. Tissue classification with gene expression profiles. J. Computational Biology 2000, 7:559-584.
S. Dudoit, J. Fridlyand, and T.P. Speed. Comparison of discrimination method for the classification of tumor using gene expression data. J. Am. Statistical Assoc. 2002, 97:77-87
T.S. Furey, N. Cristianini, N. Duffy, D.W. Bednarski, M. Schummer, and D. Haussler. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 2000, 16:906-914.
J. Ye, T. Li, T. Xiong, and R. Janardan. Using uncorrelated discriminant analysis for tissue classification with gene expression data. IEEE/ACM Trans. on Computational Biology and Bioinformatics 2004, 1:181-190.
23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics
35
Thanks!Questions?