2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification...

35
22/3/25 DIMACS Workshop on Machine Learn ing Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen Chen & Huilin Xiong) EECS & ITTC University of Kansas
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification...

Page 1: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

1

Cancer Classification with Data-dependent Kernels

Anne Ya Zhang

(with Xue-wen Chen & Huilin Xiong)EECS & ITTC

University of Kansas

Page 2: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

2

Outline

Introduction Data-dependent Kernel Results Conclusion

Page 3: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

3

Cancer facts Cancer is a group of many related diseases

Cells continue to grow and divide and do not die when they should.

Changes in the genes that control normal cell growth and death. Cancer is the second leading cause of death in the

United States Cancer causes 1 of every 4 deaths

NIH estimate overall costs for cancer in 2004 at $189.8 billion ($64.9 billion for direct medical cost)

Cancer types Breast cancer, Lung cancer, Colon cancer, …

Death rates vary greatly by cancer type and stage at diagnosis

Page 4: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

4

Motivation

Why do we need to classify cancers? The general way of treating cancer is to:

Categorize the cancers in different classes Use specific treatment for each of the classes

Traditional way to classify cancers Morphological appearance

Not accurate! Enzyme-based histochemical analyses. Immunophenotyping. Cytogenetic analysis.

Complicated & needs highly specialized laboratories

Page 5: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

5

Motivation

Why traditional ways are not enough ? There exists some tumors in the same class with

completely different clinical courses May be more accurate classification is needed

Assigning new tumors to known cancer classes is not easy e.g. assigning an acute leukemia tumor to one of the

AML (acute myeloid leukemia) ALL (acute lymphoblastic leukemia)

Page 6: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

6

DNA Microarray-based Cancer Diagnosis

Cancer is caused by changes in the genes that control normal cell growth and death.

Molecular diagnostics offer the promise of precise, objective, and systematic cancer classification These tests are not widely applied because

characteristic molecular markers for most solid tumors have to be identified.

Recently, microarray tumor gene expression profiles have been used for cancer diagnosis.

Page 7: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

7

Microarray A microarray experiment monitors the

expression levels for thousands of genes simultaneously.

Microarray techniques will lead to a more complete understanding of the molecular variations among tumors, hence to a more reliable classification.

G1G2G3G4G5G6G7G6G7

C1 C2 C3 C4 C5 C6 C7LowZeroHigh

Page 8: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

8

Microarray

Microarray analysis allows the monitoring of the activities of thousands of genes over many different conditions.

From a machine learning point of view…Gene\Experiment ex-1 ex-2 …… ex-m

g-1

g-2

…….

…….

g-n

The large volume of the data requires the computational aid in analyzing the expression data.

Page 9: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

9

Machine learning tasks in cancer classification

There are three main types of machine learning problems associated with cancer classification: The identification of new cancer classes using gene

expression profiles The classification of cancer into known classes The identifications of “marker” genes that characterize the

different cancer classes In this presentation, we focus on the second type of

problems.

Page 10: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

10

Project Goals

To develop a more systematic machine learning approach to cancer classification using microarray gene expression profiles.

Use an initial collection of samples belonging to the known classes of cancer to create a “class predictor” for new, unknown, samples.

Page 11: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

11

Challenges in cancer classification

Gene expression data are typically characterized by high dimensionality (i.e. a large number of genes) small sample size

Curse of dimensionality!

Methods Kernel techniques Data resampling Gene selection

AML

Page 12: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

12

Outline

Introduction Data-dependent Kernel Results Conclusion

Page 13: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

13

Data-dependent kernel model

),()()(),( 0 yxkyqxqyxk

function kernel basic a is ),(0 yxk

m

iii xxkxq

110 ),()(

2||||1

1),( ixxi exxk

Optimizing the data-dependent kernel is to choose the coefficient vector T

l ),,,( 10

Data dependent

Page 14: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

14

Optimizing the kernel

Criterion for kernel optimization

Maximum class separability of the training data in the kernel-induced feature space

)(

)(max

w

b

Str

StrJ

matrixscatter class- within:

matrixscatter class-between :

w

b

S

S

Page 15: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

15

The Kernel Optimization

00 NM

)(

)(max

w

b

Str

StrJ

1000 and of functions are , KKNM

0

0maxN

MJ

T

T

rnonsingula is 0NIn reality, the matrix N0 is usually singular

)( 00 INM 0

α: eigenvector corresponding to the largest eigenvalue

Page 16: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

16

Kernel optimization

Before Kernel Optimization

After Kernel Optimization

Training data Test data

Page 17: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

17

Distributed resampling

Original training data:

Training data with resampling:

),...2,1( } ,{ miyx ii

)3,...2,1( } ,{ miba ii

mix

mixa

r

ii

1

miy

miyb

r

ii

1

) ,0(~ 2 N

treplacemen with }{ of sample random : ir xx

Page 18: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

18

Gene selection

A filter method: class separability

2

1

2

2

1

2

))()((

))()(()(

k Ciki

kkk

kjxjx

jxjxmjg

samples trainingall of expression average :)(

class across expression average :)(

in samples ofnumber the:

classth -k ofset index :

jx

k-thjx

Cm

C

k

kk

k

Page 19: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

19

Outline

Introduction Data-dependent Kernel Results Conclusion

Page 20: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

20

Comparison with other methods

k-Nearest Neighbor (kNN) Diagonal linear discriminant analysis (DLDA) Uncorrelated Linear Discriminant analysis

(ULDA) Support vector machines (SVM)

Page 21: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

21

Data sets

AML Subtypes: ALL vs. AML

Status of Estrogen receptor

Status of lymph nodal

Outcome of treatment

Tumor vs. healthy tissue

Subtypes: MPM vs. ADCA

Different lymphomas cells

Cancer vs. non-cancer

Tumor vs. healthy tissue

Page 22: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

22

Experimental setup

Data normalization Zero mean and unity variance at the gene

direction Random partition data into two disjoint

subsets of equal size – training data + test data

Repeat each experiment 100 times

Page 23: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

23

Parameters

DLDA: no parameter KNN: Euclidean distance, K=3 ULDA: K=3 SVM: Gaussian kernel, use leave-one-out on

the training data to tune parameters KerNN: Gaussian kernel for basic kernel k0,

γ0 andσare empirically set. Use leave-one-out on the training data to tune the rest parameters. KNN for classification

Page 24: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

24

Effect of data resampling

Prostate

102 samples

Lung

181 samples

Page 25: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

25

Effect of gene selection

ALL-AML

Page 26: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

26

Effect of gene selection

Colon

Page 27: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

27

Effect of gene selection

Prostate

Page 28: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

28

Comparison resultsALL-AML BreastER

BreastLN Colon

Page 29: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

29

Comparison resultsCNS lung

Ovarian Prostate

Page 30: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

30

Outline

Introduction Data-dependent Kernel Results Conclusion

Page 31: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

31

Conclusion

By maximizing the class separability of training data, the data-dependent kernel is also able to increase the separability of test data.

The kernel method is robust to high dimensional microarray data

The distributed resampling strategy helps to alleviate the problem of overfitting

Page 32: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

32

Conclusion

The classifier assign samples more accurately than other approaches so we can have better treatments respectively.

The method can be used for clarifying unusual cases e.g. a patient which was diagnosed as AML but with

atypical morphology. The method can be applied to distinctions relating to

future clinical outcomes.

Page 33: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

33

Future work

How to estimate the parameters Study the genes selected

Page 34: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

34

Reference H. Xiong, M.N.S. Swamy, and M.O. Ahmad. Optimizing the data-dependent

kernel in the empirical feature space. IEEE Trans. on Neural Networks 2005, 16:460-474.

H. Xiong, Y. Zhang, and X. Chen. Data-dependent Kernels for Cancer Classification. Under review.

A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini. Tissue classification with gene expression profiles. J. Computational Biology 2000, 7:559-584.

S. Dudoit, J. Fridlyand, and T.P. Speed. Comparison of discrimination method for the classification of tumor using gene expression data. J. Am. Statistical Assoc. 2002, 97:77-87

T.S. Furey, N. Cristianini, N. Duffy, D.W. Bednarski, M. Schummer, and D. Haussler. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 2000, 16:906-914.

J. Ye, T. Li, T. Xiong, and R. Janardan. Using uncorrelated discriminant analysis for tissue classification with gene expression data. IEEE/ACM Trans. on Computational Biology and Bioinformatics 2004, 1:181-190.

Page 35: 2015-6-30DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.

23/4/18 DIMACS Workshop on Machine Learning Techniques in Bioinformatics

35

Thanks!Questions?