An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification by Carlotta...
-
Upload
mary-black -
Category
Documents
-
view
218 -
download
2
Transcript of An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification by Carlotta...
An Evaluation of Gene Selection Methods for Multi-class
Microarray Data Classification
by Carlotta Domeniconi and
Hong Chai
Outline
• Introduction to microarray data
• Problem description
• Related work
• Our methods
• Experimental Analysis
• Result
• Conclusion and future work
Microarray
• Measures gene expression levels across different conditions, times or tissue samples
• Gene expression levels inform cell activity and disease status
• Microarray data distinguish between tumor types, define new subtypes, predict prognostic outcome, identify possible drugs, assess drug toxicity, etc.
Microarray Data
• A matrix of measurements: rows are gene expression levels; columns are samples/conditions.
Example – Lymphoma Dataset
Microarray data analysis
• Clustering applied to genes to identify genes with similar functions or participate in similar biological processes, or to samples to find potential tumor subclasses.
• Classification builds model to predict diseased samples. Diagnostic value.
Classification Problem
• Large number of genes (features) - may contain up to 20,000 features.
• Small number of experiments (samples) – hundreds but usually less than 100 samples.
• The need to identify “marker genes” to classify tissue types, e.g. diagnose cancer - feature selection
Our Focus
• Binary classification and feature selection methods extensively studied; Multi-class case received little attention.
• Practically many microarray datasets have more than two categories of samples
• We focus on multi-class gene ranking and selection.
Related Work
Some criteria used in feature ranking• Correlation coefficient
• Information gain
• Chi-squared
• SVM-RFE
Notation
• Given C classes
• m observations (samples or patients)
• n feature measurements (gene expressions)
• class labels y= 1,...,C
ntn Rxx ),...,( 1x
Correlation Coefficient
• Two class problem: y = {-1,+1}• Ranking criterion defined in Golub:
• where μj is the mean and σ standard deviation along dimension j in the + and – classes; Large |w| indicates discriminant feature
jj
jjjw
Fischer’s score
• Fisher’s criterion score in Pavlidis:
22
2
)()(
)(
jj
jjjw
Assumption of above methods
• Features analyzed in isolation. Not considering correlations.
• Assumption: independent of each other
• Implication: redundant genes selected into a top subset.
Information Gain
• A measure of the effectiveness of a feature in classifying the training data.
• Expected reduction in entropy caused by partitioning the data according to this feature.
• V (A) is the set of all possible values of feature A, and Sv is the subset of S for which feature A has value v
)(||
)(),( )( vv
AVv SES
SSEASI
Information Gain
• E(S) is the entropy of the entire set S.
• wherewhere |Ci| is the number of training data in class Ci, and |S| is thecardinality of the entire set S.
||
||log
||
||)( 21 S
C
S
CSE iiC
i
Chi-squared
• Measures features individually• Continuous valued features discretized into
intervals• Form a matrix A, where Aij is the number of
samples of the Ci class within the j-th interval.
• Let CIj be the number of samples in the j-th interval
Chi-squared
• The expected frequency of Aij is
• The Chi-squared statistic of a feature is defined as
• Where I is the number of intervals. The larger the statistic, the more informative the feature is.
mCCE iIjji /||,
ij
ijijIj
Ci E
EA 2
112 )(
SVM-RFE
• Recursive Feature Elimination using SVM• In the linear SVM model on the full feature set Sign (w•x + b) w is a vector of weights for each feature, x is an
input instance, and b a threshold.
If wi = 0, feature Xi does not influence classification and can be eliminated from the set of features.
SVM-RFE
• After getting w for the full feature set, sort features in descending order of weights. A percentage of lower feature is eliminated.
3. A new linear SVM is built using the new set of features. Repeat the process.
4. The best feature subset is chosen.
Other criteria
• The Brown-Forsythe, the Cochran, and the Welch test statistics used in Chen, et al.
(Extensions of the t-statistic used in the two-class classification problem.)
• PCA (Disadvantage: new dimension formed. None of the original features can be discarded. Therefore can’t identify marker genes.)
Our Ranking Methods
• BScatter
• MinMax
• bSum
• bMax
• bMin
• Combined
Notation
• For each class i and each feature j, we define the mean value of feature j for class Ci:
• Define the total mean along feature j
jCxi
ij xC i
||
1,
jj xm x
1
Notation
• Define between-class scatter along feature j
2,
1
)(|| jiji
C
ij CB
Function 1: BScatter
• Fisher discriminant analysis for multiple classes under feature independence assumption. It credits the largest score to the feature that maximizes the ratio of the between-class scatter to the within-class scatter
• where σji is the standard deviation of class i along feature j
jiCi
jj
BBScatter
1
Function 2: MinMax
• Favors features along which the farthest mean-class difference is large, and the within class variance is small.
jiCi
jjjMinMax
1
min,max,
Function 3: bSum
• For each feature j, we sort the C values μj,i in non-decreasing order: μ j1 <= μj2…<= μ jC
• Define bj,l = μ j1+1 - μ j1 • bSum rewards the features with large distances
between adjacent mean class values:
jiCi
ljCl
j
bbSum
1,
11
Function 4: bMax
• Rewards features j with a large between-neighbor-class mean difference
jiCi
ljlj
bbMax
1,max
Function 5: bMin
• Favorsthe features with large smallest between-neighbor-class mean difference
jiCi
ljlj
bbMin
1,min
Function 6: Comb
• Considers a score function which combines MinMax and bMin
jiCi
jjljlj
bComb
1
min,max,, ))((min
Datasets
Dataset sample genes classes Comment
MLL 72 12582 3Available at http://research.nhgri.nih.gov/microarray/Supplement
Lymphoma 88 4026 6Number of samples in each class are, 46 in DLBCL, 11 in CLL, 9 in FL (malignant classes), 11 in ABB, 6 in
Yeast 80 5775 3RAT, and 6 in TCL (normal samples). available at http://llmpp.nih.gov/lymphoma
NCI60 61 1155 8Available at http://rana.lbl.gov/
Experiment Design
• Gene expression scaled between [-1,1]• Performed 9 comparative feature selection methods (6 proposed scores, Chi-squared, Information Gain, and
SVM-RFE)• Obtain subsets of top-ranked genes to train SVM classifier (3 kernel functions: linear, 2-degree polynomial, Gaussian;
Soft-margin [1,100]; Gaussian kernel [0.001,2])• Leave-one-out cross validation due to small sample size• One-vs-one multi-class classification implemented on
LIBSVM
Result – MLL Dataset
Result – Lymphoma Dataset
Conclusions• SVMs classification benefits from gene selection;• Gene ranking with correlation coefficients gives
higher accuracy than SVM-RFE in low dimensions in most data sets. The best performing correlation score varies from problem to problem;
• Although SVM-RFE shows an excellent performance in general, there is no clear winner. The performance of feature selection methods seems to be problem-dependent;
Conclusions
• For a given classification model, different gene selection methods reach the best performance for different feature set sizes;
• Very high accuracy was achieved on all the data sets studied here. In many cases perfect accuracy (based on leave-one-out error) was achieved;
• The NCI60 dataset [17] shows lower accuracy values. This dataset has the largest number of classes (eight), and smaller sample sizes per class. SVM-RFE handles this case well, achieving 96.72% accuracy with 100 selected genes and a linear kernel. The gap in accuracy between SVM-RFE and the other gene rankingmethods is highest for this dataset (ca. 11.5%).
Limitations & Future Work• The selection of features over the whole training
set induces a bias in the results. Will study valuable suggestions on how to assess and correct the bias in future experiments.
• Will take into consideration the correlation between any pair of selected features. Ranking method will be modified so that correlations are lower than a certain threshold.
• Evaluate top-ranked genes in our research against marker genes identified in other studies.