Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory.
Machine Learning and its Applications in Bioinformatics
Click here to load reader
-
Upload
rana-hanson -
Category
Documents
-
view
88 -
download
9
description
Transcript of Machine Learning and its Applications in Bioinformatics
Machine Learning and its Applications in Bioinformatics
Yen-Jen Oyang
Dept. of Computer Science and Information Engineering
Observations and Challenges in the Information Age
• A huge volume of information has been and is being digitized and stored in the computer.
• Due to the volume of digitized information, effectively exploitation of information is beyond the capability of human being without the aid of intelligent computer software.
An Example ofSupervised Machine Learning
(Data Classification)• Given the data set shown on next slide, can
we figure out a set of rules that predict the classes of objects?
Data Set
Data Class Data Class Data Class
( 15,33)
O ( 18,28)
× ( 16,31)
O
( 9 ,23)
× ( 15,35)
O ( 9 ,32)
×
( 8 ,15)
× ( 17,34)
O ( 11,38)
×
( 11,31)
O ( 18,39)
× ( 13,34)
O
( 13,37)
× ( 14,32)
O ( 19,36)
×
( 18,32)
O ( 25,18)
× ( 10,34)
×
( 16,38)
× ( 23,33)
× ( 15,30)
O
( 12,33)
O ( 21,28)
× ( 13,22)
×
Distribution of the Data Set
。。
10 15 20
30
。。。 。。
。 。。
××
××
×
×
×
×
×
×
××
×
×
Rule Based on Observation
.x
o
30
253015 22
class
else
class
, thenand y
yxIf
Rule Generated by a Kernel Density Estimation Based
Algorithm
Let and
If then prediction=“O”.
Otherwise prediction=“X”.
2o
2o
210
12o
o 2
1)( i
icv
i i
evf
.
2
1)(
2
214
12x
x
2x
x
j
jcv
j j
evf
),()( xo vfvf
(15,33)
(11,31)
(18,32)
(12,33)
(15,35)
(17,34)
(14,32)
(16,31)
(13,34)
(15,30)
1.723 2.745 2.327 1.794 1.973 2.045 1.794 1.794 1.794 2.027
ico
io
(9,23) (8,15)(13,37)
(16,38)
(18,28)
(18,39)
(25,18)
(23,33)
(21,28)
(9,32)(11,38)
(19,36)
(10,34)
(13,22)
6.458 10.08 2.939 2.745 5.451 3.287 10.86 5.322 5.070 4.562 3.463 3.587 3.232 6.260
jcx
jx
Identifying Boundary of Different Classes of Objects
Boundary Identified
Problem Definition ofData Classification
• In a data classification problem, each object is described by a set of attribute values and each object belongs to one of the predefined classes.
• The goal is to derive a set of rules that predicts which class a new object should belong to, based on a given set of training samples. Data classification is also called supervised learning.
The Vector Space Model
• In the vector space model, each object is described by a number of numerical attributes/features.
• For example, the outlook of a man is described by his height, weight, and age.
• It is typical that the objects are described by a large number of attributes/features.
Transformation of Categorical Attributes into Numerical
Attributes• Represent the attribute values of the object
in a binary table form as exemplified in the following:
10003
00112
01001School Graduate
Education
College
Education
School High
Education
Female
Male/Objects
• Assign appropriate weight to each column.
• Treat the weighted vector of each row as the feature vector of the corresponding object.
4
21
3
0003
002
0001School Graduate
Education
College
Education
School High
Education
Female
Male/Objects
Transformation of the Similarity/Dissimilarity Matrix
Model• In this model, a matrix records the
similarity/dissimilarity scores between every pair of objects.
P1 P2 P3 P4 P5 P6
P1 - 53 137 862 35 180
P2 - 46 72 816 606
P3 - 447 751 201
P4 - 291 156
P5 - 494
P6 -
• We may select P2, P5, P6 as representatives and use reciprocals of the similarity scores to these representatives to describe an object.
• For example, the feature vectors of P1 and P2 are <1/53, 1/35, 1/180> and
<0, 1/816, 1/606>, respectively.
Applications ofData Classificationin Bioinformatics
• In microarray data analysis, data classification is employed to predict the class of a new sample based on the existing samples with known class.
• Data classification has also been widely employed in prediction of protein family, protein fold, and protein secondary structure.
• For example, in the Leukemia data set, there are 72 samples and 7129 genes.• 25 Acute Myeloid Leukemia(AML) samples.
• 38 B-cell Acute Lymphoblastic Leukemia samples.
• 9 T-cell Acute Lymphoblastic Leukemia samples.
Model of Microarray Data SetsGene1 Gene2 ‧‧‧‧‧‧ Genen
Sample1
Sample2
Samplem
.),( RjiM
Alternative Data Classification Algorithms
• Decision tree (Q4.5 and Q5.0);
• Instance-based learning(KNN);
• Naïve Bayesian classifier;
• RBF network;
• Support vector machine(SVM);
• Kernel Density Estimation (KDE) based classifier.
Instance-Based Learning
• In instance-based learning, we take k nearest training samples of a new instance (v1, v2, …, vm) and assign the new instance to the class that has most instances in the k nearest training samples.
• Classifiers that adopt instance-based learning are commonly called the KNN classifiers.
Example of the KNN Classifiers
• If an 1NN classifier is employed, then the prediction of “” = “X”.
• If an 3NN classifier is employed, then prediction of “” = “O”.
Decision Function of the KNN Classifier
• Assume that there are two classes of samples, positive and negative.
• The decision function of a KNN classifier is:
.or query vect
of samplesnearest theare ,..., , where
,)sgn()sgn(
21
1
v
sss
sv
kk
k
i
i
Extension of the KNN Classifier
• We may extend the KNN classifier by weighting the contribution of each neighbor with a term related to its distance to the query vector:
.or query vect
of samplesnearest theare ,..., , where
,)sgn()()sgn(
21
1
v
sss
ssvv
k
w
k
k
i
iiii
A RBF Network Based Classifier with
Gaussian Kernels• It is typical that all are radial basis
functions of the same form.
• With the popular Gaussian function, the decision function is of the following form:
. )sgn(2
2
2
2
)(2)(2
j
j
i
i
ewewj
j
i
i
svsv
v
k ,...,, 21
The Common Structure of the RBF Network Based Data Classifier
v
)(1 v
)(vk
)(2 v
)(vf
iw
)(1 v
)(vk
)(2 v )(vf jw
2
2
2
2
)(2exp
)(2exp
Classifier DataBased
Network RBF
j
j
j
j
i
i
i
i
w
w
sν
sν
. i.e.,global aemploy and
sampleeach at function kernel a place
kernel RBF with the
ji
SVM
.
orset lly heuristica are and
ji
ji
NetworkstionRegulariza
mh
hw
adaptiveandVariable
KDE
2
1
and sampleeach at
function kernel a place
. and
2
1
h
m
h
hw
Fixed
KDE
Regularization of a RBF Network Based Classifier
• The conventional approaches proceed with either employing a constant σ for all kernel functions or employing a heuristic mechanism to set σi individually, e.g. a multiple of the average distance among samples, and attempt to minimize
where is a learning sample.
,
)(
)(
)(
)(
)(
)(
1 1
2
2
2
1
2
1
21
22221
11211
i
m
j
k
h
jh
ik
i
i
ik
i
i
mkmm
k
k
w
f
f
f
www
www
www
Es
s
s
s
s
s
s
is
• The term
is included to avoid overfitting and γ is to be set through cross validation.
m
j
k
h
jhwE1 1
2
ii
ii
ii
i
i
ii
k
h
hihi
s
ii
k
h
hihi
ii
k
h
hihi
i
k
h
iihh
ii
m
j
k
h
jh
k
h
iihh
fwww
E
fwww
E
fww
wfw
VV
w
wfw
w
E
ss
s
ss
s
s
ssss
ssss
ssss
sss
ss
)()()()(0
)()()()(0
Similarly,
)()()()(
02)()()(2
.tor column vec ofelement first theis )1( where
,0
)()(
0
1121
1
2121
2212
1
1212
1111
1
11
111
1
11
11
1 1
2
2
1
11
11
. and have weThen,
)()()(
)()()(
)()()(
Let W
1
21
22212
12111
21
22212
12111
21
22212
12111
T
ZWZW
fff
fff
fff
Z
www
www
www
TT
ikninin
ikii
ikii
kkkk
k
k
mkkk
m
m
sss
sss
sss
Decision Function of a SVM
• A prediction of the class of a new sample located at v in the vector space is based on the following rule:
.,0
, )(ˆ 2
2
2
2
22
Cww
bewewvf
ji
vn
j
j
vn
i
i
ji
ss
ctors.support ve thecalled are
weghtszero-non with and instances Those
"." class predicted Otherwise,
"." class predicted then ,0)(ˆ if
-ji
vf
ss
The Kernel Density Estimation (KDE) Based
Classifier• The KDE based learning algorithm constructs
one approximate probability density function for one class of objects.
• Classification of a new object is conducted based on the likelihood function:
objects. -class of
functiondensity y probabilit eapproximat theis )(ˆ and
ly,respective classes, all of samples trainingofnumber total theand
class of samples trainingofnumber theare and where
),(ˆ)(
m
f
mSS
vfS
SvL
m
m
mm
m
Identifying Boundary of Different Classes of Objects
Boundary Identified
Problem Definition of Kernel Density Estimation
• Given a set of samples
randomly taken from a probability distribution. We want to find a set of symmetric kernel functions and the corresponding weights such that
nsssS ,...,, 21
),;( iiK
iw
).(),;()(ˆ fKwf ii
i
i
The Proposed KDE Based Classifier
• We determined to employ the Gaussian function and set the width of each Gaussian function to a multiple of the average distance among neighboring samples:
. of samples gneighborin
theamong distance average theis and
where,2
exp2
11)(ˆ
12
2
i
i
ii
n
i i
id
i
s
nf
• can be estimated as follow:
sample.nearest
its to sample from distance theis )( and
space vector ldimensiona- ain )( radiuswith
sphere a of volume theis )1
2(
)(
where,1
)12
(
)()1(
2
1
2
k-thssR
dsR
dsR
dsR
k
iik
ik
dd
ik
d
i
i
dd
ik
i
Accuracy of Different Classification Algorithms
Data setclassification algorithms
KDE SVM 1NN 3NN
Satimage
(4335,2000)92.30 91.30 89.35 90.6
Letter
(15000,5000)97.12 97.98 95.26 95.46
Shuttle
(43500,14500)99.94 99.92 99.91 99.92
Average 96.45 96.40 94.84 95.33
Comparison of Execution Time(in seconds)
KDE without data reduction
KDE with data reduction SVM
Cross validation
Satimage 670 265 64622
Letter 2825 1724 386814
Shuttle 96795 59.9 467825
Make classifier
Satimage 5.91 0.85 21.66
Letter 17.05 6.48 282.05
Shuttle 1745 0.69 129.84
Test
Satimage 21.3 7.4 11.53
Letter 128.6 51.74 94.91
Shuttle 996.1 5.85 2.13
Parameter Setting through Cross Validation
• When carrying out data classification, we normally need to set one or more parameters associated with the data classification algorithm.
• For example, we need to set the value of k with the KNN classifier.
• The typical approach is to conduct cross validation to find out the optimal value.
• In the cross validation process, we set the parameters of the classifier to a particular combination of values that we are interested in and then evaluate how good the combination is based on alternative schemes.
• With the leave-one-out cross validation scheme, we attempt to predict the class of each sample using the remaining samples as the training data set.
• With 10-fold cross validation, we evenly divide the training data set into 10 subsets. Each time, we test the prediction accuracy of one of the 10 subsets using the other 9 subsets as the training set.
Overfitting
• Overfitting occurs when we construct a classifier based on insufficient quantity of samples.
• As a result, the classifier may works well for the training dataset but fail to deliver an acceptable accuracy in the real world.
• For example, if we toss a fair coin two times, there is a 50% chance that we will observe either side up in both tosses.
• Therefore, if we draw our conclusion on how fair the coin is with just two tosses, we may end up with overfitting the dataset.
• Overfitting is a serious problem in analyzing high-dimensional datasets, e.g. the microarray datasets.
Alternative Similarity Functions
• Let < vr,1, vr,2 ,…, vr,n> and < vt,1, vt,2 ,…, vt,n > be the gene expression vectors, i.e. the feature vectors, of samples Sr and St, respectively. Then, the following alternative similarity functions can be employed:• Euclidean distance—
n
hhthr vvitydissimilar
1
2,,
• Cosine—
• Correlation coefficient--
n
hht
n
hhr
n
hhthr
vv
vvSimilarity
1
2,
1
2,
1,,
n
hthtt
n
hrhrr
n
hhtt
n
hhrr
tr
n
hthtrhr
vn
vn
vn
vn
vvn
Similarity
1
2,
1
2,
1,
1,
1,,
ˆ1
1ˆ ˆ
1
1ˆ
1ˆ
1ˆ
where,ˆˆ
ˆˆ1
Importance of Feature Selection
• Inclusion of features that are not correlated to the classification decision may make the problem even more complicated.
• For example, in the data set shown on the following page, inclusion of the feature corresponding to the Y-axis causes incorrect prediction of the test instance marked by “”, if a 3NN classifier is employed.
• It is apparent that “o”s and “x” s are separated by x=10. If only the attribute corresponding to the x-axis was selected, then the 3NN classifier would predict the class of “” correctly.
x=10 x
y
Linearly Separable and Non-Linearly Separable
• Some datasets are linearly separable.
• However, there are more datasets that are non-linearly separable.
An Example of Linearly Separable
x
y
An Example of Non-Linearly Separable
x=10 x
y
A Simplest Case ofLinearly Separable
Feature Selection Based on the Univariate Analysis
Ai
S11 v11
S12 v12
S13 v13
: :
S21 v21
S22 v22
S23 v23
: :
S31 v31
S32 v32
S33 v33
Class 2
Class 1
Class 3
. 3) and 123nfor 3.07 e.g.(
1
ifonly and if
included, is attribute test,-F on the based Then,
. where
;1
;1
;1
Let
2
2...
1
1
1 1
2.
2
1
.
1 1
..
ks
kvvn
A
nn
vvkn
s
vn
vvn
v
i
k
i
i
i
k
i
i
k
i
n
j
iij
n
j
iji
i
k
i
n
j
ij
i
ii
An Example of Univariate Analysis
Sample X Y Class Sample X Y Class
1 7.1 9.1 1 11 10.9 8.8 2
2 6.7 10.2 1 12 10.8 10.3 2
3 7.5 10.6 1 13 11.1 11 2
4 7.6 8.8 1 14 12.3 9.1 2
5 8.1 10.3 1 15 12.1 9.7 2
6 8.0 11.0 1 16 12 10.9 2
7 8.6 8.9 1 17 13.1 8.9 2
8 8.7 9.8 1 18 12.8 10.1 2
9 9.2 11.2 1 19 13.2 11.3 2
10 6.5 10.1 1 20 13.7 9.9 2
Average 7.8 10.0 - Average 12.2 10.0 -
Joint p.m.f. of X, Y, and C
1
11
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
8
9
10
11
12
6 8 10 12 14
0
10
0.05). with statisticfor (threshold 4.35
24.16
10
2
2...
2
1
2
2...
2
1
Y
i
i
X
i
i
s
yy
F
s
xx
Blind Spot of the Univariate Analysis
• The univariate analysis is not able to identify crucial features in the following example:
x
y
The Demonstrative Data Set
Joint p.m.f. of X, Y, and C
0
2
4
6
8
10
12
0 2 4 6
1
2
• For Gene X,
• For Gene Y,
)threshold(35.40785.184.1
9845.1
9845.1155.384.210155.347.310
155.320
84.21047.310
22
35.4054.133.5
618.5
618.504.657.61004.651.510
04.620
57.61051.510
22
• However, if we apply the following linear transformation, then we will be able to identify the significance of these two genes:
2(GeneX)(GeneY)
• For “2 Gene X – Gene Y”,
35.4853.4460.0
912.26
912.2627.089.01027.043.110
27.020
)89.0(1043.110
22
• On the other hand, if we employ linear operator (x+2y), then we obtain
.3267.0
10102
2...2
2...1
ws
wwww
• Accordingly, the issue now is that how we can figure out the optimal linear operator of form αx+βy for the projection.
• In the 2-D case, given a set of samples
{(x1,y1), (x2,y2),…, (xn,yn)},
then vi = cosθxi+sinθyi
is the value obtained by projecting (xi,yi) onsinθx-cosθy=0or on the component along vector(cosθ, sinθ) as shown on the following page.
Feature Selection with Independent Components
Analysis (ICA)• In recent years, ICA has emerged as a
promising approach for carrying out multivariate analysis.
Basic Idea
• The ICA algorithm attempts to identify a plane so that when we project the data set on the plane the distribution is most non-gaussian.
A Measure of Non-Gaussianity
• The kurtosis is commonly employed to measure the non-Gaussianity of a data set.
• The kurtosis of a dataset {v1, v2 ,… , vn} is
2
1
2
1
41
4
)(1
1 and
1 where
,3)1(
)(
vvn
svn
v
sn
vv
n
ii
n
ii
n
ii
• The expected value of the kurtosis of a set of random samples taken from a standard normal distribution is 0.
• If the kurtosis of a set of random sample is larger than 0, then the p.d.f. of the distribution is sharper than the standard normal distribution.
• If the kurtosis of a set of random sample is smaller than 0, then the p.d.f. of the distribution is flatter than the standard normal distribution.
• Let kurt(θ) denote the kurtosis of {v1, v2 ,… , vn}
2
1
2
1
41
4
)sin(cos1
1 and
)sin(cos1
where
,3)1(
)sin(cos)(
vyxn
s
yxn
v
sn
vyxkurt
n
iii
n
iii
n
iii
• The issue now is to find the value of θ that minimizes kurt(θ).
• This is an optimization problem.
The Optimization Problem
• The optimization problem is to find the global maximum/minimum of a function.
• There are several heuristic algorithms designed for solving the optimization problem, e.g.• gradient descend;
• genetic algorithms;
• simulated annealing.
The Gradient Descend Algorithm
• In the gradient descend algorithm, a number of random samples are taken as the starting points.
• Then, we compute the gradient at each point and make a move in the direction of which the slope is maximum.
• This process is repeated a number of times until the convergent criterion is met.
An 1-D Example
d
dkurt )( to
right the tomove:
11
1
d
dkurt )( to
left the tomove :
22
2
d
dkurt )( to
right the tomove :
33
3
δ : is a parameter that controls the stepsize
• The gradient descend algorithm can be applied to multidimensional functions. In such cases, partial differentiation is involved.
• If the gradient descend algorithm is to be employed, then we must be able to compute the gradient of the function at any point in the vector space.
Blind Spot of ICA
• However, ICA may fail in the following non-linearly separable dataset
Data Clustering
• Data clustering concerns how to group a set of objects based on their similarity of attributes and/or their proximity in the vector space.
Model of Microarray Data Sets
Gene 1 Gene 2 Gene n
S11
S12
S13
:
S21 vi,j
S22
S23
:
S31
S32
S33
Class 2
Class 1
Class 3
Applications of Data Clustering in Microarray Data Analysis
• Data clustering has been employed in microarray data analysis for
• identifying the genes with similar expressions;
• identifying the subtypes of samples.
• For cluster analysis of samples, we can employ the feature selection mechanism developed for classification of samples.
• For cluster analysis of genes, each column of gene expression data is regarded as the feature vector of one gene.
The Agglomerative Hierarchical Clustering
Algorithms• The agglomerative hierarchical clustering
algorithms operate by maintaining a sorted list of inter-cluster distances/similarity.
• Initially, each data instance forms a cluster.• The clustering algorithm repetitively
merges the two clusters with the minimum inter-cluster distance or the maximum inter-cluster similarity.
• Upon merging two clusters, the clustering algorithm computes the distances between the newly-formed cluster and the remaining clusters and maintains the sorted list of inter-cluster distances accordingly.
• There are a number of ways to define the inter-cluster distance:• minimum distance/maximum similarity (single-link);
• maximum distance/minimum similarity (complete-link);
• average distance/average similarity;
• mean distance (applicable only with the vector space model).
• Given the following similarity matrix, we can apply the complete-link algorithm to obtain the dendrogram shown on the next slide.
g1 g2 g3 g4 g5 g6
g1 - 0.053 0.137 0.862 0.035 0.018
g2 - 0.046 0.072 0.816 0.606
g3 - 0.447 0.751 0.201
g4 - 0.291 0.156
g5 - 0.494
g6 -
• Assume that the complete-link algorithm is employed.
• If those similarity scores that are less than 0.3 are excluded, then we obtain 3 clusters {P1, P4}, {P2, P5, P6}, {P3}.
P1 P4 P2 P5P6P3
0.862 0.816
0.137
0.494
0.018
g1 g4 g2 g5 g3 g6
0.862 0.816
0.751
0.606
0.447
• If the single-link algorithm is employed, then we obtain the following result.
Example of the Chaining Effect
Single-link (10 clusters)
Complete-link (2 clusters)
Effect of Bias towards Spherical Clusters
Single-link (2 clusters) Complete-link (2 clusters)
K-Means: A Partitional Data Clustering Algorithm
• The k-means algorithm is probably the most commonly used partitional clustering algorithm.
• The k-means algorithm begins with selecting k data instances as the means or centers of k clusters.
• The k-means algorithm then executes the following loop iteratively until the convergence criterion is met.• repeat {
• assign every data instance to the closest cluster based on the distance between the data instance and the center of the cluster;
• compute the new centers of the k clusters;
• } until(the convergence criterion is met);
• A commonly-used convergence criterion is
If the difference between the values of two consecutive iterations is smaller than a threshold, then the algorithm terminates.
.cluster ofcenter theis where
,2
ii
C Cpi
Cm
mpEi i
Illustration of the K-Means Algorithm---(I)
initial center
initial center initial center
Illustration of the K-Means Algorithm---(II)
x
x
x
new center after 1st iteration
new center after 1st iteration
new center after 1st iteration
Illustration of the K-Means Algorithm---(III)
new center after 2nd iteration
new center after 2nd iteration
new center after 2nd iteration
A Case in which the K-Means Algorithm Fails
• The K-means algorithm may converge to a local optimal state as the following example demonstrates:
InitialSelection
Conclusions
• Machine learning algorithms have been widely exploited to tackle many important bioinformatics problems.