Machine Learning and its Applications in Bioinformatics

Machine Learning and its Applications in Bioinformatics

Yen-Jen Oyang

Dept. of Computer Science and Information Engineering

Observations and Challenges in the Information Age

• A huge volume of information has been and is being digitized and stored in the computer.

• Due to the volume of digitized information, effectively exploitation of information is beyond the capability of human being without the aid of intelligent computer software.

An Example ofSupervised Machine Learning

(Data Classification)• Given the data set shown on next slide, can

we figure out a set of rules that predict the classes of objects?

Data Set

Data Class Data Class Data Class

（ 15,33）

O （ 18,28）

× （ 16,31）

O

（ 9 ,23）

× （ 15,35）

O （ 9 ,32）

×

（ 8 ,15）

× （ 17,34）

O （ 11,38）

×

（ 11,31）

O （ 18,39）

× （ 13,34）

O

（ 13,37）

× （ 14,32）

O （ 19,36）

×

（ 18,32）

O （ 25,18）

× （ 10,34）

×

（ 16,38）

× （ 23,33）

× （ 15,30）

O

（ 12,33）

O （ 21,28）

× （ 13,22）

×

Distribution of the Data Set

。。

10 15 20

30

。。。。。

。。。

××

××

×

×

×

×

×

×

××

×

×

Rule Based on Observation

.x

o

30

253015 22

class

else

class

, thenand y

yxIf

Rule Generated by a Kernel Density Estimation Based

Algorithm

Let and

If then prediction=“O”.

Otherwise prediction=“X”.

2o

2o

210

12o

o 2

1)( i

icv

i i

evf

.

2

1)(

2

214

12x

x

2x

x

j

jcv

j j

evf

),()( xo vfvf

(15,33)

(11,31)

(18,32)

(12,33)

(15,35)

(17,34)

(14,32)

(16,31)

(13,34)

(15,30)

1.723 2.745 2.327 1.794 1.973 2.045 1.794 1.794 1.794 2.027

ico

io

(9,23) (8,15)(13,37)

(16,38)

(18,28)

(18,39)

(25,18)

(23,33)

(21,28)

(9,32)(11,38)

(19,36)

(10,34)

(13,22)

6.458 10.08 2.939 2.745 5.451 3.287 10.86 5.322 5.070 4.562 3.463 3.587 3.232 6.260

jcx

jx

Identifying Boundary of Different Classes of Objects

Boundary Identified

Problem Definition ofData Classification

• In a data classification problem, each object is described by a set of attribute values and each object belongs to one of the predefined classes.

• The goal is to derive a set of rules that predicts which class a new object should belong to, based on a given set of training samples. Data classification is also called supervised learning.

The Vector Space Model

• In the vector space model, each object is described by a number of numerical attributes/features.

• For example, the outlook of a man is described by his height, weight, and age.

• It is typical that the objects are described by a large number of attributes/features.

Transformation of Categorical Attributes into Numerical

Attributes• Represent the attribute values of the object

in a binary table form as exemplified in the following:

10003

00112

01001School Graduate

Education

College

Education

School High

Education

Female

Male/Objects

• Assign appropriate weight to each column.

• Treat the weighted vector of each row as the feature vector of the corresponding object.

4

21

3

0003

002

0001School Graduate

Education

College

Education

School High

Education

Female

Male/Objects

Transformation of the Similarity/Dissimilarity Matrix

Model• In this model, a matrix records the

similarity/dissimilarity scores between every pair of objects.

P1 P2 P3 P4 P5 P6

P1 - 53 137 862 35 180

P2 - 46 72 816 606

P3 - 447 751 201

P4 - 291 156

P5 - 494

P6 -

• We may select P2, P5, P6 as representatives and use reciprocals of the similarity scores to these representatives to describe an object.

• For example, the feature vectors of P1 and P2 are <1/53, 1/35, 1/180> and

<0, 1/816, 1/606>, respectively.

Applications ofData Classificationin Bioinformatics

• In microarray data analysis, data classification is employed to predict the class of a new sample based on the existing samples with known class.

• Data classification has also been widely employed in prediction of protein family, protein fold, and protein secondary structure.

• For example, in the Leukemia data set, there are 72 samples and 7129 genes.• 25 Acute Myeloid Leukemia(AML) samples.

• 38 B-cell Acute Lymphoblastic Leukemia samples.

• 9 T-cell Acute Lymphoblastic Leukemia samples.

Model of Microarray Data SetsGene1 Gene2 ‧‧‧‧‧‧ Genen

Sample1

Sample2

Samplem

.),( RjiM

Alternative Data Classification Algorithms

• Decision tree (Q4.5 and Q5.0);

• Instance-based learning(KNN);

• Naïve Bayesian classifier;

• RBF network;

• Support vector machine(SVM);

• Kernel Density Estimation (KDE) based classifier.

Instance-Based Learning

• In instance-based learning, we take k nearest training samples of a new instance (v1, v2, …, vm) and assign the new instance to the class that has most instances in the k nearest training samples.

• Classifiers that adopt instance-based learning are commonly called the KNN classifiers.

Example of the KNN Classifiers

• If an 1NN classifier is employed, then the prediction of “” = “X”.

• If an 3NN classifier is employed, then prediction of “” = “O”.

Decision Function of the KNN Classifier

• Assume that there are two classes of samples, positive and negative.

• The decision function of a KNN classifier is:

.or query vect

of samplesnearest theare ,..., , where

,)sgn()sgn(

21

1

v

sss

sv

kk

k

i

i

Extension of the KNN Classifier

• We may extend the KNN classifier by weighting the contribution of each neighbor with a term related to its distance to the query vector:

.or query vect

of samplesnearest theare ,..., , where

,)sgn()()sgn(

21

1

v

sss

ssvv

k

w

k

k

i

iiii

A RBF Network Based Classifier with

Gaussian Kernels• It is typical that all are radial basis

functions of the same form.

• With the popular Gaussian function, the decision function is of the following form:

. )sgn(2

2

2

2

)(2)(2

j

j

i

i

ewewj

j

i

i

svsv

v

k ,...,, 21

The Common Structure of the RBF Network Based Data Classifier

v

)(1 v

)(vk

)(2 v

)(vf

iw

)(1 v

)(vk

)(2 v )(vf jw

2

2

2

2

)(2exp

)(2exp

Classifier DataBased

Network RBF

j

j

j

j

i

i

i

i

w

w

sν

sν

. i.e.,global aemploy and

sampleeach at function kernel a place

kernel RBF with the

ji

SVM

.

orset lly heuristica are and

ji

ji

NetworkstionRegulariza

mh

hw

adaptiveandVariable

KDE

2

1

and sampleeach at

function kernel a place

. and

2

1

h

m

h

hw

Fixed

KDE

Regularization of a RBF Network Based Classifier

• The conventional approaches proceed with either employing a constant σ for all kernel functions or employing a heuristic mechanism to set σi individually, e.g. a multiple of the average distance among samples, and attempt to minimize

where is a learning sample.

,

)(

)(

)(

)(

)(

)(

1 1

2

2

2

1

2

1

21

22221

11211

i

m

j

k

h

jh

ik

i

i

ik

i

i

mkmm

k

k

w

f

f

f

www

www

www

Es

s

s

s

s

s

s

is

• The term

is included to avoid overfitting and γ is to be set through cross validation.

m

j

k

h

jhwE1 1

2

ii

ii

ii

i

i

ii

k

h

hihi

s

ii

k

h

hihi

ii

k

h

hihi

i

k

h

iihh

ii

m

j

k

h

jh

k

h

iihh

fwww

E

fwww

E

fww

wfw

VV

w

wfw

w

E

ss

s

ss

s

s

ssss

ssss

ssss

sss

ss

)()()()(0

)()()()(0

Similarly,

)()()()(

02)()()(2

.tor column vec ofelement first theis )1( where

,0

)()(

0

1121

1

2121

2212

1

1212

1111

1

11

111

1

11

11

1 1

2

2

1

11

11

. and have weThen,

)()()(

)()()(

)()()(

Let W

1

21

22212

12111

21

22212

12111

21

22212

12111

T

ZWZW

fff

fff

fff

Z

www

www

www

TT

ikninin

ikii

ikii

kkkk

k

k

mkkk

m

m

sss

sss

sss

Decision Function of a SVM

• A prediction of the class of a new sample located at v in the vector space is based on the following rule:

.,0

, )(ˆ 2

2

2

2

22

Cww

bewewvf

ji

vn

j

j

vn

i

i

ji

ss

ctors.support ve thecalled are

weghtszero-non with and instances Those

"." class predicted Otherwise,

"." class predicted then ,0)(ˆ if

-ji

vf

ss

The Kernel Density Estimation (KDE) Based

Classifier• The KDE based learning algorithm constructs

one approximate probability density function for one class of objects.

• Classification of a new object is conducted based on the likelihood function:

objects. -class of

functiondensity y probabilit eapproximat theis )(ˆ and

ly,respective classes, all of samples trainingofnumber total theand

class of samples trainingofnumber theare and where

),(ˆ)(

m

f

mSS

vfS

SvL

m

m

mm

m

Identifying Boundary of Different Classes of Objects

Boundary Identified

Problem Definition of Kernel Density Estimation

• Given a set of samples

randomly taken from a probability distribution. We want to find a set of symmetric kernel functions and the corresponding weights such that

nsssS ,...,, 21

),;( iiK

iw

).(),;()(ˆ fKwf ii

i

i

The Proposed KDE Based Classifier

• We determined to employ the Gaussian function and set the width of each Gaussian function to a multiple of the average distance among neighboring samples:

. of samples gneighborin

theamong distance average theis and

where,2

exp2

11)(ˆ

12

2

i

i

ii

n

i i

id

i

s

nf

• can be estimated as follow:

sample.nearest

its to sample from distance theis )( and

space vector ldimensiona- ain )( radiuswith

sphere a of volume theis )1

2(

)(

where,1

)12

(

)()1(

2

1

2

k-thssR

dsR

dsR

dsR

k

iik

ik

dd

ik

d

i

i

dd

ik

i

Accuracy of Different Classification Algorithms

Data setclassification algorithms

KDE SVM 1NN 3NN

Satimage

(4335,2000)92.30 91.30 89.35 90.6

Letter

(15000,5000)97.12 97.98 95.26 95.46

Shuttle

(43500,14500)99.94 99.92 99.91 99.92

Average 96.45 96.40 94.84 95.33

Comparison of Execution Time(in seconds)

KDE without data reduction

KDE with data reduction SVM

Cross validation

Satimage 670 265 64622

Letter 2825 1724 386814

Shuttle 96795 59.9 467825

Make classifier

Satimage 5.91 0.85 21.66

Letter 17.05 6.48 282.05

Shuttle 1745 0.69 129.84

Test

Satimage 21.3 7.4 11.53

Letter 128.6 51.74 94.91

Shuttle 996.1 5.85 2.13

Parameter Setting through Cross Validation

• When carrying out data classification, we normally need to set one or more parameters associated with the data classification algorithm.

• For example, we need to set the value of k with the KNN classifier.

• The typical approach is to conduct cross validation to find out the optimal value.

• In the cross validation process, we set the parameters of the classifier to a particular combination of values that we are interested in and then evaluate how good the combination is based on alternative schemes.

• With the leave-one-out cross validation scheme, we attempt to predict the class of each sample using the remaining samples as the training data set.

• With 10-fold cross validation, we evenly divide the training data set into 10 subsets. Each time, we test the prediction accuracy of one of the 10 subsets using the other 9 subsets as the training set.

Overfitting

• Overfitting occurs when we construct a classifier based on insufficient quantity of samples.

• As a result, the classifier may works well for the training dataset but fail to deliver an acceptable accuracy in the real world.

• For example, if we toss a fair coin two times, there is a 50% chance that we will observe either side up in both tosses.

• Therefore, if we draw our conclusion on how fair the coin is with just two tosses, we may end up with overfitting the dataset.

• Overfitting is a serious problem in analyzing high-dimensional datasets, e.g. the microarray datasets.

Alternative Similarity Functions

• Let < vr,1, vr,2 ,…, vr,n> and < vt,1, vt,2 ,…, vt,n > be the gene expression vectors, i.e. the feature vectors, of samples Sr and St, respectively. Then, the following alternative similarity functions can be employed:• Euclidean distance—

n

hhthr vvitydissimilar

1

2,,

• Cosine—

• Correlation coefficient--

n

hht

n

hhr

n

hhthr

vv

vvSimilarity

1

2,

1

2,

1,,

n

hthtt

n

hrhrr

n

hhtt

n

hhrr

tr

n

hthtrhr

vn

vn

vn

vn

vvn

Similarity

1

2,

1

2,

1,

1,

1,,

ˆ1

1ˆ ˆ

1

1ˆ

1ˆ

1ˆ

where,ˆˆ

ˆˆ1

Importance of Feature Selection

• Inclusion of features that are not correlated to the classification decision may make the problem even more complicated.

• For example, in the data set shown on the following page, inclusion of the feature corresponding to the Y-axis causes incorrect prediction of the test instance marked by “”, if a 3NN classifier is employed.

• It is apparent that “o”s and “x” s are separated by x=10. If only the attribute corresponding to the x-axis was selected, then the 3NN classifier would predict the class of “” correctly.

x=10 x

y

Linearly Separable and Non-Linearly Separable

• Some datasets are linearly separable.

• However, there are more datasets that are non-linearly separable.

An Example of Linearly Separable

x

y

An Example of Non-Linearly Separable

x=10 x

y

A Simplest Case ofLinearly Separable

Feature Selection Based on the Univariate Analysis

Ai

S11 v11

S12 v12

S13 v13

: :

S21 v21

S22 v22

S23 v23

: :

S31 v31

S32 v32

S33 v33

Class 2

Class 1

Class 3

. 3) and 123nfor 3.07 e.g.(

1

ifonly and if

included, is attribute test,-F on the based Then,

. where

;1

;1

;1

Let

2

2...

1

1

1 1

2.

2

1

.

1 1

..

ks

kvvn

A

nn

vvkn

s

vn

vvn

v

i

k

i

i

i

k

i

i

k

i

n

j

iij

n

j

iji

i

k

i

n

j

ij

i

ii

An Example of Univariate Analysis

Sample X Y Class Sample X Y Class

1 7.1 9.1 1 11 10.9 8.8 2

2 6.7 10.2 1 12 10.8 10.3 2

3 7.5 10.6 1 13 11.1 11 2

4 7.6 8.8 1 14 12.3 9.1 2

5 8.1 10.3 1 15 12.1 9.7 2

6 8.0 11.0 1 16 12 10.9 2

7 8.6 8.9 1 17 13.1 8.9 2

8 8.7 9.8 1 18 12.8 10.1 2

9 9.2 11.2 1 19 13.2 11.3 2

10 6.5 10.1 1 20 13.7 9.9 2

Average 7.8 10.0 - Average 12.2 10.0 -

Joint p.m.f. of X, Y, and C

1

11

1

1

1

1

1

1

1

2

2

2

2

2

2

2

2

2

2

8

9

10

11

12

6 8 10 12 14

0

10

0.05). with statisticfor (threshold 4.35

24.16

10

2

2...

2

1

2

2...

2

1

Y

i

i

X

i

i

s

yy

F

s

xx

Blind Spot of the Univariate Analysis

• The univariate analysis is not able to identify crucial features in the following example:

x

y

The Demonstrative Data Set

Joint p.m.f. of X, Y, and C

0

2

4

6

8

10

12

0 2 4 6

1

2

• For Gene X,

• For Gene Y,

)threshold(35.40785.184.1

9845.1

9845.1155.384.210155.347.310

155.320

84.21047.310

22

35.4054.133.5

618.5

618.504.657.61004.651.510

04.620

57.61051.510

22

• However, if we apply the following linear transformation, then we will be able to identify the significance of these two genes:

2(GeneX)(GeneY)

• For “2 Gene X – Gene Y”,

35.4853.4460.0

912.26

912.2627.089.01027.043.110

27.020

)89.0(1043.110

22

• On the other hand, if we employ linear operator (x+2y), then we obtain

.3267.0

10102

2...2

2...1

ws

wwww

• Accordingly, the issue now is that how we can figure out the optimal linear operator of form αx+βy for the projection.

• In the 2-D case, given a set of samples

{(x1,y1), (x2,y2),…, (xn,yn)},

then vi = cosθxi+sinθyi

is the value obtained by projecting (xi,yi) onsinθx-cosθy=0or on the component along vector(cosθ, sinθ) as shown on the following page.

Feature Selection with Independent Components

Analysis (ICA)• In recent years, ICA has emerged as a

promising approach for carrying out multivariate analysis.

Basic Idea

• The ICA algorithm attempts to identify a plane so that when we project the data set on the plane the distribution is most non-gaussian.

A Measure of Non-Gaussianity

• The kurtosis is commonly employed to measure the non-Gaussianity of a data set.

• The kurtosis of a dataset {v1, v2 ,… , vn} is

2

1

2

1

41

4

)(1

1 and

1 where

,3)1(

)(

vvn

svn

v

sn

vv

n

ii

n

ii

n

ii

• The expected value of the kurtosis of a set of random samples taken from a standard normal distribution is 0.

• If the kurtosis of a set of random sample is larger than 0, then the p.d.f. of the distribution is sharper than the standard normal distribution.

• If the kurtosis of a set of random sample is smaller than 0, then the p.d.f. of the distribution is flatter than the standard normal distribution.

• Let kurt(θ) denote the kurtosis of {v1, v2 ,… , vn}

2

1

2

1

41

4

)sin(cos1

1 and

)sin(cos1

where

,3)1(

)sin(cos)(

vyxn

s

yxn

v

sn

vyxkurt

n

iii

n

iii

n

iii

• The issue now is to find the value of θ that minimizes kurt(θ).

• This is an optimization problem.

The Optimization Problem

• The optimization problem is to find the global maximum/minimum of a function.

• There are several heuristic algorithms designed for solving the optimization problem, e.g.• gradient descend;

• genetic algorithms;

• simulated annealing.

The Gradient Descend Algorithm

• In the gradient descend algorithm, a number of random samples are taken as the starting points.

• Then, we compute the gradient at each point and make a move in the direction of which the slope is maximum.

• This process is repeated a number of times until the convergent criterion is met.

An 1-D Example

d

dkurt )( to

right the tomove:

11

1

d

dkurt )( to

left the tomove :

22

2

d

dkurt )( to

right the tomove :

33

3

δ : is a parameter that controls the stepsize

• The gradient descend algorithm can be applied to multidimensional functions. In such cases, partial differentiation is involved.

• If the gradient descend algorithm is to be employed, then we must be able to compute the gradient of the function at any point in the vector space.

Blind Spot of ICA

• However, ICA may fail in the following non-linearly separable dataset

Data Clustering

• Data clustering concerns how to group a set of objects based on their similarity of attributes and/or their proximity in the vector space.

Model of Microarray Data Sets

Gene 1 Gene 2 Gene n

S11

S12

S13

:

S21 vi,j

S22

S23

:

S31

S32

S33

Class 2

Class 1

Class 3

Applications of Data Clustering in Microarray Data Analysis

• Data clustering has been employed in microarray data analysis for

• identifying the genes with similar expressions;

• identifying the subtypes of samples.

• For cluster analysis of samples, we can employ the feature selection mechanism developed for classification of samples.

• For cluster analysis of genes, each column of gene expression data is regarded as the feature vector of one gene.

The Agglomerative Hierarchical Clustering

Algorithms• The agglomerative hierarchical clustering

algorithms operate by maintaining a sorted list of inter-cluster distances/similarity.

• Initially, each data instance forms a cluster.• The clustering algorithm repetitively

merges the two clusters with the minimum inter-cluster distance or the maximum inter-cluster similarity.

• Upon merging two clusters, the clustering algorithm computes the distances between the newly-formed cluster and the remaining clusters and maintains the sorted list of inter-cluster distances accordingly.

• There are a number of ways to define the inter-cluster distance:• minimum distance/maximum similarity (single-link);

• maximum distance/minimum similarity (complete-link);

• average distance/average similarity;

• mean distance (applicable only with the vector space model).

• Given the following similarity matrix, we can apply the complete-link algorithm to obtain the dendrogram shown on the next slide.

g1 g2 g3 g4 g5 g6

g1 - 0.053 0.137 0.862 0.035 0.018

g2 - 0.046 0.072 0.816 0.606

g3 - 0.447 0.751 0.201

g4 - 0.291 0.156

g5 - 0.494

g6 -

• Assume that the complete-link algorithm is employed.

• If those similarity scores that are less than 0.3 are excluded, then we obtain 3 clusters {P1, P4}, {P2, P5, P6}, {P3}.

P1 P4 P2 P5P6P3

0.862 0.816

0.137

0.494

0.018

g1 g4 g2 g5 g3 g6

0.862 0.816

0.751

0.606

0.447

• If the single-link algorithm is employed, then we obtain the following result.

Example of the Chaining Effect

Single-link (10 clusters)

Complete-link (2 clusters)

Effect of Bias towards Spherical Clusters

Single-link (2 clusters) Complete-link (2 clusters)

K-Means: A Partitional Data Clustering Algorithm

• The k-means algorithm is probably the most commonly used partitional clustering algorithm.

• The k-means algorithm begins with selecting k data instances as the means or centers of k clusters.

• The k-means algorithm then executes the following loop iteratively until the convergence criterion is met.• repeat {

• assign every data instance to the closest cluster based on the distance between the data instance and the center of the cluster;

• compute the new centers of the k clusters;

• } until(the convergence criterion is met);

• A commonly-used convergence criterion is

If the difference between the values of two consecutive iterations is smaller than a threshold, then the algorithm terminates.

.cluster ofcenter theis where

,2

ii

C Cpi

Cm

mpEi i

Illustration of the K-Means Algorithm---(I)

initial center

initial center initial center

Illustration of the K-Means Algorithm---(II)

x

x

x

new center after 1st iteration



Illustration of the K-Means Algorithm---(III)

new center after 2nd iteration



A Case in which the K-Means Algorithm Fails

• The K-means algorithm may converge to a local optimal state as the following example demonstrates:

InitialSelection

Conclusions

• Machine learning algorithms have been widely exploited to tackle many important bioinformatics problems.

Machine Learning and its Applications in Bioinformatics

Documents

Transcript of Machine Learning and its Applications in Bioinformatics