A Gentle Introduction to Support Vector Machines in Biomedicine · 2010-10-27 · A Gentle...
Transcript of A Gentle Introduction to Support Vector Machines in Biomedicine · 2010-10-27 · A Gentle...
A Gentle Introduction to A Gentle Introduction to A Gentle Introduction to A Gentle Introduction to Support Vector Machines Support Vector Machines in Biomedicinein Biomedicine
Alexander Statnikov*, Douglas Hardin#,
Isabelle Guyon†, Constantin F. Aliferis*Isabelle Guyon , Constantin F. Aliferis
(Materials about SVM Clustering were contributed by Nikita Lytkin*)
*New York University, #Vanderbilt University, †ClopiNet
Part I
• Introduction
• Necessary mathematical conceptsNecessary mathematical concepts
• Support vector machines (SVMs) for binary l ifi i l i l f l iclassification: classical formulation
• Basic principles of statistical machine learningp p g
2
Introduction
3
About this tutorial
• Main goal: Fully understand support vector machines (andMain goal: Fully understand support vector machines (and important extensions) with a modicum of mathematics knowledge.
• This tutorial is both modest (it does not invent anything new) and ambitious (support vector machines are generally considered mathematically quite difficult to grasp)mathematically quite difficult to grasp).
• Our educational approach:
Basic problem Main idea of SVM solution
Geometrical interpretationExtensions Case studies
4
Math/TheoryAlgorithms
Data-analysis problems of interesty p
1. Build computational classification models (or1. Build computational classification models (or “classifiers”) that assign objects (patients/samples) into two or more classes.
‐ Classifiers can be used for diagnosis, outcome prediction, and other classification tasks.
‐ E.g., build a decision‐support system to diagnose primary and metastatic cancers from gene expression profiles of the patients:
Classifier Primary Lung Cancer
model
Patient withlung cancer
Biopsy Gene expressionprofile
Metastatic Lung Cancer
5
Data-analysis problems of interesty p
ModelRelevant to clinical trials
Irrelevant to clinical trials
Article
Will respond to treatment
Model
PatientBlood
lMass spectrometry
t i fil
p
Will not respond to treatment
Cannot make decisionPatient sample proteomics profile
6
Data-analysis problems of interesty p
2. Build computational regression models to predict values2. Build computational regression models to predict values of some continuous response variable or outcome.
‐ Regression models can be used to predict survival, length of stay g p , g yin the hospital, laboratory test values, etc.
‐ E.g., build a decision‐support system to predict optimal dosage f h d b d d h h dof the drug to be administered to the patient. This dosage is
determined by the values of patient biomarkers, and clinical and demographics data:g p
Regression model
Bi k
Optimal dosage is 5 IU/Kg/week
1 2.2 3 423 2 3 92 2 1 8
PatientBiomarkers, clinical and
demographics data
IU/Kg/week
7
Data-analysis problems of interesty p
3. Out of all measured variables in the dataset, select the3. Out of all measured variables in the dataset, select the smallest subset of variables that is necessary for the most accurate prediction (classification or regression) of p ( g )some variable of interest (e.g., phenotypic response variable).
‐ E.g., find the most compact panel of breast cancer biomarkers from microarray gene expression data for 20,000 genes:
Breast cancer tissues
Normaltissues
8
Data-analysis problems of interesty p
4. Build a computational model to identify novel or outlier4. Build a computational model to identify novel or outlier objects (patients/samples).
‐ Such models can be used to discover deviations in sample phandling protocol when doing quality control of assays, etc.
‐ E.g., build a decision‐support system to identify aliens.Humans Alien #1
Alien #2
Alien #3
9
Data-analysis problems of interesty p
5. Group objects (patients/samples) into5. Group objects (patients/samples) into several clusters based on their similarity. y
‐ These methods can be used to discover disease sub‐types and for other tasks.
Cluster #1
Cluster #2
‐ E.g., consider clustering of brain tumor patients into 4 clusters based on their gene expression profiles All patients have the
Cluster #3expression profiles. All patients have the same pathological sub‐type of the disease, and clustering discovers new molecular
Cluster #4disease subtypes that happen to have different characteristics in terms of patient survival and time to recurrence after
Cluster #4
survival and time to recurrence after treatment.
10
Basic principles of classificationp p
• Want to classify objects as boats and houses.
11
y j
Basic principles of classificationp p
• All objects before the coast line are boats and all objects after the
12
coast line are houses. • Coast line serves as a decision surface that separates two classes.
Basic principles of classificationp p
These boatswill be misclassified as housesThese boats will be misclassified as houses
13
Basic principles of classificationp p
Thi h ill b i l ifi d b tThis house will be misclassified as a boat
14
Basic principles of classificationp p
Longitude
L i d
Boat
HouseLatitude
• The methods that build classification models (“classification algorithms”) operate
15
( f g ) pvery similarly to the previous example.
• First all objects are represented geometrically.
Basic principles of classificationp p
Longitude
L i d
Boat
HouseLatitude
Th th l ith k t fi d d i i
16
Then the algorithm seeks to find a decision surface that separates classes of objects
Basic principles of classificationp p
Longitude
These objects are classified as houses
? ? ?
? ? ?
? ? ?
L i d
These objects are classified as boats
Latitude
Unseen (new) objects are classified as “boats”
17
if they fall below the decision surface and as “houses” if the fall above it
The Support Vector Machine (SVM) happroach
• Support vector machines (SVMs) is a binary classification algorithm that offers a solution to problem #1.algorithm that offers a solution to problem #1.
• Extensions of the basic SVM algorithm can be applied to solve problems #1‐#5.solve problems #1 #5.
• SVMs are important because of (a) theoretical reasons:‐ Robust to very large number of variables and small samplesRobust to very large number of variables and small samples
‐ Can learn both simple and highly complex classification models
‐ Employ sophisticated mathematical principles to avoid overfitting
and (b) superior empirical results.
18
Main ideas of SVMsGene Y
Cancer patientsNormal patientsGene X
• Consider example dataset described by 2 genes, gene X and gene Y• Represent patients geometrically (by “vectors” or “points”)• Represent patients geometrically (by vectors or points )
19
Main ideas of SVMsGene Y
Cancer patientsNormal patientsGene X
• Find a linear decision surface (“hyperplane”) that can separate patient classes and has the largest distance (i e largest “gap” orpatient classes and has the largest distance (i.e., largest gap or “margin”) between border‐line patients (i.e., “support vectors”);
20
Main ideas of SVMsGene Y
Cancer Cancer
kernel
Decision surface
Normal
GeneX
Normal
• If such linear decision surface does not exist, the data is mapped i t h hi h di i l (“f t ”) h th
Gene X
into a much higher dimensional space (“feature space”) where the separating decision surface is found;
• The feature space is constructed via very clever mathematical
21
The feature space is constructed via very clever mathematical projection (“kernel trick”).
History of SVMs and usage in the literaturey g• Support vector machine classifiers have a long history of development starting from the 1950’s.
• The most important milestone for development of modern SVMs is the 1992 paper by Boser, Guyon, and Vapnik (“A training algorithm for optimal margin classifiers”)c ass f e s )
Vladimir N.Vapnik
Mark A.Aizerman
NachmanAronszajn
FrankRosenblatt
Alexander Y.Lerner
Emanuel M.Braverman
Lev I.Rozonoer
Thomas M.Cover
Fred W.Smith
1950 1957 1963 1964 1965 1968
Vladimir N.Vapnik
Bernard E.Boser
CorinaCortes
TomasoPoggio
221974‐1979
Aleksey Y.Chervonenkis
IsabelleGuyon
1992
Vladimir N.Vapnik
1995
Vladimir N.Vapnik
TomasoPoggio
1975
GraceWahba
1990 1990
Federico Girosi
History of SVMs and usage in the literaturey g
10000Use of Support Vector Machines in the Literature
6 660
8,180
8,860
7000
8000
9000Use of Support Vector Machines in the Literature
General sciences
Biomedicine
4,950
6,660
5000
6000
7000
2,330
3,530
2000
3000
4000
359 621
906
1,430
4 12 46 99 201 351 521 726
917 1,190
0
1000
2000
1998 1999 2000 2001 2002 2003 2004 2005 2006 2007
23
History of SVMs and usage in the literaturey g
30000Use of Linear Regression in the Literature
20 000
22,200
24,100
20,100
25000
Use of Linear Regression in the Literature
General sciences
Biomedicine
12 000
13,500
14,900 16,000
17,700
19,500 20,000 19,600
14,900 15,500
19,200 18,700 19,100
17,700 18,300
15000
20000
9,770 10,800
12,000
10000
0
5000
1998 1999 2000 2001 2002 2003 2004 2005 2006 2007
24
Necessary mathematical conceptsy p
25
How to represent samples geometrically?(R )Vectors & points in n-dimensional space (Rn)
• Assume that a sample/patient is described by n characteristics p /p y(“features” or “variables”)
• Vector representation: Every sample/patient is a vector in Rn
with tail at point with 0 coordinates and arrow‐head at point with the feature values.
E l C id ti t d ib d b 2 f t• Example: Consider a patient described by 2 features:
Systolic BP = 110 and Age = 29.
This patient can be represented as a vector in R2:This patient can be represented as a vector in R :
Age
(110 29)(110, 29)
26
Systolic BP(0, 0)
How to represent samples geometrically?(R )Vectors & points in n-dimensional space (Rn)
• Point representation: Every sample/patient is represented as a point in the n‐dimensional space (Rn) with coordinates given bypoint in the n dimensional space (R ) with coordinates given by the values of its features.
• Example: Consider a patient described by 2 features:
Systolic BP = 110 and Age = 29.
This patient can be represented as a point in R2:
Age
(110 29)(110, 29)
27
Systolic BP(0, 0)
How to represent samples geometrically?(R )Vectors & points in n-dimensional space (Rn)
Cholesterol Systolic BP Tail of the Arrow head of thePatient id
Cholesterol
(mg/dl)
Systolic BP
(mmHg) Age (years)
Tail of the
vector
Arrow-head of the
vector
1 150 110 35 (0,0,0) (150, 110, 35)
2 250 120 30 (0,0,0) (250, 120, 30)2 250 120 30 (0,0,0) (250, 120, 30)
3 140 160 65 (0,0,0) (140, 160, 65)
4 300 180 45 (0,0,0) (300, 180, 45)
28
Purpose of vector representationp p• Having represented each sample/patient allows now to geometrically represent the decision surface that separates two groups of samples/patients.
A d i i f i R3
6
7
5
6
7
A decision surface in R2 A decision surface in R3
3
4
5
2
3
4
5
0 1 2 3 4 5 6 70
1
2
3 4 5 6 75
100
1
• In order to define the decision surface, we need to introduce some basic math elements
0 1 2 3 40
some basic math elements…
29
Basic operation on vectors in Rnp1. Multiplication by a scalar
Consider a vector and a scalar c
Define:
),...,,( 21 naaaa
),...,,( 21 ncacacaac
When you multiply a vector by a scalar, you “stretch” it in the same or opposite direction depending on whether the scalar is
), ,,( 21 n
same or opposite direction depending on whether the scalar is positive or negative.
2
)2,1(c
a
ac4
1
)2,1(c
a
2
3
4
)4,2(
2
ac
c
a
ac
1
2
3
)2,1(
1
ac
c a
1
2
0 1 2 3 4-2 -1
0 1 2 3 4 ac
-2
-1
30
Basic operation on vectors in Rnp2. Addition
Consider vectors and
Define:
),...,,( 21 naaaa
),...,,( 21 nbbbb
),...,,( 2211 nn babababa
), ,,( 2211 nn
Recall addition of forces in3
4
)03(
)2,1(
b
a
Recall addition of forces in classical mechanics.1
2
0 1 2 3 4
)2,4(
)0,3(
ba
b a
ba
b
31
Basic operation on vectors in Rnp3. Subtraction
Consider vectors and
Define:
),...,,( 21 naaaa
),...,,( 21 nbbbb
),...,,( 2211 nn babababa
What vector do we
), ,,( 2211 nn
What vector do we need to add to to get ? I e similar to
)0,3(
)2,1(
b
a
2
3
4
b
a
get ? I.e., similar to subtraction of real numbers
)2,2(ba
a
b
ba
1
2
0 1 2 3 4-3 -2 -1
a
numbers.b
32
Basic operation on vectors in Rnp4. Euclidian length or L2‐norm
Consider a vector
Define the L2 norm
),...,,( 21 naaaa
222 aaaa
Define the L2‐norm:
We often denote the L2 norm without subscript i e
212... naaaa
a
We often denote the L2‐norm without subscript, i.e. a
L2 i t i l t
24.25
)2,1(
2
a
a
a
1
2
3
Length of this vector is ≈ 2.24
L2‐norm is a typical way to measure length of a vector; other methods to measure
0 1 2 3 4
vector is 2.24 other methods to measure length also exist.
33
Basic operation on vectors in Rnp5. Dot product
Consider vectors and
Define dot product:
),...,,( 21 naaaa
),...,,( 21 nbbbb
n
bbbbbDefine dot product:
The la of cosines sa s that here
i
iinn bababababa1
2211 ...
|||||||| bb
The law of cosines says that where θ is the angle between and . Therefore, when the vectors are perpendicular
cos|||||||| 22 baba a
b
0ba
are perpendicular .0ba
4)2,1(a
4)2,0(a
1
2
3
3
)0,3(
ba
b
a
1
2
3
0
)0,3(
),(
ba
b
a
34
0 1 2 3 4b 0 1 2 3 4
b
Basic operation on vectors in Rnp5. Dot product (continued)
n
iiinn bababababa
12211 ...
• Property:
• In the classical regression equation
i 1
222211 ... aaaaaaaaa nn
bxwy
• In the classical regression equation
the response variable y is just a dot product of the
bxwy
vector representing patient characteristics ( ) and
the regression weights vector ( ) which is commonw
x
g g ( )
across all patients plus an offset b.
35
Hyperplanes as decision surfacesyp p
• A hyperplane is a linear decision surface that splits the space into two parts;
• It is obvious that a hyperplane is a binary classifier.
A hyperplane in R2 is a line A hyperplane in R3 is a plane
5
6
7
5
6
7
yp p p
2
3
4
5
1
2
3
4
0 1 2 3 4 5 6 70
1
2
0 1 2 3 4 5 6 7
0
5
100
36A hyperplane in Rn is an n‐1 dimensional subspace
Equation of a hyperplaneq yp p
First we show with show the definition of hyperplane by an interactive demonstration.
or go to http://www.dsl‐lab.org/svm_tutorial/planedemo.html
37
Source: http://www.math.umn.edu/~nykamp/
Click here for demo to begin
Equation of a hyperplaneq yp p
Consider the case of R3:
An equation of a hyperplane is defined by a point (P0) and a perpendicularP0P
w
0xx
by a point (P0) and a perpendicular vector to the plane ( ) at that point.w
0
0x
x
Define vectors: and , where P is an arbitrary point on a hyperplane.00 OPx
OPx
O
A condition for P to be on the plane is that the vector is perpendicular to :0xx
w
0)( 0 xxw
or
00 xwxw
define 0xwb
0 bxw
38
The above equations also hold for Rn when n>3.
Equation of a hyperplaneq yp p
)614(wExample
43)4210(
)7,1,0(
)6,1,4(
0
xwb
P
w
w 050 xw
+ direction
043)6,1,4(
043
43)4210(0
x
xw
xwb
P0 043 xw
010
04364
043),,()6,1,4(
),,(
)3()2()1(
)3()2()1(
xxx
xxx010 xw
‐ direction
)3()2()1(
What happens if the b coefficient changes? h h l l h d f
The hyperplane moves along the direction of . We obtain “parallel hyperplanes”.
w
39wbbD
/21 Distance between two parallel hyperplanes and is equal to .
01 bxw
02 bxw
(Derivation of the distance between two ll l h l )parallel hyperplanes)
02 bxw
wtxx
wt
w
0 bxw
02 bxw
bxw
wtwtD
wtxx
022
12
01 bxw
bwtxw
bwtxw
bxw
0
0)(
0
2
2
1
21
22
1x
2x
bb
bwtbbxw
bwtxw
0
0)(
0
2
2
2
111
21
wbbt
bwtb
/)(
02
21
2
2
1
40
wbbwtD
/21
RecappWe know…
( “ ”)• How to represent patients (as “vectors”)• How to define a linear decision surface (“hyperplane”)
We need to know…• How to efficiently compute the hyperplane that separates y p yp p ptwo classes with the largest “gap”?
Gene Y
Need to introduce basics of relevant optimization theory
41
Cancer patientsNormal patientsGene X
Basics of optimization: C fConvex functions
A f ti i ll d if th f ti li b l th• A function is called convex if the function lies below the straight line segment connecting two points, for any two points in the interval.points in the interval.
• Property: Any local minimum is a global minimum!
Local minimum
Convex function Non convex function
Global minimum Global minimum
42
Convex function Non‐convex function
Basics of optimization: Q d (QP)Quadratic programming (QP)
•Quadratic programming (QP) is a special optimization problem: the function to optimize (“objective”) is quadratic, subject to linear constraints.
•Convex QP problems have convex objective functions.
•These problems can be solved easily and efficiently by greedy algorithms (because every local minimum is a global minimum).
43
Basics of optimization: E l QP blExample QP problem
Consider
Minimize subject to22||||
2
1x
),( 21 xxx
0121 xx2
quadratic objective
linear constraints
This is QP problem, and it is a convex QP as we will see later
W it it
objective constraints
We can rewrite it as:
Minimize subject to)(2
1 22
21 xx 0121 xx
2
quadratic linear
44
objective constraints
Basics of optimization: E l QP blExample QP problem
f(x x ) 01xxf(x1,x2)
121 xx
0121 xx
)(2
1 22
21 xx
x2
20121 xx
x1
The solution is x1=1/2 and x2=1/2.
45
C l i ! h dCongratulations! You have mastered all math elements needed toall math elements needed to
understand support vector machines.
Now let us strengthen yourNow, let us strengthen your knowledge by a quiz g y q
46
Quiz1) Consider a hyperplane shown
ith hit It i d fi d b
w
with white. It is defined by equation: Which of the three other hyperplanes can be defined by
P0010 xw
hyperplanes can be defined by equation: ?
‐ Orange
03 xw
Orange‐ Green‐ Yellow
4
2) What is the dot product between vectors and ?)3,3(a
)1,1( b
1
2
3 a
)3,3(a )1,1(b
b
1
0 1 2 3 4-2 -1
-1
47
-2
Quiz
4
3) What is the dot product between vectors and ?)3,3(a
)0,1(b
1
2
3 a
1b
1
0 2 3 4-2 -1
-1
-2
4
4) What is the length of a vector and what is the length of)0,2(a
2
3
4
and what is the length of all other red vectors in the figure?
)0,2(a
1
1
0 2 3 4-2 -1
-1
a
48
-2
Quiz
5) Which of the four functions is/are convex?) /
1 2
3 4
49
S hi (SVM ) f Support vector machines (SVMs) for binary classification: binary classification: classical formulation
50
Case 1: Linearly separable data; “H d ” l SVM“Hard-margin” linear SVM
Gi i i d,...,, 21 n
N Rxxx
Given training data:}1,1{,...,,
, ,,
21
21
N
N
yyy• Want to find a classifier (hyperplane) to separate negative objects from the positive onespositive ones.
• An infinite number of such hyperplanes exist.
• SVMs finds the hyperplane that• SVMs finds the hyperplane that maximizes the gap between data points on the boundaries (so called “support vectors”)
Positive objects (y=+1)Negative objects (y=‐1)
(so‐called support vectors ).• If the points on the boundaries are not informative (e.g., due to
i ) SVM t d ll
51
j (y )g j (y )noise), SVMs may not do well.
Statement of linear SVM classifier
0 bxw
1 bxw
h i di b
1 bxw
The gap is distance between parallel hyperplanes:
and1 bxw
and
Or equivalently:
1 bxw
0)1( b
We know that
0)1( bxw0)1( bxw
Positive objects (y=+1)Negative objects (y=‐1)Therefore:
wbbD
/21
w
Since we want to maximize the gap,
we need to minimize
wD
/2
wwe need to minimize
or equivalently minimize52
2
21 w ( is convenient for taking derivative later on)
21
Statement of linear SVM classifier
ddi i d0 bxw
1 bxw
1 b
In addition we need to impose constraints that all objects are correctly l ifi d I
1 bxw
1 bxw
1 bxw
classified. In our case:
if
if
1 bxw i
1 bxw i
1iy1iy
1 bxw
Equivalently:
1)( bxwy ii
Positive objects (y=+1)Negative objects (y=‐1)
In summary:
Want to minimize subject to for i = 1,…,N2
21 w
1)( bxwy ii
Want to minimize subject to for i 1,…,N
Then given a new object x, the classifier is 53
2 1)( bxwy ii
)()( bxwsignxf
SVM optimization problem: P l f lPrimal formulation
Minimize subject to for i = 1,…,Nn
iw221 01)( bxwy ii
Minimize subject to for i 1,…,N
i 1)(y ii
Objective function Constraints
•This is called “primal formulation of linear SVMs”.• It is a convex quadratic programming (QP) optimization problem with n variables (wi, i = 1,…,n), where n is the number of features in the dataset.
54
SVM optimization problem: D l f lDual formulation
Th i bl b t i th ll d “d l• The previous problem can be recast in the so‐called “dual form” giving rise to “dual formulation of linear SVMs”.
• It is also a convex quadratic programming problem but with• It is also a convex quadratic programming problem but with N variables (i ,i = 1,…,N), where N is the number of samples.samples.
Maximize subject to and . N
jijiji
N
i xxyy21 0i 0
N
ii yMaximize subject to and . ji
jijijii
i yy1,
21
0i 01i
ii y
Objective function ConstraintsN
Then the w‐vector is defined in terms of i:
A d th l ti b
N
iiii xyw
1
)()( bifN
55
And the solution becomes: )()(1
bxxysignxfi
iii
SVM optimization problem: B f f d l f lBenefits of using dual formulation
1) N d t i i l d t d t l d t1) No need to access original data, need to access only dot products.
NN
Objective function:
N
jijijiji
N
ii xxyy
1,21
1
N
Solution:
) b f f b d d b h b
)()(1
bxxysignxfi
iii
2) Number of free parameters is bounded by the number of support vectors and not by the number of variables (b f l f h h d l bl !)(beneficial for high‐dimensional problems!).
E.g., if a microarray dataset contains 20,000 genes and 100
56
E.g., if a microarray dataset contains 20,000 genes and 100 patients, then need to find only up to 100 parameters!
(Derivation of dual formulation)
Minimize subject to for i = 1 N
( )
n
iw221 01)( bxwy
Minimize subject to for i 1,…,N
iiw
12 01)( bxwy ii
Objective function Constraints
Apply the method of Lagrange multipliers.
Define Lagrangian N
iii
n
iP bxwywbw 221 1)(,,
ii 11
a vector with n elements
a vector with N elements
We need to minimize this Lagrangian with respect to and simultaneously require that the derivative with respect to vanishes all subject to the
bw,
require that the derivative with respect to vanishes , all subject to the constraints that
.0i
57
(Derivation of dual formulation)( )
If we set the derivatives with respect to to 0 we obtain:bw
If we set the derivatives with respect to to 0, we obtain:bw,
N
iiP y
b
bw00
,,
N
iiii
P
i
xyww
bw
b
1
1
0,,
We substitute the above into the equation for and obtain “dual formulation of linear SVMs”:
i 1
,,bwP
f f
N
jijijiji
N
iiD xxyy
121
1
We seek to maximize the above Lagrangian with respect to , subject to the
jii 1,1
N
58
constraints that and .0i 01
N
iii y
Case 2: Not linearly separable data;“S f ” l SVM“Soft-margin” linear SVM
00
00
00
0What if the data is not linearly separable? E.g., there are outliers or noisy measurements
00 00 0
00
0outliers or noisy measurements, or the data is slightly non‐linear.
Want to handle this case without changing the family of decision functions.
Assign a “slack variable” to each object , which can be thought of distance from the separating hyperplane if an object is misclassified and 0 otherwise.
0iApproach:
Want to minimize subject to for i = 1,…,NN
iCw1
2
21
iii bxwy 1)(
the separating hyperplane if an object is misclassified and 0 otherwise.
59
Then given a new object x, the classifier is i 1
)()( bxwsignxf
Two formulations of soft-margin l SVMlinear SVM
Primal formulation
Minimize subject to for i = 1,…,N N
i
n
i Cw221 iii bxwy 1)(
Primal formulation:
Minimize subject to for i 1,…,N i
ii
i Cw11
2
Objective function Constraints
iii bxwy 1)(
N N
Dual formulation:
N
jijijiji
n
ii xxyy
1,21
1
Ci 0 01
N
iii yMinimize subject to and
Objective function Constraintsfor i = 1,…,N.
Objective function Constraints
60
Parameter C in soft-margin SVMParameter C in soft margin SVMMinimize subject to for i = 1,…,N
N
iCw2
21
iii bxwy 1)(
j , ,i
i1
2 iiiy )(
• When C is very large, the soft‐margin SVM is equivalent to hard‐margin SVM;
• When C is very small we
C=100 C=1
• When C is very small, we admit misclassifications in the training data at the expense of
C=100 C=1having w‐vector with small norm;
• C has to be selected for the• C has to be selected for the distribution at hand as it will be discussed later in this
61C=0.15 C=0.1
tutorial.
Case 3: Not linearly separable data;K l kKernel trick
Tumor
?
Gene Y
Tumor
N l?
?kernel
Normal?Normal
Data is not linearly separable Data is linearly separable in the
Gene X
a a s o ea y sepa ab ein the input space
a a s ea y sepa ab e efeature space obtained by a kernel
62
HR N:
Kernel trickData in a higher dimensional feature spacex
)(x
Original data (in input space)
)()( bxwsignxf
))(()( bxwsignxf
N
N
)(
i
iii xyw1
i
iii xyw1
)(
))()(()( bifN
))()(()(
1
bxxysignxfi
iii
))(()( bKifN
)),(()(
1
bxxKysignxfi
iii
Therefore, we do not need to know Ф explicitly, we just need to , p y, jdefine function K(∙, ∙): RN × RNR.
Not every function RN × RNR can be a valid kernel; it has to satisfy so‐called
63
Not every function R R R can be a valid kernel; it has to satisfy so called Mercer conditions. Otherwise, the underlying quadratic program may not be solvable.
Popular kernelspA kernel is a dot product in some feature space:
Examples:
)()(),( jiji xxxxK
Examples:Linear kernelGaussian kernel
),(2
jiji xxxxK
Gaussian kernelExponential kernel)exp(),(
)exp(),(2
jiji
jiji
xxxxK
xxxxK
Polynomial kernel
Hybrid kernel)exp()()(
)(),(2
jiq
jiji
qjiji
xxxxpxxK
xxpxxK
y
Sigmoidal)tanh(),(
)exp()(),(
jiji
jijiji
xxkxxK
xxxxpxxK
64
Understanding the Gaussian kernelg
)exp(),(2
jj xxxxK
Consider Gaussian kernel: )exp(),( jj xxxxK
All d i i h i d l i di i l h
65
• All data points in the input space are mapped to a multi‐dimensional sphere.
• The parameter γ defines the relative location of the points on the sphere.
Understanding the polynomial kernel
Consider polynomial kernel:
g p y3)1(),( jiji xxxxK
p y
Assume that we are dealing with 2‐dimensional data
)1(),( jiji xxxxK
g(i.e., in R2). Where will this kernel map the data?
)2()1( xx2‐dimensional space
kernel
)2(2
)1(2
)2()1(3
)2(3
)1()2()1(2
)2(2
)1()2()1(1 xxxxxxxxxxxx
10‐dimensional space
66
Example of benefits of using a kernelp g)2(x
• Data is not linearly separable h (R2)
1x
x
x
in the input space (R2).• Apply kernelto map data to a higher
2)(),( zxzxK
)1(x
x
3x4x to map data to a higher dimensional space (3‐dimensional) where it is
2x )linearly separable.
2 )(),(
2
)2()2()1()1()2(
)1(
)2(
)1(2 zxzxz
z
x
xzxzxK
)()(222 )2()1(
2)1(
)2()1(
2)1(
2)2(
2)2()2()2()1()1(
2)1(
2)1( zxzz
z
xx
x
zxzxzxzx
67
2)2(
)()(2
)2(
)()()()()()()()()()(
zx
Example of benefits of using a kernelp g
2)1(
2)( xx
x
x
Therefore the explicit mapping is
2
)2(
)2()1(2)(
x
xxxTherefore, the explicit mapping is
x )2(x
1x
2)2(x
k l
)1(x3x
4x
2x
21 , xxkernel
2x )1(x
43 , xx
)2()1(2 xx
68
Comparison with methods from classical statistics & regression
N d 5 l f h f h i• Need ≥ 5 samples for each parameter of the regression model to be estimated:
Number of variables
Polynomial degree
Number of parameters
Required sample
2 3 10 50
10 3 286 1,430
10 5 3,003 15,015
100 3 176,851 884,255
100 5 96,560,646 482,803,230
• SVMs do not have such requirement & often require much less sample than the number of variables, even
69
when a high‐degree polynomial kernel is used.
Basic principles of statistical hi l imachine learning
70
Generalization and overfittingg
• Generalization: A classifier or a regression algorithm g glearns to correctly predict output from given inputs not only in previously seen objects but also in y p y jpreviously unseen objects.
• Overfitting: A classifier or a regression algorithmOverfitting: A classifier or a regression algorithm learns to correctly predict output from given inputs in previously seen objects but fails to do so inin previously seen objects but fails to do so in previously unseen objects.
• Overfitting Poor generalization• Overfitting Poor generalization.
71
Example of overfitting and generalizationp g g
Algorithm 2
There is a linear relationship between predictor and outcome (plus some Gaussian noise).
Outcome of Interest Y
Algorithm 2
Outcome of Interest Y
Algorithm 1
Training Data
Test Data
Predictor X
• Algorithm 1 learned non‐reproducible peculiarities of the specific sample l bl f l b d d l h l h f h f
Predictor X
available for learning but did not learn the general characteristics of the function that generated the data. Thus, it is overfitted and has poor generalization.
• Algorithm 2 learned general characteristics of the function that produced the g g pdata. Thus, it generalizes.
72
Example of overfitting and generalizationp g g
True relationshipTraining data and decision
f f l ith #1Training data and decision
f f l ith #2p
surface of algorithm #1 surface of algorithm #2
A general principle here (Occam’s razor) is that the likelihood of overfitting increases with the “complexity” of an algorithm
73
overfitting increases with the complexity of an algorithm
“Loss + penalty” paradigm for learning to d f d lavoid overfitting and ensure generalization
•Many statistical learning algorithms (including SVMs) search for a decision function by solving the followingsearch for a decision function by solving the following optimization problem:
Minimize (Loss + λ Penalty)
• Lossmeasures error of fitting the data
• Penalty penalizes complexity of the learned functiony p p y
• λ is regularization parameter that balances Loss and Penalty
74
“Loss + penalty” paradigm for learning to d f d lavoid overfitting and ensure generalization
Gene Y
Cancer
Normal Gene X
The classifier that is represented by a straight line is considered simpler than the classifier represented by a wiggled line (because we need less
75
the classifier represented by a wiggled line (because we need less information to represent it).
SVMs in “loss + penalty” formp y
SVMs build the following classifiers )()( bxwsignxf
SVMs build the following classifiers:
Consider soft‐margin linear SVM formulation:
Fi d d th t
b
)()( bxwsignxf
Find and that
Minimize subject to for i = 1,…,N
w b
N
iiCw
1
2
21
iii bxwy 1)(
This can also be stated as:
Find and thatw
b
Minimize 2
21
)](1[ wxfyN
iii
Loss (“hinge loss”)
Penalty
(in fact, one can show that λ = 1/(2C)).76
Meaning of SVM loss functiong
Consider loss function N
xfy )](1[
Consider loss function:
• Recall that […]+ indicates the positive part
i
ii xfy1
)](1[
• For a given sample/patient i, the loss is non‐zero if
• In other words,
0)(1 ii xfy
1)( ii xfy
• Since , this means that the loss is non‐zero iffor yi = +1
ii
}1,1{ iy
1)( ixf
for yi= -1
• In other words, the loss is non‐zero if1)( ixf
for yi = +1
for yi= -1
1 bxw i
1 bxw i
77
Meaning of SVM loss functiong
0 bxw
1b
1 bxw
1• If the object is negative, it is penalized only in 1 bxw
1 2 3 4is penalized only in regions 2,3,4
• If the object is positive, it j p ,is penalized only in regions 1,2,3
Positive objects (y=+1)Negative objects (y=‐1)
78
Flexibility of “loss + penalty” frameworky p yMinimize (Loss + λ Penalty)
79
Part 2
d l l f• Model selection for SVMs
• Extensions to the basic SVM model:1. SVMs for multicategory data
2. Support vector regression2. Support vector regression
3. Novelty detection with SVM‐based methods
4 Support vector clustering4. Support vector clustering
5. SVM‐based variable selection
6 C ti t i l b biliti f SVM6. Computing posterior class probabilities for SVM classifiers
80
Model selection for SVMs
81
Need for model selection for SVMs
Gene 2 Gene 2Gene 2
Tumor
Gene 2
Tumor
Normal
Gene 1
• It is impossible to find a linear SVM classifier
Gene 1Normal
• We should not apply a non‐linear SVM that separates tumors from normals!
• Need a non‐linear SVM classifier, e.g. SVM with polynomial kernel of degree 2 solves
classifier while we can perfectly solve this problem using a linear SVM classifier!
82
this problem without errors.
A data-driven approach for d l l f SVMmodel selection for SVMs
• Do not know a priori what type of SVM kernel and what kernel• Do not know a priori what type of SVM kernel and what kernel parameter(s) to use for a given dataset?
• Need to examine various combinations of parameters, e.g.Need to examine various combinations of parameters, e.g. consider searching the following grid:
Parameter C
(0.1, 1) (1, 1) (10, 1) (100, 1) (1000, 1)
Polynomialdegree d
(0.1, 2) (1, 2) (10, 2) (100, 2) (1000, 2)
(0.1, 3) (1, 3) (10, 3) (100, 3) (1000, 3)
(0.1, 4) (1, 4) (10, 4) (100, 4) (1000, 4)
(0.1, 5) (1, 5) (10, 5) (100, 5) (1000, 5)
83
• How to search this grid while producing an unbiased estimate of classification performance?
Nested cross-validation
Recall the main idea of cross‐validation:
test What combination of SVM
data train testtrain
parameters to apply on training data?
traintest
valid train
train
train
valid
train
Perform “grid search” using another nested loop of cross‐validation.
84
Example of nested cross-validationpConsider that we use 3‐fold cross‐validation and we want to optimize parameter C that takes values “1” and “2”
P1
Outer Loop
T i i T ti A
optimize parameter C that takes values 1 and 2 .
P1
P2
data
Training set
Testing set
C AccuracyAverage
AccuracyP1, P2 P3 1 89%
83%P3
P1,P3 P2 2 84%P2, P3 P1 1 76%
83%…
…
Inner LoopTraining Validation
C AccuracyAverage
set setC Accuracy
Accuracy
P1 P2 86%P2 P1 84%
1 85% chooseC=1
85
P1 P2 70%P2 P1 90%
2 80%C=1
On use of cross-validation
E i i ll f d th t lid ti k ll•Empirically we found that cross‐validation works well for model selection for SVMs in many biomedical
bl d iproblem domains;•Many other approaches that can be used for model
l ti f SVM i tselection for SVMs exist, e.g.: Virtual leave‐one‐out cross‐validation Generalized cross‐validation Generalized cross‐validation Bayesian information criterion (BIC) Minimum description length (MDL) Vapnik‐Chernovenkis (VC) dimension Bootstrap
86
SVMs for multicategory datag y
87
One-versus-rest multicategory SVM h dSVM method
Gene 2
Tumor I?
* ** *
*
**
**
* * *
* *
** Tumor II
?
** **
88Gene 1
Tumor III
One-versus-one multicategorySVM h dSVM method
Gene 2
Tumor I
* *
*** *
*
*Tumor II
*** * *
*
*
** **?
?
89Gene 1
Tumor III
DAGSVM multicategory SVM h dSVM method
AML vs. ALL T-cell
Not AML Not ALL T-cell
AML vs. ALL T-cell
ALL B-cell vs. ALL T-cell AML vs. ALL B-cell ALL B-cell vs. ALL T-cell
Not ALL B‐cellNot ALL B-cell Not ALL T-cell Not AML
ALL T‐cell ALL B‐cell AMLALL B‐cell
90
SVM multicategory methods by Weston d W k d b C d Sand Watkins and by Crammer and Singer
G 2Gene 2
?
* ** *
*?
?
**
**
* * ***
* **
91Gene 1
Support vector regressionpp g
92
-Support vector regression (-SVR)pp g ( )
Given training data:Rxxx n
N ,...,, 21
Given training data:Ryyy N ,...,, 21
y=f(x)Main idea:
Find a function bxwxf
)(
y f(x)+
‐* *
* *
that approximates y1,…,yN :
• it has at most derivation from
f )(
* *** *
the true values yi
• it is as “flat” as possible (to d f )
xavoid overfitting)
E.g., build a model to predict survival of cancer patients that
x
93
g , p pcan admit a one month error (= ).
-Support vector regression (-SVR)pp g ( )
y=f(x)
** *
* *
* **
x
94
Formulation of “hard-margin” -SVRg
Find 0 bxw
2bxwxf
)(
y=f(x)
* *by minimizing subject to constraints:
2
21 w
* *** *
*
)(
)(
bxwy
bxwy
i
i
* *
for i = 1,…,N.x
I.e., difference between yi and the fitted function should be smaller than and larger than ‐ the fitted function should go through ‐
95
than and larger than the fitted function should go through neighborhood of all points yi.
Influence of parameter py=f(x)
ε = 0.05 ε = 0.10 ε = 0.15
y=f(x)x
y f( )
96ε = 0.20 ε = 0.25 ε = 0.30 x
Formulation of “soft-margin” -SVRg
*
y* If we have points like this
( li i )
** *
* *
*
(e.g., outliers or noise) we can either:
a) increase to ensure that
** *
* ***
)these points are within the new ‐neighborhood, or
b) assign a penalty (“slack”***
**
b) assign a penalty (“slack” variable) to each of this points (as was done for “ f ” )
x“soft‐margin” SVMs)
97
Formulation of “soft-margin” -SVRgFind
b N
C *21 bxwxf
)(
(0 0)
y=f(x)
by minimizing
subject to constraints:
i
iiCw1
*2
21
*
** *
*(0,0)(0,0)
(0,0)
(0,0)
(0 0 3)
(0.2,0)
ii*
)(
)(*
ii
ii
bxwy
bxwy
**
*(0,0)
(0,0.3)
for i = 1,…,N.
0, * ii x
•i = 0 if the linear function is above the lower boundary of the ε‐neighborhood; otherwise it denotes distance from the linear function t th l b dto the lower boundary. •i* = 0 if the linear function is below the upper boundary of the ε‐neighborhood; otherwise it denotes distance from the linear function
98
g ;to the upper boundary•Notice that only points outside ‐neighborhood are penalized!
Nonlinear -SVRy=f(x) y=f(x)Input space Feature space
**
**
***
*
)( ix
*
x
**
***Ф(x)
*
y=f(x)
x Ф(x)Linear relation does not exist Linear relation exists and it is determined
Input space
*
y=f(x)
*)(1
ix
Input space
**
**
*
99
x*
Corresponding non‐linear relation
-Support vector regression in “l l ” f
B ild d i i f ti f th f
“loss + penalty” formbxwxf
)(Build decision function of the form:
Find and thatN
bxwxf )(w
bMinimize
2
21
)|)(|,0max( wxfyN
iii
Loss (“linear -insensitive loss”)
Penalty
Loss function value
100Error in approximation
Comparing loss functions of regression h dmethods
Linear insensitive loss Quadratic insensitive lossLoss function
valueLoss function
value
Linear -insensitive loss Quadratic -insensitive loss
- Error in approximation
- Error in approximation
Mean squared error Mean linear errorLoss function
valueLoss function
value
Mean squared error Mean linear error
Error in E i
101
- Error in approximation
- Error in approximation
Comparing -SVR with popular regression methods
102
Applying -SVR to real datapp y g
I th b f d i k l d b t d i iIn the absence of domain knowledge about decision functions, it is recommended to optimize the following parameters (e g by cross validation using grid search):parameters (e.g., by cross‐validation using grid‐search):
• parameter Cparameter C• parameter • kernel parameters (e.g., degree of polynomial)p ( g , g p y )
Notice that parameter depends on the ranges of variables in the dataset; therefore it is recommended to normalize/re‐scale data prior to applying ‐SVR.
103
Novelty detection with SVM-based th dmethods
104
What is it about?• Find the simplest and most compact region in the space of predictors where the majority f d t l “li ” (i r
Y
Decision function = +1
of data samples “live” (i.e., with the highest density of samples)
*
*
* ** *
**** *
*
**
**** ** **
**
** ***
** *** ****
**
*
*
*
*
***
**
* ****
*
*
**
*
Predictor
**** ***** ******** *samples).
• Build a decision function that takes value +1 in this region
**** ** ****
*** ** **
***
*****
******* *
************* *
*******
***** *******
Decision function = 1takes value +1 in this region and ‐1 elsewhere.
• Once we have such a decision * *
Decision function = ‐1
O ce e a e suc a dec s ofunction, we can identify novel or outlier samples/patients in
Predictor X
105
the data.
Key assumptionsy p
• We do not know classes/labels of samples (positive or negative) in the data available for learningor negative) in the data available for learning this is not a classification problem
• All positive samples are similar but each negative• All positive samples are similar but each negative sample can be different in its own way
Thus, do not need to collect data for negative samples!
106
Sample applicationsp pp“Normal”
“Novel”
“Novel”
107
Novel“Novel”
Modified from: www.cs.huji.ac.il/course/2004/learns/NoveltyDetection.ppt
Sample applicationsp ppDiscover deviations in sample handling protocol when d l l fdoing quality control of assays.
Protein Y
* **
** *
***
** *
****
**
***** ***
*
*
** ** **
**
*
* *
Samples with high‐quality of processing
Samples with low quality of processing from the lab of Dr Smith ** *
****
* **
* **
** ***** *
* lab of Dr. Smith
Protein X**
***
Samples with low quality of processing from infants
*
* Samples with low quality of
Samples with low quality of processing from ICU patients
108* *
processing from patients with lung cancer
Sample applicationsp ppIdentify websites that discuss benefits of proven cancer treatments.
Weighted frequency of
word Y
* **
** *
***
** *
****
**
***** ***
*
*
** ** ***
**
*
**
Websites that discuss benefits of proven cancer treatmentsWebsites that discuss *
** ****
*
**
** *****
****
* **
* **
** ***** *
**Websites that discuss cancer prevention methods
** *
*
*** ***********
*
*
* ** *** **
Websites that discuss side‐effects of proven cancer treatments
Weighted **Blogs of cancer patients
gfrequency of
word X
** Websites that discuss
109
**Websites that discuss unproven cancer treatments
On Presentation of one-class SVM…
Wh i l di d SVM l ifi ti dWhen we previously discussed SVM classification and regression, we first introduced the version of the method for linear data, then discussed the non‐linear ,version. We will follow a similar course in the discussion of one‐class SVMs but with one fundamental difference Unlike SVMs and SVR onlyfundamental difference. Unlike SVMs and SVR, only non‐linear one‐class SVM is designed to be applied and makes practical sense. Furthermore, one‐class SVM should be used with special types of kernel functions. So, even though we will describe linear one‐class SVM, it is presented here only for educational purposes andit is presented here only for educational purposes and its practical meaning may be limited.
110
Linear one-class SVMMain idea: Find the maximal gap hyperplane that separates data from the origin (i e the only member of the second class is the origin)the origin (i.e., the only member of the second class is the origin).
Gene Y
1 bxw
0 bxw
1 bxw
Origin is the only Gene X
111
g ymember of the second class Positive objects (y=+1)Negative object (y=‐1)
Linear one-class SVMOne‐class SVM seeks the most compact region where the majority of data points live; so we are interested in another hyperplane:data points live; so we are interested in another hyperplane:
Gene Y
0 bxw
Origin is the only Gene X
112
g ymember of the second class Positive objects (y=+1)Negative object (y=‐1)
Formulation of one-class SVM: l linear case
Given training data: nRxxx
Find )()( bxwsignxf
Given training data: N Rxxx ,...,, 21
0 bxw
by minimizing
N
ii b
Nw
1
2
21 1
subject to constraints:
ibxw
i
j
for i = 1 N
0i
i
upper bound on the fraction of for i = 1,…,N.outliers (i.e., points outside decision surface) allowed in
i e the decision function should
113
the datai.e., the decision function should be positive in all training samples except for small deviations
Formulation of one-class SVM:l d l linear and non-linear cases
Linear case Non‐linear case (use “kernel trick”)
Find
by minimizing N
bw21 1
)()( bxwsignxf
Find
by minimizing N
bw21 1
))(()( bxwsignxf
by minimizing
subject to constraints:
i
i bN
w1
2
by minimizing
subject to constraints:
i
i bN
w1
2
0
i
ibxw
0
)(
i
ibxw
for i = 1,…,N.
i
for i = 1,…,N.
i
114
Non-linear one-class SVMGene Y
Input space
Use a kernel to map all points to a sphere with unit radius (e.g., use
)(x
sphere with unit radius (e.g., use Gaussian kernel)
Feature space
)( ix
Gene X
We want to find the simplest andWe want to find the simplest and most compact region enclosing the majority of data points.
115
Non-linear one-class SVM
0)( b Feature space0)( bxw Feature space
Solve linear one‐class SVM problem in the feature spacep
Negative object (y 1)
Positive objects (y=+1)
Negative object (y=‐1)
116
Non-linear one-class SVMGene Y
Input spaceInput space
0)( bxw
Gene X
Consider the corresponding decision surface (image) in
117
Consider the corresponding decision surface (image) in the input space – it is exactly what we were looking for!
More about one-class SVM
• One‐class SVMs inherit most of properties of SVMs for binary classification (e.g., “kernel trick”, sample y ( g , , pefficiency, ease of finding of a solution by efficient optimization method, etc.);p , );
• The choice of parameter significantly affects the resulting decision surface.
g
• The choice of origin is arbitrary and also significantly affects the decision surface returned by the algorithm.y g
118
More about one-class SVM
ν =0.5 ν =0.3
119ν =0.2 ν =0.1
Support vector clusteringpp g
Contributed by Nikita Lytkin
120
Goal of clustering (aka class discovery)Goal of clustering (aka class discovery)
Gi h f d iGiven a heterogeneous set of data points
Assign labels such that points
nN Rxxx
,...,, 21
},...,2,1{,...,, 21 Kyyy N with the same label are highly “similar“ to each other and are distinctly different from the rest
Clustering process
121
Support vector domain descriptionSupport vector domain description• Support Vector Domain Description (SVDD) of the data is a set
of vectors lying on the surface of the smallest hyper‐sphere enclosing all data points in a feature space– These surface points, called Support Vectors, lie in low density regions p , pp , y g
in the input space and are good indicators of cluster boundaries
kernelR
R
R
R
122
Outline of Support Vector ClusteringOutline of Support Vector Clustering
• Data points that are support vectors of the minimal enclosing hyper‐sphere are taken to be the cluster boundary points
All th i t i d t l t i th b d• All other points are assigned to clusters using the boundary points
1R
R
Ra
123
Cluster assignment in SVCCluster assignment in SVC
• Two points and belong to the same cluster (i.e., have the same label) if every point of the line segment
j t d i t th f t li ithi th h
),( ji xxix jx
projected into the feature space lies within the hyper‐sphere
Some points lie outside the hyper‐sphere
RR
Ra
124
Every point is within the hyper‐sphere in the feature space
Cluster assignment in SVC (continued)Cluster assignment in SVC (continued)C
• Point‐wise adjacency matrix is constructed by testing the line
t b t i f
AB E
segments between every pair of points
C t d tA B C D E
D
• Connected components are extracted
P i b l i h
A 1 1 1 0 0
B 1 1 0 0
C 1 0 0• Points belonging to the same
connected component are assigned to the same cluster
C 1 0 0
D 1 1
E 1Cassigned to the same cluster
AB E
125
D
SVM-based variable selection
126
Understanding the weight vector wRecall standard SVM formulation:
g g
0 bxw
Find and that minimize2
w
b
subject to
for i = 1,…,N.
2
21 w
1)( bxwy ii
, ,
Use classifier: Positive objects (y=+1)Negative objects (y=‐1))()( bxwsignxf
• The weight vector contains as many elements as there are input variables in the dataset, i.e. .h i d f h l d i f h
w
nw R
• The magnitude of each element denotes importance of the corresponding variable for classification task.
127
Understanding the weight vector wg g
x2 011
)1,1(
21
bxx
w
x2 001
)0,1(
21
bxx
w
2 011 21 bxx 2 001 21 bxx
x1 x1
x )10(w x3
X1 and X2 are equally important X1 is important, X2 is not
x2
010
)1,0(
21 bxx
w 3
0011
)0,1,1(
321
bxxx
w
x1
x2
x1X1 and X2 are equally important
128X2 is important, X1 is not
equally important, X3 is not
Understanding the weight vector wg gGene X2
MelanomaNevi
SVM decision surface)11(w
Nevi
011
)1,1(
21
bxx
w
Decision surface ofDecision surface of another classifier
001
)0,1(
bxx
w
001 21 bxx
True model
Gene X1
X1
PhenotypeX2• In the true model, X1 is causal and X2 is redundant•SVM decision surface implies that X1 and X2 are equally important; thus it is locally causally inconsistent
129
important; thus it is locally causally inconsistent•There exists a causally consistent decision surface for this example•Causal discovery algorithms can identify that X1 is causal and X2 is redundant
Simple SVM-based variable selection l halgorithm
Algorithm:1. Train SVM classifier using data for all variables to
estimate vector 2. Rank each variable based on the magnitude of the
w
corresponding element in vector 3. Using the above ranking of variables, select the
w
smallest nested subset of variables that achieves the best SVM prediction accuracy.
130
Simple SVM-based variable selection l halgorithm
Consider that we have 7 variables: X X X X X X XConsider that we have 7 variables: X1, X2, X3, X4, X5, X6, X7The vector is: (0.1, 0.3, 0.4, 0.01, 0.9, ‐0.99, 0.2)The ranking of variables is: X6, X5, X3, X2, X7, X1, X4
w
6 5 3 2 7 1 4
Subset of variablesClassificationaccuracy
X6 X5 X3 X2 X7 X1 X4 0.920
X6 X5 X3 X2 X7 X1 0.920
X X X X X 0 919
Best classification accuracy
Classification accuracy that isX6 X5 X3 X2 X7 0.919
X6 X5 X3 X2 0.852
X6 X5 X3 0 843
Classification accuracy that is statistically indistinguishable from the best one
X6 X5 X3 0.843
X6 X5 0.832
X6 0.821
131 Select the following variable subset: X6, X5, X3, X2 , X7
Simple SVM-based variable selection l halgorithm
• SVM weights are not locally causally consistent we• SVM weights are not locally causally consistent we may end up with a variable subset that is not causal and not necessarily the most compact oneand not necessarily the most compact one.
• The magnitude of a variable in vector estimates the effect of removing that variable on the objective
w
the effect of removing that variable on the objective function of SVM (e.g., function that we want to minimize) However this algorithm becomes subminimize). However, this algorithm becomes sub‐optimal when considering effect of removing several variables at a time This pitfall is addressed in thevariables at a time… This pitfall is addressed in the SVM‐RFE algorithm that is presented next.
132
SVM-RFE variable selection algorithmgAlgorithm:1 I iti li V t ll i bl i th d t1. Initialize V to all variables in the data2. Repeat3 Train SVM classifier using data for variables in V to3. Train SVM classifier using data for variables in V to
estimate vector 4 Estimate prediction accuracy of variables in V using
w
4. Estimate prediction accuracy of variables in V using the above SVM classifier (e.g., by cross‐validation)
5. Remove from V a variable (or a subset of variables) ( )with the smallest magnitude of the corresponding element in vector w
6. Until there are no variables in V7. Select the smallest subset of variables with the best
133
prediction accuracy
SVM-RFE variable selection algorithmg
10,000 genes
SVM model
Prediction
5,000 genes
Important for
SVM model
2,500 genes
Important for
…PredictionPrediction
accuracy
5,000
Important for classification
2,500 genes
Important for classification
Discarded Discarded
Prediction accuracy
genes
Not important for classification
genes
Not important for classification
Discarded Discarded
• Unlike simple SVM‐based variable selection algorithm, SVM‐RFE estimates vector many times to establish ranking of thew
RFE estimates vector many times to establish ranking of the variables.
• Notice that the prediction accuracy should be estimated at
w
134
Notice that the prediction accuracy should be estimated at each step in an unbiased fashion, e.g. by cross‐validation.
SVM variable selection in feature spacep
The real power of SVMs comes with application of the kernel trick that maps data to a much higher dimensional space (“feature space”) where the data is linearly separable.
Gene 2
Tumor
?kernel
Tumor
Normal?Gene 1
Normal
135
input space feature space
SVM variable selection in feature spacep• We have data for 100 SNPs (X1,…,X100) and some phenotype.W ll 3rd d i i id• We allow up to 3rd order interactions, e.g. we consider:
• X1,…,X1001, , 100
• X12,X1X2, X1X3,…,X1X100 ,…,X1002
• X13,X1X2X3, X1X2X4,…,X1X99X100 ,…,X10031 1 2 3 1 2 4 1 99 100 100
• Task: find the smallest subset of features (either SNPs or their interactions) that achieves the best predictivetheir interactions) that achieves the best predictive accuracy of the phenotype.
• Challenge: If we have limited sample we cannot explicitly• Challenge: If we have limited sample, we cannot explicitly construct and evaluate all SNPs and their interactions (176 851 features in total) as it is done in classical statistics
136
(176,851 features in total) as it is done in classical statistics.
SVM variable selection in feature spacep
Heuristic solution: Apply algorithm SVM‐FSMB that:pp y g1. Uses SVMs with polynomial kernel of degree 3 and
selects M features (not necessarily input variables!) ( y p )that have largest weights in the feature space.
E th l ith l t f t lik XE.g., the algorithm can select features like: X10, (X1X2), (X9X2X22), (X7
2X98), and so on.
2. Apply HITON‐MB Markov blanket algorithm to find the Markov blanket of the phenotype using Mthe Markov blanket of the phenotype using M features from step 1.
137
Computing posterior class b biliti f SVM l ifiprobabilities for SVM classifiers
138
Output of SVM classifierp1. SVMs output a class label ( i i i ) f h 0b
(positive or negative) for each sample:
0 bxw
)( bxwsign
2. One can also compute distance from the hyperplane thatfrom the hyperplane that separates classes, e.g. . These distances can be used to
Positive samples (y=+1)Negative samples (y= 1)
bxw
compute performance metrics like area under ROC curve.
Positive samples (y=+1)Negative samples (y=‐1)
Question: How can one use SVMs to estimate posterior l b bili i i P( l i i | l )?
139
class probabilities, i.e., P(class positive | sample x)?
Simple binning methodp g1. Train SVM classifier in the Training set.
Training set2. Apply it to the Validation set and compute distances
from the hyperplane to each sample.
Validation set
Sample # 1 2 3 4 5...
98 99 100
Distance 2 ‐1 8 3 4 ‐2 0.3 0.8Validation set
Testing set
3. Create a histogram with Q (e.g., say 10) bins using the above distances. Each bin has an upper and lower value in terms of distancevalue in terms of distance.
20
25
ples
se
t
5
10
15
Num
ber
of s
amp
in v
alid
atio
n s
140
-15 -10 -5 0 5 10 150
Distance
Simple binning methodp g4. Given a new sample from the Testing set, place it in
the corresponding bin
Training set
the corresponding bin.
E.g., sample #382 has distance to hyperplane = 1, so it is placed in the bin [0 2 5]
Validation set
is placed in the bin [0, 2.5]
15
20
25
mpl
es
n se
t
Validation set
Testing set0
5
10
15
Num
ber
of s
ain
val
idat
ion
5. Compute probability P(positive class | sample #382) as f ti f t iti i thi bi
-15 -10 -5 0 5 10 150
Distance
a fraction of true positives in this bin.
E.g., this bin has 22 samples (from the Validation set), out of which 17 are positive ones so we compute
141
out of which 17 are positive ones , so we computeP(positive class | sample #382) = 17/22 = 0.77
Platt’s methodConvert distances output by SVM to probabilities by passing them through the sigmoid filter:through the sigmoid filter:
)exp(1
1)|(
BAdsampleclasspositiveP
where d is the distance from hyperplane and A and B are parameters.
)exp(1 BAd
0.8
0.9
1
0.5
0.6
0.7
e cl
ass|
sam
ple)
0.2
0.3
0.4
P(p
ositi
ve
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
Distance142
Platt’s method1. Train SVM classifier in the Training set.
Training set2. Apply it to the Validation set and compute distances from the hyperplane to each sample.
Validation set
Sample # 1 2 3 4 5...
98 99 100
Distance 2 ‐1 8 3 4 ‐2 0.3 0.8 Validation set
Testing set3. Determine parameters A and B of the sigmoid
function by minimizing the negative log likelihood of y g g gthe data from the Validation set.
4 Given a new sample from the Testing set compute its4. Given a new sample from the Testing set, compute its posterior probability using sigmoid function.
143
Part 3
• Case studies (taken from our research)( )1. Classification of cancer gene expression microarray data2. Text categorization in biomedicine3. Prediction of clinical laboratory values4. Modeling clinical judgment5 Using SVMs for feature selection5. Using SVMs for feature selection6. Outlier detection in ovarian cancer proteomics data
• Software• Software
• Conclusions
• Bibliography
144
1. Classification of cancer gene i i d texpression microarray data
145
Comprehensive evaluation of algorithms f l f f dfor classification of cancer microarray data
Main goals:• Find the best performing decision support• Find the best performing decision support
algorithms for cancer diagnosis from i i d tmicroarray gene expression data;
• Investigate benefits of using gene selection and ensemble classification methods.
146
Classifiers
• K‐Nearest Neighbors (KNN) instance‐basedK Nearest Neighbors (KNN)
• Backpropagation Neural Networks (NN)
• Probabilistic Neural Networks (PNN)
neural networks
instance based
• Probabilistic Neural Networks (PNN)
• Multi‐Class SVM: One‐Versus‐Rest (OVR)
• Multi‐Class SVM: One‐Versus‐One (OVO)• Multi‐Class SVM: One‐Versus‐One (OVO)
• Multi‐Class SVM: DAGSVM
• Multi Class SVM by Weston & Watkins (WW)
kernel‐based
• Multi‐Class SVM by Weston & Watkins (WW)
• Multi‐Class SVM by Crammer & Singer (CS)
• Weighted Voting: One Versus Rest• Weighted Voting: One‐Versus‐Rest
• Weighted Voting: One‐Versus‐One
• Decision Trees CART
voting
d i i t
147
• Decision Trees: CART decision trees
Ensemble classifiers
dataset
Classifier 1 Classifier 2 Classifier N…
Prediction 1 Prediction 2 Prediction N… dataset
Ensemble l ifiClassifier
Final
148
aPrediction
Gene selection methods
Highly discriminatory genes
1. Signal‐to‐noise (S2N) ratio in one‐versus‐rest (OVR)
Uninformative genesHighly discriminatory genes
fashion;
2. Signal‐to‐noise (S2N) ratio in one versus one (OVO)one‐versus‐one (OVO) fashion;
3. Kruskal‐Wallis nonparametric3. Kruskal Wallis nonparametric one‐way ANOVA (KW);
4. Ratio of genes between‐i i hicategories to within‐category
sum of squares (BW).
genes
149
Performance metrics andl statistical comparison
1 A1. Accuracy+ can compare to previous studies+ easy to interpret & simplifies statistical comparisony p p p
2 Relative classifier information (RCI)2. Relative classifier information (RCI)+ easy to interpret & simplifies statistical comparison+ not sensitive to distribution of classes
t f diffi lt f d i i bl+ accounts for difficulty of a decision problem
R d i d t ti t ti t i• Randomized permutation testing to compare accuracies of the classifiers (=0.05)
150
Microarray datasetsy
Sam- Variables Cate- Dataset name
Number of
Referenceples (genes) gories
11_Tumors 174 12533 11 Su, 2001
Total:
•~1300 samples
14_Tumors 308 15009 26 Ramaswamy, 2001
9_Tumors 60 5726 9 Staunton, 2001
•74 diagnostic categories•41 cancer types and 12 l ti t
Brain_Tumor1 90 5920 5 Pomeroy, 2002
Brain_Tumor2 50 10367 4 Nutt, 2003
L k i 1 72 5327 3 Gol b 1999 12 normal tissue typesLeukemia1 72 5327 3 Golub, 1999
Leukemia2 72 11225 3 Armstrong, 2002
Lung Cancer 203 12600 5 Bhattacherjee 2001Lung_Cancer 203 12600 5 Bhattacherjee, 2001
SRBCT 83 2308 4 Khan, 2001
Prostate Tumor 102 10509 2 Singh, 2002
151
Prostate_Tumor g ,
DLBCL 77 5469 2 Shipp, 2002
Summary of methods and datasetsyGene Selection
Methods (4)Cross-Validation
Designs (2)PerformanceMetrics (2)
StatisticalComparison
S2N One‐Versus‐Rest
S2N One‐Versus‐One
Non‐param. ANOVA
( )g ( )10‐Fold CV
LOOCV
Accuracy
RCI
( )
Randomizedpermutation testing
p
Non param. ANOVA
BW ratioOne‐Versus‐Rest
One‐Versus‐One
Classifiers (11)
M
11_Tumors
14 Tumors
Gene Expression Datasets (11)
DAGSVM
Method by WW
Method by CS
MC‐SVM
Majority Voting
MC‐SVM OVR
Ensemble Classifiers (7)
MC‐
uts
_
9_Tumors
Brain Tumor1
Brain Tumor2category Dx
Method by CS
KNN
Backprop. NN
P b NN
MC‐SVM OVO
MC‐SVM DAGSVM
Decision Trees
Based
on
SVM outp
s
Brain_Tumor2
Leukemia1
Leukemia2
Multic
Prob. NN
Decision Trees
One‐Versus‐Rest
WV
Decision Trees
Decision Trees
Majority Voting
sed on outputs
all classifiers
Lung_Cancer
SRBCT
Prostate_Tumors
ary Dx
152
One‐Versus‐OneW
Bas
of a
DLBCLBina
Results without gene selectiong
100
OVROVODAGSVM
C-S
VM60
80
racy
, %
WWCSKNNNNPNN
MC
20
40
Acc
ur
mors
mor
s
mor
2
mor
1 m
ors
mia1
mia2
ncer
RBCT
mor
LBCL
0
9_Tu
mo14
_Tum
oBra
in_T
umo
Brain
_Tum
o11
_Tum
oLe
ukem
Leuk
emLu
ng_C
anc
SRB
Pros
tate_
Tum
DLB
153
Results with gene selectiongImprovement of diagnostic
performance by gene selectionDiagnostic performance
performance by gene selection (averages for the four datasets)
before and after gene selection
100 9 Tumors 14 Tumors
50
60
70
urac
y, %
40
60
80
100 9_Tumors 14_Tumors
30
40
50
emen
t in
accu
20
100
Acc
urac
y, %
Brain_Tumor1 Brain_Tumor2
10
20
Impr
ove
20
40
60
80
A
OVR OVO DAGSVM WW CS KNN NN PNN
SVM non-SVM
d f
SVM non‐SVM SVM non‐SVM
20
154
Average reduction of genes is 10‐30 times
Comparison with previously bl h d lpublished results
100
Multiclass SVMs(this study)
y, %
60
80
Acc
ura
cy
40
60
Multiple specialized classification methods(original primary studies)A
0
20
(original primary studies)
0
9_Tu
mors
14_T
umor
s ra
in_T
umor
2 ra
in_T
umor
1 11
_Tum
ors
Leuk
emia1
Le
ukem
ia2
Lung
_Can
cer
SRBCT
state
_Tum
or
DLBCL
155
1
Bra Bra 1
Lu
Pros
t
Summary of resultsy
•Multi‐class SVMs are the best family among the tested algorithms outperforming KNN, NN, PNN, DT, and WV.
•Gene selection in some cases improves classification performance of all classifiers, especially of non‐SVM algorithms;
•Ensemble classification does not improve performance of SVM and other classifiers;
•Results obtained by SVMs favorably compare with the literature.
156
Random Forest (RF) classifiers
• Appealing properties
( )
pp g p p– Work when # of predictors > # of samples
– Embedded gene selection
– Incorporate interactions
– Based on theory of ensemble learning
Can work with binary & multiclass tasks– Can work with binary & multiclass tasks
– Does not require much fine‐tuning of parameters
• Strong theoretical claims• Strong theoretical claims
• Empirical evidence: (Diaz‐Uriarte and Alvarez de A d BMC Bi i f ti 2006) t dAndres, BMC Bioinformatics, 2006) reported superior classification performance of RFs compared t SVM d th th dto SVMs and other methods
157
Key principles of RF classifiersy p p
T i
Training
Testingdata
gdata
4) Apply to testing data & combine predictions
1) Generate bootstrap samples
2) Random geneselection
3) Fit unpruned decision trees 158
Results without gene selectiong
• SVMs nominally outperform RFs is 15 datasets, RFs outperform SVMs in 4 datasets, algorithms are exactly the same in 3 datasets.
159
g y• In 7 datasets SVMs outperform RFs statistically significantly.• On average, the performance advantage of SVMs is 0.033 AUC and 0.057 RCI.
Results with gene selectiong
• SVMs nominally outperform RFs is 17 datasets, RFs outperform SVMs in 3 datasets, algorithms are exactly the same in 2 datasets.
160
g y• In 1 dataset SVMs outperform RFs statistically significantly.• On average, the performance advantage of SVMs is 0.028 AUC and 0.047 RCI.
2. Text categorization in biomedicineg
161
Models to categorize content and quality: M dMain idea
1. Utilize existing (or easy to build) training corpora
1 2. Simple document representations (i.e., typically stemmed and weighted
2
3
gwords in title and abstract, Mesh terms if available; occasionally addition of Metamap CUIs author info) as
162
4Metamap CUIs, author info) as “bag‐of‐words”
Models to categorize content and quality: M dMain idea
LabeledExamples
Unseen Examples
3. Train SVM models that capture implicit categories of meaning or
lit it iquality criteria
Labeled
0 80.9
1
4. Evaluate models’ performances
0.20.30.40.50.60.70.8p
‐ with nested cross‐validation or other appropriate error estimators
‐ use primarily AUC as well as other metrics ( iti it ifi it PPV P i i /R ll
5. Evaluate performance prospectively &
00.1
Txmt Diag Prog Etio
Estimated Performance 2005 Performance
(sensitivity, specificity, PPV, Precision/Recall curves, HIT curves, etc.)
163
5. Evaluate performance prospectively & compare to prior cross‐validation estimates
Models to categorize content and quality: S bl lSome notable results
Method Treatment - Etiology - Prognosis - Diagnosis -Method Treatment - Etiology - Prognosis - Diagnosis -Category Average AUC Range
over n folds
Treatment 0.97* 0.96 - 0.98
Category Average AUC Rangeover n folds
Treatment 0.97* 0.96 - 0.98
Method Treatment -AUC
Etiology -AUC
Prognosis -AUC
Diagnosis -AUC
Google Pagerank 0.54 0.54 0.43 0.46
Yahoo Webranks 0 56 0 49 0 52 0 52
Method Treatment -AUC
Etiology -AUC
Prognosis -AUC
Diagnosis -AUC
Google Pagerank 0.54 0.54 0.43 0.46
Yahoo Webranks 0 56 0 49 0 52 0 52Treatment 0.97 0.96 0.98
Etiology 0.94* 0.89 – 0.95
Prognosis 0.95* 0.92 – 0.97
Diagnosis 0.95* 0.93 - 0.98
Treatment 0.97 0.96 0.98
Etiology 0.94* 0.89 – 0.95
Prognosis 0.95* 0.92 – 0.97
Diagnosis 0.95* 0.93 - 0.98
Yahoo Webranks 0.56 0.49 0.52 0.52
Impact Factor 2005
0.67 0.62 0.51 0.52
W b hi
Yahoo Webranks 0.56 0.49 0.52 0.52
Impact Factor 2005
0.67 0.62 0.51 0.52
W b hi
1. SVM models have excellent ability to identify
ggWeb page hit count
0.63 0.63 0.58 0.57
Bibliometric Citation Count
0.76 0.69 0.67 0.60
Web page hit count
0.63 0.63 0.58 0.57
Bibliometric Citation Count
0.76 0.69 0.67 0.60
high‐quality PubMed documents according to ACPJ gold standard
Machine Learning Models
0.96 0.95 0.95 0.95Machine Learning Models
0.96 0.95 0.95 0.95
2 SVMmodels have better classificationg 2. SVM models have better classification performance than PageRank, Yahoo ranks, Impact Factor, Web Page hit counts, and
164
bibliometric citation counts on the Web according to ACPJ gold standard
Models to categorize content and quality: S bl lSome notable results
Gold standard: SSOAB Area under the ROCGold standard: SSOAB Area under the ROC
Treatment - Fixed Specificity
0.80.910.95 0.91
0 80.9
1
Treatment - Fixed Sensitivity
0.98
0 71
0.980.89
0 80.9
1
Gold standard: SSOAB Area under the ROC curve*
SSOAB-specific filters 0.893
Cit ti C t 0 791
Gold standard: SSOAB Area under the ROC curve*
SSOAB-specific filters 0.893
Cit ti C t 0 791
00.10.20.30.40.50.60.70.8
Sens Fixed Spec
Query Filters Learning Models
0.71
00.10.20.30.40.50.60.70.8
Fixed Sens Spec
Query Filters Learning Models
Citation Count 0.791
ACPJ Txmt-specific filters 0.548
Impact Factor (2001) 0.549
Citation Count 0.791
ACPJ Txmt-specific filters 0.548
Impact Factor (2001) 0.549
Etiology - Fixed Specificity
0.68
0.910.94 0.91
00.10.20.30.40.50.60.70.80.9
1
Etiology - Fixed Sensitivity
0.98
0.44
0.98
0.75
00.10.20.30.40.50.60.70.80.9
1
Fixed Sens Spec
3. SVM models have better
Impact Factor (2005) 0.558Impact Factor (2005) 0.558Diagnosis - Fixed Specificity
0.65
0.970.82
0.97
0 30.40.50.60.70.80.9
1
Diagnosis - Fixed Sensitivity
0.96
0.68
0.960.88
0 30.40.50.60.70.80.9
1
Sens Fixed Spec
Query Filters Learning Models
Fixed Sens Spec
Query Filters Learning Models
classification performance than PageRank, Impact Factor and Citation count in Medline for
00.10.20.3
Sens Fixed Spec
Query Filters Learning Models
00.10.20.3
Fixed Sens Spec
Query Filters Learning Models
Prognosis - Fixed Specificity1
1
Prognosis - Fixed Sensitivity
0 871
SSOAB gold standard 0.8 0.77 0.77
00.10.20.30.40.50.60.70.80.9
1
Sens Fixed Spec
Query Filters Learning Models
0.80.71
0.80.87
00.10.20.30.40.50.60.70.80.9
1
Fixed Sens Spec
Query Filters Learning Models
165
4. SVM models have better sensitivity/specificity in PubMed than CQFs at comparable thresholds according to ACPJ gold standard
Other applications of SVMs to text categorization
Model Area Under the Curve
Machine Learning Models 0.93
Model Area Under the Curve
Machine Learning Models 0.93
Quackometer* 0.67
G l 0 63
Quackometer* 0.67
G l 0 63Google 0.63Google 0.63
1. Identifying Web Pages with misleading treatment information according to special purpose gold standard (Quack Watch). SVM models outperform p p p g (Q ) pQuackometer and Google ranks in the tested domain of cancer treatment.
2. Prediction of future paper citation and instrumentality of citations
166
3. Prediction of clinical laboratory lvalues
167
Dataset generation and l dexperimental design
• StarPanel database contains ~8∙106 lab measurements of ~100,000 in‐patients from Vanderbilt University Medical Center.
• Lab measurements were taken between 01/1998 and 10/2002.ab easu e e ts e e ta e bet ee 0 / 998 a d 0/ 00
For each combination of lab test and normal range, we generatedthe following datasets. t e o o g datasets
01/1998‐05/2001 06/2001‐10/2002
Training TestingValidation
(25% of Training)
168
Query-based approach for d f l l b lprediction of clinical cab values
Training data
Data model database Testing data
Train SVM classifier Testing sample
Validation data
sample
OptimalSVM classifier
Performance
Prediction
pdata model
SVM classifier
These steps are performed for every testing sample
These steps are performed for every data model
Predictionfor every testing sample
169
Classification results
Including cases with K=0 (i.e. samplesh l b )
Excluding cases with K=0 (i.e. samplesh l b )
Area under ROC curve (without feature selection)
Range of normal valuesRange of normal values
Area under ROC curve (without feature selection)
with no prior lab measurements) with no prior lab measurements)
>1 <99 [1, 99] >2.5 <97.5 [2.5, 97.5]
BUN 80.4% 99.1% 76.6% 87.1% 98.2% 70.7%
Ca 72.8% 93.4% 55.6% 81.4% 81.4% 63.4%
CaIo 74.1% 60.0% 50.1% 64.7% 72.3% 57.7%
tes
t
g
>1 <99 [1, 99] >2.5 <97.5 [2.5, 97.5]
BUN 75.9% 93.4% 68.5% 81.8% 92.2% 66.9%
Ca 67.5% 80.4% 55.0% 77.4% 70.8% 60.0%
CaIo 63.5% 52.9% 58.8% 46.4% 66.3% 58.7%
Range of normal values
est
CO2 82.0% 93.6% 59.8% 84.4% 94.5% 56.3%
Creat 62.8% 97.7% 89.1% 91.5% 98.1% 87.7%
Mg 56.9% 70.0% 49.1% 58.6% 76.9% 59.1%
Osmol 50.9% 60.8% 60.8% 91.0% 90.5% 68.0%
PCV 74 9% 99 2% 66 3% 80 9% 80 6% 67 1%
La
bo
rato
ry tCO2 77.3% 88.0% 53.4% 77.5% 90.5% 58.1%
Creat 62.2% 88.4% 83.5% 88.4% 94.9% 83.8%
Mg 58.4% 71.8% 64.2% 67.0% 72.5% 62.1%
Osmol 77.9% 64.8% 65.2% 79.2% 82.4% 71.5%
PCV 91 6% 84 6%
Lab
ora
tory
te
PCV 74.9% 99.2% 66.3% 80.9% 80.6% 67.1%
Phos 74.5% 93.6% 64.4% 71.7% 92.2% 69.7%
PCV 62.3% 91.6% 69.7% 76.5% 84.6% 70.2%
Phos 70.8% 75.4% 60.4% 68.0% 81.8% 65.9%
A total of 84,240 SVM classifiers were built for 16,848 possible data models.A total of 84,240 SVM classifiers were built for 16,848 possible data models.
170
Improving predictive power and parsimony f BUN d l f lof a BUN model using feature selection
Model description x 104 Histogram of test BUN
Test name BUNRange of normal values < 99 perc.Data modeling SRTNumber of previous measurements
5
p
2.5
3x 10 Histogram of test BUN
men
ts)
Use variables corresponding to hospitalization units?
Yes
Number of prior hospitalizations used
2
Dataset description 1
1.5
2
cy (
N m
easu
rem
N samples (total)
N abnormal samples
N variables
Training set 3749 78
V lid ti t 1251 27 3442
Dataset description
0
0.5
1
Fre
quen
normalvalues
abnormalvalues
105
Validation set 1251 27
Testing set 836 16
34420 50 100 150 200 250
0
Test value
Classification performance (area under ROC curve)Classification performance (area under ROC curve)
All RFE_Linear RFE_Poly HITON_PC HITON_MBValidation set 95.29% 98.78% 98.76% 99.12% 98.90%Testing set 94.72% 99.66% 99.63% 99.16% 99.05%
Number of features 3442 26 3 11 17
171
Classification performance (area under ROC curve)
All RFE_Linear RFE_Poly HITON_PC HITON_MBValidation set 95.29% 98.78% 98.76% 99.12% 98.90%Testing set 94.72% 99.66% 99.63% 99.16% 99.05%
Number of features 3442 26 3 11 17
Features
1 LAB: PM_1(BUN) LAB: PM_1(BUN) LAB: PM_1(BUN) LAB: PM_1(BUN)
2 LAB: PM_2(Cl) LAB: Indicator(PM_1(Mg)) LAB: PM_5(Creat) LAB: PM_5(Creat)
3 LAB: DT(PM_3(K))LAB: Test Unit NO_TEST_MEASUREMENT (Test CaIo, PM 1)
LAB: PM_1(Phos) LAB: PM_3(PCV)
4 LAB: DT(PM_3(Creat)) LAB: Indicator(PM_1(BUN)) LAB: PM_1(Mg)
5 LAB: Test Unit J018 (Test Ca, PM 3) LAB: Indicator(PM_5(Creat)) LAB: PM_1(Phos)
6 LAB: DT(PM_4(Cl)) LAB: Indicator(PM_1(Mg)) LAB: Indicator(PM_4(Creat))
7 LAB: DT(PM_3(Mg)) LAB: DT(PM_4(Creat)) LAB: Indicator(PM_5(Creat))
8 LAB: PM_1(Cl) LAB: Test Unit 7SCC (Test Ca, PM 1) LAB: Indicator(PM_3(PCV))
9 LAB: PM_3(Gluc) LAB: Test Unit RADR (Test Ca, PM 5) LAB: Indicator(PM_1(Phos))
10 LAB: DT(PM_1(CO2)) LAB: Test Unit 7SMI (Test PCV, PM 4) LAB: DT(PM_4(Creat))
11 LAB: DT(PM_4(Gluc)) DEMO: Gender LAB: Test Unit 11NM (Test BUN, PM 2)
12 LAB: PM_3(Mg) LAB: Test Unit 7SCC (Test Ca, PM 1)
13 LAB: DT(PM_5(Mg)) LAB: Test Unit RADR (Test Ca, PM 5)
14 LAB: PM_1(PCV) LAB: Test Unit 7SMI (Test PCV, PM 4)
15 LAB: PM_2(BUN) LAB: Test Unit CCL (Test Phos, PM 1)
16 LAB: Test Unit 11NM (Test PCV, PM 2) DEMO: Gender
17 LAB: Test Unit 7SCC (Test Mg, PM 3) DEMO: Age
18 LAB: DT(PM_2(Phos))
19 LAB: DT(PM_3(CO2))
20 LAB: DT(PM_2(Gluc))
21 LAB: DT(PM_5(CaIo))
22 DEMO: Hospitalization Unit TVOS
23 LAB: PM_1(Phos)
24 LAB: PM_2(Phos)
25 LAB: Test Unit 11NM (Test K, PM 5)
26 LAB: Test Unit VHR (Test CaIo, PM 1)
172
4. Modeling clinical judgmentg j g
173
Methodological framework and study outline
Patients Guidelines
PhysiciansPredict clinical decisions
Patient FeatureClinical
Diagnosis
Gold Standard
1 f1…fm cd1 hd1
Identify predictors ignored by physicians
Phy
sici
an1 … … …
N cdN hdN
… … … …
Explain each physician’s diagnostic model
ysic
ian 6
1 f1…fm cd1 hd1
… … …
N cdN hdN
Compare physicians with each other and with guidelines
same across physicians different across physicians
Phy
N N
Clinical context of experimentClinical context of experimentMalignant melanoma is the most dangerous form of skin cancer
Incidence & mortality have been constantly increasing in the last decades.the last decades.
Physicians and patientsPhysicians and patientsPatients N=177 76 melanomas - 101 nevi
Data collection:
Patients seen prospectively,
Dermatologists N = 6
3 t 3 t
76 melanomas - 101 nevi p p y,from 1999 to 2002 at Department of Dermatology, S.Chiara Hospital, Trento, Italy
inclusion criteria: histological 3 experts ‐ 3 non‐experts diagnosis and >1 digital image
available
Diagnoses made in 2004
Features
Lesion location
Family history of melanoma
Irregular Border Streaks (radial streaming, pseudopods)
Max-diameterFitzpatrick’s Photo-type
Number of colors Slate-blue veil
Min-diameter SunburnAtypical pigmented network
Whitish veilnetwork
Evolution Ephelis Abrupt network cut-off Globular elements
Age Lentigos Regression-ErythemaComedo-like openings, milia like cystsmilia-like cysts
Gender Asymmetry Hypo-pigmentation Telangiectasia
Method to explain physician-specificSVM models
FS
B ild
Build SVM
Build DT
SVM”Black Box”
Apply SVM
Regular Learning Meta-Learning
Results: Predicting physicians’ judgmentsResults: Predicting physicians judgments
All HITON PC HITON MB RFEPhysiciansPhysicians
All(features)
HITON_PC
(features)
HITON_MB
(features)RFE
(features)
Expert 1Expert 1 0.94 (24) 0.92 (4) 0.92 (5) 0.95 (14)
Expert 2Expert 2 0.92 (24) 0.89 (7) 0.90 (7) 0.90 (12)
Expert 3Expert 3 0.98 (24) 0.95 (4) 0.95 (4) 0.97 (19)
NonExpert 1NonExpert 1 0.92 (24) 0.89 (5) 0.89 (6) 0.90 (22)
NonExpert 2NonExpert 2 1.00 (24) 0.99 (6) 0.99 (6) 0.98 (11)NonExpert 2NonExpert 2 1.00 (24) 0.99 (6) 0.99 (6) 0.98 (11)
NonExpert 3NonExpert 3 0.89 (24) 0.89 (4) 0.89 (6) 0.87 (10)
Results: Physician-specific modelsResults: Physician specific models
Results: Explaining physician agreementResults: Explaining physician agreement
BlueBlue veil
irregular border streaks
Patient 001 yesyes nono yesyes
Expert 1 AUC=0.92 R2=99%
Expert 3AUC=0.95 R2=99%
Results: Explain physician disagreementResults: Explain physician disagreementBlue veil
irregular border
streaksnumber of colors
evolutionveil border of colors
Patient 002 nono nono yesyes 33 nono
Expert 1 AUC=0.92 R2=99%
Expert 3AUC=0.95 R2=99%
Results: Guideline complianceResults: Guideline compliance
Physician Reported guidelines
Compliance
Non compliant: they ignore theExperts1,2,3, Experts1,2,3, nonnon--expert 1expert 1
Pattern analysis Non-compliant: they ignore the majority of features (17 to 20) recommended by pattern analysis.
N li t t i lNon expert 2Non expert 2 ABCDE rule
Non compliant: asymmetry, irregular border and evolution are ignored.
Non-standard. Non compliant: 2 out of 7 reportedNon expert 3Non expert 3
Non standard. Reports using 7 features
Non compliant: 2 out of 7 reported features are ignored while some non-reported ones are not
On the contrary: In all guidelines the more predictors present On the contrary: In all guidelines, the more predictors present, the higher the likelihood of melanoma. All physicians were compliant with this principle.
5. Using SVMs for feature selectiong
183
Feature selection methodsFeature selection methods (non‐causal)• SVM RFE hi i S b d• SVM‐RFE• Univariate + wrapper• Random forest‐basedLARS El i N
This is an SVM‐based feature selection method
• LARS‐Elastic Net• RELIEF + wrapper• L0‐norm• Forward stepwise feature selection• No feature selection
Causal feature selection methods• HITON‐PC• HITON‐MB This method outputs a HITON MB• IAMB• BLCD• K2MB
Markov blanket of the response variable (under assumptions)
184
• K2MB…
13 real datasets were used to evaluate f l h dfeature selection methods
Dataset name DomainNumber of variables
Number of samples
Target Data typevariables samples
g yp
Infant_Mortality Clinical 86 5,337 Died within the first year discrete
Ohsumed Text 14,373 5,000 Relevant to nenonatal diseases continuous
ACPJ Eti l T t 28 228 15 779 R l t t it l tiACPJ_Etiology Text 28,228 15,779 Relevant to eitology continuous
LymphomaGene
expression7,399 227 3-year survival: dead vs. alive continuous
GisetteDigit
recognition5,000 7,000 Separate 4 from 9 continuous
Dexter Text 19,999 600 Relevant to corporate acquisitions continuous
Sylva Ecology 216 14,394 Ponderosa pine vs. everything else continuous & discrete
Ovarian_Cancer Proteomics 2,190 216 Cancer vs. normals continuous
ThrombinDrug
discovery139,351 2,543 Binding to thromin discrete (binary)
CGene
17 816 286 E i i (ER ) ER iBreast_CancerGene
expression17,816 286 Estrogen-receptor positive (ER+) vs. ER- continuous
HivaDrug
discovery1,617 4,229 Activity to AIDS HIV infection discrete (binary)
Nova Text 16,969 1,929 Separate politics from religion topics discrete (binary)
185
Bankruptcy Financial 147 7,063 Personal bankruptcy continuous & discrete
Classification performance vs. proportion f l d fof selected features
O i i l M ifi d
0.95
1Original
C)
0.9
Magnified
C)
0 8
0.85
0.9
an
ce (
AU
C
0 88
0.89
an
ce (
AU
C
0.7
0.75
0.8
on
pe
rfo
rma
0.87
0.88
on
pe
rfo
rm
0.6
0.65
0.7
Cla
ssifi
catio
0.86
Cla
ssifi
catio
0 0.5 10.5
0.55
C
HITON-PC with G2 testRFE
0.05 0.1 0.15 0.2
0.85C
HITON-PC with G2 testRFE
186
Proportion of selected features Proportion of selected features
Statistical comparison of predictivity and d f freduction of features
Predicitivity Reduction
P-value Nominal winner P-value Nominal winner
0.9754 SVM-RFE 0.0046 HITON-PC
SVM-RFE(4 variants)
0.8030 SVM-RFE 0.0042 HITON-PC
0.1312 HITON-PC 0.3634 HITON-PC
0.1008 HITON-PC 0.6816 SVM-RFE
• Null hypothesis: SVM‐RFE and HITON‐PC perform the same;• Use permutation‐based statistical test with alpha = 0.05.
187
Use permutation based statistical test with alpha 0.05.
Simulated datasets with known causal d l hstructure used to compare algorithms
188
Comparison of SVM-RFE and HITON-PCp
189
Comparison of all methods in terms of l h dcausal graph distance
SVM‐RFE HITON‐PC based causal
methods
190
Summary resultsy
HITON‐PC based causal
methods
HITON‐PC‐FDR
methods
191SVM‐RFE
Statistical comparison of graph distancep g p
Sample size = 200
Sample size = 500
Sample size =5000
Nominal Nominal NominalComparison P-value
Nominal winner
P-valueNominal winner
P-valueNominal winner
average HITON-PC-FDRith G2 t t <0 0001
HITON-PC-0 0028
HITON-PC-<0 0001
HITON-PC-with G2 test vs. average SVM-RFE
<0.0001FDR
0.0028FDR
<0.0001FDR
• Null hypothesis: SVM‐RFE and HITON‐PC‐FDR perform the same;• Use permutation‐based statistical test with alpha = 0.05.p p
192
6. Outlier detection in ovarian cancer t i d tproteomics data
193
Ovarian cancer dataData Set 1 (Top), Data Set 2 (Bottom)
CancerSame set of 216 patients, obtained using the Ciphergen
Normal
Other
H4 ProteinChiparray (dataset 1) and using the
h
Cancer
Ciphergen WCX2 ProteinChip array(dataset 2).
Normal
Clock Tick4000 8000 12000
Other
The gross break at the “benign disease” juncture in dataset 1 and the similarity of theThe gross break at the benign disease juncture in dataset 1 and the similarity of the profiles to those in dataset 2 suggest change of protocol in the middle of the first experiment.
Experiments with one-class SVMpAssume that sets {A, B} are normal and {C D E F} are
Data Set 1 (Top), Data Set 2 (Bottom)
normal and {C, D, E, F} are outliers. Also, assume that we do not know what are normal
Cancer
and outlier samples. Normal
Other
•Experiment 1: Train one‐class SVM on {A, B, C} and test on {A, B, C}: Area under ROC curve = 0.98
Cancer
•Experiment 2: Train one‐class SVM on {A, C} and test on {B, D, E, F}: Area under ROC curve = 0 98
Normal
Area under ROC curve = 0.98Clock Tick
4000 8000 12000
Other
195
Software
196
Interactive media and animations
SVM AppletsSVM Applets• http://www.csie.ntu.edu.tw/~cjlin/libsvm/• http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtmlp // /p g /• http://www.smartlab.dibe.unige.it/Files/sw/Applet%20SVM/svmapplet.html• http://www.eee.metu.edu.tr/~alatan/Courses/Demo/AppletSVM.html• http://www dsl lab org/svm tutorial/demo html (requires Java 3D)• http://www.dsl‐lab.org/svm_tutorial/demo.html (requires Java 3D)
Animationsat o s• Support Vector Machines:http://www.cs.ust.hk/irproj/Regularization%20Path/svmKernelpath/2moons.avihttp://www cs ust hk/irproj/Regularization%20Path/svmKernelpath/2Gauss avihttp://www.cs.ust.hk/irproj/Regularization%20Path/svmKernelpath/2Gauss.avihttp://www.youtube.com/watch?v=3liCbRZPrZAhttp://www.youtube.com/watch?v=AC7afKLgGTs
197
• Support Vector Regression: http://www.cs.ust.hk/irproj/Regularization%20Path/movie/ga0.5lam1.avi
Several SVM implementations for beginners
• GEMS: http://www.gems‐system.org
• Weka: http://www.cs.waikato.ac.nz/ml/weka/
• Spider (for Matlab): http://www.kyb.mpg.de/bs/people/spider/
• CLOP (for Matlab): http://clopinet.com/CLOP/
198
Several SVM implementations for intermediate users
•LibSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/ General purpose Implements binary SVM multiclass SVM SVR one‐class SVM Implements binary SVM, multiclass SVM, SVR, one class SVM Command‐line interface Code/interface for C/C++/C#, Java, Matlab, R, Python, Pearl
SVMLi ht htt // li ht j hi /•SVMLight: http://svmlight.joachims.org/ General purpose (designed for text categorization) Implements binary SVM, multiclass SVM, SVRp y , , Command‐line interface Code/interface for C/C++, Java, Matlab, Python, Pearl
More software links at http://www.support‐vector‐machines.org/SVM_soft.htmland http://www.kernel‐machines.org/software
199
Conclusions
200
•Excellent empirical performance for many biomedical datasets.•Can learn efficiently both simple linear and very complex nonlinearCan learn efficiently both simple linear and very complex nonlinear functions by using the “kernel trick”.
•Address outliers and noise by using “slack variables”.• Internal capacity control (regularization) and overfitting avoidance• Internal capacity control (regularization) and overfitting avoidance.•Do not require direct access to data and can work with only dot‐products of data points.R i l i f QP i i i bl h h l b l•Require solution of a convex QP optimization problem that has a global minimum and can be solved efficiently. Thus optimizing the SVM model parameters is not subject to heuristic optimization failures that plague other machine learning methods (e.g., neural networks, decision trees).
•Produce sparse models that are defined only by a small subset of training points (“support vectors”).p ( pp )
•Do not have more free parameters than the number of support vectors, irrespective of the number of variables in the dataset.
•Depend on data scaling/normalization.Depend on data scaling/normalization. •Can be used with established data analysis protocols for model selection and error estimation, such as cross‐validation.
•When used for feature selection employ heuristic strategies to identify
201
•When used for feature selection, employ heuristic strategies to identify relevant features and possible mechanisms.
Bibliographyg p y
202
Part 1: Support vector machines for binary l f l l f lclassification: classical formulation
• Boser BE, Guyon IM, Vapnik VN: A training algorithm for optimal margin classifiers. Proceedings of the Fifth Annual Workshop on Computational Learning Theory (COLT)1992:144‐152.
• Burges CJC: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 1998, 2:121‐167.
• Cristianini N, Shawe‐Taylor J: An introduction to support vector machines and other kernel‐based learning methods. Cambridge: Cambridge University Press; 2000.
• Hastie T, Tibshirani R, Friedman JH: The elements of statistical learning: data mining, inference, and prediction. New York: Springer; 2001.
• Herbrich R: Learning kernel classifiers: theory and algorithms. Cambridge, Mass: MIT Press; 20022002.
• Schölkopf B, Burges CJC, Smola AJ: Advances in kernel methods: support vector learning. Cambridge, Mass: MIT Press; 1999.
Sh T l J C i ti i i N K l th d f tt l i C b id UK• Shawe‐Taylor J, Cristianini N: Kernel methods for pattern analysis. Cambridge, UK: Cambridge University Press; 2004.
• Vapnik VN: Statistical learning theory. New York: Wiley; 1998.
V ik VN Th t f t ti ti l l i th N Y k S i 2000
203
• Vapnik VN: The nature of statistical learning theory. New York: Springer, 2000.
Part 1: Basic principles of statistical h lmachine learning
• Aliferis CF, Statnikov A, Tsamardinos I: Challenges in the analysis of mass‐throughput data: a technical commentary from the statistical machine learning perspective. Cancer Informatics 2006, 2:133‐162.
• Duda RO, Hart PE, Stork DG: Pattern classification. 2nd edition. New York: Wiley; 2001.
• Hastie T, Tibshirani R, Friedman JH: The elements of statistical learning: data mining, inference, and prediction. New York: Springer; 2001.
• Mitchell T:Machine learning. New York, NY, USA: McGraw‐Hill; 1997.Mitchell T: Machine learning. New York, NY, USA: McGraw Hill; 1997.
• Vapnik VN: Statistical learning theory. New York: Wiley; 1998.
• Vapnik VN: The nature of statistical learning theory. New York: Springer, 2000.
204
Part 2: Model selection for SVMs• Hastie T, Tibshirani R, Friedman JH: The elements of statistical learning: data mining,
inference, and prediction. New York: Springer; 2001.
• Kohavi R: A study of cross‐validation and bootstrap for accuracy estimation and model selection. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (IJCAI) 1995, 2:1137‐1145.
• Scheffer T: Error estimation and model selection. Ph.D.Thesis, Technischen UniversitätBerlin, School of Computer Science; 1999.
• Statnikov A, Tsamardinos I, Dosbayev Y, Aliferis CF: GEMS: a system for automated cancerStatnikov A, Tsamardinos I, Dosbayev Y, Aliferis CF: GEMS: a system for automated cancer diagnosis and biomarker discovery from microarray gene expression data. Int J Med Inform 2005, 74:491‐503.
• Weiss SM and Kulikowski CA: Computer systems that learn: classification and prediction methods p y f pfrom statistics, neural nets, machine learning, and expert systems. San Mateo, California: M. Kaufmann Publishers; 1991.
205
Part 2: SVMs for multicategory datag y• Crammer K, Singer Y: On the learnability and design of output codes for multiclass
problems. Proceedings of the Thirteenth Annual Conference on Computational Learning Theory (COLT) 2000.
• Platt JC, Cristianini N, Shawe‐Taylor J: Large margin DAGs for multiclass classification. Advances in Neural Information Processing Systems (NIPS) 2000, 12:547‐553.
• Schölkopf B, Burges CJC, Smola AJ: Advances in kernel methods: support vector learning. Cambridge, Mass: MIT Press; 1999.
• Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S: A comprehensive evaluation ofStatnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 2005, 21:631‐643.
• Weston J, Watkins C: Support vector machines for multi‐class pattern recognition.Weston J, Watkins C: Support vector machines for multi class pattern recognition. Proceedings of the Seventh European Symposium On Artificial Neural Networks 1999, 4:6.
206
Part 2: Support vector regressionpp g• Hastie T, Tibshirani R, Friedman JH: The elements of statistical learning: data mining,
inference, and prediction. New York: Springer; 2001.
• Smola AJ, Schölkopf B: A tutorial on support vector regression. Statistics and Computing2004, 14:199‐222.
Part 2: Novelty detection with SVM-based methods and Support Vector Clustering
• Scholkopf B, Platt JC, Shawe‐Taylor J, Smola AJ, Williamson RC: Estimating the Support of a High‐Dimensional Distribution Neural Computation 2001 13:1443‐1471
methods and Support Vector Clustering
High Dimensional Distribution. Neural Computation 2001, 13:1443 1471.
• Tax DMJ, Duin RPW: Support vector domain description. Pattern Recognition Letters 1999, 20:1191‐1199.
• Manevitz LM and Yousef M One class svms for document classification The Journal of Machine• Manevitz LM and Yousef M: One‐class svms for document classification. The Journal of Machine Learning Research 2002, 2:154.
• Hur BA, Horn D, Siegelmann HT, Vapnik V: Support vector clustering. Journal of Machine Learning Research 2001 2:125 137
207
Learning Research 2001, 2:125–137.
• Tax DMJ and Duin RPW: Support vector domain description. Pattern Recognition Letters 1999, 20:1191‐1199.
Part 2: SVM-based variable selection• Guyon I, Elisseeff A: An introduction to variable and feature selection. Journal of Machine
Learning Research 2003, 3:1157‐1182.• Guyon I, Weston J, Barnhill S, Vapnik V: Gene selection for cancer classification using
support vector machines. Machine Learning 2002, 46:389‐422.• Hardin D, Tsamardinos I, Aliferis CF: A theoretical characterization of linear SVM‐based
f t l ti P di f th T t Fi t I t ti l C f M hifeature selection. Proceedings of the Twenty First International Conference on Machine Learning (ICML) 2004.
• Statnikov A, Hardin D, Aliferis CF: Using SVM weight‐based methods to identify causally relevant and non‐causally relevant variables Proceedings of the NIPS 2006 Workshop onrelevant and non causally relevant variables. Proceedings of the NIPS 2006 Workshop on Causality and Feature Selection 2006.
• Tsamardinos I, Brown LE: Markov Blanket‐Based Variable Selection in Feature Space. Technical report DSL‐08‐01 2008.
• Weston J, Mukherjee S, Chapelle O, Pontil M, Poggio T, Vapnik V: Feature selection for SVMs. Advances in Neural Information Processing Systems (NIPS) 2000, 13:668‐674.
• Weston J, Elisseeff A, Scholkopf B, Tipping M: Use of the zero‐norm with linear models d k l h d J l f M hi L i R h 2003 3 1439 1461and kernel methods. Journal of Machine Learning Research 2003, 3:1439‐1461.
• Zhu J, Rosset S, Hastie T, Tibshirani R: 1‐norm support vector machines. Advances in Neural Information Processing Systems (NIPS) 2004, 16.
208
Part 2: Computing posterior class b b l f SVM l fprobabilities for SVM classifiers
• Platt JC: Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Advances in Large Margin Classifiers. Edited by Smola A, Bartlett B, Scholkopf B, Schuurmans D. Cambridge, MA: MIT press; 2000.
P 3 Cl ifi i f Part 3: Classification of cancer gene expression microarray data (Case Study 1)expression microarray data (Case Study 1)
• Diaz‐Uriarte R, Alvarez de Andres S: Gene selection and classification of microarray data using random forest. BMC Bioinformatics 2006, 7:3.
k l f d d h i l i f• Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 2005, 21:631‐643.
• Statnikov A Tsamardinos I Dosbayev Y Aliferis CF: GEMS: a system for automated cancer• Statnikov A, Tsamardinos I, Dosbayev Y, Aliferis CF: GEMS: a system for automated cancer diagnosis and biomarker discovery from microarray gene expression data. Int J Med Inform 2005, 74:491‐503.
• Statnikov A, Wang L, Aliferis CF: A comprehensive comparison of random forests and
209
, g , p psupport vector machines for microarray‐based cancer classification. BMC Bioinformatics2008, 9:319.
Part 3: Text Categorization in Biomedicine (C S d 2)(Case Study 2)
• Aphinyanaphongs Y, Aliferis CF: Learning Boolean queries for article quality filtering. Medinfo 2004 2004, 11:263‐267.
• Aphinyanaphongs Y, Tsamardinos I, Statnikov A, Hardin D, Aliferis CF: Text categorization models for high‐quality article retrieval in internal medicine. J Am Med Inform Assoc2005, 12:207‐216.
• Aphinyanaphongs Y, Statnikov A, Aliferis CF: A comparison of citation metrics to machine learning filters for the identification of high quality MEDLINE documents. J Am Med Inform Assoc 2006, 13:446‐455.
• Aphinyanaphongs Y, Aliferis CF: Prospective validation of text categorization models for indentifying high‐quality content‐specific articles in PubMed. AMIA 2006 Annual Symposium Proceedings 2006.
• Aphinyanaphongs Y, Aliferis C: Categorization Models for Identifying Unproven Cancer Treatments on the Web. MEDINFO 2007.
• Fu L, Aliferis CF: Models for Predicting and Explaining Citation Count of Biomedical Articles. AMIA 2008 Annual Symposium Proceedings 2008.
• Fu F, Aliferis CF: Using content‐based and bibliometric features for machine learning models to
210
u , e s C Us g co te t based a d b b o et c eatu es o ac e ea g ode s topredict citation counts in the biomedical literature. Scientometrics, 2010.
Part 3: Modeling clinical judgment (C S d 4)(Case Study 4)
• Sboner A, Aliferis CF: Modeling clinical judgment and implicit guideline compliance in the diagnosis of melanomas using machine learning. AMIA 2005 Annual Symposium Proceedings 2005:664‐668.
P 3 U i SVM f f l i Part 3: Using SVMs for feature selection (Case Study 5)(Case Study 5)
• Aliferis CF, Statnikov A, Tsamardinos I, Mani S, Koutsoukos XD: Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification. Part II: Analysis and Extensions. Journal of Machine Learning Research 2010.
• Aliferis CF, Statnikov A, Tsamardinos I, Mani S, Koutsoukos XD: Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification. Part I: Al i h d E i i l E l i J l f M hi L i R h 2010Algorithms and Empirical Evaluation. Journal of Machine Learning Research 2010.
211
Part 3: Outlier detection in ovarian cancer d (C S d 6)proteomics data (Case Study 6)
• Baggerly KA, Morris JS, Coombes KR: Reproducibility of SELDI‐TOF protein patterns in serum: comparing datasets from different experiments. Bioinformatics 2004, 20:777‐785.
• Petricoin EF, Ardekani AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM, Mills GB, Simone C, Fishman DA, Kohn EC, Liotta LA: Use of proteomic patterns in serum to identify ovarian cancer. Lancet 2002, 359:572‐577.
212
Thank you for your attention!y yQuestions/Comments?
Email: AlexanderStatnikov@med nyu eduEmail: [email protected]
URL: http://ww.nyuinformatics.org
213
Upcoming book on SVMsW ld S i ifi P bli hi (i ) 2010World Scientific Publishing, (in press) 2010
214