Sp’10 Bafna/Ideker
Classification (SVMs / Kernel method)
Sp’10 Bafna/Ideker
LP versus Quadratic programming
€
min cT xAx ≤ bx ≥ 0
€
min xTQx + cT xAx ≤ bx ≥ 0
• LP: linear constraints, linear objective function
• LP can be solved in polynomial time.
• In QP, the objective function contains a quadratic form.
• For +ve semindefinite Q, the QP can be solved in polynomial time
Sp’10 Bafna/Ideker
Margin of separation
• Suppose we find a separating hyperplane (, 0) s.t.– For all +ve points
x• Tx-0>=1
– For all +ve points x• Tx-0 <= -1
• What is the margin of separation?
Tx- 0=0
Tx- 0=1
Tx- 0=-1
Sp’10 Bafna/Ideker
Separating by a wider margin
• Solutions with a wider margin are better.
€
Maximize 2β 2 , or Minimize β
2
2
Sp’10 Bafna/Ideker
Separating via misclassification
• In general, data is not linearly separable• What if we also wanted to minimize misclassified points• Recall that, each sample xi in our training set has the
label yi {-1,1} • For each point i, yi(Txi-0) should be positive• Define i >= max {0, 1- yi(Txi-0) }• If i is correctly classified ( yi(Txi-0) >= 1), and i = 0• If i is incorrectly classified, or close to the boundaries i
> 0• We must minimize ii
Sp’10 Bafna/Ideker
Support Vector machines (wide margin and misclassification)
• Maximimize margin while minimizing misclassification
• Solved using non-linear optimization techniques
• The problem can be reformulated to exclusively using cross products of variables, which allows us to employ the kernel method.
• This gives a lot of power to the method. €
min β2
2 +C ξ ii∑
Sp’10 Bafna/Ideker
Reformulating the optimization
€
min β2
2 +C ξ ii∑
€
i ≥ 0
ξ i ≥1− y i βT x i −β 0( )
Sp’10 Bafna/Ideker
Lagrangian relaxation
€
L =β 2
2+C ξ ii
∑ − α ii∑ ξ i −1+ y i β
T x i −β 0( )( ) − λ ii∑ ξ i
• Goal
• S.t.
• We minimize€
minβ 2
2+C ξ ii∑
€
i ≥ 0
ξ i ≥1− y i βT x i −β 0( )
Sp’10 Bafna/Ideker
Simplifying
€
L =β 2
2+C ξ ii
∑ − α ii∑ ξ i −1+ y i β
T x i −β 0( )( ) − λ ii∑ ξ i
= β T β2
− α iy ix ii∑ ⎛
⎝ ⎜ ⎞ ⎠ ⎟+ C −α i − λ i( )
i∑ ξ i + α ii
∑ y iβ 0 + α ii∑
• For fixed >= 0, >= 0, we minimize the lagrangian
€
∂L∂β
= β − y ii∑ α ix i = 0 (1)
∂L∂β 0
= y ii∑ α i = 0 (2)
∂L∂ξ i
=C −α i − λ i = 0 (3)
Sp’10 Bafna/Ideker
Substituting
• Substituting (1)€
L =β 2
2+C ξ ii
∑ − α ii∑ ξ i −1+ y i β
T x i −β 0( )( ) − λ ii∑ ξ i
= β T β2
− α iy ix ii∑ ⎛
⎝ ⎜ ⎞ ⎠ ⎟+ C −α i − λ i( )
i∑ ξ i + α ii
∑ y iβ 0 + α ii∑
€
L = − 12
α iα j y iy jx iT x j
i, j∑ + C −α i − λ i( )
i∑ ξ i + α ii∑ y iβ 0 + α ii∑
Sp’10 Bafna/Ideker
• Substituting (2,3), we have the minimization problem€
L = − 12
α iα jy iy jx iT x j
i, j∑ + C −α i − λ i( )
i∑ ξ i + α ii∑ y iβ 0 + α ii∑
€
min − 12
α iα jy iy jx iT x j
i, j∑ + α ii
∑s.t.
y iα ii∑ = 0
0 ≤ α i ≤C
Sp’10 Bafna/Ideker
Classification using SVMs
• Under these conditions, the problem is a quadratic programming problem and can be solved using known techniques
• Quiz: When we have solved this QP, how do we classify a point x?
€
f (x) = β T x −β 0 = y ii∑ α ix i
T x −β 0
Sp’10 Bafna/Ideker
The kernel method
• The SVM formulation can be solved using QP on dot-products.
• As these are wide-margin classifiers, they provide a more robust solution.
• However, the true power of SVMs approach from using ‘the kernel method’, which allows us to go to higher dimensional (and non-linear spaces)
Sp’10 Bafna/Ideker
kernel
• Let X be the set of objects – Ex: X =the set of samples in micro-arrays.– Each object xX is a vector of gene
expression values• k: X X -> R is a positive semidefinite
kernel if– k is symmetric.– k is +ve semidefinite
€
k(x,x ') = k(x',x)
€
cT kc ≥ 0 ∀c ∈ R p
Sp’10 Bafna/Ideker
Kernels as dot-product
• Quiz: Suppose the objects x are all real vectors (as in gene expression)
• Define
• Is kL a kernel? It is symmetric, but is is +ve semi-definite?€
kL x,x'( ) = xT x'
Sp’10 Bafna/Ideker
Linear kernel is +ve semidefinite
• Recall X as a matrix, such that each column is a sample– X=[x1 x2 …]
• By definition, the linear kernel kL=XTX• For any c
€
cT kLc = cTX TXc = Xc 2 ≥ 0
Sp’10 Bafna/Ideker
Generalizing kernels
• Any object can be represented by a feature vector in real space.
€
φ : X → R p
€
k(x,x ') = φ(x)T φ(x')
Sp’10 Bafna/Ideker
Generalizing
• Note that the feature mapping could actually be non-linear.
• On the flip side, Every kernel can be represented as a dot-product in a high dimensional space.
• Sometimes the kernel space is easier to define than the mapping
Sp’10 Bafna/Ideker
The kernel trick
• If an algorithm for vectorial data is expressed exclusively in the form of dot-products, it can be changed to an algorithm on an arbitrary kernel– Simply replace the dot-product by the kernel
Sp’10 Bafna/Ideker
Kernel trick example
• Consider a kernel k defined on a mapping – k(x,x’) = (x)T (x’)
• It could be that is very difficult to compute explicitly, but k is easy to compute
• Suppose we define a distance function between two objects as
• How do we compute this distance?€
d(x,x ') = φ(x) −φ(x')
€
d(x,x') = φ(x) −φ(x') = φ(x)T φ(x) + φ(x ')T φ(x ') − 2φ(x)T φ(x ')
= k(x,x) + k(x',x ') − 2k(x,x')
Sp’10 Bafna/Ideker
Kernels and SVMs
• Recall that SVM based classification is described as
€
min − 12
α iα jy iy jx iT x j
i, j∑ + α ii
∑s.t.
y iα ii∑ = 0
0 ≤α i ≤ C
Sp’10 Bafna/Ideker
Kernels and SVMs
• Applying the kernel trick
• We can try kernels that are biologically relevant€
min − 12
α iα j y iy jk(x ii, j∑ ,x j ) + α ii
∑s.t.
y iα ii∑ = 0
0 ≤α i ≤C
Sp’10 Bafna/Ideker
Examples of kernels for vectors
€
linear kernel kL (x,x') = xT x '
€
poly kernel kp (x,x') = xT x'+c( )d
€
Gaussian RBF kernel kG (x,x') = exp −x − x ' 2
2σ 2
⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
Sp’10 Bafna/Ideker
String kernel
• Consider a string s = s1, s2,…
• Define an index set I as a subset of indices
• s[I] is the substring limited to those indices
• l(I) = span• W(I) = cl(I) c<1
– Weight decreases as span increases
• For any string u of length k
l(I)
€
φu(s) = c l(I )
I :s(I )= u∑
Sp’10 Bafna/Ideker
String Kernel
• Map every string to a ||n dimensional space, indexed by all strings u of length upto n
• The mapping is expensive, but given two strings s,t,the dot-product kernel k(s,t) = (s)T(t) can be computed in O(n |s| |t|) time
€
φs u
€
φu(s)
Sp’10 Bafna/Ideker
SVM conclusion
• SVM are a generic scheme for classifying data with wide margins and low misclassifications
• For data that is not easily represented as vectors, the kernel trick provides a standard recipe for classification– Define a meaningful kernel, and solve using
SVM• Many standard kernels are available
(linear, poly., RBF, string)
Sp’10 Bafna/Ideker
Classification review• We started out by treating the classification
problem as one of separating points in high dimensional space
• Obvious for gene expression data, but applicable to any kind of data
• Question of separability, linear separation• Algorithms for classification
– Perceptron– Lin. Discriminant– Max Likelihood– Linear Programming– SVMs– Kernel methods & SVM
Sp’10 Bafna/Ideker
Classification review
• Recall that we considered 3 problems:– Group together samples in an unsupervised
fashion (clustering)– Classify based on a training data (often by
learning a hyperplane that separates).– Selection of marker genes that are
diagnostic for the class. All other genes can be discarded, leading to lower dimensionality.
Sp’10 Bafna/Ideker
Dimensionality reduction
• Many genes have highly correlated expression profiles.
• By discarding some of the genes, we can greatly reduce the dimensionality of the problem.
• There are other, more principled ways to do such dimensionality reduction.
Sp’10 Bafna/Ideker
Why is high dimensionality bad?
• With a high enough dimensionality, all points can be linearly separated.
• Recall that a point xi is misclassified if – it is +ve, but Txi-0<=0– it is -ve, but Txi+0 > 0
• In the first case choose i s.t. – Txi-0+i >= 0
• By adding a dimension for each misclassified point, we create a higher dimension hyperplane that perfectly separates all of the points!
Sp’10 Bafna/Ideker
Principle Components Analysis
• We get the intrinsic dimensionality of a data-set.
Sp’10 Bafna/Ideker
Principle Components Analysis
• Consider the expression values of 2 genes over 6 samples.
• Clearly, the expression of the two genes is highly correlated.
• Projecting all the genes on a single line could explain most of the data.
• This is a generalization of “discarding the gene”.
Sp’10 Bafna/Ideker
Projecting
• Consider the mean of all points m, and a vector emanating from the mean
• Algebraically, this projection on means that all samples x can be represented by a single value T(x-m)
m
x
x-m
T =
MT(x-m)
Sp’10 Bafna/Ideker
Higher dimensions
• Consider a set of 2 (k) orthonormal vectors 1, 2…
• Once projected, each sample means that all samples x can be represented by 2 (k) dimensional vector– 1
T(x-m), 2T(x-m)
1m
x
x-m1
T
=
M
1T(x-m)
2
Sp’10 Bafna/Ideker
How to project
• The generic scheme allows us to project an m dimensional surface into a k dimensional one.
• How do we select the k ‘best’ dimensions?
• The strategy used by PCA is one that maximizes the variance of the projected points around the mean
Sp’10 Bafna/Ideker
PCA
• Suppose all of the data were to be reduced by projecting to a single line from the mean.
• How do we select the line ?
m
Sp’10 Bafna/Ideker
PCA cont’d
• Let each point xk map to x’k=m+ak. We want to minimize the error
• Observation 1: Each point xk maps to x’k = m + T(xk-m)– (ak= T(xk-m))€
xk − x 'k2
k∑
m
xk
x’k
Sp’10 Bafna/Ideker
Proof of Observation 1
€
minak xk − x'k2
= minak xk −m + m − x'k2
= minak xk −m 2 + m − x'k2 − 2(x'k −m)T (xk −m)
= minak xk −m 2 + ak2β Tβ − 2akβ
T (xk −m)
= minak xk −m 2 + ak2 − 2akβ
T (xk −m)
€
2ak − 2β T (xk −m) = 0
ak = β T (xk −m)
⇒ ak2 = akβ
T (xk −m)
⇒ xk − x 'k2 = xk −m 2 −β T (xk −m)(xk −m)T β
Differentiating w.r.t ak
Sp’10 Bafna/Ideker
Minimizing PCA Error
• To minimize error, we must maximize TS• By definition, = TS implies that is an eigenvalue,
and the corresponding eigenvector.• Therefore, we must choose the eigenvector
corresponding to the largest eigenvalue.€
xk − x 'kk
∑ 2
=C − β Tk
∑ (xk −m)(xk −m)T β =C − β TSβ
Sp’10 Bafna/Ideker
PCA steps• X = starting matrix with n
columns, m rows
xjX
€
1. m = 1n
x jj=1
n
∑
2. hT = 11L 1[ ]
3. M = X −mhT
4. S = MMT = x j −m( )j=1
n
∑ x j −m( )T
5. BTSB = Λ
6. Return BTM
End of Lecture
Sp’10 Bafna/Ideker
Sp’10 Bafna/Ideker
Sp’10 Bafna/Ideker
ALL-AML classification
• The two leukemias need different different therapeutic regimen.
• Usually distinguished through hematopathology
• Can Gene expression be used for a more definitive test?– 38 bonemarrow samples– Total mRNA was hybridized against probes for 6817
genes– Q: Are these classes separable
Sp’10 Bafna/Ideker
Neighborhood analysis (cont’d)
• Each gene is represented by an expression vector v(g) = (e1,e2,…,en)
• Choose an idealized expression vector as center.
• Discriminating genes will be ‘closer’ to the center (any distance measure can be used).
Discriminating gene
Sp’10 Bafna/Ideker
Neighborhood analysis
• Q: Are there genes, whose expression correlates with one of the two classes
• A: For each class, create an idealized vector c– Compute the number of genes Nc whose expression
‘matches’ the idealized expression vector– Is Nc significantly larger than Nc* for a random c*?
Sp’10 Bafna/Ideker
Neighborhood test• Distance measure used:
– For any binary vector c, let the one entries denote class 1, and the 0 entries denote class 2
– Compute mean and std. dev. [1(g),1(g)] of expression in class 1 and also [2(g),2(g)].
– P(g,c) = [1(g)-2(g)]/ [1(g)+2(g)] – N1(c,r) = {g | P(g,c) == r}– High density for some r is indicative of correlation with class
distinction– Neighborhood is significant if a random center does not
produce the same density.
Sp’10 Bafna/Ideker
Neighborhood analysis• #{g |P(g,c) > 0.3} > 709 (ALL) vs 173 by
chance.• Class prediction should be possible using
micro-array expression values.
Sp’10 Bafna/Ideker
Class prediction
• Choose a fixed set of informative genes (based on their correlation with the class distinction).– The predictor is uniquely defined by the sample and the subset
of informative genes.• For each informative gene g, define (wg,bg).
– wg=P(g,c) (When is this +ve?)– bg = [1(g)+2(g)]/2
• Given a new sample X– xg is the normalized expression value at g– Vote of gene g =wg|xg-bg| (+ve value is a vote for class 1, and
negative for class 2)
Sp’10 Bafna/Ideker
Prediction Strength
• PS = [Vwin-Vlose]/[Vwin+Vlose]– Reflects the margin of victory
• A 50 gene predictor is correct 36/38 (cross-validation)
• Prediction accuracy on other samples 100% (prediction made for 29/34 samples.
• Median PS = 0.73• Other predictors between 10 and 200 genes all
worked well.
Sp’10 Bafna/Ideker
Performance
Sp’10 Bafna/Ideker
Differentially expressed genes?
• Do the predictive genes reveal any biology?
• Initial expectation is that most genes would be of a hematopoetic lineage.
• However, many genes encode– Cell cycle progression genes– Chromatin remodelling– Transcription– Known oncogenes– Leukemia targets (etopside)
Sp’10 Bafna/Ideker
Relationship between ML, and Golub predictor
• ML when the covariance matrix is a diagonal matrix with identical variance for different classes is similar to Golub’s classifier€
p(x |ωi) = 1
2π( )d2 Σ
12
exp − 12x −m( )
T Σ−1 x −m( ) ⎛ ⎝ ⎜
⎞ ⎠ ⎟
gi(x) = ln p(x |ωi)( ) + lnP(ωi)
Compute argmaxi gi(x)
€
gi(x) = ln p(x |ωi)( ) + lnP(ωi)
gi(x) =x j −μ kj( )
2
σ j2
j=1
p
∑
g1(x) − g2(x) =μ1 j −μ2 j( )σ j
2 x j −μ1 j + μ2 j( )
2
⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
j∑
Sp’10 Bafna/Ideker
Automatic class discovery
• The classification of different cancers is over years of hypothesis driven research.
• Suppose you were given unlabeled samples of ALL/AML. Would you be able to distinguish the two classes?
Sp’10 Bafna/Ideker
Self Organizing Maps
• SOMs was applied to group the 38 samples
• Class A1 contained 24/25 ALL and 3/13 AML samples.
• How can we validate this?• Use the labels to do supervised
classification via cross-validation• A 20 gene predictor gave 34 accurate
predictions, 1 error, and 2 of 3 uncertains
Sp’10 Bafna/Ideker
Comparing various error models
Sp’10 Bafna/Ideker
Conclusion
Top Related