Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik &...
-
Upload
phoebe-jones -
Category
Documents
-
view
221 -
download
1
Transcript of Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik &...
![Page 1: Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f495503460f94c6b70c/html5/thumbnails/1.jpg)
Support Vector Machine Data Mining
Olvi L. Mangasarian
with
Glenn M. Fung, Jude W. Shavlik
& Collaborators at ExonHit – Paris
Data Mining Institute
University of Wisconsin - Madison
![Page 2: Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f495503460f94c6b70c/html5/thumbnails/2.jpg)
What is a Support Vector Machine?
An optimally defined surface Linear or nonlinear in the input space Linear in a higher dimensional feature space Implicitly defined by a kernel function K(A,B) C
![Page 3: Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f495503460f94c6b70c/html5/thumbnails/3.jpg)
What are Support Vector Machines Used For?
Classification Regression & Data Fitting Supervised & Unsupervised Learning
![Page 4: Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f495503460f94c6b70c/html5/thumbnails/4.jpg)
Principal TopicsKnowledge-based classification
Incorporate expert knowledge into a classifierBreast cancer prognosis & chemotherapy
Classify patients on basis of distinct survival curves Isolate a class of patients that may benefit from chemotherapy
Multiple Myeloma detection via gene expression measurementsDrug discovery based on gene macroarray expression
Joint work with ExonHit
![Page 5: Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f495503460f94c6b70c/html5/thumbnails/5.jpg)
Support Vector MachinesMaximize the Margin between Bounding Planes
x0w= í +1
x0w= í à 1
A+
A-
jjwjj2
w
![Page 6: Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f495503460f94c6b70c/html5/thumbnails/6.jpg)
Principal Topics
Knowledge-based classification (NIPS*2002)
![Page 7: Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f495503460f94c6b70c/html5/thumbnails/7.jpg)
Conventional Data-Based SVM
![Page 8: Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f495503460f94c6b70c/html5/thumbnails/8.jpg)
Knowledge-Based SVM via Polyhedral Knowledge Sets
![Page 9: Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f495503460f94c6b70c/html5/thumbnails/9.jpg)
Incoporating Knowledge Sets Into an SVM Classifier
This implication is equivalent to a set of constraints that can be imposed on the classification problem.
Suppose that the knowledge set: belongs to the class A+. Hence it must lie in the halfspace :
èx??Bx 6 b
é
èxjx0w>í +1
é
Bx6b ) x0w>í +1
We therefore have the implication:
![Page 10: Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f495503460f94c6b70c/html5/thumbnails/10.jpg)
Numerical TestingThe Promoter Recognition Dataset
Promoter: Short DNA sequence that precedes a gene sequence.
A promoter consists of 57 consecutive DNA nucleotides belonging to {A,G,C,T} .
Important to distinguish between promoters and nonpromoters
This distinction identifies starting locations of genes in long uncharacterized DNA sequences.
![Page 11: Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f495503460f94c6b70c/html5/thumbnails/11.jpg)
The Promoter Recognition DatasetNumerical Representation
Simple “1 of N” mapping scheme for converting nominal attributes into a real valued representation:
Not most economical representation, but commonly used.
![Page 12: Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f495503460f94c6b70c/html5/thumbnails/12.jpg)
The Promoter Recognition DatasetNumerical Representation
Feature space mapped from 57-dimensional nominal space to a real valued 57 x 4=228 dimensional space.
57 nominal values
57 x 4 =228binary values
![Page 13: Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f495503460f94c6b70c/html5/thumbnails/13.jpg)
Promoter Recognition Dataset Prior Knowledge Rules
Prior knowledge consist of the following 64 rules:
R1orR2orR3orR4
2
66666664
3
77777775
V
R5orR6orR7orR8
2
66666664
3
77777775
V
R9orR10orR11orR12
2
66666664
3
77777775
=) PROMOTER
![Page 14: Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f495503460f94c6b70c/html5/thumbnails/14.jpg)
Promoter Recognition Dataset Sample Rules
R8 : (pà 12= T) ^(pà 11= A) ^(pà 07= T);
R4 : (pà 36= T) ^(pà 35= T) ^(pà 34=G)^(pà 33= A) ^(pà 32= C);
R10 : (pà 45= A) ^(pà 44= A) ^(pà 41= A);where denotes position of a nucleotide, with respect to a meaningful reference point starting at position and ending at positionpà 50
pj
p7:Then:
R4^R8^R10=) PROMOTER
![Page 15: Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f495503460f94c6b70c/html5/thumbnails/15.jpg)
The Promoter Recognition DatasetComparative Algorithms
KBANN Knowledge-based artificial neural network [Shavlik et al] BP: Standard back propagation for neural networks [Rumelhart et al] O’Neill’s Method Empirical method suggested by biologist O’Neill [O’Neill] NN: Nearest neighbor with k=3 [Cost et al] ID3: Quinlan’s decision tree builder[Quinlan] SVM1: Standard 1-norm SVM [Bradley et al]
![Page 16: Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f495503460f94c6b70c/html5/thumbnails/16.jpg)
The Promoter Recognition DatasetComparative Test Results
![Page 17: Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f495503460f94c6b70c/html5/thumbnails/17.jpg)
Principal Topics
Breast cancer prognosis & chemotherapy
![Page 18: Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f495503460f94c6b70c/html5/thumbnails/18.jpg)
Kaplan-Meier Curves for Overall Patients:With & Without Chemotherapy
![Page 19: Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f495503460f94c6b70c/html5/thumbnails/19.jpg)
Breast Cancer Prognosis & ChemotherapyGood, Intermediate & Poor Patient Groupings
(6 Input Features : 5 Cytological, 1 Histological)(Clustering: Utilizes 2 Histological Features &Chemotherapy)
253 Patients(113 NoChemo, 140 Chemo)
Cluster 113 NoChemo PatientsUse k-Median Algorithm with Initial Centers:
Medians of Good1 & Poor1
69 NoChemo Good 44 NoChemo Poor 67 Chemo Good 73 Chemo Poor
Good PoorIntermediate
Cluster 140 Chemo PatientsUse k-Median Algorithm with Initial Centers:
Medians of Good1 & Poor1
Good1:Lymph=0 AND Tumor<2
Compute Median Using 6 Features
Poor1:Lymph>=5 OR Tumor>=4
Compute Median Using 6 Features
Compute InitialCluster Centers
![Page 20: Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f495503460f94c6b70c/html5/thumbnails/20.jpg)
Kaplan-Meier Survival Curvesfor Good, Intermediate & Poor Patients
82.7% Classifier Correctness via 3 SVMs
![Page 21: Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f495503460f94c6b70c/html5/thumbnails/21.jpg)
Kaplan-Meier Survival Curves for Intermediate Group Note Reversed Role of Chemotherapy
![Page 22: Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f495503460f94c6b70c/html5/thumbnails/22.jpg)
Multiple Myeloma DetectionMultiple Myeloma is cancer of the plasma cell
Plasma cells normally produce antibodiesOut of control plasma cells produce tumorsWhen tumors appear in multiple sites they are called Multiple Myeloma
Dataset105 patients: 74 with MM, 31 healthyEach patient is represented by 7008 gene measurements taken from plasma cell samplesFor each one of the 7008 gene measurements
Absolute Call (AC):Absent (A), Marginal (M) or Present (P)
Average Difference (AD):Positive or negative number
![Page 23: Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f495503460f94c6b70c/html5/thumbnails/23.jpg)
Multiple Myeloma Data Representation
A 1 0 0
M 0 1 0
P 0 0 1
AMP 7008 X 3 = 21024AD 7008 Total = 28,032 per patient104 Patients: 74 MM + 31 Healthy104 X 28,032 Data Matrix A
![Page 24: Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f495503460f94c6b70c/html5/thumbnails/24.jpg)
Multiple Myeloma 1-Norm SVM Linear Classifier
Leave-one-out-correctness (looc) = 100%Average number of features used = 7 per foldTotal computing time for 105 folds = 7892 sec. Overall number of features used in 105 folds= 7
![Page 25: Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f495503460f94c6b70c/html5/thumbnails/25.jpg)
Breast Cancer Treatment ResponseJoint with ExonHit - Paris (Curie Dataset)
35 patients treated by a drug cocktail 9 partial responders; 26 nonresponders25 gene expressions out of 692, selected by Arnaud Zeboulon Most patients had 3 replicate measurements1-Norm SVM classifier selected 14 out of 25 gene expressions
Leave-one-out correctness was 80%Greedy combinatorial approach selected 5 genes out of 14Separating plane obtained in 5-dimensional gene-expression space
Replicates of all patients except one used in trainingAverage of replicates of patient left out used for testing
Leave-one-out correctness was 33 out of 35, or 94.2%
![Page 26: Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f495503460f94c6b70c/html5/thumbnails/26.jpg)
Separation of Convex Hull of Replicates of:10 Synthetic Nonresponders & 4 Synthetic Partial Responders
![Page 27: Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f495503460f94c6b70c/html5/thumbnails/27.jpg)
Linear Classifier in 3-Gene Space35 Patients with 93 Replicates
26 Nonresponders & 9 Partial Responders
![Page 28: Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f495503460f94c6b70c/html5/thumbnails/28.jpg)
Conclusion
New approaches for SVM-based classification Algorithms capable of classifying data with few examples in very large dimensional spaces
Typical of microarray classification problemsClassifiers based on both abstract prior knowledge as well as conventional datasetsIdentification of breast cancer patients that can benefit from chemotherapy Useful tool for drug discovery