Mathematical Programming in Data Mining Author: O. L. Mangasarian Advisor: Dr. Hsu Graduate:...
-
Upload
magdalen-king -
Category
Documents
-
view
219 -
download
2
Transcript of Mathematical Programming in Data Mining Author: O. L. Mangasarian Advisor: Dr. Hsu Graduate:...
Mathematical Mathematical Programming in Data Programming in Data
MiningMining
Author: O. L. MangasarianAuthor: O. L. MangasarianAdvisor: Dr. HsuAdvisor: Dr. HsuGraduate: Yan-Cheng LinGraduate: Yan-Cheng Lin
AbstractAbstract
Describe mathematical programming Describe mathematical programming to feature selection, clustering and to feature selection, clustering and robust representationrobust representation
OutlineOutline
MotivationMotivation Objective Objective ProblemsProblems Feature SelectionFeature Selection ClusteringClustering Robust RepresentationRobust Representation ConclusionConclusion
MotivationMotivation
Mathematical programming has been Mathematical programming has been applied to a great variety of applied to a great variety of theoreticaltheoretical
Problems can be formulated and Problems can be formulated and effectively solved as mathematical effectively solved as mathematical programsprograms
Objective Objective
Describe three mathematical-Describe three mathematical-programming-based developments programming-based developments relevant to data miningrelevant to data mining
ProblemsProblems
Feature SelectionFeature Selection ClusteringClustering Robust RepresentationRobust Representation
Problem - Feature SelectionProblem - Feature Selection
Discriminating between two finite Discriminating between two finite point sets in n-dimensional feature point sets in n-dimensional feature space and utilizes as few of the space and utilizes as few of the feature as possiblefeature as possible
Formulated as mathematical Formulated as mathematical program with a parametric objective program with a parametric objective function and linear constraintsfunction and linear constraints
Problem - ClusteringProblem - Clustering
Assigning m points in the n-dimensional Assigning m points in the n-dimensional real space Rreal space Rnn to k clusters to k clusters
Formulated as determining k centers in Formulated as determining k centers in RRnn, the sum of distances of each point to , the sum of distances of each point to the nearest center is minimizedthe nearest center is minimized
Problem - Robust RepresentationProblem - Robust Representation
Modeling a system of relations in a manModeling a system of relations in a manner that preserves the validity of the repner that preserves the validity of the representation when the data on which the resentation when the data on which the model is based changesmodel is based changes
Use a sufficiently small error Use a sufficiently small error ּזּז is purposeis purposely toleratedly tolerated
Feature SelectionFeature Selection
Use the simplest model to describe the eUse the simplest model to describe the essence of a phenomenonssence of a phenomenon
Binary classification problem: Binary classification problem: – discriminating between two given point sets discriminating between two given point sets
A and B in the n-dimensional real space RA and B in the n-dimensional real space Rnn b by using as few of the n-dimensions of the spy using as few of the n-dimensions of the space as possible ace as possible
Binary classificationBinary classification
W
P
the following are some defined:the following are some defined:
AA
BB
Feature SelectionFeature Selection
Successive Linearization AlgorithmSuccessive Linearization Algorithm
w vector is resultw vector is result
ExperimentationExperimentation
32-feature Wisconsin 32-feature Wisconsin Prognostic Breast CaPrognostic Breast Cancer(WPBC)ncer(WPBC)
N=32, m = 28, k = 118,N=32, m = 28, k = 118, r r = 0.05, 4 features, i = 0.05, 4 features, increasing tenfold croncreasing tenfold cross-validation correctss-validation correctness by 35.4%ness by 35.4%
ClusteringClustering
Determining k cluster centers, the Determining k cluster centers, the sum of the 1-norm distances of each sum of the 1-norm distances of each point in a given database to nearest point in a given database to nearest cluster center is minimizedcluster center is minimized
Minimizing product of two linear Minimizing product of two linear functions on a set defined by linear functions on a set defined by linear inequalitiesinequalities
K-Median AlgorithmK-Median Algorithm
Need to solveNeed to solve
ExperimentationExperimentation
used as a KDD tool to mine WPBC to used as a KDD tool to mine WPBC to discover medical knowledgediscover medical knowledge
key observation is curves are well key observation is curves are well separatedseparated
ExperimentationExperimentation
Robust RepresentationRobust Representation
model remains valid under a class of datmodel remains valid under a class of data perturbationa perturbation
Use Use ּזּז-tolerance zone wherein errors are -tolerance zone wherein errors are disregardeddisregarded
Better generalization results than conveBetter generalization results than conventional zero-tolerancentional zero-tolerance
Robust RepresentationRobust Representation
A is a m*n matrix, a is a m*1 vectorA is a m*n matrix, a is a m*1 vector x is a vector be “learned”x is a vector be “learned” find minimize of Ax - afind minimize of Ax - a
Robust RepresentationRobust Representation
=
xA atolerate-ּזּז
=
xA a
ConclusionConclusion
Mathematical programming codes Mathematical programming codes are reliable and robust codesare reliable and robust codes
Problems solved demonstrate Problems solved demonstrate mathematical programming as mathematical programming as versatile and effective tool for versatile and effective tool for solving important problems in data solving important problems in data mining and knowledge discovery in mining and knowledge discovery in databasesdatabases
OpinionOpinion
Mathematical describe can explain Mathematical describe can explain about complex problems and about complex problems and convince others, but …you must be convince others, but …you must be understand it firstunderstand it first