COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.
-
Upload
willa-walton -
Category
Documents
-
view
218 -
download
0
Transcript of COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.
![Page 1: COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.](https://reader036.fdocuments.in/reader036/viewer/2022082518/5697bfa51a28abf838c97f4c/html5/thumbnails/1.jpg)
COP5992 – DATA MINING COP5992 – DATA MINING TERM PROJECTTERM PROJECT
RANDOM SUBSPACE METHOD RANDOM SUBSPACE METHOD + +
CO-TRAININGCO-TRAINING
byby
SELIM KALAYCISELIM KALAYCI
![Page 2: COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.](https://reader036.fdocuments.in/reader036/viewer/2022082518/5697bfa51a28abf838c97f4c/html5/thumbnails/2.jpg)
RANDOM SUBSPACE METHOD RANDOM SUBSPACE METHOD (RSM)(RSM)
Proposed by HoProposed by Ho
““The Random Subspace for The Random Subspace for Constructing Decision Forests”, Constructing Decision Forests”, 19981998
Another combining technique for Another combining technique for weak classifiers like Bagging, weak classifiers like Bagging, Boosting.Boosting.
![Page 3: COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.](https://reader036.fdocuments.in/reader036/viewer/2022082518/5697bfa51a28abf838c97f4c/html5/thumbnails/3.jpg)
RSM ALGORITHMRSM ALGORITHM
1. Repeat for b = 1, 2, . . ., B:
(a) Select an r-dimensional random subspace X from the original p-dimensional feature space X.
2. Combine classifiers Cb(x), b = 1, 2, . . ., B, by simple majority voting to a final decision rule
![Page 4: COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.](https://reader036.fdocuments.in/reader036/viewer/2022082518/5697bfa51a28abf838c97f4c/html5/thumbnails/4.jpg)
MOTIVATION FOR RSMMOTIVATION FOR RSM
Redundancy in Data Feature SpaceRedundancy in Data Feature Space Completely redundant feature setCompletely redundant feature set Redundancy is spread over many Redundancy is spread over many
featuresfeatures
Weak classifiers that have critical Weak classifiers that have critical training sample sizestraining sample sizes
![Page 5: COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.](https://reader036.fdocuments.in/reader036/viewer/2022082518/5697bfa51a28abf838c97f4c/html5/thumbnails/5.jpg)
RSM PERFORMANCE RSM PERFORMANCE ISSUESISSUES
RSM Performance depends on:RSM Performance depends on: Training sample sizeTraining sample size The choice of a base classifierThe choice of a base classifier The choice of combining rule (simple The choice of combining rule (simple
majority vs. weighted)majority vs. weighted) The degree of redundancy of the The degree of redundancy of the
datasetdataset The number of features chosenThe number of features chosen
![Page 6: COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.](https://reader036.fdocuments.in/reader036/viewer/2022082518/5697bfa51a28abf838c97f4c/html5/thumbnails/6.jpg)
DECISION FORESTS (by DECISION FORESTS (by Ho)Ho)
A combination of trees instead of a A combination of trees instead of a single treesingle tree
Assumption: Dataset has some Assumption: Dataset has some redundant featuresredundant features Works efficiently with any decision tree Works efficiently with any decision tree
algorithm and data splitting methodalgorithm and data splitting method Ideally, look for best individual trees Ideally, look for best individual trees
with lowest tree similarity with lowest tree similarity
![Page 7: COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.](https://reader036.fdocuments.in/reader036/viewer/2022082518/5697bfa51a28abf838c97f4c/html5/thumbnails/7.jpg)
UNLABELED DATAUNLABELED DATA
Small number of labeled documentsSmall number of labeled documents
Large pool of unlabeled documentsLarge pool of unlabeled documents
How to classify unlabeled documents How to classify unlabeled documents accurately?accurately?
![Page 8: COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.](https://reader036.fdocuments.in/reader036/viewer/2022082518/5697bfa51a28abf838c97f4c/html5/thumbnails/8.jpg)
EXPECTATION-MAXIMIZATION EXPECTATION-MAXIMIZATION (E-M)(E-M)
![Page 9: COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.](https://reader036.fdocuments.in/reader036/viewer/2022082518/5697bfa51a28abf838c97f4c/html5/thumbnails/9.jpg)
CO-TRAININGCO-TRAINING
Blum and Mitchel, “Combining Blum and Mitchel, “Combining Labeled and Unlabeled Data with Labeled and Unlabeled Data with Co-Training”, 1998.Co-Training”, 1998.
Requirements:Requirements: Two sufficiently strong feature setsTwo sufficiently strong feature sets Conditionally independentConditionally independent
![Page 10: COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.](https://reader036.fdocuments.in/reader036/viewer/2022082518/5697bfa51a28abf838c97f4c/html5/thumbnails/10.jpg)
CO-TRAININGCO-TRAINING
![Page 11: COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.](https://reader036.fdocuments.in/reader036/viewer/2022082518/5697bfa51a28abf838c97f4c/html5/thumbnails/11.jpg)
APPLICATION OF CO-TRAINING APPLICATION OF CO-TRAINING
TO A SINGLE FEATURE SETTO A SINGLE FEATURE SET Algorithm:Obtain a small set L of labeled examplesObtain a large set U of unlabeled examplesObtain two sets F1 and F2 of features that are sufficiently redundant
While U is not empty do:Learn classifier C1 from L based on F1
Learn classifier C2 from L based on F2
For each classifier Ci do:
Ci labels examples from U based on Fi
Ci chooses the most confidently predicted examples E from U
E is removed from U and added (with their given labels) to LEnd loop
![Page 12: COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.](https://reader036.fdocuments.in/reader036/viewer/2022082518/5697bfa51a28abf838c97f4c/html5/thumbnails/12.jpg)
THINGS TO DOTHINGS TO DO
How can we measure redundancy How can we measure redundancy and use it efficiently?and use it efficiently?
Can we improve Co-training?Can we improve Co-training? How can we apply RSM efficiently How can we apply RSM efficiently
to:to: Supervised learningSupervised learning Semi-supervised learningSemi-supervised learning Unsupervised learningUnsupervised learning
![Page 13: COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.](https://reader036.fdocuments.in/reader036/viewer/2022082518/5697bfa51a28abf838c97f4c/html5/thumbnails/13.jpg)
QUESTIONSQUESTIONS
????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????