Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004
description
Transcript of Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004
![Page 1: Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004](https://reader030.fdocuments.in/reader030/viewer/2022033103/56814589550346895db270ba/html5/thumbnails/1.jpg)
Analysis of Bootstrapping Algorithms
Seminar of Machine Learning for Text
Mining UPC, 18/11/2004
Mihai Surdeanu
![Page 2: Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004](https://reader030.fdocuments.in/reader030/viewer/2022033103/56814589550346895db270ba/html5/thumbnails/2.jpg)
Goals Introduce Steven Abney’s “Understanding the
Yarowski Algorithm” (Computational Linguistics 30(3) 2004) paper
What are the bootstrapping algorithms covered and their properties?
Will skip theorem proofs What do they mean in the context of
document clustering and pattern acquisition? How do they compare with other iterative
refinement clustering algorithms and with Yangarber 2003?
![Page 3: Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004](https://reader030.fdocuments.in/reader030/viewer/2022033103/56814589550346895db270ba/html5/thumbnails/3.jpg)
Notations
WSD:x – wordj – word sensef – word/context feature
Clustering:x – documentj – category/domainf – doc feature: word, pattern
![Page 4: Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004](https://reader030.fdocuments.in/reader030/viewer/2022033103/56814589550346895db270ba/html5/thumbnails/4.jpg)
Generic Yarowski Algorithm (Y-0)
Needs a base
learner
Changes labeling only if prediction larger than
arbitrary threshold
Does not change labels of seeds
Nothing formal can be shown about Y-0.
![Page 5: Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004](https://reader030.fdocuments.in/reader030/viewer/2022033103/56814589550346895db270ba/html5/thumbnails/5.jpg)
Modified Algorithm (Y-1)
A labeled example cannot become unlabeled again.
Fixed threshol
d
![Page 6: Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004](https://reader030.fdocuments.in/reader030/viewer/2022033103/56814589550346895db270ba/html5/thumbnails/6.jpg)
Properties of Y-1 If the base learner reduces the divergence on
the labeled (or all) examples, algorithm Y-1 decreases H (cross entropy – equation (6)) at each iteration until it reaches a critical point of H
![Page 7: Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004](https://reader030.fdocuments.in/reader030/viewer/2022033103/56814589550346895db270ba/html5/thumbnails/7.jpg)
The Original Decision List Induction Algorithm (DL-0)
Smooth precision with an arbitrary
value
Pick the label given by the rule with the best
score is NOT a probability
distribution!
Nothing formal can be shown about DL-0.
![Page 8: Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004](https://reader030.fdocuments.in/reader030/viewer/2022033103/56814589550346895db270ba/html5/thumbnails/8.jpg)
The EM-based Decision List Algorithm (DL-EM)
A mixture of is used to compute (see above). Because is a probability distribution, is also a probability distribution.
Whereas in DL-0 the prediction is given by the “strongest” feature, here the algorithm permits a block of “weaker” features to outweigh the strongest feature.
DL-EM does not construct a classifier from scratch (like DL-0), but rather builds upon the previous classifier (fj
old and xold).
![Page 9: Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004](https://reader030.fdocuments.in/reader030/viewer/2022033103/56814589550346895db270ba/html5/thumbnails/9.jpg)
The EM-based Decision List Algorithm (DL-EM)
Probability that feature f was
responsible for label j for object x
Normalization over all features
![Page 10: Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004](https://reader030.fdocuments.in/reader030/viewer/2022033103/56814589550346895db270ba/html5/thumbnails/10.jpg)
Algorithm DL-EM-
A similar algorithm exists when the feature score is computed over all examples, not just the labeled ones: DL-EM-V.
What are the (0) parameters???
![Page 11: Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004](https://reader030.fdocuments.in/reader030/viewer/2022033103/56814589550346895db270ba/html5/thumbnails/11.jpg)
Properties of DL-EM-*
Y-1/DL-EM- and Y-1/DL-EM-V decrease H at each iteration until they reach a critical point of H (local minimum).
![Page 12: Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004](https://reader030.fdocuments.in/reader030/viewer/2022033103/56814589550346895db270ba/html5/thumbnails/12.jpg)
Algorithm DL-1-R
“Raw” precision
Mixture of feature scores
![Page 13: Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004](https://reader030.fdocuments.in/reader030/viewer/2022033103/56814589550346895db270ba/html5/thumbnails/13.jpg)
Algorithm DL-1-VS
Precision with variable smoothing for each
feature
Mixture of feature scores
![Page 14: Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004](https://reader030.fdocuments.in/reader030/viewer/2022033103/56814589550346895db270ba/html5/thumbnails/14.jpg)
Properties of DL-1-*
Y-1/DL-1-R minimizes K (an upper limit on H) over labeled examples
Y-1/DL-1-VS minimizes K over all examples X
![Page 15: Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004](https://reader030.fdocuments.in/reader030/viewer/2022033103/56814589550346895db270ba/html5/thumbnails/15.jpg)
So far…
Y-0/DL-0 – original Yarowski algorithm. Can not be shown to minimize H or K.
Y-1/DL-EM- and Y-1/DL-EM-V minimize H
Y-1/DL-1-R and Y-1/DL-1-VS minimize K
![Page 16: Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004](https://reader030.fdocuments.in/reader030/viewer/2022033103/56814589550346895db270ba/html5/thumbnails/16.jpg)
Sequential Algorithms All previous algorithms do “parallel”
updates, in the sense that the parameters {fj} are all recomputed at every iteration.
Sequential algorithms: one feature is selected at each iteration: St+1 = St U {ft}
Only the score of the selected feature and the scores of the documents containing a chosen feature are recomputed.
More flexible – shown to converge for more base learners.
![Page 17: Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004](https://reader030.fdocuments.in/reader030/viewer/2022033103/56814589550346895db270ba/html5/thumbnails/17.jpg)
Algorithm YS
Choose a feature that:(1) Is not seed(2) Is seen in training(3) Its score changed
![Page 18: Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004](https://reader030.fdocuments.in/reader030/viewer/2022033103/56814589550346895db270ba/html5/thumbnails/18.jpg)
Base Learners for YS
Biased towards the feature that maximizes raw precision = anti-smoothing
![Page 19: Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004](https://reader030.fdocuments.in/reader030/viewer/2022033103/56814589550346895db270ba/html5/thumbnails/19.jpg)
Properties of YS-*
YS-P and YS-R reduce K in every iteration.
YS-FS reduce K in every iteration for new features.
![Page 20: Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004](https://reader030.fdocuments.in/reader030/viewer/2022033103/56814589550346895db270ba/html5/thumbnails/20.jpg)
Yarowski versus Co-training Co-training attempts to maximize
agreement on unlabeled data between classifiers trained on different “views” of the data.
The modified Yarowski algorithms introduced in this paper reduce the upper limit on entropy (H), similarly to co-training.
Co-training uses an assumption of at least two independent views of the data. Hence it is more restricted.
![Page 21: Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004](https://reader030.fdocuments.in/reader030/viewer/2022033103/56814589550346895db270ba/html5/thumbnails/21.jpg)
YS versus Yangarber (1)
NOT a probability distribution
set = 1, else = 0
Recompute
![Page 22: Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004](https://reader030.fdocuments.in/reader030/viewer/2022033103/56814589550346895db270ba/html5/thumbnails/22.jpg)
YS versus Yangarber (2) Yangarber does not require the
computation of Y, as its goal is to learn patterns (features) relevant for each label (category) A plus for Yangarber as Yx = ŷ is a VERY
strong statement in document classification = classifies a document based on the limited information available in this iteration
Y can be computed as a side effect when the algorithm completes. This is used as an indirect evaluation.
![Page 23: Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004](https://reader030.fdocuments.in/reader030/viewer/2022033103/56814589550346895db270ba/html5/thumbnails/23.jpg)
YS versus Yangarber (3)
The base learner for Yangarber generates scores that are NOT probability distributions! Hard to analyze the algorithm formally!
fj = raw_precision(f,j) * log(how many documents contain f)
???This part similar to YS-R
![Page 24: Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004](https://reader030.fdocuments.in/reader030/viewer/2022033103/56814589550346895db270ba/html5/thumbnails/24.jpg)
Bootstrapping versus K-Means and EM
K-Means and Bootstrapping “hard” classify objects in each iteration: Yx = ŷ. EM (and Yangarber) compute Y only in the last iteration.
I think K-Means and EM converge more rapidly because they accumulate more features faster than bootstrapping.
In K-Means basically after the first iteration all features are in use.
In FS (and Yangarber) only one (or a very small number) of the features is selected in every iteration.
![Page 25: Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004](https://reader030.fdocuments.in/reader030/viewer/2022033103/56814589550346895db270ba/html5/thumbnails/25.jpg)
Conclusions Abney simple modifications of the
Yarowski bootstrapping algorithm can be formally shown to converge to a local minimum (like EM)
Based on this work Yangarber (and Riloff) are far from the formalization required to show that they converge
Is there a better algorithm for pattern learning?