Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi...

32
Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi [email protected] http://www.it.iitb.ac.in/ ~sunita

Transcript of Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi...

Page 1: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Classifying with limited training dataActive and semi-supervised learning

Sunita Sarawagi

[email protected]

http://www.it.iitb.ac.in/~sunita

Page 2: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Sarawagi 2

Motivation Several learning methods critically dependent

on quality of labeled training data Often labeled data is expensive to collect and

unlabeled data is abundant Two techniques to reduce labeling effort

Active learning: Iteratively select small sets of unlabeled data to be

labeled by a human Semi-supervised learning

Use unlabeled data to train classifier

Page 3: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Sarawagi 3

Outline Active learning

Definition Application Algorithms Case studies:

Duplicate elimination Information Extraction

Semi-supervised learning Definition Some methods

Page 4: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Sarawagi 4

Application areas Text classification Duplicate elimination Information Extraction

HTML wrappers Free text

Speech recognition Reducing the need for transcribed data

Semantic parsing of natural language Reducing need for complex annotated data

Page 5: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Sarawagi 5

Example: active learning

Sure reds Sure greensRegion of uncertainity

Assume: Points from two classes (red and green) on a real line perfectly separable by a single point separator

Unlabeled points labeled points

y

Need greatest expected reduction in the size of theuncertainty region

Page 6: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Sarawagi 6

Active-learning

Explicit measure: For each unlabeled instance

For each class label Add to training data, Train classifier Measure classifier confusion

Compute expected confusion Choose instance that yields

lowest expected confusion

Implicit measure: Train classifier For each unlabeled

instance Measure prediction

uncertainty Choose instance

with highest uncertainty

Page 7: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Sarawagi 7

Measuring prediction certainty Classifier-specific methods

Support vector machines: Distance from separator

Naïve Bayes classifier: Posterior probability of winning class

Decision tree classifier: Weighted sum of distance from different boundaries, error of

the leaf, depth of the leaf, etc

Committee-based approach:

(Seung, Opper, and Sompolinsky 1992) Disagreement amongst members of a committee Most successfully used method

Page 8: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Sarawagi 8

Forming a classifier committee

Randomly perturb learnt parameters Probabilistic classifiers:.

Sample from posterior distribution on parameters given training data.

Example: binomial parameter p has a beta distribution with mean p

Discriminative classifiers: Random boundary in uncertainty region

Page 9: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Sarawagi 9

Committee-based algorithm Train k classifiers C1, C2,.. Ck on training

data For each unlabeled instance x

Find prediction y1,.., yk from the k classifiers Compute uncertainty U(x) as entropy of above y-s

Pick instance with highest uncertainty

Sampling for representativeness:With weight as U(x), do weighted sampling to select an instance for labeling.

Page 10: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Sarawagi 10

Case study: Duplicate eliminationGiven a list of semi-structured records,

find all records that refer to a same entity Example applications:

Data warehousing: merging name/address lists Entity:

a) Person

b) Household

Automatic citation databases (Citeseer): references Entity: paper

Challenges: Errors and inconsistencies in large datasets Domain-specific

Page 11: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Sarawagi 11

Motivating example: Citations Our prior:

duplicate when author, title, booktitle and year match..

Author match could be hard: L. Breiman, L. Friedman, and P. Stone, (1984). Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone.

Conference match could be harder: In VLDB-94 In Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, September 1994.

Page 12: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Fields may not be segmented, Word overlap could be misleading

Duplicates with little overlap even in title Johnson Laird, Philip N. (1983). Mental models. Cambridge, Mass.: Harvard University Press.

P. N. Johnson-Laird. Mental Models: Towards a Cognitive Science of Language, Inference, and Consciousness. Cambridge University Press, 1983

Non-duplicates with lots of word overlap H. Balakrishnan, S. Seshan, and R. H. Katz., Improving Reliable Transport and Hando Performance in Cellular Wireless Networks, ACM Wireless Networks, 1(4), December 1995.

H. Balakrishnan, S. Seshan, E. Amir, R. H. Katz, "Improving TCP/IP Performance over Wireless Networks," Proc. 1st ACM Conf. on Mobile Computing and Networking, November 1995.

Page 13: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Sarawagi 13

Experiences with the learning approach Too much manual search in preparing

training data Hard to spot challenging and covering sets of

duplicates in large lists Even harder to find close non-duplicates that will

capture the nuances

Active learning is a generalization of this!

examine instances that are similar on one attribute but dissimilar on another

Page 14: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Sarawagi 14

Learning to identify duplicates

f1 f2 …fn Similarity functions

Examplelabeledpairs

Record 1 D Record 2

Record 3 NRecord 4

1.0 0.4 … 0.2 1

0.0 0.1 … 0.3 0Classifier

Record 6 Record 7Record 8 Record 9Record 10Record 11

Unlabeled list 0.0 0.1 … 0.3 ?1.0 0.4 … 0.2 ?0.6 0.2 … 0.5 ?0.7 0.1 … 0.6 ?0.3 0.4 … 0.4 ?0.0 0.1 … 0.1 ?0.3 0.8 … 0.1 ?0.6 0.1 … 0.5 ?

0.7 0.1 … 0.6 10.3 0.4 … 0.4 0

Active learner

0.7 0.1 … 0.6 ?0.3 0.4 … 0.4 ?

Page 15: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Forming committee of trees Selecting split attribute

Normally: attribute with lowest entropy Perturbed: random attribute within close range of

lowest Selecting a split point

Normally: midpoint of range with lowest entropy Perturbed: a random point anywhere in the range

with lowest entropy

Page 16: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Sarawagi 16

Experimental analysis 250 references from Citeseer 32000 pairs

of which only 150 duplicates Citeseer’s script used to segment into author,

title, year, page and rest. 20 text and integer similarity functions Average of 20 runs Default classifier: decision tree Initial labeled set: just two pairs

Page 17: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Sarawagi 17

Methods of creating committee

Data partition bad when limited data Attribute partition bad when sufficient data Parameter perturbation: best overall

Page 18: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Sarawagi 18

Importance of randomization

Important to randomize selection for generative classifiers like naïve Bayes

Decision tree Naïve Bayes

Page 19: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Sarawagi 19

Choosing the right classifier

SVMs good initially but not effective in choosing instances

Decision trees: best overall

Page 20: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Sarawagi 20

Benefits of active learning

Active learning much better than random With only 100 active instances

97% accuracy, Random only 30% Committee-based selection close to optimal

Page 21: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Sarawagi 21

Analyzing selected instances Fraction of duplicates in selected instances:

44% starting with only 0.5% Is the gain due to increased fraction of

duplicates? Replaced non-duplicates in selected set with

random non-dups Result only 40% accuracy!!!

Page 22: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Sarawagi 22

Case study: Information Extraction (IE)

The IE task: Given, E: a set of structured elements (Target schema) S: unstructured source S

extract all instances of E from S

Varying levels of difficulty depending on input and kind of extracted patterns Text segmentation: Extraction by segmenting text HTML wrapper: Extraction from formatted text Classical IE: Extraction from free-format text

Page 23: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Sarawagi 23

IE by text segmentation

Source: concatenation of structured elements with limited reordering and some missing fields Example: Addresses, bib records

P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237.

Author Year Title JournalVolume

Page

36/307 Unnat Nagar (II) Goregaon (W) Bombay 400 079

House number Area City Zip

Page 24: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Sarawagi 24

IE with Hidden Markov Models Probabilistic models for IE

Title

Journal

Author 0.9

0.5

0.50.8

0.2

0.1

Transition probabilitie

s

Year

A

B

C

0.6

0.3

0.1

journal

ACM

IEEE

0.4

0.2

0.3

Letter

Et. al

Word

0.3

0.1

0.5

Emission probabiliti

es

dddd

dd

0.8

0.2

Page 25: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Sarawagi 25

A model for Indian Addresses

Page 26: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Sarawagi 26

Active learning in IE with HMM

Forming committee of HMMs by random perturbation Emission and transition probabilities are independent

multinomial distributions. Posterior distribution for Multinomial parameters:

Dirichlet with mean estimated as using maximum likelihood Results on part of speech tagging (Dagan 1999)

92.6% accuracy using active learning with 20,000 instances as against 100,000 random

Page 27: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Sarawagi 27

Semi-supervised learning Unlabeled data can improve classifier

accuracy by providing correlation information between features

Three methods: Probabilistic classifiers like naïve Bayes HMMs

The Expectation Maximization method (EM) Distance-based classifiers like k-Nearest neighbor

Graph min-cut method Paired independent classifiers

Co-training

Page 28: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Sarawagi 28

The EM approach Dl: labeled data, Du: unlabeled data

Train classifier parameter using Dl

While likelihood of Dl + Du improves E step: For each d in Du, find fractional

membership in each class using current classifier parameter

M step: Use fractional membership of Du and labels of Dl to re-estimate maximum likelihood parameters of classifier

Output classifier

Page 29: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Sarawagi 29

Results with EM Practical considerations:

When unlabeled data too large and class-labels don’t correspond to natural data clusters, need to weight contribution of unlabeled data to parameters

Experiments on text classification with Naïve Bayes 20 Newsgroup: 70% accuracy with 10,000 labeled

reduced to 600 + 20000 unlabeled Experiments on IE with HMM

No improvement in accuracy

Page 30: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Sarawagi 30

The Graph min-cut method Construct a weighted graph using Dl + Du

Dl = Dl+ + Dl

-

Wij = Similarity between i and j

wij

Page 31: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

Sarawagi 31

Conclusion Active learning

successfully used in several applications to reduce need for training data

Semi-supervised learning Limited improvement observed in text

classification with naïve Bayes Most proposed methods classifier-specific Still open to further research

Page 32: Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi sunita@it.iitb.ac.in sunita.

References

Shlomo Argamon-Engelson and Ido Dagan. Committee-based sample selection for probabilistic classififers. J. of Artificial Intelligence Research, 11:335--360, 1999.

Yoav Freund, H. Sebastian Seung, Eli Shamir, and Naftali Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28(2-3):133-168, 1997.

S Sarawagi and Anuradha Bhamidipaty, Interactive deduplication using active learning, ACM SIGKDD 2002

H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Computational Learing Theory, pages 287-294, 1992.

T. Zhang and F. J. Oles. A probability analysis on the value of unlabeled data for classification problems. ICML, 2000

Vinayak R. Borkar, Kaustubh Deshmukh, and Sunita Sarawagi. Automatic text segmentation for extracting structured records. SIGMOD 2001.

D Freitag and A McCallum, Information Extraction with HMM Structures Learned by Stochastic Optimization, AAAI 2000