Importance of Semantic Representation: Dataless Classification

39
Importance of Semantic Representation: Dataless Classification Ming-Wei Chang Lev Ratinov Dan Roth Vivek Srikumar University of Illinois, Urbana-Champaign

description

Importance of Semantic Representation: Dataless Classification. Ming-Wei Chang Lev Ratinov Dan Roth Vivek Srikumar University of Illinois, Urbana-Champaign. Text Categorization. Classify the following sentence: Syd Millar was the chairman of the International Rugby Board in 2003. - PowerPoint PPT Presentation

Transcript of Importance of Semantic Representation: Dataless Classification

Page 1: Importance of Semantic Representation:  Dataless Classification

Importance of Semantic Representation:

Dataless Classification

Ming-Wei Chang Lev Ratinov Dan Roth Vivek Srikumar

University of Illinois, Urbana-Champaign

Page 2: Importance of Semantic Representation:  Dataless Classification

Slide 2

Text Categorization

Classify the following sentence:

Syd Millar was the chairman of the International Rugby Board in

2003.

Pick a label:

Class1 vs. Class2

Traditionally, we need annotated data to train a classifier

Page 3: Importance of Semantic Representation:  Dataless Classification

Slide 3

Text Categorization

Humans don’t seem to need labeled data

Syd Millar was the chairman of the International Rugby Board in 2003.

Pick a label:

Sports vs. Finance

Label names carry a lot of information!

Page 4: Importance of Semantic Representation:  Dataless Classification

Slide 4

Text Categorization

Do we really always need labeled data?

Page 5: Importance of Semantic Representation:  Dataless Classification

Slide 5

Contributions

We can often go quite far without annotated data … if we “know” the meaning of text

This works for text categorization ….and is consistent across different domains

Page 6: Importance of Semantic Representation:  Dataless Classification

Slide 6

Outline

Semantic Representation

On-the-fly Classification

Datasets

Exploiting unlabeled data

Robustness to different domains

Page 7: Importance of Semantic Representation:  Dataless Classification

Slide 7

Outline

Semantic Representation

On-the-fly Classification

Datasets

Exploiting unlabeled data

Robustness to different domains

Page 8: Importance of Semantic Representation:  Dataless Classification

Slide 8

Semantic Representation

One common representation is the Bag of Words representation

All text is a vector in the space of words.

Page 9: Importance of Semantic Representation:  Dataless Classification

Slide 9

Semantic Representation

Explicit Semantic Analysis [Gabrilovich & Markovitch, 2006, 2007]

Text is a vector in the space of concepts

Concepts are defined by Wikipedia articles

Page 10: Importance of Semantic Representation:  Dataless Classification

Slide 10

Explicit Semantic Analysis: Example

Monetary Policy

International Monetary Fund

Monetary policy

Economic and Monetary Union

Hong Kong Monetary Authority

Monetarism

Central bank

ESA representation

IPod mini

IPod photo

IPod nano

Apple Computer

IPod shuffle

ITunes

Apple IPod

ESA representation

Wikipedia article titles

Page 11: Importance of Semantic Representation:  Dataless Classification

Slide 11

Semantic Representation

Two semantic representations

Bag of words

ESA

Page 12: Importance of Semantic Representation:  Dataless Classification

Slide 12

Outline

Semantic Representation

On-the-fly Classification

Datasets

Exploiting unlabeled data

Robustness to different domains

Page 13: Importance of Semantic Representation:  Dataless Classification

Slide 13

Traditional Text Categorization

Sports Finance

Labeled corpus

Semantic space

A classifier

Page 14: Importance of Semantic Representation:  Dataless Classification

Slide 14

Dataless Classification

Sports Finance

Labeled corpusLabels

What can we do using just the labels?

Page 15: Importance of Semantic Representation:  Dataless Classification

Slide 15

But labels are text too!

Page 16: Importance of Semantic Representation:  Dataless Classification

Slide 16

Dataless Classification

Sports Finance

Semantic space

LabelsNew unlabeled

document

Page 17: Importance of Semantic Representation:  Dataless Classification

Slide 17

What is Dataless Classification?

Humans don’t need training for classification

Annotated training data not always needed

Look for the meaning of words

Page 18: Importance of Semantic Representation:  Dataless Classification

Slide 18

What is Dataless Classification?

Humans don’t need training for classification

Annotated training data not always needed

Look for the meaning of words

Page 19: Importance of Semantic Representation:  Dataless Classification

Slide 19

On-the-fly Classification

Sports Finance

Semantic space

LabelsNew unlabeled

document

Page 20: Importance of Semantic Representation:  Dataless Classification

Slide 20

On-the-fly Classification

No training data needed

We know the meaning of label names

Pick the label that is closest in meaning to the

document

Nearest neighbors

Page 21: Importance of Semantic Representation:  Dataless Classification

Slide 21

On-the-fly Classification

Hockey Baseball

Semantic space

New labels

New unlabeled

document

Page 22: Importance of Semantic Representation:  Dataless Classification

Slide 22

On-the-fly Classification

No need to even know labels before hand

Compare with traditional classification Annotated training data for each label

Page 23: Importance of Semantic Representation:  Dataless Classification

Slide 23

Outline

Semantic Representation

On-the-fly Classification

Datasets

Exploiting unlabeled data

Robustness to different domains

Page 24: Importance of Semantic Representation:  Dataless Classification

Slide 24

Dataset 1: Twenty Newsgroups

Posts to newsgroups Newsgroups have descriptive names

sci.electronics = Science Electronicsrec.motorbikes = Motorbikes

Page 25: Importance of Semantic Representation:  Dataless Classification

Slide 25

Dataset 2: Yahoo Answers

Posts to Yahoo! Answers Posts categorized into a two level hierarchy 20 top level categories Totally 280 categories at the second level

Arts and Humanities, Theater ActingSports, Rugby League

Page 26: Importance of Semantic Representation:  Dataless Classification

Slide 26

Experiments

20 Newsgroups 10 binary problems (from [Raina et al, ‘06])

Religion vs. Politics.guns

Motorcycles vs. MS Windows

Yahoo! Answers 20 binary problems

Health, Diet fitness vs. Health Allergies

Consumer Electronics DVRs vs. Pets Rodents

Page 27: Importance of Semantic Representation:  Dataless Classification

Slide 27

Results: On-the-fly classification

Dataset Supervised Baseline

Bag of Words

ESA

Newsgroup 71.7 65.7 85.3

Yahoo! 84.3 66.8 88.6

Naïve Bayes classifier

Uses annotated data,

Ignores labels

Nearest neighbors,

Uses labels,

No annotated data

Page 28: Importance of Semantic Representation:  Dataless Classification

Slide 28

Outline

Semantic Representation

On-the-fly Classification

Datasets

Exploiting unlabeled data

Robustness to different domains

Page 29: Importance of Semantic Representation:  Dataless Classification

Slide 29

Using Unlabeled Data

Knowing the data collection helps We can learn specific biases of the dataset

Potential for semi-supervised learning

Page 30: Importance of Semantic Representation:  Dataless Classification

Slide 30

Bootstrapping Each label name is a “labeled” document

One “example” in word or concept space

Train initial classifier Same as the on-the-fly classifier

Loop: Classify all documents with current classifier Retrain classifier with highly confident predictions

Page 31: Importance of Semantic Representation:  Dataless Classification

Slide 31

Co-training Words and concepts are two independent “views”

Each view is a teacher for the other

[Blum & Mitchell ‘98]

Page 32: Importance of Semantic Representation:  Dataless Classification

Slide 32

Co-training

Train initial classifiers in word space and concept space

Loop Classify documents with current classifiers Retrain with highly confident predictions of both

classifiers

Page 33: Importance of Semantic Representation:  Dataless Classification

Slide 33

Using unlabeled data

Three approaches

Bootstrapping with labels using Bag of Words

Bootstrapping with labels using ESA

Co-training

Page 34: Importance of Semantic Representation:  Dataless Classification

Slide 34

More Results

No annotated data

Co-training using just labels does as well as supervision with 100 examples

Page 35: Importance of Semantic Representation:  Dataless Classification

Slide 35

Outline

Semantic Representation

On-the-fly Classification

Datasets

Exploiting unlabeled data

Robustness to different domains

Page 36: Importance of Semantic Representation:  Dataless Classification

Slide 36

Domain Adaptation

Classifiers trained on one domain and tested on another

Performance usually decreases across domains

Page 37: Importance of Semantic Representation:  Dataless Classification

Slide 37

But the label names are the same Label names don’t depend on the domain

Label names are robust across domains On-the-fly classifiers are domain independent

Page 38: Importance of Semantic Representation:  Dataless Classification

Slide 38

ExampleBaseball vs. Hockey

Page 39: Importance of Semantic Representation:  Dataless Classification

Slide 39

Conclusion

Sometimes, label names are tell us more about a class than annotated examples Standard learning practice of treating labels as unique

identifiers loses information

The right semantic representation helps What is the right one?