Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

38
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification? Arkaitz Zubiaga, V´ ıctor Fresno, Raquel Mart´ ınez Universidad Nacional de Educaci´ on a Distancia June 4, 2009

description

My presentation at SSLNLP Workshop (NAACL 2009) on June 4th, 2009

Transcript of Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Page 1: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Is Unlabeled Data Suitable for Multiclass SVM-basedWeb Page Classification?

Arkaitz Zubiaga, Vıctor Fresno, Raquel Martınez

Universidad Nacional de Educacion a Distancia

June 4, 2009

Page 2: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Text Classification

Index

1 Text Classification

2 Motivation

3 Support Vector Machines

4 Multiclass SVM

5 S3VM

6 Multiclass S3VM

7 Compared Approaches: Multiclass SVM vs Multiclass S3VM

8 Experiments

9 Results

10 Conclusions and Outlook

11 Thank you

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 2 / 31

Page 3: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Text Classification

What is it?

We have a set of documents:

D = {d1, ..., d|D|}

With a set of predefined categories:

C = {c1, ..., c|C |}

Classification is known as:

〈dj , ci 〉 ∈ D × C

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 3 / 31

Page 4: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Motivation

Index

1 Text Classification

2 Motivation

3 Support Vector Machines

4 Multiclass SVM

5 S3VM

6 Multiclass S3VM

7 Compared Approaches: Multiclass SVM vs Multiclass S3VM

8 Experiments

9 Results

10 Conclusions and Outlook

11 Thank you

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 4 / 31

Page 5: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Motivation

Motivation

Several studies for plain text classification (news), but a few for webpage classification.

Typical web page classification task:

Semi-supervised: not much labeled documents.Multiclass: taxonomy > 2.

(Joachims, 1999) proved the suitability of unlabeled data for binarytasks.

What about multiclass tasks?(Chapelle et al., 2006) did it over image datasets, but never fortext/web pages.

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 5 / 31

Page 6: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Support Vector Machines

Index

1 Text Classification

2 Motivation

3 Support Vector Machines

4 Multiclass SVM

5 S3VM

6 Multiclass S3VM

7 Compared Approaches: Multiclass SVM vs Multiclass S3VM

8 Experiments

9 Results

10 Conclusions and Outlook

11 Thank you

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 6 / 31

Page 7: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Support Vector Machines

SVM

It looks for a hyperplane to separate the classes

Margin maximization

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 7 / 31

Page 8: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Support Vector Machines

SVM

It looks for a hyperplane to separate the classes

Margin maximization

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 7 / 31

Page 9: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Support Vector Machines

SVM

Optimization function: min 12 ||ω||

2 + C ·∑n

i=1 ξdi

Subject to: yi (ω · xi + b) ≥ 1− ξi , ξi ≥ 0

It only handles binary and supervised problems by nature.

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 8 / 31

Page 10: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Multiclass SVM

Index

1 Text Classification

2 Motivation

3 Support Vector Machines

4 Multiclass SVM

5 S3VM

6 Multiclass S3VM

7 Compared Approaches: Multiclass SVM vs Multiclass S3VM

8 Experiments

9 Results

10 Conclusions and Outlook

11 Thank you

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 9 / 31

Page 11: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Multiclass SVM

Multiclass SVM

Approaches to multiclass SVM:

Direct.Combining binary classfiers.

One-against-one.One-against-all.

Usually applied to supervised tasks, but hardly ever to semi-supervisedones.

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 10 / 31

Page 12: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Multiclass SVM

Multiclass SVM: Direct approach

The optimization function considers all the hyperplanes at the sametime.

min1

2

n∑m=1

||wm||2 + Cl∑

i=1

∑m 6=yi

ξmi

Subject to:

wyi · xi + byi ≥ wm · xi + bm + 2− ξmi , ξmi ≥ 0

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 11 / 31

Page 13: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Multiclass SVM

Multiclass SVM: One-against-one

It creates k·(k−1)2 binary classifiers

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 12 / 31

Page 14: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Multiclass SVM

Multiclass SVM: One-against-one

It creates k·(k−1)2 binary classifiers

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 12 / 31

Page 15: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Multiclass SVM

Multiclass SVM: One-against-one

It creates k·(k−1)2 binary classifiers

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 12 / 31

Page 16: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Multiclass SVM

Multiclass SVM: One-against-one

It creates k·(k−1)2 binary classifiers

sign(ωTij · x + bij) −→ Add a vote for the winning class between i and j

The class with more votes will be the output.

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 12 / 31

Page 17: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Multiclass SVM

Multiclass SVM: One-against-all

It creates k binary classifiers

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 13 / 31

Page 18: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Multiclass SVM

Multiclass SVM: One-against-all

It creates k binary classifiers

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 13 / 31

Page 19: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Multiclass SVM

Multiclass SVM: One-against-all

It creates k binary classifiers

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 13 / 31

Page 20: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Multiclass SVM

Multiclass SVM: One-against-all

It creates k binary classifiers

Ci = arg maxi=1,...,k

(ωi · x + bi )

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 13 / 31

Page 21: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

S3VM

Index

1 Text Classification

2 Motivation

3 Support Vector Machines

4 Multiclass SVM

5 S3VM

6 Multiclass S3VM

7 Compared Approaches: Multiclass SVM vs Multiclass S3VM

8 Experiments

9 Results

10 Conclusions and Outlook

11 Thank you

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 14 / 31

Page 22: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

S3VM

Semi-supervised SVM (S3VM)

Unlabeled documents are considered during the learning phase.

The optimization function results:

min1

2· ||ω||2 + C ·

l∑i=1

ξdi + C ∗ ·u∑

j=1

ξ∗d

j

Convex optimization algorithms required.

Commonly used over binary taxonomies, but hardly ever with moreclasses.

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 15 / 31

Page 23: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Multiclass S3VM

Index

1 Text Classification

2 Motivation

3 Support Vector Machines

4 Multiclass SVM

5 S3VM

6 Multiclass S3VM

7 Compared Approaches: Multiclass SVM vs Multiclass S3VM

8 Experiments

9 Results

10 Conclusions and Outlook

11 Thank you

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 16 / 31

Page 24: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Multiclass S3VM

Multiclass S3VM

(Yajima and Kuo, 2006) present the following optimization function:

min(1

2

h∑i=1

βiT K−1βi + Cl∑

j=1

∑i 6=yj

max(0, 1− (βyj

j − βij ))2)

where β represents the product of a vector and a kernel matrix defined bythe author.

(Chapelle et al., 2006): direct approach by means of the ContinuationMethod.

2 steps:

(Qi et al., 2004) use Fuzzy C-Means to predict new unlabeleddocuments.(Xu and Schuurmans, 2005) rely on a clustering-based approach tolabel the unlabeled data.

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 17 / 31

Page 25: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Compared Approaches: Multiclass SVM vs Multiclass S3VM

Index

1 Text Classification

2 Motivation

3 Support Vector Machines

4 Multiclass SVM

5 S3VM

6 Multiclass S3VM

7 Compared Approaches: Multiclass SVM vs Multiclass S3VM

8 Experiments

9 Results

10 Conclusions and Outlook

11 Thank you

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 18 / 31

Page 26: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Compared Approaches: Multiclass SVM vs Multiclass S3VM

Multiclass SVM vs Multiclass S3VM

2-steps-SVM/1-step-SVM: Multiclass SVM.Does an intermediate step adding newly labeled data improveclassifier’s performance?

One-against-all-S3VM/One-against-all-SVM.

One-against-one-S3VM/One-agaisnt-one-SVM.Does unlabeled data help to improve binary combining classifier’sresults?

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 19 / 31

Page 27: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Experiments

Index

1 Text Classification

2 Motivation

3 Support Vector Machines

4 Multiclass SVM

5 S3VM

6 Multiclass S3VM

7 Compared Approaches: Multiclass SVM vs Multiclass S3VM

8 Experiments

9 Results

10 Conclusions and Outlook

11 Thank you

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 20 / 31

Page 28: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Experiments

Experiments settings

Datasets:

BankSearch: 10.000 web documents / 10 categories (4.000 for thetraining set).WebKB: 4.518 web documents / 6 categories (2.000 for the trainingset).Yahoo! Science: 788 web documents / 6 categories (200 for thetraining set).

Numerous labeled/unlabeled sets.

9 executions for each.

Representation: TF-IDF.

Software:

SVM-light (http://svmlight.joachims.org)SVM-multiclass

Evaluation by means of the accuracy (percent of correct predictions).

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 21 / 31

Page 29: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Results

Index

1 Text Classification

2 Motivation

3 Support Vector Machines

4 Multiclass SVM

5 S3VM

6 Multiclass S3VM

7 Compared Approaches: Multiclass SVM vs Multiclass S3VM

8 Experiments

9 Results

10 Conclusions and Outlook

11 Thank you

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 22 / 31

Page 30: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Results

Results: BankSearch

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 23 / 31

Page 31: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Results

Results: WebKB

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 24 / 31

Page 32: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Results

Results: Yahoo! Science

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 25 / 31

Page 33: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Results

Results

Supervised multiclass approaches (2-steps-SVM & 1-step-SVM)outperform the rest.

Among binary combinations, one-against-all outperformsone-against-one.

Unlabeled data slightly helps for one-against-all.

1-step-SVM and 2-steps-SVM show similar results, except forWebKB, where the former wins.

It could be due to the homogeneous nature of the WebKB dataset.

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 26 / 31

Page 34: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Conclusions and Outlook

Index

1 Text Classification

2 Motivation

3 Support Vector Machines

4 Multiclass SVM

5 S3VM

6 Multiclass S3VM

7 Compared Approaches: Multiclass SVM vs Multiclass S3VM

8 Experiments

9 Results

10 Conclusions and Outlook

11 Thank you

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 27 / 31

Page 35: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Conclusions and Outlook

Conclusions

Comparison of multiclass SVM and S3VM approaches for web pageclassification.

Direct and combining approaches.

Direct approaches outperform the rest.

Unlabeled data did not provide considerable improvements, and evenprovide worsenings in some cases.

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 28 / 31

Page 36: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Conclusions and Outlook

Future Work

To add more multiclass S3VM approaches to the study.

To test with different SVM settings (kernel, parameters,...).

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 29 / 31

Page 37: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Thank you

Index

1 Text Classification

2 Motivation

3 Support Vector Machines

4 Multiclass SVM

5 S3VM

6 Multiclass S3VM

7 Compared Approaches: Multiclass SVM vs Multiclass S3VM

8 Experiments

9 Results

10 Conclusions and Outlook

11 Thank you

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 30 / 31

Page 38: Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?

Thank you

Thank you

Thank you

A. Zubiaga, V. Fresno, R. Martınez (UNED) Unlabeled Data for Multiclass SVM June 4, 2009 31 / 31