Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ

25
Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ 12.05.2010 Text Categorization For Turkish News

description

Text Categorization For Turkish News. Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ. 12.05.2010. Text categorization. Classify text to predefined categories Supervised learning Labeled corpus Used in Indexing (e.g. Libraries) ‏ News articles Spam filtering. 11.5.2010. - PowerPoint PPT Presentation

Transcript of Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ

Page 1: Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ

Ceyhun Karbeyaz

Çağrı Toraman

Anıl Türel

Ahmet Yeniçağ12.05.2010

Text Categorization For Turkish News

Page 2: Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ

Text categorization

Classify text to predefined categories

Supervised learningLabeled corpus

Used inIndexing (e.g. Libraries)News articlesSpam filtering

1 / 22

Page 3: Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ

2 / 22

Page 4: Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ

Bilkent News Portal

Gather news from different news providers

News are more accessibleNew event detection and trackingNovelty detectionDublicate eliminationPersonalizationNews Categorization

3 / 22

Page 5: Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ

Aktuel

En Çok Okunanlar

Anasayfa

Spor

Politika

Çevre

Tüm Haberler

Son Dakika

.......

News

CATEGORIES

4 / 22

Page 6: Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ

Motivation

News are categorizedFrom Rss

24 good categories

A few bad categoriesAnasayfaEnCokOkunanlarGundemSonDakikaTum_HaberlerYazarlar

AktuelAvrupa_FutbolBilimTeknolojiBilisimCevreDisHaberlerDunyaEgeEgitimEkonomiFormula1Hava_Yol

IspanyaItalyaKulturSanatPolitikaSaglikSiyasetSporTelevizyonTurkiyeYasamYazarlarYurtHaberler

5 / 22

Page 7: Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ

Data Set

Categories are skewed (not homogene)

6 / 22

Page 8: Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ

Approach

ClassifiersK Nearest Neighbour (kNN)Support Vector Machines (SVM)

Use training setNews with good categories

Use test setNews with good categories (for evaluation)

EvaluationTest with already categorized news 7 / 22

Found to be best [1]

Page 9: Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ

Methodology

Cleaning Noises

Preprocessing

Indexing

Document Classification

8 / 22

Page 10: Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ

Cleaning Noises

News documents coming from different RSS feeds generally contain noises such as advertisements, hypertexts, etc.

Increase the similarity between documents which contain the same or similar noises

Decrease in the performance of the systems as Bilkent News Portal, which uses similarity between documents.

9 / 22

Page 11: Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ

Cleaning Noises (cntd.)

Cleaning process of noises such as hypertexts is easily done by removing the sentences contain these noises.

Their pattern do not change for each news document coming from different RSS feeds.

E.g. hypertexts, which contain links to other documents, are defined as “<a href="http://…" ></a>”.

10 / 22

Page 12: Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ

Cleaning Noises (cntd.)

Each RSS feed attaches specific advertisements to its news documents.

No general pattern for all news documents.

After a while, even the same RSS feed changes the advertisement being attached to its news documents.

11 / 22

Page 13: Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ

Cleaning Noises (cntd.)

Compare two consecutive news documents from the same RSS feed sentence-by-sentence. (Each sentence is compared with every sentence of the consecutive news document)

Calculate the similarity between each sentence by using Cosine Similarity.

12 / 22

Page 14: Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ

Preprocessing

Stemming – Zemberek API is used.

Stop word list comparison – frequently occuring words are not taken into consideration.

13 / 22

Page 15: Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ

Document Indexing

Creating vector space model from index terms + feature selection may be a costly process.

Consistency with Bilkent News Portal Lemur[2] for document indexing operation. Lemur can only index predefined formats

TREC text

14 / 22

Page 16: Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ

Document Classification

Two different approaches to assess which one performs better:

K-Nearest NeighborSupport Vector Machines

15 / 22

Page 17: Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ

K Nearest Neighbor

Given training data D (categorized news in our case), goal is to assign test point X (news with unknown category in our case) to label of associated closest neighbors in D.

As distance function to specify k nearest news, we again used Lemur.

Lemur can also retrieve documents according to some score calculation.

Lemur is quite fast at retrieving k similar documents. 16 / 22

Page 18: Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ

Support Vector Machines

Support Vector Machines (SVMs),Applied to various problems,Data in k-dim space,Find a hyperplane (i.e subset with k-1 dim), Several possible hyperplanes..

17 / 22

Page 19: Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ

Support Vector Machines (cntd.)

Figure 2. Possible hyperplanes in a sample space.

Margin of u2

Support Vectors

18 / 22

Page 20: Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ

Support Vector Machines (cntd.)

Aim: Find a hyperplane correctly classifying with maximal margin,

Support vectors are only effective,

Represent a hyperplane:

19 / 22

Page 21: Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ

Support Vector Machines (cntd.)

Figure 3. A sample linear SVM.

20 / 22

Page 22: Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ

Support Vector Machines (cntd.)

Figure 4. A sample non-linear SVM.

Figure 5. Mapping with kernel function.

21 / 22

Page 23: Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ

Conclusion

The necessity of Turkish news categorization is covered.

Bilkent News Portal: RSS feeds may lack category information or having unrealistic categories such as Last Minute, Main Page, Agenda etc.

A categorization methodology for Turkish news is proposed.

Finding the correct category is done both by KNN (base classifier) and SVM to evaluate which one performs better.

KNN for 100 experiments, 60% success.22 / 22

Page 24: Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ

References

[1] Yang, Y. and Liu, X., A re-examination of text categorization methods. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, (SIGIR), 1999

[2] http://www.lemurproject.org/

Page 25: Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ

Questions?

Thank you for listening…