Presenter : CHANG, SHIH-JIE Authors : ADNAN YAHYA and ALI SALHI 2014. ACM TALIP .

Post on 24-Feb-2016

37 views 0 download

Tags:

description

Arabic Text Categorization Based on Arabic Wikipedia. Presenter : CHANG, SHIH-JIE Authors : ADNAN YAHYA and ALI SALHI 2014. ACM TALIP . Outlines. Motivation Objectives Methodology Experiments Conclusions Comments. Motivation. - PowerPoint PPT Presentation

Transcript of Presenter : CHANG, SHIH-JIE Authors : ADNAN YAHYA and ALI SALHI 2014. ACM TALIP .

Intelligent Database Systems Lab

Presenter : CHANG, SHIH-JIE

Authors : ADNAN YAHYA and ALI SALHI

2014. ACM TALIP.

Arabic Text Categorization Based on Arabic Wikipedia

Intelligent Database Systems Lab

OutlinesMotivationObjectivesMethodologyExperimentsConclusionsComments

Intelligent Database Systems Lab

Motivation

A challenge due to the correlation between certain subcategories and overlap between main categories.

EX:

Intelligent Database Systems Lab

Objectives• To solve this, we use algorithm and further adopt the two

approaches .

Intelligent Database Systems Lab

CATEGORIZATION CORPORA - Training Data

Related Tags Approach

Intelligent Database Systems Lab

Intelligent Database Systems Lab

Testing Data

10 categories with 40 documents in each category

Intelligent Database Systems Lab

Methodology - PREPROCESSING TECHNIQUES

Root Extraction (RE) Light Stemming (LS) Special Expressions Extraction

Intelligent Database Systems Lab

Methodology- CATEGORIZATION PROCESSCategorize the input text in two phases

Phase one: we categorize the text into one of the main categories.

Phase two:We further categorize the input text based on subcategories:

Intelligent Database Systems Lab

Intelligent Database Systems Lab

Methodology - Basic Categorization Algorithm (BCA)

Intelligent Database Systems Lab

Methodology - Percentage and Difference Categorization (PDC) Algorithm

has frequency 7 in the 300-word

Intelligent Database Systems Lab

Methodology - Percentage and Difference Categorization (PDC) Algorithm

The category with the highest sum of flag values is considered to be the best match for the input text.

Intelligent Database Systems Lab

Methodology – PDC Algorithm vs. BCA Algorithm

Intelligent Database Systems Lab

Methodology – Enhancing Main/Subcategories Grouping

(1) Overlapping Main Categories for Phase Two

Problem : The possible high correlation between subcategories of different main categories

Intelligent Database Systems Lab

Methodology – Enhancing Main/Subcategories Grouping

(2) Replacing Main Categories by Groups of Related Categories

Intelligent Database Systems Lab

Methodology – Enhancing Main/Subcategories Grouping

Intelligent Database Systems Lab

Methodology - Word Filtration Techniques within Categories

Intelligent Database Systems Lab

Methodology - The result of applying the three techniques

Intelligent Database Systems Lab

Modified PDC with N Scales Define a scaling of

1 0.5 0

1 0.5 00.250.75

Intelligent Database Systems Lab

Further Testing on the PDC AlgorithmTool Root ExtractionTool Light Stemming & Light10Tool Double WordsTool Expressions Extraction

Intelligent Database Systems Lab

Using Testing Data from the Reference Categories

Intelligent Database Systems Lab

Training Data Characteristics

Intelligent Database Systems Lab

COMPARISON WITH RELATED WORK

Intelligent Database Systems Lab

Using Testing Data from the Reference Categories

Intelligent Database Systems Lab

Conclusions– To use training and testing data from same source by

splitting the corpus into test and training components. This consistently gives better results.

– However, we believe that the second method (different source ) makes more sense, as the tests will

be more credible and indicative of performance in real-life environments.

Intelligent Database Systems Lab

Comments• Advantages

– To.• Applications

– Arabic Text Categorization .