UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification...
-
Upload
sean-morales -
Category
Documents
-
view
217 -
download
2
Transcript of UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification...
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
A Genetic Algorithm for Text Classification
Rule Induction
A.Pietramala1, V.Policicchio1, P.Rullo1,2, I.Sidhu3
1. Università della Calabria (Rende, Italy) {a.pietramala,policicchio,rullo}@mat.unical.it
2. Exeura Srl (Rende, Italy)
3. Kenetica Ltd (Chicago, IL-USA) {isidhu}@computer.org
ECML PKDD 200815-19 September 2008, Antwerp, Belgium
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICAOutline
– Motivations
– The Olex Hypothesis Language
– The Genetic Algorithm Approach (Olex-GA)
– Experimental Results and Comparative Evaluation
– Discussions
– Conclusions and Future Work
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICAMotivations
• Rule learning algorithms have become a successful strategy for classifier induction.
• Rule-based classifiers provide the desirable property of being readable and, thus, easy to understand (and, possibly, modify).
• Genetic Algorithms (GAs) are stochastic search methods inspired to the biological evolution.
• GAs show the capability to provide good solutions for classical optimization tasks (e.g. TSP and Knapsack)
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICARule Induction and GAs
• Rule induction is one of the application fields of GAs.
The basic idea is that:
– Each individual in the population represents a candidate solution
(a classification rule or a classifier)
– The fitness of an individual is evaluated in terms of the predictive
accuracy.
• We propose presents a GA approach, called Olex-GA, for the induction of rule-based text classifiers.
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
Olex-GA - The hypothesis language
• A classifier c (Pos,Neg) is of the form:
c category
titerm (n-gram)
d document).dt...dt(
)dt...dt(c
mnn
n
1
1
Neg
c (Pos,Neg)Pos
“if any of the terms t1,…,tn occurs in d and none of the terms tn+1,…,tn+m occurs in d, then classify d under category c”
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
Olex-GA The hypothesis language
• The terms in Pos and Neg are chosen among the ones belonging to the local vocabulary:
• Intuitively, Vc (k, f ) is the set of the best k terms for category c according to a given scoring function f.
U Cc c fkVfkV ),(),(
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
Olex-GAProblem statement
• The Olex-GA’s learning problem is stated as an optimization
problem:
PROBLEM MAX-F
Let a category c C and a vocabulary V (k, f) over the training set TS be
given.
Then, find two subsets of V (k, f), Pos = {t1,…,tn } and Neg = {tn+1,…,tn+m }
with Pos ≠ Ø , such that c (Pos, Neg) applied to TS yields a maximum value
of Fc, (over TS), for a given [0,1].
• Problem MAX-F is NP-Hard.
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
Olex-GA A Genetic Algorithm to Solve MAX-F
• Problem MAX-F is a combinatorial optimization problem aimed at finding a best combination of terms taken from a given vocabulary.
• MAX-F is a typical problem for which GAs are known to be a good candidate resolution method.
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
GA-OlexOur implementation of GA
• In the following, we describe our choices concerning:
– Population Encoding
– Fitness Function
– Evolutionary Operators
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
Olex-GAPopulation Encoding
• Each individual represents an entire classifier.
• An individual is simply a binary representation of the sets Pos and Neg of a classifier c (Pos, Neg) .
}t,t,t,t,t{)f,k(V 54321
)tt()tt(c 4231 c
0101000101t5t4t3t2t1t5t4t3t2
t1
Given a vocabulary
EX
AM
PLE
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
Olex-GA Population Encoding
• We restrict the search of both positive and negative terms, respectively, to:
– Pos*, the set of terms belonging to Vc (k, f ) (candidate
positive terms);
– Neg*, the set of terms which occur in any document
containing some candidate positive term and not
belonging to the training set TSc of c (candidate negative
terms).
• The reduction of search space allows:
– an improvement of the algorithm efficiency
– a quick convergence toward good solutions
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
Olex-GAFitness Function
• The fitness of a chromosome K, representing c(Pos,Neg) is the value of the F-measure resulting from applying c(Pos,Neg) to the training set TS.
• This choice naturally follows from the formulation of problem MAX-F.
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
Olex-GAEvolutionary Operators
• We perform:
– selection via the roulette-wheel method,
– crossover by the uniform crossover scheme.
– mutation, which consists in the flipping of each
single bit with a given (low) probability.
– elitism, in order to ensure that the best individuals of
the current generation are passed to the next one
without being altered
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
Olex-GAExperimentation
We have experimentally evaluated our algorithm on two
standard benchmark corpora:
• REUTERS-21578 (R10)– It consists of 12,902 documents– They are manually classified with respect to 135 categories. We
have considered the subset of the 10 most populated categories.
• OHSUMED
– We used the collection consisting of the first 20,000 documents from the 50,216 medical abstracts of the year 1991.
– The classification scheme consisted of the 23 MeSH disease categories.
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICAExperimental settings
• We applied the stratified holdout method:
REUTERS:– ModApté split : 9603 documents are used to form the training
corpus (seen data) and 3299 to form the test set (unseen data).
OHSUMED:– The first 10,000 were used as seen data and the second 10,000
as unseen data.
In both cases, we have randomly split the set of seen data into a
– training set (70%), on which to run the GA
– and a validation set (30%), on which tuning the model parameters.
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICAExperimental settings
• GA Parameters:
• For each chromosome K in the population, we initialized K+ at random, while we set K¡
- [t] = 0, for each t Neg* (thus, K initially encodes a classifier Hc(Pos,Neg) with no negative terms).
Parameter Value
Iterations 3
Population Size 500
Num of Generations 200
Cross-over Rate 1.0
Mutation Rate 0.001
Elitism Probability 0.2
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICAComparative Evaluation
• On both corpora, we carried out a direct comparison with the following systems:
– SVM (both polynomial and radial basis function)
– Ripper (with two optimization steps)
– C4.5
– Naive Bayes
– Olex-Greedy
• The performances were evaluated using the Weka library of ML algorithms (apart from Olex-Greedy).
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
Performance Comparison on Reuters
• Efficacy
– SVMpoli > SVMrbf > Ripper ≈ Olex-GA > C45 > Olex-Greedy > NB• Efficiency
– NB > Olex-Greedy > SVMpoli > Olex-GA > C45 > SVMrbf > Ripper
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
Performance Comparison on OHSUMED
• Efficacy
– Olex-GA > Ripper > SVMpoli > Olex-Greedy > SVMrbf ≈ NB > C45 • Efficiency
– NB > Olex-Greedy > SVMpoli > Olex-GA > C45 > SVMrbf > Ripper
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
Discussions – Relation to other inductive rule learners
• Conventional Rule Learners (Ripper, C4.5):
– Usually rely on a two-stage process: rule induction and rule pruning.
– Each of the above step in turn consists of several steps
• Olex-GA relies on a a single-step process which does not need any post-induction optimization.
• With respect to Olex-Greedy, Olex-GA provides better predictive accuracy, but is less efficient.
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICAConclusions
• Olex-GA encodes a classifier, in a very natural and compact way, as an individual
• Fitness of an individual is evaluated as the F-measure of the encoded classifiers
• Experimental results point out:
– Olex-GA quickly converges to very accurate classifiers;
– Olex-GA performs at a competitive level with standard
algorithms;
– Time efficiency is lower than Olex-Greedy but higher than the
other rule learning methods, such as Ripper and C45.
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICAFuture work
• Extension of the proposed technique to deal with classifiers of the form
where each Ti is a conjunction of “simple” terms:
)dT...dT(
)dT...dT(c
mnn
n
1
1
kiii t....tT 1