ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Bu çalışmada, Web sayfaları...
Transcript of ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Bu çalışmada, Web sayfaları...
ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED SCIENCES
MSc THESIS
Esra SARAÇ
WEB PAGE CLASSIFICATION USING ANT COLONY OPTIMIZATION
DEPARTMENT OF COMPUTER ENGINEERING
ADANA, 2011
ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED SCIENCES
WEB PAGE CLASSIFICATION USING ANT COLONY OPTIMIZATION
Esra SARAÇ
MSc THESIS
DEPARTMENT OF COMPUTER ENGINEERING
We certify that the thesis titled above was reviewed and approved for the award of degree of the Master of Science by the board of jury on 18/01/2011. ……………….................... ………………………….. ……................................ Asst. Prof. Dr. Selma Ayşe ÖZEL Prof. Dr. Mehmet TÜMAY Assoc. Prof. Dr. Zekriya TÜFEKÇİ SUPERVISOR MEMBER MEMBER This MSc Thesis is written at the Department of Institute of Natural And Applied Sciences of Çukurova University. Registration Number:
Prof. Dr. İlhami YEĞİNGİL Director Institute of Natural and Applied Sciences
Not:The usage of the presented specific declerations, tables, figures, and photographs either in this thesis or in
any other reference without citiation is subject to "The law of Arts and Intellectual Products" number of 5846 of Turkish Republic
I
ABSTRACT
MSc THESIS
WEB PAGE CLASSIFICATION USING ANT COLONY OPTIMIZATION
Esra SARAÇ
ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED SCIENCES
DEPARTMENT OF COMPUTER ENGINEERING
Supervisor : Asst. Prof. Dr. Selma Ayşe ÖZEL Year: 2011, Pages: 73 Jury : Asst. Prof. Dr. Selma Ayşe ÖZEL : Prof. Dr. Mehmet TÜMAY : Assoc. Prof. Dr. Zekeriya TÜFEKÇİ
In this study, Web pages are classified by selecting the best features using an Ant Colony Optimization algorithm and applying the C4.5 classifier. The proposed Ant Colony Optimization based algorithm was experimented on the WebKB and the Conference datasets. The aim of this study is to reduce the number of features to be used during the classification process to improve run-time performance and efficiency of the classifier. The experimental results of this study showed that, Ant Colony Optimization is an acceptable optimization algorithm for Web Page feature selection. Key Words: Feature Selection, Ant Colony Optimization, Web Page Classification
II
ÖZ
YÜKSEK LİSANS TEZİ
KARINCA KOLONİSİ OPTİMİZASYONU KULLANARAK WEB SAYFASI SINIFLANDIRMA
Esra SARAÇ
ÇUKUROVA ÜNİVERSİTESİ FEN BİLİMLERİ ENSTİTÜSÜ
BİLGİSAYAR MÜHENDİSLİĞİ ANABİLİM DALI
Danışman : Yrd. Doç. Dr. Selma Ayşe ÖZEL Yıl: 2011, Sayfa: 73 Jüri : Yrd. Doç. Dr. Selma Ayşe ÖZEL : Prof. Dr. Mehmet TÜMAY : Doç. Dr. Zekeriya TÜFEKÇİ
Bu çalışmada, Web sayfaları Karınca Kolonisi Eniyilemesi algoritması kullanılarak seçilen en iyi niteliklerle ve C4.5 sınıflandırıcısı uygulanarak sınıflandırılmıştır. Önerilen Karınca Kolonisi Eniyilemesine dayalı algoritma WebKB ve Konferans veri kümeleri üzerinde denenmiştir. Bu çalışmanın amacı çalışma zamanı performansını ve sınıflandırıcının etkinliğini iyileştirmek için sınıflandırma işlemi sırasında kullanılan nitelik sayısını azaltmaktır. Anahtar Kelimeler: Nitelik Seçimi, Karınca Kolonisi Optimizasyonu, Web Sayfası
Sınıflandırması
III
ACKNOWLEDGEMENTS
Foremost, I would like to express my sincere gratitude to my advisor Assist.
Prof. Dr. Selma Ayşe ÖZEL, for her supervision guidance, encouragements,
patience, motivation, useful suggestions and her valuable time for this work.
I would like to thank members of MSc thesis jury, Prof. Dr. Mehmet
TÜMAY and Assoc. Prof. Dr. Zekeriya TÜFEKÇİ, for their suggestions and
corrections.
My sincere thanks also goes to Nilgün Özgenç, Neslihan Gündoğdu and
Çiğdem İnan Acı, for their patience, motivation and useful suggestions.
Special thanks to my right hand Bengisu Özyeşildağ, for her endless support
and patience.
Last but not the least, I would like to thank my family: my parents Ayşe,
Şeref and my brothers Emre and Kemal Saraç, for their endless support and
encouragements for my life and career.
IV
CONTENTS PAGE
ABSTRACT ......................................................................................................... I
ÖZ ....................................................................................................................... II
ACKNOWLEDGEMENTS ............................................................................... III
CONTENTS……………………………………………………………………...IV
LIST OF TABLES............................................................................................. VI
LIST OF FIGURES .............................................................................................X
l. INTRODUCTION ............................................................................................ 1
2. PRELIMINARY WORK ................................................................................. 5
2.1. Preliminary Works in Web Page Classification ……………….………...5
2.2. Preliminary Works in Feature Selection…………………………..……...9
2.3. Preliminary Works in Nature inspired techniques in Web Page
Classification and Feature Selection.…………………………………....13
3. MATERIAL AND METHOD ........................................................................ 17
3.1. Material. ............................................................................................... 17
3.1.1. WebKB Dataset…………………………………………………..17
3.1.2. Conference Dataset………………………………………..…...... 18
3.1.3. Ant Colony Optimization ………………………….....……...…..18
3.1.4. Weka Data Mining Tool……………………….……………....…24
3.2. Method..………………..…………………………………………….....27
3.2.1. Construction of Dataset ……………………………………….....29
3.2.2. Feature Extraction …………………………………………….....30
3.2.3. Feature Selection ……………………………………………...…31
4. RESEARCH AND DISCUSSION .............................................................. …37
4.1. Classification Experiments With Only URL Addresses.……………….39
4.2. Classification Experiments With Only <title> Tags……..………...…...41
4.3. Classification Experiments With Bag of Terms Method..………...……43
4.3.1.Classification Experiments With Bag of Terms Method in 5%
Document Frequency Value…………………….………………...44
V
4.3.2. Classification Experiments With Bag of Terms Method in 10%
Document Frequency Value…………..…………………………..46
4.3.3. Classification Experiments With Bag of Terms Method in 15%
Document Frequency Value…………..…………………………..48
4.4. Classification Experiments With Tagged Terms Method……………....50
4.4.1. Classification Experiments With Tagged Terms Method in 5%
Document Frequency Value..…….……………………………….51
4.4.2. Classification Experiments With Tagged Terms Method in 10%
Document Frequency Value………......…………………………..54
4.4.3. Classification Experiments With Tagged Terms Method in 15%
Document Frequency Value…..………………………………......56
4.5. Comparison With C4.5………………………………………………….60
4.6. Comparison of the Proposed Method With Earlier Studies.……………61
5. CONCLUSION ............................................................................................. 63
REFERENCES……………………………………………………….…………..65
CURRICULUM VITAE….………………………………………….…………...73
VI
LIST OF TABLES PAGE
Table 3.1. Distribution of Each Class .................................................................... 17
Table 3.2. Distribution of Pages With Respect to Universities ............................... 18
Table 3.3. Distribution of Conference Dataset Pages ............................................. 18
Table 3.4. Train/Test Distribution of WebKB Dataset for Binary Class
Classification ........................................................................................ 29
Table 3.5. Train/Test Distribution of the Conference Dataset ................................ 30
Table 3.6. Number of Features for all Classes According to the Selected Tags ...... 31
Table 3.7. Number of Features for all Classes In Bag-of-Term Method with
Respect to Document Frequency Values ............................................... 32
Table 3.8. Number of Features for all Classes in Tagged Term Method with
Respect to Document Frequency Values ............................................... 32
Table 4.1. Classification Results of Student Class With Respect to 500 Epoch
Value .................................................................................................... 37
Table 4.2. Classification Results of Student Class With Respect to 250 Epoch
Value .................................................................................................... 38
Table 4.3. F-measures of NB, RBF and C4.5 Classifiers for the WebKB
Dataset ................................................................................................. 39
Table 4.4. Experimental Results Using URLs With 60 Features of All Classes ..... 40
Table 4.5. Experimental Results Using URL With 30 Features of All Classes ....... 40
Table 4.6. Experimental Results Using URLs With 10 Features of All Classes ..... 41
Table 4.7. Experimental Results Using <title> Tags With 60 Features of All
Classes ................................................................................................. 42
Table 4.8. Experimental Results Using <title> Tags With 60 Features of All
Classes ................................................................................................. 42
Table 4.9. Experimental Results Using <title> Tags With 10 Features of All
Classes ................................................................................................. 43
Table 4.10. Experimental Results Using Bag of Terms Method for Course
Class With 5% Document Frequency.................................................... 44
VII
Table 4.11. Experimental Results Using Bag of Terms Method for Project
Class With 5% Document Frequency.................................................... 45
Table 4.12. Experimental Results Using Bag of Terms Method for Student
Class With 5% Document Frequency.................................................... 45
Table 4.13. Experimental Results Using Bag of Terms Method for Faculty
Class With 5% Document Frequency.................................................... 46
Table 4.14. Experimental Results Using Bag of Terms Method for Conference
Class With 5% Document Frequency.................................................... 46
Table 4.15. Experimental Results Using Bag of Terms Method for Course
Class With 10% Document Frequency .................................................. 47
Table 4.16. Experimental Results Using Bag of Terms Method for Project
Class With 10% Document Frequency .................................................. 47
Table 4.17. Experimental Results Using Bag of Terms Method for Student
Class With 10% Document Frequency .................................................. 47
Table 4.18. Experimental Results Using Bag of Terms Method for Faculty
Class With 10% Document Frequency .................................................. 48
Table 4.19. Experimental Results Using Bag of Terms Method for Conference
Class With 10% Document Frequency .................................................. 48
Table 4.20. Experimental Results Using Bag of Terms Method for Course
Class With 15% Document Frequency .................................................. 48
Table 4.21. Experimental Results Using Bag of Terms Method for Project
Class With 15% Document Frequency .................................................. 49
Table 4.22. Experimental Results Using Bag of Terms Method for Student
Class With 15% Document Frequency .................................................. 49
Table 4.23. Experimental Results Using Bag of Terms Method for Faculty
Class With 15% Document Frequency .................................................. 49
Table 4.24. Experimental Results Using Bag of Terms Method for Conference
Class With 15% Document Frequency .................................................. 50
Table 4.25. Number of Features For Each Tag With 5% Document Frequency
Value for Each Class ............................................................................ 51
VIII
Table 4.26. Experimental Results Using Tagged Terms Method for Course
Class With 5% Document Frequency ................................................... 52
Table 4.27. Experimental Results Using Tagged Terms Method for Project
Class With 5% Document Frequency.................................................... 52
Table 4.28. Experimental Results Using Tagged Terms Method for Student
Class With 5% Document Frequency.................................................... 53
Table 4.29. Experimental Results Using Tagged Terms Method for Faculty
Class With 5% Document Frequency.................................................... 53
Table 4.30. Experimental Results Using Tagged Terms Method for Conference
Class With 5% Document Frequency.................................................... 53
Table 4.31. Number of Features For Each Tag With 10% Document Frequency
Value .................................................................................................... 54
Table 4.32. Experimental Results Using Tagged Terms Method for Course
Class With 10% Document Frequency .................................................. 54
Table 4.33. Experimental Results Using Tagged Terms Method for Project
Class With 10% Document Frequency .................................................. 55
Table 4.34. Experimental Results Using Tagged Terms Method for Student
Class With 10% Document Frequency .................................................. 55
Table 4.35. Experimental Results Using Tagged Terms Method for Faculty
Class With 10% Document Frequency .................................................. 55
Table 4.36. Experimental Results Using Tagged Terms Method for Conference
Class With 10% Document Frequency .................................................. 56
Table 4.37. Number of Features For Each Tag With 15% Document Frequency
Value .................................................................................................... 56
Table 4.38. Experimental Results Using Tagged Terms Method for Course
Class With 15% Document Frequency .................................................. 57
Table 4.39. Experimental Results Using Tagged Terms Method for Project
Class With 15% Document Frequency .................................................. 57
Table 4.40. Experimental Results Using Tagged Terms Method for Student
Class With 15% Document Frequency ................................................. 57
IX
Table 4.41. Experimental Results Using Tagged Terms Method for Faculty
Class With 15% Document Frequency .................................................. 58
Table 4.42. Experimental Results Using Tagged Terms Method for Faculty
Class With 15% Document Frequency .................................................. 58
Table 4.43. Distribution of Selected Features With Respect to Tags for Project
Classes When 15% Document Frequency is applied ............................. 59
Table 4.44. Distribution of the Selected Features With Respect to Tags for the
Best Cases ............................................................................................ 59
Table 4.45. Comparison of the Proposed ACO Feature Selection Algorithm
with C4.5 .............................................................................................. 60
X
LIST OF FIGURES PAGE
Figure 2.1. Binary classification ……………………………………………………..6
Figure 2.2. Multiclass classification ......................................................................... 7
Figure 3.1. Figure of Ant Behavior ......................................................................... 19
Figure 3.2. ACO Flow Chart .................................................................................. 24
Figure 3.3. The starting graphical user interface of Weka ....................................... 25
Figure 3.4. Explorer Environment of Weka ............................................................ 26
Figure 3.5. iris.arff File as an Example of arff File ................................................. 27
Figure 3.6. Architecture of The Proposed System ................................................... 28
Figure 3.7. Flow Chart of Proposed ACO Algorithm .............................................. 34
Figure 3.8. An Instance of an arff File .................................................................... 35
Figure 3.9. Pseudo Code of General C4.5 algorithm ............................................... 36
XI
1. INTRODUCTION Esra SARAÇ
1
1. INTRODUCTION
The rapid growth of the Internet use and the developments in the
communication technologies, have caused rapid increase in the amount of online text
information. As a result of this it became difficult to manage the huge amount of
online information. To solve this problem, a lot of new techniques have been
developed and used by search engines. Several studies are done to give more
accurate and fast results to users. One of the most important studies in this area is
text classification. Text categorization or classification, which is widely used by
search engines, is one of the key techniques for handling and organizing text data.
The aim of text categorization is to classify documents into a certain number
of pre-defined categories by using document features. Text classification plays a
crucial role in many information retrieval and management tasks. These tasks are;
information retrieval, information extraction, document filtering and building
hierarchical Web directories (Qi and Davison, 2009). When text classification
focuses on Web pages it is named as Web Classification, or Web Page Classification.
However, Web pages are different from text, and they contain a lot of information,
such as URLs, links, HTML tags which are not supported by text documents.
Because of this distinction, Web classification is different from traditional text
classification (Qi and Davison, 2009).
On the Web, page classification is essential to topic-specific Web link
selection, to analysis of the topical structure of the Web, to development of Web
directories, and to focused crawling (Qi and Davison, 2009). People manually
constructed some Web directories such as the Yahoo! (http://www.yahoo.com/) and
the Open Directory Project (www.dmoz.org), and manually labeled documents for
classification. However, manual classification is time consuming and needs a lot of
human effort, which makes it un-scalable with respect to the high growing speed of
the Web. Therefore, there has ensued great need for automated Web page
classification systems. In addition to time benefits of automated Web page
classification, by this way classification results have become more clear and
1. INTRODUCTION Esra SARAÇ
2
objective with respect to manual classification due to the subjectivity of human
experts.
The first examples of the text classification systems were automated text
indexing systems, and they were studied in 70’s (Salton, 1970). Then these systems
were supported with machine learning techniques to improve the classification
performance. A major problem of the text classification is the high dimensionality of
the feature space. We need to select proper subsets of features from the original
feature space to reduce the dimensionality of feature space and to improve the
efficiency and performance of the classifier (Shang et al., 2007). Several approaches
have been applied to select proper features (Yu and Liddy, 1999). Some of these
methods are Document Frequency, Information Gain, Mutual Information and x2
Test (Han and Kamber, 2006).
Feature extraction methods have become an important issue for classification
performance. Therefore, too many studies have been made in this area. In the past,
researchers focused on classification of the text files. But as a result of increased
online documents, in recent years interest in Web categorization has rapidly grown
(Qi and Davison, 2009). Classification of Web page content is essential to many
tasks in Web information retrieval such as maintaining Web directories, focused
crawling, question-answering systems and etc (Qi and Davison, 2009). Since a Web
page has more features than a text document, feature selection becomes more
important for Web Page Classification than for typical text classification.
To select best features, application of machine learning techniques is one of
the most popular techniques. Machine learning algorithms are separated into two
main categories (Mitchell, 1997). These are supervised learning and unsupervised
learning. In supervised learning; a global model that maps input objects to desired
outputs is generated (Mitchell, 1997). Support vector machines, k-Nearest Neighbors
and Naïve Bayes algorithms are the examples of the supervised learning. The task of
the supervised learner is to predict the value of the function for any valid input object
after having seen a number of training examples (Mitchell, 1997). To achieve this,
the learner has to generalize from the presented data to unseen situations in a
reasonable way. In unsupervised learning on the other hand, all the observations are
1. INTRODUCTION Esra SARAÇ
3
assumed to be caused by latent variables, that is, the observations are assumed to be
at the end of the casual chain (Mitchell, 1997). The Self Organizing Maps are
commonly used unsupervised learning algorithms (Haykin, 1999). Supervised
learning requires that the target variable is well defined and that a sufficient number
of its values are given. For unsupervised learning typically either the target variable
is unknown or has only been recorded for too small number of cases (Mitchell,
1997).
Up till today, several methods and solutions have been developed for feature
extraction problems. However, many of these methods are proposed for text
classification. For classifying Web pages which have a lot of features, there are only
a few studies for selecting features. With the contribution of the structural properties
of Web pages such as HTML tags and headers, an Ant Colony Optimization (ACO)
algorithm is thought to be a different method for extracting features. The purpose of
this study is to find the best features in HTML pages by using an optimization
algorithm to improve performance of the existing Web page classification systems.
In order to choose the best features, the Ant Colony Optimization technique, which
was developed to solve optimization problems, will be used.
1. INTRODUCTION Esra SARAÇ
4
2.PRELIMINARY WORK Esra SARAÇ
5
2. PRELIMINARY WORK
2.1. Preliminary Works in Web Page Classification
Classification is the process of assigning predefined appropriate class labels
to the available data. For this purpose, a set of labeled data is used to train a classifier
which is then used for labeling unseen data. This classification process is also
defined as a supervised learning (Mitchell, 1997).
The process is not different in Web page classification, there is one or more
predefined class labels and a classification model assigns Web pages to one or more
predefined class labels. Web page classification assigns a label from a predefined set
of labels to a Web page. Web pages, which are in fact hypertext, have many features
such as textual tokens, markup tags, URLs, host names in URLs and all these
features could be meaningful for classifiers. Therefore, Web page classification has
several differences from traditional text classification.
Web page classification has some subfields like subject classification and
functional classification (Qi and Davison, 2009). In subject classification, classifier is
concerned with the content of a Web page and tries to determine the “subject” of the
Web page. For example, categories of online newspapers like finance, sport,
technology, are instances of subject classification. Functional classification is
concerned with function or type of the Web page. For example, determining a page is
a “personal homepage” or a “course page” is an instance of a functional
classification. These two types of classification are most popular classification types
(Qi and Davison, 2009).
Classification can be divided into binary classification and multiclass
classification according to the number of classes (Qi and Davison, 2009). In binary
classification there is only one class label. Classifier looks an instance and assigns it
into the specific class or not. Instances of the specific class are called as relevant, and
the others are named as non-relevant instances. Binary class classification process is
presented in Figure 2.1.
2.PRELIMINARY WORK Esra SARAÇ
6
Figure 2.1 Binary classification (Qi and Davison, 2009)
If there are more than one class, this type of classification is called as multiclass
classification (Qi and Davison, 2009). The classifier also assigns an instance to one
of the multiple classes. Multiclass classification can be viewed in Figure 2.2.
In our problem, we studied on Web content classification which is called as
subject classification, and we focused on binary classification. Binary classification
is the basis of the focused crawler (Chakrabarti et al., 1999) or topical crawler
(Menczer and Belew, 1998). The aim of a focused crawler is to increase the search
engine performance by crawling and indexing Web pages about a specific topic. To
achieve this goal a focused crawler needs to determine whether a Web page is on the
specific class or not.
2.PRELIMINARY WORK Esra SARAÇ
7
Figure 2.2 Multiclass classification (Qi and Davison, 2009)
Web content classification has a lot of benefits for information retrieval tasks.
By using classification techniques, the performance of information retrieval
processes can be improved. As an example Web directories are the platforms which
include predefined set of categories that are available for browsing information. The
most popular examples of Web directories are Yahoo! (http://www.yahoo.com) and
Open Directory Project (http://www.dmoz.org). Constructing, maintaining or
expanding Web directories manually need extensive human effort. Huang et al.
(2004) proposed an approach which is named as Liveclassifier for the automatic
creation of classifiers from Web corpora based on user-defined hierarchies. They
have presented a system that can automatically extract corpora from the Web to train
classifiers. They used Vector Space Model to describe the features. They used tf*idf
method to describe similarities between features and classes. Based on the two basic
methods, they developed their own method which was named as LiveClassifier.
According to this study, the main merits of LiveClassifier are its wide adaptability
and its flexibility. The classifier can be created by defining a topic hierarchy. The
2.PRELIMINARY WORK Esra SARAÇ
8
necessary corpora can be fetched and organized automatically, promptly, and
effectively. By making automatic classification; construction, maintenance and
expansion of Web directories become more effective, and automatic classification
techniques improve the performance of Web directories (Huang et al., 2004).
Improving search results quality is another example of the benefits of Web
content classification. Query terms are important for search results. If user can not
select the right term, the returned search results would be meaningless. If the selected
term has multiple meanings, query results would be irrelevant with the search term.
For example, query term “bank” could mean “the border of a body of a water” or
“financial establishment” (Qi and Davison, 2009). Various approaches have been
proposed to improve retrieval quality by disambiguating query terms. Chekuri et al.
(1997) have studied automatic Web page classification to increase the precision of
Web search. In this study, at query time, the user is asked to specify one or more
desired categories so that only the results in those categories are returned, or the
search engine returns a list of categories under which the pages would fall. This
approach works when the user is looking for a known item. In such a case, it is not
difficult to specify the preferred categories. However, there are situations in which
the user is less certain about which documents will match; in this case the above
approach does not help much.
Also as a solution for ranking problem, Page and Brin (1997) developed the
link-based ranking algorithm called the PageRank. The PageRank was developed
at Stanford University as a part of a research project about a new kind of search
engine (Page and Brin, 1997). They had the idea that information on the Web could
be ordered in a hierarchy by "link popularity": a page is ranked higher as there are
more links to it. PageRank calculates the authoritativeness of Web pages based on a
graph constructed by Web pages and their hyperlinks, without considering the topic
of each page. Since then, much research has been explored to differentiate authorities
of different topics. Haveliwala et al. (2003) have proposed Topic-sensitive
PageRank, which performs multiple PageRank calculations, one for each topic.
When computing the PageRank score for each category, the random surfer jumps to
a page in that category at random rather than just any Web page. This has the effect
2.PRELIMINARY WORK Esra SARAÇ
9
of biasing the PageRank to that topic. This approach needs a set of pages that are
accurately classified.
If only domain-specific queries are expected, performing a full crawl is
usually inefficient. Focused crawling which was proposed by Chakrabarti et al.
(1999), is an approach that crawls documents only in a small part of the Web. A
focused crawler analyzes its crawl boundary to find the links that are likely to be the
most relevant for the crawl, and avoids irrelevant regions of the Web. It has
predefined set of topics and these sets construct crawling area. In this approach, a
classifier is used to evaluate the relevance of a Web page to the given topics so as to
provide evidence for the crawl boundary. The proposed algorithm consists of two
parts: a classifier that evaluates the relevance of a hypertext document with respect to
the focus topics, and a distiller that identifies hypertext nodes that are great access
points to many relevant pages within a few links (Chakrabarti et al., 1999).
Good quality document summarization can accurately represent the major
topic of a Web page. Shen et al. (2004), proposed an approach to classify Web pages
through summarization. They showed that classifying Web pages on their summaries
is able to improve the accuracy by around 10% as compared with content based
classifiers.
2.2.Preliminary Works in Feature Selection
Feature selection is the one of the most important steps in classification
systems. Web pages are generally in HTML format. This means that Web pages are
not in a fully structured format. They are semi-structured since they contain HTML
tags and hyperlinks in addition to pure text. Because of this, feature selection in Web
page classification is different than traditional text classification. Feature selection is
commonly used to reduce dimensionality of datasets with tens or hundreds of
thousands of features which would be impossible to process further. A major
problem of Web page classification is the high dimensionality of the feature space.
The best feature subset contains the least number of features that most contribute to
accuracy and efficiency.
2.PRELIMINARY WORK Esra SARAÇ
10
To improve the performance of Web page classification, several approaches
that are imported from feature selection for text classification have been applied to
the problem of feature selection for Web page classification. In addition to traditional
feature selection methods, machine learning techniques are also popular algorithms
to be used for feature selection.
A lot of feature scoring measures have been proposed. Information Gain
(Mitchell, 1997), Mutual Information (Shannon, 1948), Document Frequency (Yang
and Pedersen, 1997), and Term Strength (Chakrabarti, 2002) techniques are the most
popular traditional techniques. Information gain (IG) measures the amount of
information in bits about the class prediction, if the only information available is the
presence of a feature and the corresponding class distribution. Concretely, it
measures the expected reduction in entropy (Mitchell, 1997). Mutual information
(MI) was first introduced by Shannon (1948) in the context of digital
communications between discrete random variables and was generalized to
continuous random variables. Mutual information is considered as an acceptable
meter of relevance between two random variables (Cover and Thomas, 1991).
Mutual Information method is a probabilistic method which measures how much
information the presence/absence of a term contributes to making the correct
classification decision on a class (Guiasu and Silviu, 1977). Document frequency
(DF) is the number of documents in which a term occurs in a dataset. It is the
simplest criterion for term selection and easily scales to a large dataset with linear
computation complexity. It is a simple but effective feature selection method for text
categorization (Yang and Pedersen, 1997). Term strength (TS) was proposed and
evaluated by Wilbur and Sirotkin (1992) for vocabulary reduction in text retrieval.
Term strength is also used in text categorization (Yang, 1995: Yang and Wilbur,
1996). This method predicts term importance based on how commonly a term is
likely to appear in “closely-related” documents. TS uses a training set of documents
to derive document pairs whose similarity measured using the cosine value of the
two document vectors is above a threshold. Then, “Term Strength” is computed
based on the predicted conditional probability that a term occurs in the second half of
a pair of related documents given that it occurs in the first half. The above methods
2.PRELIMINARY WORK Esra SARAÇ
11
namely the IG, the DF, the MI, and the TS are compared by Yang and Pedersen
(1997). They used kNN classifier on the Reuters
(http://archive.ics.uci.edu/ml/databases/reuters) corpus. According to this study
(Yang and Pedersen, 1997), IG is the most effective method with 98% feature
reduction, DF is the simplest method with the lowest cost in computation and it can
be credibly used instead of IG if computation of these measures is too expensive.
Kwon and Lee (2000, 2003), proposed classifying Web pages using a
modified k-Nearest Neighbor algorithm, in which terms within different tags are
given different weights. At the k-Nearest Neighbor algorithm a constant k value is
selected. They divided all the HTML tags into three groups and assigned each group
a random weight. Thus, utilizing tags can take advantage of the structural
information embedded in the HTML files, which is generally ignored by plain text
approaches. However, since most HTML tags are oriented toward presentation rather
than semantics, Web page authors may generate different but conceptually equivalent
tag structures. Therefore, using HTML tagging information in Web classification
may suffer from the inconsistent formation of HTML documents. In Kwon and Lee’s
(2000) modified k-nearest neighbor algorithm features are selected using two well-
known metrics: expected mutual information and mutual information. They also
weighted terms according to the HTML tags that the term appears. This means that
terms within different tags bear different importance. k-Nearest Neighbor (kNN)
classifiers require a document dissimilarity measure to quantify the distance between
a test document and each training document. They replace the traditional cosine
measure by their own similarity measure which is called as “matching factor”. They
called the number of the matching terms as the "matching factor". The proposed
similarity measure was modified from the traditional similarity measure to use the
matching factor. The intuition behind their similarity measure is that frequently co-
occurring terms constrain the semantic concept of each other. The more co-occurred
terms any two documents have in common, the stronger the relationship between the
two documents. According to the experimental results, with only cosine similarity
measure, micro averaging breakeven point was reported as 18.23% and with only
inner product method micro averaging breakeven point was reported as 19.74%. If
2.PRELIMINARY WORK Esra SARAÇ
12
matching factor was used with these two similarity methods, results were reported as
19.23% and 20.02%, respectively.
Chakrabarti et al. (1998), proposed a term-based classifier which uses a score-
based function for feature selection. The proposed algorithm provides new
techniques for automatically classifying hypertext into a given topic hierarchy, using
information latent in hyperlinks. There is much information in the hyperlink
neighborhood of a document. The iterative relaxation technique bootstraps off a text-
based classifier and then uses both local text in a document, as well as the
distribution of the estimated classes of other documents in its neighborhood, to refine
the class distribution of the document being classified. They applied this approach to
Yahoo! directory. They converted documents into bag-of-words format. According
to this study, the proposed algorithm is able to improve the accuracy from 32% to
75%. Using even a small neighborhood around the test document significantly
boosts classification accuracy, reducing error up to 62% from text-based classifiers.
Rather than deriving information from the page content, Kan and Thi (2005),
demonstrated that a Web page can be classified based on its URL. They inspired
from Kan (2004)’s study. Although accuracy is not high, this approach eliminates the
necessity of downloading the page. Therefore, it is especially useful when the page
content is not available or time/space efficiency is strictly emphasized. Performance
of the proposed study was measured both by accuracy and macro F1. According to the
experimental results, accuracy value was reported as 76.18% and macro F1 value was
reported as 0.525. In addition to these results, their URL-only method achieved about
95% of the performance of the full text methods. A similar URL based study is
proposed by Baykan et al. (2009), and it is about binary classification of Web pages.
In this study, features are extracted from URL addresses of Web pages. They have
used a support vector machine (SVM) and an n-gram technique to classify Open
Directory Project (ODP) (www.dmoz.org) Web pages. According to the
experimental results, F-measure values are between 80% and 85%.
Most supervised learning approaches only learn from training examples. Co-
training, which is introduced by Blum and Mitchell (1998), is an approach that
makes use of both labeled and unlabeled data to achieve better accuracy. In a binary
2.PRELIMINARY WORK Esra SARAÇ
13
classification scenario, two classifiers that are trained on different sets of features are
used to classify the unlabeled instances. The prediction of each classifier is used to
train the other. Compared with the approach which only uses the labeled data, this
co-training approach is able to cut the error rate by half. Ghani (2001: 2002) adopted
this approach to multi-class problems. The results showed that co-training does not
improve accuracy when there is a large number of categories. Classification usually
requires manually labeled positive and negative examples. Yu et al. (2004) devised
an SVM-based approach to eliminate the need for manual collection of negative
examples while still retaining similar classification accuracy. Given positive data and
unlabeled data, their algorithm is able to identify the most important positive
features. Using these positive features, it filters out possible positive examples from
the unlabeled data, which leaves only negative examples. An SVM classifier could
then be trained on the labeled positive examples and the filtered negative examples.
2.3.Preliminary Works in Nature inspired techniques in Web Page Classification
and Feature Selection
Nature inspired techniques including genetic algorithm (GA), ant colony
optimization (ACO) and particle swarm optimization (PSO) algorithms have also
been proposed for text and Web classification problems.
Gordon (1988) has used GAs to find best document illustrator for each user
specified document according to the queries used and the relevance judgments made
during the retrieval process. This is one of the earliest studies on application of GAs
to information retrieval domain. Chen and Kim (1995) have proposed a hybrid GA
and neural network based system, called GANNET. They used a GA to choose the
best keywords which describe user-selected documents, and with a neural network,
weights for the keywords are determined. Additionally, Boughanem et al. (1999)
have applied a GA-based technique to optimize document descriptions and to
improve query formulations. Ribeiro et al. (2003) proposed a Web page classifier
which is based on rule extraction. They labeled Web pages as ‘‘Not Relevance”,
‘‘Medium Relevance”, and ‘‘Extreme Relevance” with fuzzy membership function.
2.PRELIMINARY WORK Esra SARAÇ
14
They used both Navie Bayes and a GA for classification. In their study, fuzzy
membership function performed better with a Naive Bayes classifier than with a GA-
based classifier. Liu and Huang (2003) have proposed a semi-supervised fuzzy
clustering algorithm based on a GA. Both labeled and unlabeled documents are taken
together to derive a classifier. Each document is represented as tf-idf weighted word
frequency vector, and stemming and stopword removal are not used. HTML tags are
also not considered. Liu and Huang (2003) compared their classifier with Naive
Bayes, and observed gain in classification accuracy. Özel (2010) has proposed a Web
page classification which is based on a GA. In the proposed approach both HTML
tags and terms are used as features and optimal weights of features are learned with a
GA. The proposed method was compared with the Navie Bayes and the KNN
algorithms. The accuracy of the proposed approach is higher than the compared
algorithms.
AntMiner (Parpinell et al., 2002) is the first method that uses the ACO in the
classification domain. Holden and Freitas (2004) inspired from AntMiner (Parpinell
et al., 2002). Holden and Freitas (2004) make use of the Ant Colony paradigm to find
a set of rules that classify the Web pages into several categories. They made no prior
assumptions which words in the Web pages to be classified were to be used as
potential discriminators. To reduce data rarity, they used stemming; a technique with
different grammatical forms of a root word can be considered equivalent such as
help, helping, helped. Holden and Freitas (2004) also gathered sets of words if they
were closely related in the WordNet electronic thesaurus. Holden and Freitas (2004)
compared their Ant_Miner with the rule inference algorithms C4.5 and CN2. They
found that Ant_Miner was comparable in accuracy, and formed simpler rules. The
best result of Holden and Freitas’s study is 81.0% of accuracy when using WordNet
generalization with Title features. Aghdam et al. (2009) have proposed an ACO
based Feature selection algorithm for text classification. The performance of the
proposed algorithm is compared with the performance of a genetic algorithm,
information gain and CHI on the task of feature selection in Reuters-21578 dataset
(http://archive.ics.uci.edu/ml/databases/reuters21578). Their simulation results on
Reuters-21578 dataset showed the superiority of the proposed algorithm.
2.PRELIMINARY WORK Esra SARAÇ
15
Another nature inspired algorithm which is used in Web page classification
problem is the PSO. Wang et al. (2007) have used the PSO in Web page
classification task as a classifier method. They have used entropy weighting for
feature selection and PSO for classification on the Reuter
(http://archive.ics.uci.edu/ml/databases/reuters21578) and the TREC
(http://trec.nist.gov/data.html) data sets. The proposed algorithm yields much better
performance than other conventional algorithms. Liangtu and Xiaoming (2007) have
used the PSO in Web text feature extraction problem. They have used Vector Space
Model (VSM) as description of Web text. The algorithm is based on the PSO with
reverse thinking particles and structure of the particles is also improved. According
to their experimental results, their study’s accuracy value changes between 88.9%
and 96.1% depending upon the classified text size.
2.PRELIMINARY WORK Esra SARAÇ
16
3.MATERIAL AND METHOD Esra SARAÇ
17
3. MATERIAL AND METHOD
3. 1. Material
This section includes explanations about the datasets namely the WebKB
and the Conference datasets; the Weka classification environment and the ACO
method which were used in the proposed algorithm.
3.1.1. WebKB Dataset
The WebKB dataset is a set of Web pages collected by the World Wide
Knowledge Base (Web->Kb) project of the CMU (http://www.cs.cmu.edu) text
learning group, and were downloaded from The 4 Universities Dataset Homepage
(http://www.cs.cmu.edu/~webkb/). These pages were collected from computer
science departments of various universities in 1997, and manually classified into
seven different classes namely: student, faculty, staff, department, course, project,
and other. For each class, the collection contains pages from four universities
which are Cornell, Texas, Washington, Wisconsin universities, and other
miscellaneous pages collected from other universities.
The 8,282 pages were manually classified into the seven categories such
that the student category has 1641 pages, faculty has 1124, staff has 137,
department has 182, course has 930, project has 504, and other contains 3764
pages (Table 3.1). The class other is a collection of pages that are not deemed as
the “main page” and are not representing an instance of the previous six classes.
Table 3.1 Distribution of Each Class Class Student Faculty Staff Department Course Project Other # of
Pages 1641 1124 137 182 930 504 3764
The WebKB dataset includes 867 Web pages from Cornell University, 827
pages from Texas University, 1205 pages from Washington University, 1263
3.MATERIAL AND METHOD Esra SARAÇ
18
pages from Wisconsin University and finally 4,120 miscellaneous pages from
other universities (Table 3.2.).
Table 3.2. Distribution of Pages With Respect to Universities University Cornell Texas Washington Wisconsin Other
# of Pages 867 827 1205 1263 4, 120
3.1.2. Conference Dataset
The Conference dataset consists of the Computer Science related
conference homepages that were obtained from the DBLP web site
(http://www.informatik.uni-trier.de/~ley/db/). The conference Web pages were
labeled as positive documents in the dataset. To complete the dataset, the short
names of the conferences were queried using the Google search engine manually
(http://www.google.com), and the irrelevant pages in the result set were taken as
negative documents. The dataset consists of 824 relevant pages and 1545
irrelevant pages which are approximately 3 times of the relevant pages.
Distribution of Conference dataset pages is seen in Table 3.3.
Table 3.3. Distribution of Conference Dataset Pages
Relevant Pages
Non-relevant Pages
Conference 824 1545
3.1.3. Ant Colony Optimization
Ant Colony Optimization (ACO) studies artificial systems that take
inspiration from the behavior of real ant colonies and it is used to solve discrete
optimization problems. In 1999, the Ant Colony Optimization metaheuristic was
defined by Dorigo et al. (Dorigo et al., 1999). The first ACO system was
introduced by Marco Dorigo in his Ph.D. thesis (1992), and was called as Ant
System (AS). The AS is the result of a research on computational intelligence
3.MATERIAL AND METHOD Esra SARAÇ
19
approaches to combinatorial optimization (Dorigo et al., 1992). The AS was
initially applied to the travelling salesman problem, and to the quadratic
assignment problem.
The original AS was motivated by the natural phenomenon that ants deposit
pheromone on the ground in order to mark some favorable path that should be
followed by other members of the colony.
Natural behaviors of ants are shown in Figure 3.1. The aim of the colony is
to find the shortest path between a food source and the nest. The behaviors of ants
can be listed as follows:
1. The first ant finds the food source (Food), via any way (a), then returns to
the nest (Nest), leaving behind a trail pheromone (b)
2. Ants indiscriminately follow four possible ways, but the strengthening of
the runway makes it more attractive as the shortest tour.
3. Ants take the shortest tour; long portions of other ways lose their trail
pheromones.
Figure 3.1. Figure of Ant Behavior
(http://en.wikipedia.org/wiki/Ant_colony_optimization)
Ants use the environment as a medium of communication. They exchange
information indirectly by depositing pheromones, all detailing the status of their
work. The information exchanged has a local scope, only an ant located where the
3.MATERIAL AND METHOD Esra SARAÇ
20
pheromones were left has a notion of them. The mechanism is a good example of
a self-organized system. This system is based on positive feedback (the deposit of
pheromone attracts other ants that will strengthen it themselves) and negative
feedback (dissipation of the route by evaporation prevents the system from
thrashing). Theoretically, if the quantity of pheromone remained the same over
time on all edges, no route would be chosen. However, because of feedback, a
slight variation on an edge will be amplified and thus allow the choice of an edge.
The algorithm will move from an unstable state in which no edge is stronger than
another, to a stable state where the route is composed of the strongest edges.
The basic philosophy of the algorithm involves the movement of a colony of
ants through the different states of the problem influenced by two local decision
policies, viz., trails and attractiveness. Thereby, each such ant incrementally
constructs a solution to the problem. When an ant completes a solution, or during
the construction phase, the ant evaluates the solution and modifies the trail value
on the components used in its solution. This pheromone information will direct
the search of the future ants. Furthermore, the algorithm also includes two more
mechanisms, trail evaporation and daemon actions. Trail evaporation reduces all
trail values over time thereby avoiding any possibilities of getting stuck in local
optima. The daemon actions are used to bias the search process from a non-local
perspective.
ACO algorithms can be applied to any optimization problem, for which the
following problem-dependent aspects can be defined (Bonabeau et al., 1999:
Dorigo et al., 1996):
• An appropriate graph representation to represent the discrete search space.
A graph should accurately represent all states and transitions between
states. A solution representation scheme also has to be defined.
• Positive feedback process; that is, a mechanism to update pheromone
concentrations such that current successes positively influence feature
solution construction.
• A constraint-satisfaction method to ensure that only feasible solutions are
constructed.
3.MATERIAL AND METHOD Esra SARAÇ
21
• A solution construction method which defines the way in which solutions
are built and a state transition probability.
The first ACO algorithm was applied to the travelling salesman problem
(TSP). This is the problem of finding the shortest tour visiting all the nodes of a
fully-connected graph, the nodes of which represent locations, and the arcs
represent a path with an associated cost (normally assumed to be distance). This
problem has a clear analogy with the shortest path finding ability of real ants, and
is also a widely studied NP-hard combinatorial optimization problem. Dorigo
applied an ACO to the TSP with his ‘Ant System’ (AS) approach (Dorigo, 1992).
In AS each (artificial) ant is placed on a randomly chosen city, and has a memory
which stores information about its route so far (a partial solution), initially this is
only the starting point. Setting off from its starting city an ant builds a complete
tour by probabilistically selecting cities to move next until all cities have been
visited. While at city i, an ant k picks an unvisited city j with a probability given
by equation 3.1.
( , ) = [ ( , )] ∗[ŋ( , )] ∑ [ ( , )] ∈ ∗[ŋ( , )] (3.1)
In equation 3.1, ŋ(i,j)=1/d(i,j), where d(i,j) is the distance between cities i
and j, and represents heuristic information available to the ants. is the
‘feasible’ neighborhood of ant k, that is all cities as yet unvisited by ant k. τ(i,j) is
the pheromone trail value between cities i and j. α and β are parameters which
determine the relative influence of the heuristic and pheromone information, if α
is 0 the ants will effectively be performing a stochastic greedy search using the
‘nearest-neighbor’ heuristic, if β is 0 then the ants use only pheromone
information to build their tours. After all the ants have built a complete tour the
pheromone trail is updated according to the global update rule defined in equation
3.2.
( , ) = ∗ ( , ) + ∑ ∆ ( , ) (3.2)
3.MATERIAL AND METHOD Esra SARAÇ
22
where ρ denotes a pheromone evaporation parameter which decays the pheromone
trail (and thus implements a means of ‘forgetting’ solutions which are not
reinforced often), and m is the number of ants. The specific amount of pheromone,
∆τk(i,j), that each ant k deposits on the trail is given by equation 3.3.
∆ ( , ) = 1 ⁄0 ( , ) (3.3)
In equation 3.3 is the length of ant k’s tour. This means that the shorter
the ant’s tour, the more pheromone will be deposited on the arcs used in the tour,
and these arcs will thus be more likely to be selected in the next iteration.
The algorithm iterates through each of these stages until the termination
criteria are met. The results from the AS were encouraging, but it did not improve
on state of the art approaches. This original system has since been adapted in
various ways, including using an elitist strategy which only allows the iteration of
global best ant to leave pheromone (known as Max-Min AS (Stützle and Hoos,
2000)), and allowing the ants to leave pheromone as they build the solution
(Dorigo and Stützle, 2002). Since the AS and its adaptation the ACO approach
have been modified and applied to many different problems, the implementation
details have moved some way from the original biological analogy. Dorigo and
Stützle (2002) provide a useful distillation of the ideas of various approaches to
implementing ACO algorithms, and describe the main features that any ACO
approach must define for a new problem. The five main requirements are
identified below (Dorigo and Stützle, 2002):
• A heuristic function ŋ( ), which will guide the ants’ search with problem
specific information.
• A pheromone trail definition, which states what information, is to be
stored in the pheromone trail. This allows the ants to share information
about good solutions.
• The pheromone update rule, this defines the way in which good solutions
are reinforced in the pheromone trail.
3.MATERIAL AND METHOD Esra SARAÇ
23
• A fitness function which determines the quality of a particular ant’s
solution.
• A construction procedure that the ants follow as they build their solutions
(this also tends to be problem specific).
The overall process of ACO can be seen in Figure 3.2. The process begins
by generating a number of ants which are then placed randomly on the graph.
Alternatively, the number of ants to place on the graph may be set equal to the
number of nodes within the data; each ant starts path construction at a different
node. From these initial positions, they traverse nodes probabilistically until a
traversal stopping criterion is satisfied. The resulting subsets are gathered and then
evaluated. If an optimal subset has been found or the algorithm has executed a
certain number of times, then the process halts and outputs the best solution
encountered. If none of these conditions hold, then the pheromone is updated, a
new set of ants are created and the process iterates once more.
The solution construction process is stochastic and is biased by a pheromone
model, that is, a set of parameters associated with graph components (either nodes
or edges) whose values are modified at runtime by the ants. Thus, ACO algorithm
can be applied to any problem which can be represented as a graph or a node
format. Basic flow of the algorithm is illustrated in Figure 3.2. Diversity of
algorithm comes from the different stochastic selection methods which are used in
subset evaluation section in the flow chart. Roulette wheel and ranking method
are the examples of these stochastic selection methods (Bäck and Thomas, 1996).
In roulette wheel selection algorithm, which is a stochastic algorithm, the
individuals are mapped to contiguous segments of a line, such that each
individual's segment is equal in size to its fitness. A random number is generated
and the individual whose segment spans the random number is selected. The
process is repeated until the desired number of individuals is obtained. This
technique is similar to a roulette wheel with each slice proportional in size to the
fitness. In rank-based selection algorithms, the population is sorted according to
the objective values. The fitness assigned to each individual depends only on its
position in the individuals rank and not on the actual objective value. After
3.MATERIAL AND METHOD Esra SARAÇ
24
sorting, a random number is generated and the individual whose order spans the
random number is selected.
Figure 3.2. ACO Flow Chart
3.1.4. Weka Data Mining Tool
Weka (Waikato Environment for Knowledge Analysis)
(http://www.cs.waikato.ac.nz/ml/weka) is a popular suite of machine learning
software written in Java, developed at the University of Waikato, New Zealand.
Weka is a free software available under the GNU General Public License. The
starting graphical user interface of Weka is shown in Figure 3.3. Weka supports
3.MATERIAL AND METHOD Esra SARAÇ
25
several standard data mining tasks, more specifically, data preprocessing,
clustering, classification, regression, visualization, and feature selection.
Figure 3.3. The starting graphical user interface of Weka
Weka has 4 different working modes namely Explorer, Experimenter,
KnowledgeFlow and Simple CLI (Witten and Frank, 2005). Simple CLI
provides a simple command-line interface that allows direct execution of Weka
commands for operating systems that do not provide their own command line
interface. Experimenter is an environment for performing experiments and
conducting statistical tests between learning schemes. Knowledge Flow supports
essentially the same functions as the Explorer but with a drag-and-drop interface.
One advantage is that it supports incremental learning. Explorer is an
environment for exploring data with Weka (Witten and Frank, 2005). Explorer is
the most frequently used environment of Weka and it is shown in Figure 3.4.
3.MATERIAL AND METHOD Esra SARAÇ
26
Figure 3.4. Explorer Environment of Weka
Explorer environment has 7 different tabs namely Preprocess, Classify,
Cluster, Associate, Select attributes and Visualize. In the Preprocess tab, data to
be analyzed is chosen and modified. The 71 algorithms available in Classify tab of
Weka are grouped into 6 categories, namely, Bayes (Bayesian algorithms),
Functions (function algorithms such as logistic regression and SVMs), Lazy (lazy
algorithms or instance based learners), Meta (algorithms that combine several
models and in some cases models from different algorithms), Trees
(classification/regression tree algorithms) and Rules (rule based algorithms). For
clustering, Cluster tab can be used to learn clusters for the data. Association rules
are learned in Associate tab. Select attributes tab includes attribute selection
methods. Finally, with Visualize tab, 2D plot of the data can be viewed.
In order to maintain format independence, data is converted to an
intermediate representation called ARFF (Attribute Relation File Format). ARFF
files contain blocks describing relations and their attributes, together with all the
instances of the relation and there are often very many of these. They are stored as
plain text for ease of manipulation. Relations are simply a single word or string
3.MATERIAL AND METHOD Esra SARAÇ
27
naming of the concept to be learned. Each attribute has a name, a data type (which
must be one of enumerated, real or integer) and a value range (enumerations for
nominal data, intervals for numeric data). The instances of the relation are
provided in comma-separated form to simplify interaction with spreadsheets and
databases. Missing or unknown values are specified by the ‘?’ character. An
example arff file is shown in Figure 3.5. In this study, Weka was used in the
classification phase.
Figure 3.5. iris.arff File as an Example of arff File
(http://www.cs.waikato.ac.nz/ml/weka)
3.2. Method
This section includes explanations about the proposed algorithm. Detailed
information about the datasets used, the ACO algorithm applied for feature
selection and the classification algorithm used for fitness function evaluation were
given in this section. General steps of the proposed system are shown in Figure
3.6.
@RELATION iris @ATTRIBUTE sepallength REAL @ATTRIBUTE sepalwidth REAL @ATTRIBUTE petallength REAL @ATTRIBUTE petalwidth REAL @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} @DATA 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 6.3,2.5,5.0,1.9,Iris-virginica 6.5,3.0,5.2,2.0,Iris-virginica 6.2,3.4,5.4,2.3,Iris-virginica 5.9,3.0,5.1,1.8,Iris-virginica
3.MATERIAL AND METHOD Esra SARAÇ
28
Figure 3.6. Architecture of The Proposed System
The proposed method includes four main steps namely construction of
datasets, feature extraction, feature selection and classification. In dataset
construction phase, datasets are prepared according to binary class classification
problem. After preparation step, features are extracted from these datasets. In
feature selection phase, a subset of features is selected among the extracted
features. And finally selected features are sent to Weka for classification. Feature
selection and classification phases are repeated until the best feature set is
selected. These steps are explained in detail in the following sections.
3.MATERIAL AND METHOD Esra SARAÇ
29
3.2.1. Construction of Dataset
In this work, two different datasets namely the WebKB and the Conference
datasets were used. The first of these is the WebKB which was described in
section 3.1.1. From the WebKB dataset Project, Faculty, Student and Course
classes were used in this study. As Staff and Department classes have insufficient
number of positive examples, they were not considered in this thesis. Training and
test datasets were constructed as described in the WebKB project Web site
(http://www.cs.cmu.edu). For each class, training set includes relevant pages
which belong to randomly chosen three universities, and other class of the dataset.
The fourth university’s pages were used in the test phase. Approximately 75%
of the irrelevant pages were added into the training set and 25% of the
irrelevant pages were included into the test set. The detailed information about the
dataset, which was used in this study, is given in Table 3.4. For
example, the Course class includes 846 relevant and 2822 irrelevant pages for the
training phase, in the test phase of Course class, 86 pages were relevant
and 942 pages were irrelevant.
Table 3.4. Train/Test Distribution of WebKB Dataset for Binary Class Classification
Train Relevant/Non-relevant
Test Relevant/Non-relevant
Course 846 / 2822 86 / 942 Project 840 / 2822 26 / 942 Student 1485 / 2822 43 / 942 Faculty 1084 / 2822 42 / 942
In Conference dataset, approximately 75% of the both relevant and
irrelevant pages were added into the training set and 25% of the relevant and
irrelevant pages were included into the test set. The detailed information about the
Conference dataset, which was used in this study, is given in Table 3.5. So that
618 relevant and 1159 irrelevant pages were used in the training phase and in the
test phase, 206 pages were relevant and 386 pages were irrelevant.
3.MATERIAL AND METHOD Esra SARAÇ
30
Table 3.5. Train/Test Distribution of the Conference Dataset
Train Relevant/Non-relevant
Test Relevant/Non-relevant
Conference 618 / 1159 206 / 386
3.2.2. Feature Extraction
In the feature extraction phase all <title>, <h1>, <a>, <b>, <p>, and <li>
tags which denote title, header at level 1, anchor, bold, paragraph, list item; and
URL addresses of pages were used. According to the experimental results of the
earlier studies (Kim and Zhang, 2003: Ribeiro et al., 2003: Trotman, 2005), these
tags are meaningful for feature extraction. To extract features, all the terms from
each of the above cited tags and URL addresses of the relevant pages in the train
set were taken. After term extraction, stopword removal and Porter’s stemming
algorithm (Porter, 1980) were applied. Each stemmed term was gathered for the
feature set.
The number of features varies based on the dataset, selected tags (i.e. only
URL addresses, only <title> tags, all tags or all terms) and selected document
frequency value. Number of features for each class of all datasets with respect to
the selected tags are shown in Table 3.6. As an example, 33998 features were
extracted when all of the above mentioned tags and URL are used for the Course
class. When only the <title> tags were considered, the number of features reduced
to 305 for this class. As shown in Table 3.6., the number of features were too
large for the Weka software to be used since Weka processes all data in-memory.
So we applied document frequency feature selection method (Salton and Buckley,
1988) to reduce the number of features so that Weka can handle them. Document
frequency of a feature is defined as the number of positive documents in the
training dataset that contains the feature (Baeza-Yates and Ribeiro-Neto, 1999). In
this study, features whose document frequencies are at least approximately 5%,
10% and %15 were chosen since according to the Salton and Buckley (1988)
these features are good. After that the ACO algorithm was used for feature
selection from this feature pool.
3.MATERIAL AND METHOD Esra SARAÇ
31
Table 3.6. Number of Features for all Classes According to the Selected Tags
Tagged Terms
Bag of Terms
Title Tag
URL Course 33998 16421 305 476 Project 31542 15466 596 686 Student 51009 22417 1987 1557 Faculty 48584 24756 1502 1208
Conference 3667 18788 890 1115
3.2.3. Feature Selection
ACO algorithm was used for feature selection. According to the
experimental results of the preliminary studies on ACO algorithm (Dorigo and
Stützle, 2002), the optimum number of ants is defined as 5. Therefore, in this
thesis, 5 ants were used. ACO algorithm which was described in section 3.1.2,
was adopted for Web page feature selection problem. Before selection of features,
number of features was reduced by their document frequency. It means that more
frequent features were selected before the ACO algorithm was applied.
In this study, features were selected from four different feature groups for
each class. In the first group, features were extracted from only the URL addresses
of Web pages. In URL address features, document frequency elimination was not
used because the number of features extracted was not too large. The detailed
information about the number of features for all classes according to the URL
addresses was given in Table 3.6. Secondly, only <title> tags were used for
feature extraction. Features that were extracted from the <title> tags were not too
large. So, document frequency elimination was not used, too. The number of
features for all classes by using the <title> tags was given in Table 3.6. In the third
feature extraction method, all terms were used as features without their tag
properties. In other words, a term which appeared in the document regardless of
its position was taken as a feature. This feature extraction approach is called as
‘‘bag-of-terms” method. In bag-of-terms method, number of features was very
large, so document frequency factor was used as a reducer. In this case, the
number of features that were extracted for all classes according to the their
document frequency values are seen in Table 3.7.
class method
3.MATERIAL AND METHOD Esra SARAÇ
32
Table 3.7. Number of Features for all Classes In Bag-of-Terms Method with Respect to Document Frequency Values
Class
5% Document
Frequency
10% Document
Frequency
15% Document
Frequency Course 459 217 130 Project 241 89 54 Student 292 121 70 Faculty 386 194 107
Conference 492 245 141
Finally, all terms that appeared in all tags were used as features. In other words, a
term which appeared in different tags was taken as different features (i.e. tagged
terms). Number of features for this case is also too large, so document frequency
values were used to reduce the number of features. The same document frequency
values which were used in the previous method were used. The detailed
information about the number of features for all classes in tagged terms form with
respect to document frequency values is given in Table 3.8.
Table 3.8. Number of Features for all Classes in Tagged Term Method with Respect to Document Frequency Values
Class
5% Document
Frequency
10% Document
Frequency
15% Document
Frequency Course 757 326 193 Project 324 115 66 Student 450 169 98 Faculty 603 259 140
Conference 831 370 201
After the feature extraction step, the optimum subset of features was tried to
be selected with an ACO algorithm. In the proposed method, each feature
represents a node, and all nodes are independent. Nodes (i.e. features) were
selected according to their selection probability Pk(i) which is given in equation
3.4.
( ) = [ ( )] ∗[ŋ( )] ∑ [ ( )] ∈ ∗[ŋ( )] (3.4)
3.MATERIAL AND METHOD Esra SARAÇ
33
In equation 3.4, ŋ(i)=df(i) , where df(i) is the document frequency of feature
i, and represents heuristic information available to the ants. is the ‘feasible’
neighborhood of ant k, that is all features as yet unvisited by ant k. τ(i) is the
pheromone trail value of feature i. The initial pheromone values are selected as
10. α and β are parameters which determine the relative influence of the heuristic
and pheromone information, and both of them are selected as 1.
Previous studies have shown that 1 is the most appropriate value for α and β and
10 is a suitable value for parameter initial pheromone trail value (Dorigo and
Stützle, 2002). After all the ants have built a complete tour, the pheromone trail is
updated according to the global update rule which is defined in equation 3.5.
( ) = ∗ ( ) + ∑ ∆ ( ) (3.5)
where ρ denotes a pheromone evaporation parameter which decays the pheromone
trail and m is the number of ants. ρ value is selected as 0.2 (Dorigo and Stützle,
2002).The specific amount of pheromone, ∆τk(i), that each ant k deposits on the
trail is given by equation 3.6.
∆ ( ) = ∗ 2 ∗ if node ( ) is used by ant and is the elithist ant ∗ if node ( ) is used by ant 0 otherwise (3.6)
In equation 3.6, is the F-measure value of ant k’s subset, and Bk is the
unit pheromone values. This means that the higher the F-measure of the ant’s
subset, the more pheromone will be deposited on the arcs used in the subset, and
these arcs will thus be more likely to be selected in the next iteration.
Until each ant chooses a predefined number of features, selection
probability of each unselected node was evaluated by equation 3.4. After the
probability evaluation, a roulette wheel selection algorithm was used for selecting
the next feature (Bäck and Thomas, 1996). The flow chart of the proposed feature
subset selection algorithm is shown in Figure 3.7.
3.MATERIAL AND METHOD Esra SARAÇ
34
Figure 3.7. Flow Chart of Proposed ACO Algorithm
When all ants complete their subset selection process, two arff files were
generated for each ant (train and test phase). In arff files of @data section, each
row represents a Web page. And each value in each row shows frequency of that
feature. If Web pages relevant, rows end with R, else if Web pages irrelevant rows
end with N. By this way, arff files were generated. Obtained arff files were
classified with WEKA. An example for an arff file is shown in Figure 3.8. In
Figure 3.8. in row @attribute 14 real, 14 denotes the number of the feature (i.e.
index of the feature).
3.MATERIAL AND METHOD Esra SARAÇ
35
Figure 3.8. An Instance of an arff File
A well known algorithm C4.5 was used for classification. C4.5 is a well
known classification algorithm which is based on decision trees. C4.5 is an
extension of Quinlan's earlier ID3 algorithm (Quinlan, 1993). The decision trees
generated by C4.5 can be used for classification, and for this reason, C4.5 is often
referred to as a statistical classifier. C4.5 builds decision trees from a set of
training data in the same way as ID3, using the concept of information entropy.
C4.5 is tried to find small decision trees by pruning. Pseudo code of general C4.5
algorithm is given in Figure 3.9.
J48 is an open source Java implementation of the C4.5 algorithm in
the Weka data mining tool. And results were compared with respect to F-measure
(Van Rijsbergen, 1979) values. Formulation of Weka F-measure value is given in
Equation 3.7.
3.MATERIAL AND METHOD Esra SARAÇ
36
Figure 3.9. Pseudo Code of General C4.5 algorithm (Quinlan, 1993).
− = ∗ ∗ (3.7)
In Equation 3.7, recall is the ratio of relevant documents found in the search result
to the total number of all relevant documents and precision is the proportion of
relevant documents in the results returned. In earlier studies, researchers were
measured their studies performance with respect to F-measure. To comply with
the standards on this issue, F-measure value was chosen as a performance metric
in this study. As a result of classification, an ant has chosen as elitist ant which
gives the best result. After that process, pheromone values were updated based on
equations 3.5 and 3.6. The process is repeated the predetermined epoch number of
times. Finally, the best feature subset has chosen as optimum subset which has the
best F-measure value.
1. Check for base cases
2. For each attribute a
2.1. Find the normalized information gain from splitting an a
3. Let a_best be the attribute with highest normalized information gain
4. Create a decision node that splits on a_best
5. Recur on the sublists obtained by splitting on a_best, and add those
nodes as children of node
4. RESEARCH AND DISCUSSION Esra SARAÇ
37
4. RESEARCH AND DISCUSSION
In this section experiments performed and their results are presented. Perl
programming language (http://www.perl.org/) was used for whole feature
extraction phase and document frequency part of feature selection phase. ACO
feature selection algorithm was programmed with Java programming language
under Eclipse environment (http://www.oracle.com/technetwork/developer-
tools/eclipse/downloads/index.html). The proposed method was tested under
Microsoft Windows XP SP3 operating system. The hardware used in the
experiments had 1 GB of RAM and Intel Core2Duo 1.60 GHz processor. The
proposed method consists of two main parts. The first part is extracting features
explained in the previous section and the second one is selecting optimal feature
subset from these features with an ACO algorithm. Suggested method was tested
on the Conference and the WebKB (http://www.cs.cmu.edu/~webkb/) datasets.
Detailed information about the datasets were given in the previous section. The
proposed selection method has been run for 250 times for each class. In other
words, according to experimental results, epoch number of selection method was
defined as 250. After 250 epoch, there was no improvement on classification
results. So 250 has been the optimum epoch value for the proposed method.
Classification results of an experiment up to 500 epoch is shown in Table 4.1.
Table 4.1. Classification Results of Student Class with Respect to 500 Epoch Value
# of Epoch Max/Min F-Measure
1 0.979/0.890
125 0.978/0.969
250 0.977/0.972
375 0.933/0.9277
500 0.933/0.9267
Run Time in Seconds 30.08
Classification results of an experiment on 250 epoch is shown in Table 4.2.
4. RESEARCH AND DISCUSSION Esra SARAÇ
38
Table 4.2. Classification Results of Student Class with Respect to 250 Epoch Value
# of Epoch Max/Min F-Measure
1 0.979/0.890
125 0.978/0.969
250 0.977/0.972
Run Time in Seconds 15.04
Given results in Table 4.1. are belong to the Student class, bag of terms method
under %15 document frequency value. Number of selected feature is defined as 70
in these results. According to the experimental results, after 250 epoch, although,
there was no improvement on F-Measure value, run time of method was doubled.
So, epoch number was defined as 250.
In this study, each ant chooses predefined number of features. These
number of features were determined by total number of features of each class. In
bag-of terms and tagged terms methods, after document frequency selection
phase, half of minimum feature number was taken as a limit of selected features
for all classes. For example, as seen in Table 3.8, the Conference dataset has 492,
245 and 141 features based on 5%, 10% and 15% document frequency values
respectively, on bag of terms method. Approximately half of 141 which was the
minimum feature number of Conference dataset in this case (i.e. 70), was taken as
upper limit of selected number of features. In URL and <title> tags methods,
document frequency reduction technique was not applied to datasets, for this
reason, 20% of the minimum feature number was taken as a upper limit of
selected features for all classes. For example, as seen in Table 3.6, the Course
dataset has 305 features on title tags method. Approximately 20% of 305 which
was the minimum feature number of Course dataset in this case (i.e. 60), was
taken as upper limit of selected number of features. The purpose of this study is to
minimize the number of features; therefore, after this determination, predefined
upper limits have reduced orderly. Detailed information about selected number of
features are given specifically in each section of each experiment.
4. RESEARCH AND DISCUSSION Esra SARAÇ
39
In the previous study (Saraç and Özel, 2010), selected features were
classified with Navie Bayes, RBF (Poggio and Girosi, 1990), and C4.5
classification algorithms of Weka data mining tool, and results of this comparison
was given in Table 4.3.
Table 4.3. F-measures of NB, RBF, and C4.5 Classifiers for the WebKB Dataset
Course Project Faculty Student NB 0.149 0.947 0.097 0.1 RBF 0.871 0.959 0.926 0.775 C4.5 0.877 0.962 0.947 0.793
According to the Table 4.3., C4.5 classification algorithm was chosen in this study
for classification of Web pages, since C4.5 classifier had the highest classification
F-measure.
4.1. Classification Experiments With Only URL Addresses
Performance of the proposed method with only URL addresses of Web
pages was considered on a preferential basis. For all classes, m value which has
been shown in Figure 3.7. was defined according to the total number of features.
Total number of features have been shown in Table 3.6. To make a comparison
between URL and <title> tags methods, same number of features are used in both
cases. So, to define m value, number of features extracted from <title> tag and
URL are considered. According to the Table 3.6., the Course class at title tag
method has the minimum feature number as 305. So, limit m value was defines as
60. Since 60 is approximately 20% of 305. After this predefinition, the proposed
algorithm was tested under 30 and 10 number of features. In the first experiment,
each ant has selected 60 features from whole features for all classes, and Web
pages are classified with respect to these selected 60 features. Classification
results in F-Measure value for 60 features are given in Table 4.4.
4. RESEARCH AND DISCUSSION Esra SARAÇ
40
Table 4.4. Experimental Results Using URLs With 60 Features of All Classes # of
epoch Course Project Student Faculty Conference
1 1.0/1.0 1.0/1.0 1.0/1.0 1.0/1.0 0.817/0.745
125 1.0/1.0 1.0/1.0 1.0/1.0 1.0/1.0 0.835/0.817
250 1.0/1.0 1.0/1.0 1.0/1.0 1.0/1.0 0.835/0.826
Run Time
(min/sc) 12:45 06:11 04:23 07:15 12:40
Run time of the experiments are also can be seen in Table 4.4. In the
second experiment, each ant has selected 30 features from whole features for all
classes and Web pages are classified with respect to these selected 30 features.
Classification results in F-Measure value for 30 features are given in Table 4.5.
Table 4.5. Experimental Results Using URLs With 30 Features of All Classes # of
epoch Course Project Student Faculty Conference
1 1.0/1.0 1.0/1.0 1.0/1.0 1.0/1.0 0.798/ 0.606
125 1.0/1.0 1.0/1.0 1.0/1.0 1.0/1.0 0.824/ 0.812
250 1.0/1.0 1.0/1.0 1.0/1.0 1.0/1.0 0.824/ 0.792
Run Time
(min/sc) 06:24 06:11 02:16 03:47 06:52
And finally, each ant has selected 10 features from whole features for all
classes and Web pages are classified with respect to these selected 10 features.
Classification results in F-Measure value for 10 features are given in Table 4.6.
4. RESEARCH AND DISCUSSION Esra SARAÇ
41
Table 4.6. Experimental Results Using URLs With 10 Features of All Classes # of
epoch Course Project Student Faculty Conference
1 1.0/0.98654 1.0/1.0 1.0/0.82201 1.0/1.0 0.767/0.545
125 1.0/1.0 1.0/1.0 1.0/1.0 1.0/1.0 0.780/0.746
250 1.0/1.0 1.0/1.0 1.0/1.0 1.0/1.0 0.780/ 0.751
Run Time
(min/sc) 02:31 01:12 01:05 01:32 02:22
In the given experimental results, tables include maximum and minimum
F-Measure values at defined iteration number. So, 0.767/0.545 value in Table 4.6.
specifies the maximum and the minimum F-Measure values of the Conference
dataset at the first iteration.
According to the obtained experimental results, WebKB dataset includes
meaningful URL addresses. But the Conference dataset has not meaningful URL
addresses. There were no noticeable change between different number of features
for WebKB dataset. Reduced number of features only change the run time of the
algorithm. But in the Conference dataset, with reducing the number of features F-
Measure values were also decreased. Limited number of features, although
advantageous in terms of time, a disadvantage in terms of classification
performance for the Conference dataset.
4.2. Classification Experiments With Only <title> Tags
Performance of the proposed method with only <title> tags of Web pages
was considered in this section. Same as the URL address tests, in the first
experiment, each ant has selected 60 features from whole features for all classes,
and Web pages are classified with respect to these selected 60 features.
Classification results in F-Measure value for 60 features are given in Table 4.7. In
Table 4.7., # E, denotes the epoch number and RT, denotes the run time of the
algorithm in terms of minutes and seconds.
4. RESEARCH AND DISCUSSION Esra SARAÇ
42
Table 4.7. Experimental Results Using <title> Tags With 60 Features of All Classes
# E Course Project Student Faculty Conference
1 0.880/0.871 0.983/0.983 0.917/0.913 0.940/0.932 0.741/0.698
125 0.880/0.874 0.983/0.983 0.911/0.911 0.935/0.922 0.721/0.718
250 0.874/0.874 0.983/0.983 0.913/0.911 0.935/0.926 0.724/0.717
R T 29:05 11:21 21:41 25:17 11:38
Run time of the experiments can be seen in Table 4.7. In the second
experiment, each ant has selected 30 features from whole features for all classes
and Web pages are classified with respect to these selected 30 features.
Classification results in F-Measure value for 30 features are given in Table 4.8.
Table 4.8. Experimental Results Using <title> Tags With 30 Features of All Classes
# E Course Project Student Faculty Conference
1 0.876/0.869 0.983/0.976 0.920/0.917 0.939/0.932 0.710/0.687
125 0.882/0.875 0.983/0.983 0.911/0.911 0.927/0.922 0.730/0.721
250 0.880/0.877 0.983/0.983 0.917/0.911 0.927/0.926 0.732/0.718
R T 13:58 05:54 11:53 12:56 05:40
And finally, each ant has selected 10 features from whole features for all
classes and Web pages are classified with respect to these selected 10 features.
Classification results in F-Measure value for 10 features are given in Table 4.9.
According to the obtained experimental results, the WebKB dataset
includes meaningful title declarations. But the Conference dataset’s title
declarations are not meaningful as the WebKB dataset’s. In this experiment,
reduced number of features were affected F-Measure values negatively. In the
previous experiment, short and meaningful information about Web pages are
extracted from URL addresses. So, favorite features are selected in all cases. But
in <title> tag features, number of features are more than URL addresses features.
4. RESEARCH AND DISCUSSION Esra SARAÇ
43
When m value reduced, selection probability of meaningful features are reduced.
Limited number of features, although advantageous in terms of time, a
disadvantage in terms of classification performance for the Conference dataset.
The number of pages also important for run time of the algorithm, the Conference
and the Project classes have a minimum number of pages this means minimum
number of cycle so, run time of these classes are fewer than the others.
Table 4.9. Experimental Results Using <title> Tags With 10 Features of All Classes
# E Course Project Student Faculty Conference
1 0.882/0.866 0.983/0.948 0.917/0.886 0.946/0.930 0.698/0.687
125 0.877/0.875 0.983/0.983 0.920/0.917 0.940/0.936 0.736/0.710
250 0.877/0.875 0.983/0.983 0.917/0.917 0.940/0.935 0.736/0.699
RT 02:20 02:19 05:06 04:00 02:13
4.3. Classification Experiments With Bag of Terms Method
Performance of the proposed algorithm with bag of terms method of Web
pages was considered in this section. Number of features of all classes can be seen
in Table 3.6. In bag of terms method, number of features are very high, because
of this, document frequency feature elimination method was applied in this
experiment. The m value of each class was determined by their number of features
which was shown in Table 3.7. According to the these number of features, the
upper limit of the m value was defined as the approximately half of minimum
feature number. Such as, for Conference class, minimum feature number is 141
so, upper limit of m value was defined as 70. For Course class upper limit of m
value was defined as 60, for Project and Student classes upper limit of m value
was defined as 30, and finally for Faculty class upper limit of m value was defined
as 50. After determination of upper limits of m values, number of features were
4. RESEARCH AND DISCUSSION Esra SARAÇ
44
reduced according to the upper limits. Experimental results were given in three
parts, with respect to document frequencies.
4.3.1. Classification Experiments With Bag of Terms Method in 5%
Document Frequency Value
Specified number of features and classification results for 5% document
frequency value are discussed in this section. Each ants were run with three
different number of features. For the Course class in the first experiment, each ant
selected 60 features. In the second experiment, each ant selected 40 features. In
the final experiment, each ant selected 10 features. We have defined a fixed
number of features to compare the classification performance of all classes under
a fixed number of features. So, the aim of the final experiment is to provide
comparison between different classes. Because of this, there was an experimental
result with 10 features for all classes. Specified number of features and
classification results of the Course class for 5% document frequency value are
shown in Table 4.10.
Table 4.10. Experimental Results Using Bag of Terms Method for Course Class With 5% Document Frequency
# of Features for Course Class # E 60 40 10 1 0.980/0.851 0.985/0.974 0.977/0.799
125 0.975/0.958 0.964/0.959 0.982/0.915 250 0.975/0.958 0.981/0.964 0.982/0.914 R T 08:50 06:08 03:40
For the Project class in the first experiment each ant selected 30 features.
In the second experiment, each ant selected 20 features. In the final experiment,
each ant selected 10 features. Specified number of features and classification
results of Project class for 5% document frequency value are shown in Table 4.11.
4. RESEARCH AND DISCUSSION Esra SARAÇ
45
Table 4.11. Experimental Results Using Bag of Terms Method for Project Class With 5% Document Frequency
# of Features for Project Class # E 30 20 10 1 0.981/0.887 0.994/0.958 0.951/0.792
125 0.973/0.977 0.988/0.963 0.994/0.953 250 0.976/0.973 0.992/0.963 0.994/0.956 R T 05:31 03:05 02:50
For the Student class in the first experiment each ant selected 30 features.
In the second experiment, each ant selected 20 features. In the final experiment,
each ant selected 10 features. Specified number of features and classification
results of the Student class for 5% document frequency value are shown in Table
4.12.
Table 4.12. Experimental Results Using Bag of Terms Method for Student Class With 5% Document Frequency
# of Features for Student Class # E 30 20 10 1 0.982/0.688 0.983/0.920 0.865/0.820
125 0.968/0.950 0.985/0.949 0.984/0.854 250 0.968/0.962 0.983/0.891 0.988/0.979 R T 07:41 07:04 05:02
For the Faculty class in the first experiment each ant selected 50 features.
In the second experiment, each ant selected 30 features. In the final experiment,
each ant selected 10 features. Specified number of features and classification
results of the Faculty class for 5% document frequency value are shown in Table
4.13.
4. RESEARCH AND DISCUSSION Esra SARAÇ
46
Table 4.13. Experimental Results Using Bag of Terms Method for Faculty Class With 5% Document Frequency
# of Features for Faculty Class # E 50 30 10 1 0.978/0.835 0.958/0.930 0.990/0.887
125 0.981/0.983 0.989/0.877 0.990/0.959 250 0.978/0.972 0.991/0.983 0.993/0.981 R T 11:00 07:26 04:15
For the Conference class in the first experiment each ant selected 70
features. In the second experiment, each ant selected 50 features. In the final
experiment, each ant selected 10 features. Specified number of features and
classification results of the Conference class for 5% document frequency value are
shown in Table 4.14.
Table 4.14. Experimental Results Using Bag of Terms Method for Conference Class With 5% Document Frequency
# of Features for Conference Class # E 70 50 10 1 0.992 / 0.952 0.991/0.910 0.994/0.911
125 0.987/0.984 0.992/0.973 0.994/0.970 250 0.987/0.984 0.992/0.985 0.992/0.978 R T 08:02 05:31 03:35
4.3.2. Classification Experiments With Bag of Terms Method in 10%
Document Frequency Value
Classification results for 10% document frequency value are discussed in
this section. Same number of features were used in all document frequency
values. So, each ant run with three different number of features in this case.
Classification results of the Course class for 10% document frequency value with
respect to number of features are shown in Table 4.15.
4. RESEARCH AND DISCUSSION Esra SARAÇ
47
Table 4.15. Experimental Results Using Bag of Terms Method for Course Class With 10% Document Frequency
# of Features for Course Class # E 60 40 10 1 0.980/0.822 0.974/0.781 0.987/0.890
125 0.986/0.964 0.986/0.884 0.984/0.896 250 0.986/0.964 0.964/0.964 0.983/0.888 R T 08:30 05:54 03:39
Classification results of the Project class for 10% document frequency
value with respect to number of features are shown in Table 4.16.
Table 4.16. Experimental Results Using Bag of Terms Method for Project Class With 10% Document Frequency
# of Features for Project Class # E 30 20 10 1 0.993/0.963 0.990/0.877 0.995/0.785
125 0.993/0.963 0.992/0.963 0.993/0.955 250 0.977/0.973 0.995/0.973 0.992/0.987 R T 05:30 03:55 02:46
Classification results of the Student class for 10% document frequency
value with respect to number of features are shown in Table 4.17.
Table 4.17. Experimental Results Using Bag of Terms Method for Student Class With 10% Document Frequency
# of Features for Student Class # E 30 20 10 1 0.966/0.655 0.868/0.709 0.854/0.780
125 0.826/0.798 0.791/0.763 0.897/0.833 250 0.826/0.776 0.793/0.761 0.887/0.840 R T 20:25 15:19 07:05
Classification results of the Faculty class for 10% document frequency
value with respect to number of features are shown in Table 4.18.
4. RESEARCH AND DISCUSSION Esra SARAÇ
48
Table 4.18. Experimental Results Using Bag of Terms Method for Faculty Class With 10% Document Frequency
# of Features for Faculty Class # E 50 30 10 1 0.988/0.889 0.985/0.855 0.932/0.805
125 0.980/0.974 0.991/0.974 0.994/0.986 250 0.978/0.971 0.989/0.972 0.993/0.946 R T 10:51 07:19 03:48
Classification results of the Conference class for 10% document frequency
value with respect to number of features are shown in Table 4.19.
Table 4.19. Experimental Results Using Bag of Terms Method for Conference Class With 10% Document Frequency
# of Features for Conference Class # E 70 50 10 1 0.991 / 0.985 0.992/0.922 0.991/0.915
125 0.985/0.984 0.992/0.985 0.992/0.977 250 0.987/0.985 0.987/0.987 0.994/0.991 R T 07:23 05:24 03:41
4.3.3. Classification Experiments With Bag of Terms Method in 15%
Document Frequency Value
Classification results for 15% document frequency value are discussed in
this section. Each ants were run with three different number of features in this
case. Classification results of the Course class for 15% document frequency value
with respect to number of features are shown in Table 4.20.
Table 4.20. Experimental Results Using Bag of Terms Method for Course Class With 15% Document Frequency
# of Features for Course Class # E 60 40 10 1 0.986/0.963 0.979/0.959 0.914/0.727
125 0.975/0.958 0.964/0.964 0.985/0.917 250 0.986/0.958 0.964/0.964 0.984/0.883 R T 07:58 04:30 03:24
4. RESEARCH AND DISCUSSION Esra SARAÇ
49
Classification results of the Project class for 15% document frequency
value with respect to number of features are shown in Table 4.21.
Table 4.21. Experimental Results Using Bag of Terms Method for Project Class With 15% Document Frequency
# of Features for Project Class # E 30 20 10 1 0.974/0.885 0.994/0.964 0.994/0.859
125 0.976/0.963 0.992/0.974 0.994/0.725 250 0.988/0.962 0.993/0.977 0.991/0.934 R T 05:16 04:06 02:43
Classification results of the Student class for 15% document frequency
value with respect to number of features are shown in Table 4.22.
Table 4.22. Experimental Results Using Bag of Terms Method for Student Class With 15% Document Frequency
# of Features for Student Class # E 30 20 10 1 0.987/0.707 0.988/0.952 0.984/0.730
125 0.963/0.948 0.988/0.968 0.988/0.858 250 0.988/0.949 0.981/0.949 0.987/0.858 R T 07:18 05:33 02:27
Classification results of the Faculty class for 15% document frequency
value with respect to number of features are shown in Table 4.23.
Table 4.23. Experimental Results Using Bag of Terms Method for Faculty Class With 15% Document Frequency
# of Features for Faculty Class # E 50 30 10 1 0.994/0.927 0.979/0.899 0.980/0.746
125 0.980/0.970 0.977/0.973 0.989/0.984 250 0.982/0.972 0.993/0.975 0.988/0.915 R T 11:05 06:43 03:50
Classification results of the Conference class for 15% document frequency
value with respect to number of features are shown in Table 4.24.
4. RESEARCH AND DISCUSSION Esra SARAÇ
50
Table 4.24. Experimental Results Using Bag of Terms Method for Conference Class With 15% Document Frequency
# of Features for Conference Class # E 70 50 10 1 0.992 / 0.984 0.991/0.964 0.991/0.954
125 0.992/0.984 0.992/0.984 0.992/0.991 250 0.987/0.984 0.992/0.987 0.992/0.977 R T 07:11 05:20 03:13
According to the obtained experimental results, we can say that text in
Web pages are more meaningful than URL addresses and titles of Web pages for
the Conference dataset. But in WebKB dataset, URL addresses are more
meaningful than page contents and titles of Web pages. As in previous
experiments, F-Measure values (i.e. classification performance) were changed
with number of features. 15% document frequency value yielded the best
classification performance with the maximum feature number. With document
frequency method, meaningless features were eliminated before the ACO was
applied, and this method enforced ants to select meaningful features. Number of
features also affected the run time of the algorithm. These are inversely
proportional quantities.
4.4. Classification Experiments With Tagged Terms Method
Performance of the proposed algorithm with tagged terms method of Web
pages was considered in this section. Number of features for all classes can be
seen in Table 3.6. In tagged terms method, number of features are very large,
because of this, document frequency feature elimination method was applied in
this method. m values which are previously defined for bag of terms method, are
used for tagged terms method, too. The same values are used to make a healthy
comparison between the two methods (i.e. bag of terms method and tagged terms
method). Experimental results were given in three parts, with respect to document
frequencies.
4. RESEARCH AND DISCUSSION Esra SARAÇ
51
4.4.1. Classification Experiments With Tagged Terms Method in 5%
Document Frequency Value
Specified number of features and classification results for 5% document
frequency value are discussed in this section. Each ant was run with three
different number of features. Number of features for each tag are presented in
Table 4.25.
Table 4.25. Number of Features For Each Tag With 5% Document Frequency Value for Each Class
Course Project Student Faculty Conference
URL 19 7 16 11 10
Title 10 8 4 3 8
Header 42 14 22 30 21
Anchor 77 29 57 59 122
Bold 33 9 19 31 45
Text 457 240 291 384 488
List item 119 17 41 85 137
Total 757 324 450 603 831
For the Course class in the first experiment each ant selected 60 features.
In the second experiment, each ant selected 40 features. In the final experiment,
each ant selected 10 features. Specified number of features and classification
results of the Course class for 5% document frequency value are shown in Table
4.26.
Tag Class
4. RESEARCH AND DISCUSSION Esra SARAÇ
52
Table 4.26. Experimental Results Using Tagged Terms Method for Course Class With 5% Document Frequency
# of Features for Course Class # E 60 40 10 1 1.0/0.826 1.0/0.760 1.0/0.788
125 1.0/1.0 1.0/1.0 1.0/0.746 250 1.0/1.0 1.0/1.0 1.0/1.0 R T 10:24 08:12 02:44
For the Project class in the first experiment each ant selected 30 features.
In the second experiment, each ant selected 20 features. In the final experiment,
each ant selected 10 features. Specified number of features and classification
results of the Project class for 5% document frequency value are shown in Table
4.27.
Table 4.27. Experimental Results Using Tagged Terms Method for Project Class With 5% Document Frequency
# of Features for Project Class # E 30 20 10 1 1.0/0.85524 1.0/0.87269 1.0/0.83372
125 1.0/1.0 1.0/1.0 1.0/0.84466 250 1.0/1.0 1.0/1.0 1.0/0.83577 R T 06:42 04:05 02:20
For the Student class in the first experiment each ant selected 30 features.
In the second experiment, each ant selected 20 features. In the final experiment,
each ant selected 10 features. Specified number of features and classification
results of the Student class for 5% document frequency value are shown in Table
4.28.
4. RESEARCH AND DISCUSSION Esra SARAÇ
53
Table 4.28. Experimental Results Using Tagged Terms Method for Student Class With 5% Document Frequency
# of Features for Student Class # E 30 20 10 1 1.0/0.86697 1.0/0.71351 0.856/0.810
125 1.0/0.81951 1.0/1.0 1.0/0.31080 250 1.0/1.0 1.0/1.0 1.0/1.0 R T 07:41 07:04 05:02
For the Faculty class in the first experiment each ant selected 50 features.
In the second experiment, each ant selected 30 features. In the final experiment,
each ant selected 10 features. Specified number of features and classification
results of the Faculty class for 5% document frequency value are shown in Table
4.29.
Table 4.29. Experimental Results Using Tagged Terms Method for Faculty Class With 5% Document Frequency
# of Features for Faculty Class # E 50 30 10 1 1.0/0.81510 1.0/0.84927 0.961/0.915
125 1.0/1.0 1.0/1.0 1.0/1.0 250 1.0/1.0 1.0/1.0 1.0/1.0 R T 09:47 06:39 03:11
For the Conference class in the first experiment each ant selected 70
features. In the second experiment, each ant selected 50 features. In the final
experiment, each ant selected 10 features. Specified number of features and
classification results of the Conference class for 5% document frequency value are
shown in Table 4.30.
Table 4.30. Experimental Results Using Tagged of Terms Method for Conference Class With 5% Document Frequency
# of Features for Conference Class # E 70 50 10 1 0.998 / 0.992 0.998/0.998 0.998/0.898
125 0.998/0.998 0.998/0.998 0.998/0.998 250 0.998/0.998 0.998/0.998 0.998/0.998 R T 08:26 06:32 04:49
4. RESEARCH AND DISCUSSION Esra SARAÇ
54
4.4.2. Classification Experiments With Tagged Terms Method in 10%
Document Frequency Value
Classification results for 10% document frequency value are discussed in
this section. Same number of features were used in all document frequency value.
So, each ant was run with three different number of features in this case, too.
Number of features for each tag are presented in Table 4.31.
Table 4.31. Number of Features For Each Tag With 10% Document Frequency Value
Course Project Student Faculty Conference
URL 9 7 8 7 6
Title 5 3 3 3 3
Header 18 4 9 9 8
Anchor 34 12 19 19 50
Bold 8 0 3 7 13
Text 215 87 118 192 243
List item 37 2 9 22 47
Total 326 115 169 259 370
Classification results of the Course class for 10% document frequency
value with respect to specified number of features are shown in Table 4.32.
Table 4.32. Experimental Results Using Tagged Terms Method for Course Class With 10% Document Frequency
# of Features for Course Class # E 60 40 10 1 1.0/0.853 1.0/0.854 1.0/0.547
125 1.0/1.0 1.0/1.0 1.0/1.0 250 1.0/1.0 1.0/1.0 1.0/1.0 R T 10:38 08:08 02:36
Classification results of the Project class for 10% document frequency
value with respect to number of features are shown in Table 4.33.
4. RESEARCH AND DISCUSSION Esra SARAÇ
55
Table 4.33. Experimental Results Using Tagged Terms Method for Project Class With 10% Document Frequency
# of Features for Project Class # E 30 20 10 1 1.0/0.885 1.0/0.900 0.956/0.854
125 1.0/1.0 1.0/1.0 1.0/0.702 250 1.0/1.0 1.0/1.0 1.0/0.729 R T 05:46 04:13 03:45
Classification results of the Student class for 10% document frequency
value with respect to number of features are shown in Table 4.34.
Table 4.34. Experimental Results Using tagged Terms Method for Student Class With 10% Document Frequency
# of Features for Student Class # E 30 20 10 1 1.0/0.738 1.0/0.556 1.0/0.844
125 1.0/0.785 1.0/1.0 1.0/1.0 250 1.0/1.0 1.0/0.682 1.0/0.831 R T 06:41 04:28 03:33
Classification results of the Faculty class for 10% document frequency
value with respect to number of features are shown in Table 4.35.
Table 4.35. Experimental Results Using Tagged Terms Method for Faculty Class With 10% Document Frequency
# of Features for Faculty Class # E 50 30 10 1 1.0/0.841 1.0/0.864 0.925/0.715
125 1.0/1.0 1.0/0.916 1.0/1.0 250 1.0/1.0 1.0/0.932 1.0/0.942 R T 09:29 06:08 02:57
Classification results of the Conference class for 10% document frequency
value with respect to number of features are shown in Table 4.36.
4. RESEARCH AND DISCUSSION Esra SARAÇ
56
Table 4.36. Experimental Results Using Tagged Terms Method for Conference Class With 10% Document Frequency
# of Features for Conference Class # E 70 50 10 1 0.998/ 0.998 0.998/ 0.998 0.998/ 0.905
125 0.998/ 0.998 0.998/ 0.998 0.998/ 0.998 250 0.998/ 0.998 0.998/ 0.998 0.998/ 0.998 R T 07:53 06:05 04:57
4.4.3. Classification Experiments With Tagged Terms Method in 15%
Document Frequency Value
Classification results for 15% document frequency value are discussed in
this section. Each ant was run with three different number of features in this case,
too. Number of features for each tag are presented in Table 4.37.
Table 4.37. Number of Features For Each Tag With 15% Document Frequency Value
Course Project Student Faculty Conference
URL 7 6 6 8 4
Title 4 0 3 105 2
Header 8 2 5 2 3
Anchor 21 7 11 12 29
Bold 4 0 2 4 2
Text 129 51 67 3 137
List item 20 0 4 6 24
Total 193 66 98 140 201
Classification results of the Course class for 15% document frequency
value with respect to specified number of features are shown in Table 4.38.
4. RESEARCH AND DISCUSSION Esra SARAÇ
57
Table 4.38. Experimental Results Using Tagged Terms Method for Course Class With 15% Document Frequency
# of Features for Course Class # E 60 40 10 1 1.0/1.0 1.0/0.829 1.0/0.605
125 1.0/1.0 1.0/0.883 1.0/1.0 250 1.0/1.0 1.0/1.0 1.0/0.753 R T 11:02 08:47 02:38
Classification results of the Project class for 15% document frequency
value with respect to number of features are shown in Table 4.39.
Table 4.39. Experimental Results Using Tagged Terms Method for Project Class With 15% Document Frequency
# of Features for Project Class # E 30 20 10 1 1.0/0.890 1.0/0.826 1.0/0.953
125 1.0/1.0 1.0/1.0 1.0/0.845 250 1.0/1.0 1.0/1.0 1.0/0.939 R T 05:42 04:03 03:17
Classification results of the Student class for 15% document frequency
value with respect to number of features are shown in Table 4.40.
Table 4.40. Experimental Results Using Tagged Terms Method for Student Class With 15% Document Frequency
# of Features for Student Class # E 30 20 10 1 1.0/0.881 1.0/0.881 1.0/0.834
125 1.0/0.746 1.0/0.746 1.0/1.0 250 1.0/0.748 1.0/0.748 1.0/1.0 R T 06:21 04:10 02:32
Classification results of the Faculty class for 15% document frequency
value with respect to number of features are shown in Table 4.41.
4. RESEARCH AND DISCUSSION Esra SARAÇ
58
Table 4.41. Experimental Results Using Tagged Terms Method for Faculty Class With 15% Document Frequency
# of Features for Faculty Class # E 50 30 10 1 1.0/1.0 1.0/0.848 1.0/0.869
125 1.0/1.0 1.0/1.0 1.0/1.0 250 1.0/1.0 1.0/1.0 1.0/0.853 R T 10:33 06:00 03:11
Classification results of the Conference class for 15% document frequency
value with respect to number of features are shown in Table 4.42.
Table 4.42. Experimental Results Using Tagged Terms Method for Conference
Class With 15% Document Frequency # of Features for Conference Class
# E 70 50 10 1 0.998/ 0.998 0.998/0.998 0.998/0.964
125 0.998/ 0.951 0.998/ 0.998 0.998/ 0.998 250 0.998/ 0.998 0.998/ 0.998 0.998/ 0.998 R T 07:49 06:02 05:00
According to the obtained experimental results, 15% document frequency
value yielded the best classification performance with the maximum feature
number. Number of features also affected the run time of the algorithm. These are
inversely proportional quantities. Tagged terms method were better than the other
three methods. There were no noticeable change between the WebKB dataset
classes. Classification performance of these classes were similar. But Conference
dataset was different from the others. The Conference dataset’s classification
performance was lower than the others, because of the content of the Web pages.
The Conference dataset does not include clear information about the class of the
Web pages. When we analyzed the best arff files (i.e. which has the maximum F-
measure value) which belong to the tagged terms method, most popular tag is
observed as the <p> tag (i.e. text tag). For example, in the Faculty class, 38 of the
selected 50 features belong to the <p> tag, 5 of them belong to <a> tag, 2 of them
belong to the <h1> tag and 5 of them belong to the URL. Distribution of the
selected features with respect to tags are shown in Table 4.43 for the Faculty class
4. RESEARCH AND DISCUSSION Esra SARAÇ
59
with 15% document frequency values. We can say that text tag is more
meaningful than the other tags.
Table 4.43. Distribution of Selected Features With Respect to Tags for Faculty Classes When 15% Document Frequency is applied
# of Features URL <title> <h1> <a> <b> <p> <li>
10 2 0 0 0 0 8 0
30 5 0 0 1 0 23 1
50 5 0 2 5 0 38 0
According to our arff file analysis for 15% document frequency value with
maximum number of selected features, in the Conference class, 52 of the selected
70 features belong to the <p> tag, 12 of them belong to <a> tag, 1 of them belong
to the <h1> tag, 1 of them belong to the <title> tag, 2 of them belong to the <b>
tag and 2 of them belong to the URL. In the Project class, 20 of the selected 30
features belong to the <p> tag, 4 of them belong to <a> tag and 6 of them belong
to the URL. In the Student class, 20 of the selected 30 features belong to the text
tag, 5 of them are belongs to <a> tag, 2 of them belong to the <title> tag and 3 of
them belong to the URL. In the Course class, 38 of the selected 60 features belong
to the text tag, 8 of them belong to <a> tag, 4 of them belong to the <h1> tag, 1 of
them belong to the <title> tag, 3 of them belong to the <b> tag and 6 of them
belong to the URL. Distribution of the selected features with respect to tags for
the best cases can be seen in Table 4.44. We can say that text contend of Web
pages is more meaningful than other tags.
Table 4.44. Distribution of the Selected Features With Respect to Tags for the Best Cases
Class # of Features URL <title> <h1> <a> <b> <p>
Conference 70 2 1 1 12 2 52
Project 30 6 - - 4 - 20
Student 30 3 2 - 5 - 20
Course 60 6 1 4 8 3 38
4. RESEARCH AND DISCUSSION Esra SARAÇ
60
4.5. Comparison With C4.5
In this section, performance of the proposed ACO feature selection
algorithm is compared with pure C4.5 classifier. For this purpose, F-measure
values of the C4.5 classifier with and without the proposed ACO based feature
selection are computed and compared. This comparison is made for the
Conference dataset only because in our previous experiments lower F-measure
values were obtained with this dataset. As future work we plan to repeat this
experiment for the WebKB dataset.
The results of this experiment can be seen in Table 4.45. In this
experiment for the URL and <title> tag methods whole feature set of the
Conference dataset are taken, and for the tagged terms and bag of terms methods
5% document frequency feature selection method is used to reduce the features to
make Weka to work with them. As seen in Table 4.45., the proposed ACO feature
selection algorithm improves classification performance for Tagged Terms, Bag-
of Terms and <title> tags methods. Run time of classification reduced by ACO
feature selection algorithm. When number of features was reduced, run time of
classification was reduced with respect to number of selected features.
Table 4.45. Comparison of the Proposed ACO Feature Selection Algorithm with C4.5
Method With ACO
Feature Selection
Without ACO
Feature Selection
Tagged Terms 5%
Document Frequency
F-measure 0.998 0.998
Run Time 0.22 sec 1.42 sec
Bag-of Terms 5%
Document Frequency
F-measure 0.994 0.991
Run Time 0.27 sec 1.23 sec
URL F-measure 0.835 0.857
Run Time 0.63 sec 10.57 sec
<title> F-measure 0.741 0.715
Run Time 0.5 sec 7.73 sec
4. RESEARCH AND DISCUSSION Esra SARAÇ
61
4.6. Comparison of the Proposed Method With Earlier Studies
In this section, the proposed method is compared with the earlier studies.
URL tag features are used in Kan and Thi (2005)’s studies. They used
sequential n-grams to derive features from the URL, and selected features
classified with Maximum Entropy and Support Vector Machine classification
algorithms separately. Their average F-measure value is reported as 0.525. This
result belongs to multi-class classification. Our average F-measure value was 1.0
for WebKB dataset with URL-only method in binary class classification. Based
upon these results, we can say that, binary class classification is more suitable
than multi-class classification for WebKB data set, our proposed ACO-based
algorithm has better classification performance than Kan and Thi (2005)’s
method.
In Özel (2010)’s study, tagged terms features are used with a GA based
classifier. URL addresses are not used in feature extraction step of Özel (2010)’s
study. Average F-measure value is reported as 0.9 for the Course class of the
WebKB dataset, and average F-measure value is reported as 0.7 for the Student
class of the WebKB dataset. In our proposed method, average F-measure value is
measured as 1.0 for the Course class, and average F-measure value is measured as
1.0 for the Student class. This comparison shows that, URL tags effects
classification performance positively.
Jiang (2010) proposed a text classification algorithm that combines a k-
means clustering scheme with an Expectation Maximization (EM) variation, and
it can learn from a very small number of labeled samples and a large quantity of
unlabeled data. Jiang (2010)’s experimental results show that, the average F-
measure value is 0.7 for WebKB dataset in multi-class classification. This results
shows that, our ACO-based algorithm performs better than k-means clustering
algorithm.
Joachims (1999) used Transductive Support Vector Machines on WebKB
dataset with binary class classification. Bag-of terms method is used in Joachims
(1999)’s study. According to the experimental results of this study, average F-
measure values are reported as 93.8, 53.7, 18.4 and 83.8 for the Course, the
4. RESEARCH AND DISCUSSION Esra SARAÇ
62
Faculty, the Project and the Student classes respectively. Also, these results show
that, the proposed ACO-based algorithm is better than the SVM algorithm.
5. CONCLUSION Esra SARAÇ
63
5. CONCLUSION
In this thesis we have developed an ACO-based Web page classification
system which uses HTML tags and terms pairs as classification features. In our
system, ants learns the optimal features by the ACO and experimental evaluation
shows that, using tagged-terms as features increases classification performance
with respect to using bag-of terms or URL alone or <title> tag alone. In addition
to tag of features, document frequency value is important for classification
performance. Experimental evaluation shows that, 15% document frequency value
is acceptable. The proposed system is effective on reducing the number of features
so, it is suitable for classification of any number of features.
As future work, we plan to study the weight of tags on the accuracy of our
ACO-based classifier system in more detail. Tags can be weighted with respect to
their importance, and this method can improve the performance of the classifier.
5. CONCLUSION Esra SARAÇ
64
65
REFERENCES
AGHDAM, M. H., GHASEM-AGHAEE, N., and BASIRI, M. E., 2009. Text
Feature Selection Using Ant Colony Optimization. Expert Systems with
Applications 36: 6843-6853.
BÄCK, T., 1996. Evolutionary Algorithms in Theory and Practice. Oxford
University Press. New York, USA. 328 p.
BAEZA-YATES, R., and RIBEIRO-NETO, B., 1999. Modern Information
Retrieval. Addison-Wesley ACM Press. Harlow, England. 513 p.
BAYKAN, E., HENZINGER, M., MARIAN, L., and WEBER, I., 2009. Purely
URL-based Topic Classification. International World Wide Web
Conference . Madrid, Spain. 1109-1110.
BONABEAU, E., DORIGO, M., and THERAULAZ, G., 1999. Swarm Intelligence:
From Natural To Artificial Systems. First Edition. Oxford University
Press. New York, USA. 320 p.
BOUGHANEM, M., CHRISMENT, C., and TAMINE, L., 1999. Genetic Approach
to Query Space Exploration. Journal of Information Retrieval 1(3): 175-
192.
BLUM, A., and MITCHELL, T., 1998. Combining Labeled and Unlabeled Data
With Co-training. In COLT’ 98: Proceedings of the 11th Annual
Conference on Computational Learning Theory, New York, USA. 92-
100.
CHAKRABARTI, S., 2002. Mining the Web: Discovering Knowledge from
hypertext data. First Edition. Morgan Kaufmann Press. San Francisco,
USA. 344 p.
CHAKRABARTI, S., VAN DEN BERG, M., and DOM, B., 1999. Focused
Crawling: A New Approach to Topic-specific Web Resource Discovery.
Computer Networks, 31(11-16): 1623-1640.
66
CHEKURİ, C., GOLDWASSER, M., RAGHAVAN, P., and UPFAL, E., 1997. Web
Search Using Automated Classification. In Proceedings of the Sixth
International World Wide Web Conference, Santa Clara, CA. Poster
POS725.
CHEN, H., and KIM, J., 1995. GANNET: A Machine Learning Approach to
Document Retrieval. Journal of Management Information Systems -
Special section: Information technology and IT organizational impact.
New York, USA. 11(3): 7-41.
COVER, T. M., and THOMAS, J. A., 1991. Elements of Information Theory. 99th
Edition. Wiley-Interscience Press. 542 p.
DBLP Web Site, http://www.informatik.uni-trier.de/~ley/db/
DORIGO, M., 1992. Optimization, Learning and Natural Algorithms. Ph.D.Thesis,
Politecnico di Milano, Italy.
DORIGO, M., DI CARO, G., and GAMBARDELLA, L. M., 1999. Ant Algorithms
for Discrete Optimization. Artificial Life, 5(2):137-172.
DORIGO, M., MANIEZZO, V., and COLORNI, A., 1991. Positive Feedback as a
Search Strategy. Technical Report No. 91-016. Politecnico di Milano,
Italy.
DORIGO, M., MANIEZZO, V., and COLORNI, A., 1996. Ant System:
Optimization by A Colony Of Cooperating Agents. IEEE Transaction on
Systems, Man, and Cybernetics-Part B, 26(1): 29-41.
GHANI, R., 2001. Combining Labeled And Unlabeled Data For Text Classification
With A Large Number of Categories. In First IEEE International
Conference on Data Mining (ICDM), Los Alamitos, CA. 597.
GHANI, R., 2002. Combining Labeled And Unlabeled Data For Multiclass Text
Categorization. In ICML ’02: Proceedings of the 19th International
Conference on Machine Learning, San Francisco, CA. 187-194.
GOOGLE, www.google.com.
GORDON, M., 1988. Probabilistic and Genetic Algorithms in Document Retrieval.
Communications of the ACM. 31(10): 1208-1218.
67
GUIASU, S., 1977. Information Theory with Applications. First Edition. McGraw-
Hill Press. New York, USA. 439 p.
HAN, J.,AND KAMBER, M., 2006. Data Mining: Concepts and Techniques. Second
Edition. Morgan Kaufmann Publishers. 550 p.
HAVELIWALA, T., KAMVAR, S., and JEH, G., 2003. An analytical comparison of
approaches to personalizing PageRank. Stanford University technical
report. Available at:
http://infolab.stanford.edu/~taherh/papers/comparison.pdf.
HAYKIN, S., 1999. Neural networks - A Comprehensive Foundation. Second
Edition. Prentice Hall. 842 p.
HOLDEN, N., and FREITAS, A. A., 2004. Web Page Classification With An Ant
Colony Algorithm. Parallel Problem Solving from Nature, 8 (LNCS
3242): 1092-1102.
HUANG, C. C., CHUANG, S. L., and CHIEN, L. F., 2004. Liveclassifier: Creating
Hierarchical Text Classifiers Through Web Corpora. In WWW ’04:
Proceedings of the 13th International Conference on World Wide Web,
New York, USA. 184-192.
JIANG, E. P., 2010. Learning to Integrate Unlabeled Data in Text Classification.
Computer Science and Information Technology (ICCSIT), 3rd IEEE
International Conference on. 9: 82-86.
JOACHIMS, T., 1999. Transductive Inference for Text Classification using Support
Vector Machines. Proceedings of the 16th International Conference on
Machine Learning. 200-209.
KAN, M.-Y., 2004. Web Page Classification Without The Web Page. In WWW Alt.
’04: Proceedings of the 13th International World Wide Web Conference
Alternate Track Papers & Posters, New York, USA. 262-263.
KAN, M.-Y., and THI, H. O. N., 2005. Fast Webpage Classification Using URL
Features. In Proceedings of the 14th ACM International Conference on
Information and Knowledge Management (CIKM ’05). New York, USA.
325-326.
68
KIM, S., and ZHANG, B.T., 2003. Genetic Mining Of HTML Structures For
Effective Web Document Retrieval. Applied Intelligence. 18: 243-256.
KWON, O.-W., and LEE, J.-H., 2000. Web Page Classification Based on k-Nearest
Neighbor Approach. In IRAL ’00: Proceedings of the 5th International
Workshop on Information Retrieval with Asian languages, New York,
USA. 9-15.
KWON, O.-W., and LEE, J.-H., 2003. Text Categorization Based on k-Nearest
Neighbor Approach for Web Site Classification. Information Processing
and Management. 29(1): 25-44.
LIANGTU, S., and XIAOMING, Z., 2007. Web Text Feature Extraction with
Particle Swarm Optimization. International Journal of Computer Science
and Network Security. 7(6): 132-136.
LIU, H., and HUANG, S., 2003. A Genetic Semi-Supervised Fuzzy Clustering
Approach to Text Classification. Lecture Notes in Computer Science
2762.173-180.
MENCZER, F., and BELEW, R. K., 1998. Adaptive Information Agents in
Distributed Textual Environments. In Proc. 2nd International Conference
on Autonomous Agents, Minneapolis.
MITCHELL, T. M., 1997. Machine Learning. First Edition. McGraw-Hill. New
York. 432 p.
MLADENIC, D., BRANK, J., GROBELNIK M., and MILIC-FRAYLING, N., 2004.
Feature Selection Using Support Vector Machines. The 27th Annual
International ACM SIGIR Conference. 234-241.
ODP, Open Directory Project. Available at: http://www.dmoz.org.
ÖZEL, S. A., 2010. A Web Page Classification System Based On A Genetic
Algorithm Using Tagged-Terms As Features. Expert Systems with
Applications. doi:10.1016/j.eswa.2010.08.126.
ÖZEL, S. A., and SARAÇ, E., 2008. Focused Crawler for Finding Professional
Events Based On User Interests. In: Proceedings of the 23rd of the
International Symposium on Computer and Information Sciences ISCIS.
Istanbul, Turkey. 441-444.
69
PAGE, L., and BRIN, S., 1997. PageRank: Bringing Order to the Web. Available at:
http://web.archive.org/web/20020506051802/www-
diglib.stanford.edu/cgi-bin/WP/get/SIDL-WP-1997-0072?1
PAREPINELLI, R. S., LOPES H. S., and FREITAS A., 2002. An Ant Colony
Algorithm for Classification Rule Discovery. IEEE Transactions on
Evolutionary Computation. 6(4): 321-332.
PERL Programming Language, http://www.perl.org/
POGGIO, T. and GIROSI, F., 1990. Networks For Approximation And Learning.
Proc. IEEE 78(9): 1484-1487.
PORTER, M.F. ,1980. An Algorithm For Suffix Stripping. Program. 14(3):130-137.
REUTERS Dataset, http://archive.ics.uci.edu/ml/databases/reuters
RIBEIRO, A., FRESNO, V., GARCIA-ALEGRE, M.C., and GUINEA, D., 2003.
Web Page Classification: A Soft Computing Approach. Lecture Notes in
Artificial Intelligence 2663. 103-112.
QI, X., and DAVISON, B. D., 2009. Web Page Classification: Features and
Algorithms, ACM Computing Surveys 41(2): Article 12.
QUINLAN, J. R., 1993. C4.5: Programs for Machine Learning. First Edition.
Morgan Kaufmann Publishers. San Mateo, California. 302 p.
SALTON, G., 1970. Automatic Text Analysis. Science. 168: 335-343.
SALTON, G., and BUCKLEY, C., 1988. Term-weighting Approaches In Automatic
Text Retrieval. Inform. Process. Man. 24(5): 513-523.
SARAÇ, E., and ÖZEL, S. A., 2010. URL Tabanlı Web Sayfası Sınıflandırma. Akıllı
Sistemlerde Yenilikler Ve Uygulamaları Sempozyumu. 1: 13-18
SHANG, W. , HUANG, H., ZHU, H., LİN, Y., QU, T., WANG, Z., 2007. A Novel
Feature Selection Algorithm For Text Categorization. Expert Systems
with Applications: An International Journal. 33(1): 1-5
SHANNON, C. E., 1948. A Mathematical Theory of Communication. Bell System
Technical Journel. 27: 379-423, 27: 623-656.
SHEN, D., CHEN, Z., YANG, Q., ZENG, H. J., ZHANG, B., LU, Y., and MA, W.
Y., 2004. Web-Page Classification Through Summarization. In SIGIR
’04: Proceedings of the 27th Annual International ACM SIGIR
70
Conference on Research and Development in Information Retrieval. New
York, USA. 242-249.
STÜTZLE, T., and HOOS, H., 2000. Max-Min Ant System. Journal of Future
Generation Computer Systems. 16: 889-914.
TREC Dataset, http://trec.nist.gov/data.html
TROTMAN, A., 2005. Choosing Document Structure Weights. Information
Processing & Management 41(2): 243-264.
VAN RIJSBERGEN, C. J., 1979. Information Retrieval. Second Edition.
Butterworth-Heinemann Publishers. London, UK. 224 p.
WANG, Z., ZHANG, Q., and ZHANG, D., 2007. A PSO-based Web Document
Classification Algorithm. In Proc. the Eighth ACIS International
Conference on Software Engineering, Artificial Intelligence, Networking,
and Parallel/Distributed Computing. 659-664.
WebKB, CMU World Wide Knowledge Base (Web->KB) project. Available at:
http://www.cs.cmu.edu/~webkb/
WEKA, Data Mining Software in Java. Available at:
http://www.cs.waikato.ac.nz/~ml/weka/
WILBUR, W. J., SIROTKIN, K, 1992. The Automatic İdentification of Stop Words.
Journal of Information Science, 18(1): 45
WIKIPEDIA, http://en.wikipedia.org/wiki/Ant_colony_optimization
WITTEN, I. H., FRANK, E., 2005. Data Mining: Practical Machine Learning Tools
and Techniques. 2nd Edition. Morgan Kaufmann, San Francisco.
YAHOO!, http://www.yahoo.com
YANG, Y., 1995. Noise Reduction in a Statistical Approach to Text Categorization.
In Proceedings of the 18th Ann Int ACM SIGIR Conference on Research
and Development in Information Retrieval. 256-263.
YANG, Y., and PEDERSEN, J. O., 1997. A Comparative Study On Feature
Selection In Text Categorization. Proc. of ICML. 412-420.
YANG, Y., WILBUR, W. J., 1996. Using Corpus Statistics to Remove Redundant
Words in Text Categorization. In J Amer Soc Inf Sci.
71
YU, E. S., and LIDDY, E. D., 1999. Feature Selection in Text Categorization Using
the Baldwin Effect. Proceedings of International Joint Conference on
Neural Networks. Washington DC.
YU, H., HAN, J,. and CHANG, K. C.-C., 2004. PEBL: Web Page Classification
Without Negative Examples. IEEE Transactions on Knowledge and Data
Engineering. 16 (1): 70-81.
72
73
CURRICULUM VITAE
Esra Saraç was born in İskenderun, in 1986. She has completed her
elementary education at İskenderun Demirçelik Primary Education School. She went
to high school at İskenderun Demirçelik Anatolian High School. Then she deserved
to educate in Niğde Science School. She has completed university education at
department of Computer Engineering of Çukurova University in 2008. Since 2008,
she has been working as a research assistant at Computer Engineering Department of
Çukurova University in Adana.