Investigation of Feature Selection Methods in Web Page Classification using Data Mining...
-
Upload
aayush-gupta -
Category
Documents
-
view
216 -
download
0
Transcript of Investigation of Feature Selection Methods in Web Page Classification using Data Mining...
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
1/39
PROJECT REPORT
ON
Investigation of Feature Selection methods for Web Page Classification
BY
Aayush Gupta 2008A4PS004U
FOR
FINAL YEAR PROJECT
COMPUTER PROJECT (BITSC331)
BITS Pilani, Dubai CampusDubai International Academic City (DIAC)
Dubai, U.A.E
FIRST SEMESTER 2011-2012
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
2/39
1
PROJECT REPORT
ON
Investigation of Feature Selection methods for Web Page Classification
BY
Aayush Gupta 2008A4PS004U MECH
FOR
FINAL YEAR PROJECT
COMPUTER PROJECT (BITSC331)
BITS Pilani, Dubai CampusDubai International Academic City (DIAC)
Dubai, U.A.E
FIRST SEMESTER 2011-2012
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
3/39
2
BITS Pilani, Dubai CampusDubai International Academic City (DIAC)
Dubai, U.A.E
Duration: 6th Sept 2012 to 27th Dec 2012 Date of Start: 06/09/2012
Date of Submission: ___________
Title of the Project: Investigation of Feature Selection methods for Web Page
classification
Name of Student:
Aayush Gupta 2008A4PS004U
Discipline of Student: Mechanical Engineering
Name of the Faculty: Ms. J. Alamelu Mangai
Abstract:
Since the aInternet aprovides billions ofaweb apages foraevery asearch word, agetting
suitable and relevant aresults aquickly from it abecomes avery adifficult. aAutomatic
aclassification of aweb apages into required classifications is the acurrent research
subject, which allows the asearch aengine to get essential aresults. As the aweb pages
acontain amany extraneous, ainfrequent awords athat areduce the effectiveness of the
aclassifier, mining or aselecting characteristic afeatures from athe web apage is an
aessential apre-processing astep. aThis project analyzes various feature selection
methods and try to use them for representing web pages in order to improve
categorization accuracy. We process data collected from WebKB which is sourced from
more than 100 webpages of different universities, and we use these processed files to
find the efficiency of various feature selection methods through WEKA tool and also
classify them in Decision Tree method to further analyze our results and find the optimal
algorithm. Our experiments show the usefulness of dimensionality reduction and of a
new, structure-oriented weighting technique. The objective of this report is to identify the
applications of web page classification, to understand the various challenges of WPC
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
4/39
3
algorithms and to understand, survey and analyze the various afeature aselection
methods used for web apage aclassification
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
5/39
4
TABLE OF CONTENTS
Abstracta
TableaofaContents
List of Figures/Tables.39
Chapter 1 aINTRODUCTION
1.1DATA MINING: BASICS 05
1.2WEB PAGE CLASSIFICATION: BASICS. ..07
1.2.1 APPLICATIONS OF WEB CLASSIFICATION...08
1.3 DISCUSSION: FEATURES.......12
1.3.1 USING ON PAGES FEATURES .12
1.3.2 USING NEIGHBOR FEATURES. 12
1.4DISCUSSION: DATA MINING IN WEB PAGE CLASSIFICATION. .15
1.5DISCUSSION: FEATURE SELECTION IN DATA MINING. .16
1.6RELATED WORK..17
Chapter 2 LITERATURE SURVEY17
Chapter 3 METHODOLOGIES
2.1 PROBLEM DESCRIPTION....19
2.2 TOOLS USED FOR RESEARCH.....20
2.3 DATA EXTRACTION.......21
Chapter 4 PROPOSED WORK / IMPORTANT CONCEPTS
3.1 ALGORITHMS USED FOR WEP PAGE CLASSIFIER....23
3.2 IMPLEMENTED CLASSIFIERS...28
3.2.1 NAVE BAYES...29
3.2.2 K STAR29
3.2.3 J48 ...29
Chapter 5 RESULTS AND DISCUSSION
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
6/39
5
4.1 INITIAL FEATURE SELECTION....30
4.2 FEATURE SELECTION, F2....31
4.3 FEATURE SELECTION, F3....31
4.2 CLASSIFICATION....32
Chapter 6 CONCLUSION..35
Chapter 7 LITERATURE SURVEY/ RESOURCES..39
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
7/39
6
CHAPTER 1INTRODUCTION
1.1 DATA MINING: BASICS
Data Mining (the analysis step of the Knowledge Discovery in Databases process, or
KDD), a relatively young and interdisciplinary field of computer science, is the process
of discovering new patterns from large data sets involving methods
from statistics and artificial intelligence but also database management. In contrast
to machine learning, the emphasis lies on the discovery ofpreviously unknown patterns
as opposed to generalizing known patterns to new data.[1]
According to the Gartner Group, Data mining is the process of discovering meaningful
new correlations, patterns and trends by sifting through large amounts of data stored in
repositories, using pattern recognition technologies as well as statistical and
mathematical techniques.[1]
Data mining is predicted to be one of the most revolutionary developments of the next
decade, according to the online technology magazine ZDNET News. In fact, the MIT
Technology Review chose data mining as one of 10 emerging technologies that will
change the world. Data mining expertise is the most sought after . . . among
information technology professionals, according to the 1999 Information Week National
Salary Survey [9]. The survey reports: Data mining skills are in high demand this year,
as organizations increasingly put data repositories online. Effectively analyzing
information from customers, partners, and suppliers has become important to more
companies. Many companies have implemented a data warehouse strategy and are
now starting to look at what they can do with all that data, says Dudley Brown,
managing partner of BridgeGate LLC, a recruiting firm in Irvine, Calif.[3]
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
8/39
7
The following list shows the most common data mining tasks.[3]
1. Description Sometimes, researchers and analysts are simply trying to find
ways to describe patterns. Descriptions of patterns and trends often suggest possibleexplanations for such patterns and trends. For example, those who are laid off are now
less well off financially than before the incumbent was elected, and so would tend to
prefer an alternative and trends lying within data.
2. Estimationa Estimation is asimilar to aclassification aexcept athat athe
atarget avariable is anumerical arather than acategorical. aModels are abuilt using
complete arecords, awhich aprovide the avalue of the atarget avariable as well as the
predictors.aThen, for new aobservations, approximations of the avalue of the atarget
avariable are made, abased on the avalues of the apredictors. For aexample, we
might be ainterested in aestimating the asystolic blood apressure areading of a
hospital apatient, abased on the apatients age, agender, body-mass index, and
ablood asodium alevels.
3. Prediction aPrediction is asimilar to aclassification and aestimation, aexcept
that foraprediction, the aresults lie in the afuture. Any of the amethods and atechniques
used for aclassification and aestimation may also be used, aunder aappropriate
circumstancesa, foraprediction. These ainclude the atraditional astatistical amethods of
point aestimation and aconfidence ainterval aestimations, simple alinear aregression
and acorrelation, and amultiple aregressions.
4. Classification In classification, there is a target unconditional variable, such
as income bracket, which, for example, could be divided into three classes or
categories: high income, middle income, and low income. The data-mining model
inspects a large set of records, each record containing information on the target variable
as well as a set of input or predictorvariables.
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
9/39
8
5. Clustering -Clustering refers to the grouping of records, observations, or cases
into classes of similar objects. A cluster is a collection of records that are similar to one
another, and dissimilar to records in other clusters. Clustering differs from classification
in that there is no target variable for clustering. The clustering task does not try to
classify, estimate, or predict the value of a target variable. Instead, clustering algorithms
seek to segment the entire data set into relatively homogeneous subgroups or clusters,
where the similarity of the records within the cluster is maximized and the similarity to
records outside the cluster is minimized.
6. Association - The association task afor data amining ais the job of afinding,
which aattributes go together. Most aprevalent in the abusiness aworld, awhere it is
aknown as aaffinity aanalysis oramarket abasket aanalysis, the atask ofaassociation
seeks to auncover arules for aenumerating the arelationship abetween two or more
attributes. aAssociation rules are of the aform If antecedent, then consequent,
together with a ameasure of the asupport and aconfidence arelated with the rule.[9]
1.2 WEB PAGE CLASSIFICATION: BASICS
Webpage aclassification orawebpage acategorization is the aprocess ofaassigning a
awebpage to one or more acategory alabels. E.g. News, Sport, Business"
Classification of web page content is essential to many tasks in web information
retrieval such as maintaining web directories and focused crawling. The uncontrolled
nature of web content presents additional challenges to web page classification as
compared to traditional text classification, but the interconnected nature of hypertext
also provides features that can assist the process.[4]
Classification Aplays AaAfundamental AroleAinAmanyAinformationAmanagement
Aand Aretrieval Atasks. AOn Athe AWeb, Aclassification AofApage Acontent Ais
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
10/39
9
Aimportant Ato Afocused Acrawling, Ato Athe Asupported Adevelopment AofAweb
Adirectories, AtoAtopic-specific AwebAlink Aanalysis, AandAtoAanalysis AofAthe
Atopical Astructure AofAthe AWeb. AWeb Apage Aclassification Acan Aalso Ahelp
AimproveAtheAqualityAofAwebAsearch.
The universal aproblem ofawebpage aclassification can be adivided into:
Subject aclassification: subject or topic of webpage e.g. Adult, Sport, Business.
Function classification: the role that the webpage play e.g. Personal homepage,
Course page, and Admission page.
Based on the number of classes in webpage classification can be divided into
Binary classification
Multi-class classification - Based on the number of classes that can be assigned to
an instance, classification can be separated into single-label classification and multi-
label classification.[5]
BasedAonAtheAnumberAofAclassesAthatAcanAbeAassignedAtoAanAinstance,
Aclassification AcanAbeAdividedAintoAsingle-label Aclassification AandAmulti-label
Aclassification.
1.2.1 Applications of web classification
1) Constructing, Amaintaining Aor Aexpanding Aweb Adirectories (web
hierarchiesA)
Web directories, such as those provided by Yahoo! (2007) and the dmoz Open
Directory Project (ODP) (2007), provide an efficient way to browse for information within
a predefined set of categories. Currently, these directories are mainly constructed and
maintained by editors, requiring extensive human effort. As of July 2006, it was reported
(Corporation 2007) that there are 73,354 editors involved in the dmoz ODP. As the Web
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
11/39
10
changes and continues to grow, this manual approach will become less effective. One
could easily imagine building classifiers to help update and expand such directories. For
example, Huang et al. (2004a, 2004b) propose an approach to automatic creation of
classifiersA AfromAwebAcorpora Abased AonAuser-defined Ahierarchies. AFurther
more, AwithAadvancedAclassification Atechniques, Acustomized (orAeven dynamic)
viewsAofAwebAdirectoriesAcan beAgenerated Aautomatically.[7]
2) ImprovingA qualityAAofAsearchAresults
Query ambiguity isAamong Athe Aproblems Athat Aundermine Athe Aquality Aof
Asearch Aresults. AForAexample, Athe Aquery Aterm bank Acould Amean theborderAof a AwaterAarea or a Afinancial Aestablishment. Numerous aapproaches
ahave abeen aproposed ato aimprove aretrieval aquality aby adisambiguating aquery
terms. aChekuri et al. (Chekuri et al. 1997) aastudied aautomatic aweb apage
classification in aorder ato asurge athe aprecision of aweb search. A astatistical
aclassifier, atrained aon aexisting aweb adirectories, is aapplied ato anew aweb apages
and acreates aan aordered alist of acategories in awhich the aweb apage acould be
placed. At aquery atime the auser ais asked to stipulate one or more adesired
acategories aso that only the aresults in athose acategories are areverted, or the search
engine areturns a list ofacategories aunder which the apages would fall. This amethod
works when the user is alooking for a known item. In asuch a acase, it is not adifficult to
specify the afavored acategories. aHowever, there are acircumstances in awhich the
user is less apositive about awhat adocuments will amatch, for awhich the aabove
amethod does not help amuch.
Search results are usually presented in a ranked list. However, presenting categorized,
or clustered, results could be more useful to users. An approach proposed by Chen and
Dumais (2000) classifies search results into a predefined hierarchical structure and
presents the categorized view of the results to the user. Their user study demonstrated
that the category interface is liked by the users better than the result list interface, and is
more effective for users to find the desired information. Compared to the approach
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
12/39
11
suggested by Chekuri et al., this approach is less efficient at query time because it
categorizes web pages on the fly. However, it does not require the user to specify
desired categories; therefore, it is more helpful when the user does not know the query
terms well. Similarly, Kaki (2005) also proposed to present a categorized aview of
asearch aresults to ausers. aExperiments ashowed that the acategorized view is
abeneficial for the ausers, aespecially when the aranking ofaresults is not asatisfying.
In 1998, Page and Brin developed the link-based ranking algorithm called PageRank
(1998). PageRank calculates the authoritativeness of web pages based on a graph
constructed by web pages and their hyperlinks, without considering the topic of each
page. Since then, much research has been explored to differentiate authorities of
diff
erent topics. Haveliwala (2002) proposed Topic-sensitive PageRank, which performsmultiple PageRank calculations, one for 4each topic. When computing the PageRank
score for each category, the random surfer jumps to a page in that category at random
rather than just any web page. This has the effect of biasing the PageRank to that topic.
This approach needs a set of pages that are accurately classified. Nie et al. (2006)
proposed another web ranking algorithm that considers the topics of web pages. In that
work, the contribution that each category has to the authority of web pages is
distinguished by means of soft classification, in which a probability distribution is given
for a web page being in each category. In order to answer the question to what
granularity of topic the computation of biased page ranks make sense, aKohlschutter et
al. (2007) aconducted aanalysis on ODP acategories, and ashowed that aranking
aperformance aincreases with the aODP alevel up to a acertain apoint. It aseems
furtheraresearch aalong athis adirection is aquite apromising.[7]
4) Building efficient focused crawlers or vertical (domain-specific) search engines
When only domain-specific queries are expected, performing a full crawl is usually
inefficient. Chakrabarti et al. (Chakrabarti et al. 1999) proposed an approach called
focused crawling, in which only documents relevant to a predefined set of topics are of
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
13/39
12
interest. In this approach, a classifier is used to evaluate the relevance of a web page to
the given topics so as to provide evidence for the crawl boundary.
5) Other applications
Besides the applications discussed above, web page classification is also useful in web
content filtering (Hammami, Chahir, and Chen 2003; Chen, Wu, Zhu, and Hu 2006),
assisted web browsing (Armstrong, Freitag, Joachims, and Mitchell 1995; Pazzani,
Muramatsu, and Billsus 1996; Joachims, Freitag, and Mitchell 1997) and in knowledge
base construction (Craven, DiPasquo, Freitag, McCallum, Mitchell, Nigam, and Slattery
1998).
1.3 DISCUSSION: FEATURES
Feature aselection has abeen an adynamic aresearch aarea in apattern arecognition,
astatistics, and adata amining agroups. The amain aidea of afeature aselection is to
achoose a asubset of input avariables by excluding afeatures with alittle or no
prognostic ainformation. aFeature aselection can asignificantly aimprove the aclarity of
the aresulting aclassifier amodels and aoften abuild a amodel athat asimplifies abetter
to aunseen apoints. aFurther, it is aoften the acase that afinding the acorrect asubset of
apredictive afeatures is an aimperative aproblem in its own aright. For aexample,
aphysician may make a adecision abased on the aselected afeatures awhether a
adangerous asurgery is anecessary aforatreatment or not. [5]
In this asection, we areview the atypes ofafeatures afound to be auseful in web page
aclassification aresearch.
Written in HTML, web pages contain additional information, such as HTML tags,
hyperlinks and anchor text (the text to be clicked on to activate and follow a hyperlink to
another web page, placed between HTML and tags), other than the textual
content visible in a web browser. These features can be divided into two broad classes:
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
14/39
13
on-page features, which are directly located on the page to be classified, and features
of neighbors, which are found on the pages related in some way with the page to be
classified. [7]
1.3.1 Using on-page features
a) Textual content and tags - Directly located on the page, the textual content is the
most forthright feature that one may consider to use. However, due to the variety of
unrestrained noise in web pages, directly using a bag-of-words demonstration for all
terms may not achieve top performance
One obvious feature that appears in HTML documents but not in plain text documents
is HTML tags. It has been demonstrated that using information derived from tags can
boost the classifiers performance. Golub and Ardo (2005) derived significance
indicators for textual content in different tags. In their work, four elements from the web
page are used: title, headings, metadata, and main text. They showed that the best
result is achieved from a well-tuned linear amalgamation of the fourelements.
Thus, utilizing tags can take benefit of the structural information embedded in the
HTML files, which is usually ignored by plain text methods. However, since most HTML
tags are leaning toward representation rather than semantics, web page authors may
generate different but theoretically corresponding tag structures. Therefore, using HTML
tagging information in web classification may suffer from the unpredictable formation of
HTML documents. [7]
b) Visual analysis- Each web page has two representations, if not more. One is the
text representation written in HTML. The other one is the visual representation rendered
by a web browser. They provide different views of a page. Most approaches focus on
the text representation while ignoring the visual information. Yet the visual
representation is useful as well.
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
15/39
14
Although the visual layout of a page relies on the tags, using visual information of the
rendered page is arguably more generic than analyzing document structure focusing on
HTML tags (Kovacevic, Diligenti, Gori, and Milutinovic 2004). The reason is that
different tagging may have the same rendering effect. In other words, sometimes one
can change the tags without affecting the visual representation. Based on the
assumption that most web pages are built for human eyes, it makes more sense to use
visual information rather than intrinsic tags [7]
1.3.2 Using features of neighbors
a) Motivation - Although web pages contain useful features as discussed above, in aparticular web page these features are sometimes missing, misleading, or
unrecognizable for various reasons. For example, web pages contain large images or
flash objects but little textual content
In such cases, it is difficult for classifiers to make reasonable judgments based on
features on the page. In order to address this problem, features can be extracted from
neighboring pages that are related in some way to the page to be classified to supply
supplementary information for categorization. There are a variety of ways to derive such
connections among pages. One obvious connection is the hyperlink. Since most
existing work that utilizes features of neighbors is based on hyperlink connection, in the
following, we focus on hyperlinks connection. However, other types of connections can
also be derived; and some of them have been shown to be useful for web page
classification. [7]
c) Neighbor selection - Anotherquestion when using features from neighbors is that
of which neighbors to examine. Existing research mainly focuses on pages within two
steps of the page to be classified. At a distance no greater than two, there are six types
of neighboring pages according to their hyperlink relationship with the page in question:
parent, child, sibling, spouse, grandparent and grandchild, as illustrated in Figure 4. The
effect and contribution of the first four types of neighbors have been studied in existing
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
16/39
15
research. Although grandparent pages and grandchild pages have also been used, their
individual contributions have not yet been specifically studied [7]
d) Features of neighbors - The features that have been used from neighbors
comprise labels, fractional content (anchor text, the surrounding text of anchor text,
titles, and headers), and full content. [7]
1.4 DISCUSSION: DATA MINING IN WEB PAGE CLASSIFICATION.
Data mining turns a large collection of data into knowledge.
A search engine (e.g., Google receives hundreds of millions of queries every day, each
query can he viewed as a transaction where any user describes her or his information
need. What novel and useful knowledge can a search engine learn from such a huge
collection of queries collected from users over time? Interestingly, some patterns found
in user search queries can disclose invaluable knowledge that cannot be obtained by
reading individual data items alone. For example, Google Flu Trends uses specific
search terms as indicators of flu activity. It found a close association between the
number of people who search for flu-related evidence and the number of people who
actually have flu symptoms. A pattern emerges, when all of the search queries related
to flu are aggregated. Using combined Google search data, Flu Trend, can estimate
activity up to two weeks quicker than traditional systems can. 2 This example shows
how data mining can turn a large collection ofdata into knowledge that can help meet a
current global challenge. [9]
1.5 DISCUSSION: FEATURE SELECTION IN DATA MINING
Featureaselection is a amust for aany adatamining aproduct. aThat is abecause,
awhen you abuild a adata amining amodel, the adataset often acontains amore
ainformation athan is aneeded to shape the amodel. For aexample, a adataset may
contain 1000 column the describes characteristic of customer, but aperhaps only 100 of
those acolumns is aused to abuild a specific amodel. If you akeep the aunneeded
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
17/39
16
acolumns while abuilding the amodel, amore aCPU and amemory are essential aduring
the atraining aprocess, and more astorage aspace is aessential for the completed
model.
Even if resources are not an issue, you typically want to remove unneeded columns
because they might degrade the quality of discovered patterns, for the following
reasons:
Some columns are noisy or redundant. This noise makes it more difficult to
discover meaningful patterns from the data;
Feature selection helps solve this problem, of having too much data that is of little
value, or of having too little data that is of high value. [9]
1.6 RELATED WORK
Aijun An, Yanhui Huang, Xiangji Huang and Nick Cercone (2005) - presented a
feature reduction method based on the rough set theory and investigate the
effectiveness of the rough set feature selection method on web page classification. [10]
Ali Selamat, Sigeru Omatu (2003) - apropose aanews aweb apage aclassification
amethod (WPCM). TheaWPCM auses a aneural anetwork with ainputs aobtained by
aboth the aprincipal acomponents and aclass aprofile-based afeatures. aEach anews
web page is represented by theaterm-weightingascheme. As the anumber of
aunique awords in the acollection set is big, the aprincipal acomponent aanalysis
(PCA) has been used to aselect the amost arelevant afeatures for the aclassification.
Then the afinal aoutput of the aPCA is acombined with the afeature avectors from the
class-profile, which acontains the amost aregular awords ain aeach aclass. The
experimental evaluation demonstrates that the WPCM method provides acceptable
classification accuracy with the sports news datasets.[11]
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
18/39
17
Oh-Woog Kwon, Jong-Hyeok Lee (2006) - proposed a Web page classifier based on
an adaptation of k-Nearest Neighbor (k-NN) approach. To improve the performance of
k-NN approach, they supplement k-NN approach with a feature selection method and a
term-weighting scheme using markup tags, and reform document-document similarity
measure used in vector space model. [13]
Wakaki, T, Itakura, H, Tamura, M (2005) - ainvestigatesahow arough set theory can
aid select applicable features for aWeb-page aclassification. Our aexperimental
aresults ashow that the arrangement of the rough set-aided afeature aselection
amethod and the aSupport aVectoraMachine with a alinearakernel is aquite beneficial
in apractice to aclassify aWeb-pages into amany acategories since anot aonly the
aperformance agives aacceptable aaccuracy but aalso athe ahigh adimensionality
areduction is aachieved without depending on arbitrary thresholds for feature selection.
[14]
Soumen Chakrabarti, Byron Dom, Rakesh Agrawal and Prabhakar Raghavan
(2007) - describe an aautomaticasystem that astarts with a asmall asample of the
abody in awhich atopics have abeen allocated by ahand, and athen revises the
adatabase awith new adocuments as the body agrows, allocating atopics to these new
documents with high speed and accuracy. [12]
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
19/39
18
CHAPTER 2METHODOLOGY
2.1 PROBLEM DESCRIPTION
Web page classification, also known as web page categorization, is the process of
assigning a web page to one or more predefined category labels. Classification is often
posed as a supervised learning problem (Mitchell 1997) in which a set of labeled data is
used to train a classifier which can be applied to label future examples. [6]
This project aims at discovering the most efficient feature selection methods that can
be used for web page classification. We will study in depth about data mining, feature
selection and their applications in web page classification. We will also learn how to use
the WEKA tool in testing the performance of various feature selection methods through
classification tools such as decision tree. [3] We will then collect appropriate datasets
for testing these feature selection methods and use these data sets in finding the
accuracy of various feature selection methods. Finally we will work towards the
development of algorithms for optimum classification results using WEKA tool. [5]
Based on theclasses that can be aassigned to a case, aclassification can be adivided
into asingle-label aclassification and amulti-label aclassification. In asingle-label
classification, one and only one class label is to be assigned to each instance, while in
multi-label classification, more than one class can be assigned to an instance. If a
problem is multi-class, say four-class classification, it meansafour classes are
ainvolved, say aArts, aBusiness, aComputers, and aSports. It acan be either single-
label, where exactly one class label can be assigned to an instance, or multi-label,
where an instance canabelong to aany one, twoa, oraall ofathe aclasses. aBased on
the atype of class aassignment, aclassification can abe adivided ainto hard
aclassification and asoft aclassification. In ahard aclassification, an ainstance acan
either abe or not be in a aparticular aclass, awithout an atransitional astate; awhile in
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
20/39
19
soft aclassification, an ainstance can be apredicted to be in asome aclass with
someaprobability (often a aprobability adistribution aacross all aclasses.
Based on the organization of categories, web page classification can also be divided
into flat classification and hierarchical classification. In flat classification, categories are
considered parallel, i.e., one category does not supersede another. While in hierarchical
classification, the categories are organized in a hierarchical tree-like structure, in which
each category may have a number of subcategories.
2.2 DETAILS OF TOOLS/SOFTWARES USED (WEKA)
WEKA
Java package developed at the University of Waikato in New Zealand. Weka stands
for the Waikato Environment for Knowledge Analysis.
Weka is a acollection ofamachine alearning aalgorithms forasolving areal-world adata
mining aproblems. The aalgorithms can aeither be aapplied adirectly to a adataset or
acalled from ayour own aJava acode.
Weka contains tools for data pre-processing, classification, regression, clustering,
association rules, and visualization. It is also well suited for developing new machine
learning schemes. Weka is open source software issued under the GNU General
Public License. [6]
The Weka aworkbench
acontains a acollection ofavisualization atools and aalgorithms
foradata aanalysis and apredictive amodeling, atogether with agraphical user
ainterfaces foraeasy aaccess to athis afunctionality. [6]
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
21/39
20
Advantagesa ofaWeka ainclude:
Free aavailability aunder the aGNU aGeneral aPublic aLicense
Portabilitya, asince it is fully aimplemented in the aJava aprogramming
languagea and athus aruns on aalmost any amodern acomputing aplatform
A acomprehensiveacollection ofadata apreprocessing and amodeling
techniquesease of use due to its agraphicalauserainterfaces
2.2 DATA EXTRACTION (WEBKB)
The Datasets used for testing the feature selection methods have been taken from
WEBkb.WebKB refers to the knowledge base (KB) servers WebKB-1 and WebKB-2.
KB servers are not Web search engines; they are online Knowledge based systems.
KBMSs are database management systems that, unlike relational DBMSs, object-
oriented DBMSs and deductive DBMSs, permit end-users to dynamically modify a large
number of conceptual definitions in their KBs and hence do not limit end-users to
predefined kinds of data [7].
Specifically we will use the 4 Universities Dataset.aThis adata aset acontains aWWW-
pages acollected from acomputer ascience adepartments of avarious auniversities in
January 1998 By Webkb. For each aclass the adata aset acontains apages from the
four auniversities. The 8,280 apages were amanually aclassified into the afollowing
acategories:
Category No. of Pages
Student 1641
Faculty 1124
Staff 137
Department 182
Course 930
University Pages
Cornell 867
Washington 1205
Texas 827
Wisconsin 1263
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
22/39
21
Project 504
Other 3764
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
23/39
22
CHAPTER 3PROPOSED WORK AND IMPORTANT CONCEPTS
3.1 ALGORITHM USED FOR WEB PAGE CLASSIFIER
The aim of this project is to find the grouping of afeature aselection methods for web
page acategorization. It also beats the topics in afeature aselection. aThis procedure
covers 3 stage : a) the aextraction of illustrative features, to adescribe acontent - the
initial aset, b) the assortment of the abest afeatures from ainitial set by aapplying
another afeature aselection asystem (aminimizing the anumber of afeatures and
exploiting the adiscriminative ainformation acarried by athem) and c) the training and
classification using the resulting features in the different classifiers to adetermine the
quality ofafeatures.[14]
3.1.1. PREPROCESSING
Do stop word, Common Word, Punctuation symbols and HTML tags removal
Do stemming
1) Stop Words - Inacomputing, astop awords are awords awhich are afiltered out
aprior to, oraafter, aprocessing ofanatural alanguage adata (text). It is acontrolled by
ahuman ainput and not aautomated. aThere is not one adefinite list ofastop awords
which aallatools ause, if aeven aused. aSome atools aspecifically aavoid ausing
athem to asupport aphrase asearch.[15]
2) Stemming - In linguistic morphology and information retrieval, stemming is the
process for reducing inflected (or sometimes derived) words to their stem, base
or root formgenerally a written word form. The stem need not be identical to
the morphological root of the word; it is usually sufficient that related words map to the
same stem, even if this stem is not in itself a valid root. Algorithms for stemming have
been studied in computer science since 1968.[15]
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
24/39
23
3.1.2. FEATURE SET EXTRACTION, F1
Find frequency of occurrence of independent words
Find the term frequency, inverse document frequency of the documents.
Term-Frequency Inverse Document Frequency
The term frequency in the given document is the number of times a given term appears
in that document. This count is usually normalized to prevent a bias towards longer web
pages (which may have a higher term frequency regardless of the actual importance of
that term in the web page) to give a measure of the importance of the term tiwithin the
particular web page dj.[16]
where ni,j is the number of occurrences of the considered term in webpage dj, and the
denominator is the number of occurrences of all terms in webpage dj.
The inverseadocumentafrequencyis a ameasure of the ageneral aimportance of the
aterm (aobtained byadividing the number of all adocuments by the anumber of
webpages containing the term, and then ataking the alogarithm of that aquotient).[16]
with
| D | : total anumber ofaterms in the acorpus
anumber of webpages awhere athe term tiappears (that
is ).
Thena
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
25/39
24
A ahigh aweight in tfidf is areached by a ahigh aterm afrequency (in the agiven
webpage) and a alow document frequency of the term in the whole collectionofadocuments; the aweights ahence atend to afilteraout acommon aterms.[16]
3.1.3. FEATURE SELECTION, F2
Select features from F1 using CfsSubset Evaluatormethod.
The CfsSubsetEval Algorithm
A afeature is auseful if it is acorrelated with orapredictive of the aclass; aotherwise it is
airrelevant. A afeature Vi is asaid to be arelevant iffathere aexists asome vi and c for
awhich
p(Vi = vi) > 0asuch athat p(C = c|Vi = vi) =6 p(C = c).
Empiricalaaevidence afrom the afeature aselection aliterature shows that, aalong with
airrelevant afeatures, redundant information should be eliminated asawell. The aabove
adefinitions forarelevance and aredundancy lead to the afollowing ahypothesis, on
whicha the afeature aselection amethod apresented in this athesis is abased: A agood
feature asubset is one athat contains features highly correlated with (predictive of) the
aclass, yet uncorrelated with (anot apredictive of) aeach aother. [17]
When we develop a composite, which we intend to use as a basis for predicting an
outside variable, it is likely that the components we select to form the composite will
have relatively low inter-correlations. When we seek topredict some variable from
several other variables, we try to select predictor variables which measure different
aspects of the outside variable.
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
26/39
25
The CfsSubsetEval algorithm evaluates the worth of a subset of attributes by
considering the individual predictive ability of each feature along with the degree of
redundancy between them. Hence it is an intensely important part of our approach.[17]
3.1.4. TRAINING THE C4.5 DECISION TREE CLASSIFIER WITH F2.
C4.5 aDecision aTree ClassifierDetermininga the arelative aimportance of a afeature is one of the abasic atasks
aduring adecision atree generation.C4.5 (Quinlan, 1993) uses a univariate
featureaselection astrategy. At each level of the tree building process only one
aattribute, the attribute with the highest values for the selection criteria, is picked out of
the set of all attributes. Afterwards the sample set is split into subsample sets according
to theavalues ofathis aattribute and the whole procedure is recursively repeated until
only samples from one aclass are in the aremaining asample aset or auntil the
aremaining asample aset has ano adiscrimination apoweraanymore aand athe atree
abuilding aprocess astops.[19[
As we can see feature selection is only done at the root node over the entire decision
space. After this level, the sample set is split into sub-samples and only the most
important feature in the remaining sub-sample set is selected. Geometrically it means,
the search for good features is only done in orthogonal decision subspaces, which
might not represent the real distributions, beginning after the root node. Thus, unlike
statistical feature search strategies (Fukunaga, 1990) this approach is not driven by the
evaluation measure for the combinatorial feature subset; the best single feature only
drives it. This might not lead to an optimal featureasubset in aterms ofaclassification
aaccuracy.[19]
C4.5 builds decision trees from a set of training data in the same way as ID3, using the
concept of information entropy. The training data is a set S = s1,s2,... of already
classified samples. Each sample si = x1,x2,... is a vector where x1,x2,...represent
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
27/39
26
attributes or features of the sample. The training data is augmented with a vector C=
c1,c2,... where c1,c2,...represent the class to which each sample belongs.
At each node of the tree, C4.5 chooses one attribute of the data that most effectively
splits its set of samples into subsets enriched in one class or the other. Its criterion is
the normalized information gain (difference in entropy) that results from choosing an
attribute for splitting the data. The attribute with the highest normalized information gain
is chosen to make the decision. The C4.5 algorithm then recourses on the smaller sub
lists.[13]
3.1.5. Theafeatures in the apruned adecision atree aform the afinal aset of
afeatures F3.
3.1.6. aEvaluate the aperformance of the amachine alearning aclassifiers ausing
the afinal set ofafeatures F3.
3.2 IMPLEMENTED CLASSIFIERS
3.2.1 NAVE BAYES CLASSIFIER
Also, in some cases it is also seen that Nave Bayes outperforms most complex
algorithms. It makes use of the variables contained in the adata asample, by observing
them individually, independent of eachaother.[12]
The aNave Bayes classifier is based on the Bayes rule of conditional probability. It
makes use of all the attributes contained in the data, and analyses them individually as
though they are equally important and independent of each other. For example,
consider that the training data consists of various animals (say elephants, monkeys and
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
28/39
27
giraffes), and our classifier has to classify any new instance that it encounters. We know
that elephants have attributes like they have a trunk, huge tusks, a short tail, are
extremely big, etc. Monkeys are short in size, jump around a lot, and can climb trees;
whereas giraffes are tall, have a long neck and short ears.[12]
It is seen that the Nave Bayes classifier performs almost at par with the other
classifiers in most of the cases. Of the 26 different experiments carried out on various
datasets, the Nave Bayes classifier shows a drop in performance in only 3-4 cases,
when compared with J48 and Support Vector Machines. This aproves the awidely
aheld abelief that athough asimple in aconcept, the aNave aBayes aclassifier aworks
well in amost adata aclassification aproblems. [16]
3.2.2. K*
K* is an instance-based classifier, that is the class of a test instance is based upon the
class of those training instances similar to it, as determined by some similarity function.
It differs from other instance-based learners in that it uses an entropy-based distance
function. [17]
Instancebased alearners aclassify an ainstance by acomparing it to a adatabase of
preclassified aexamples. The afundamental aassumption is athat asimilarainstances
awill ahave asimilaraclassifications. The aquestion alies in ahow to adefine similar
instance and similar classification. The acorresponding acomponents of an
ainstance-abased alearner are the distance function which adetermines how asimilar
two are, and the classification function which specifies how instance similarities yield a
final classification for the new instance.
The approach we take here to computing the distance between twoainstances is
amotivated by ainformation atheory. The aintuition is that the adistance abetween
ainstances be adefined as the acomplexity of atransforming one ainstance into
aanother. The acalculation of the acomplexity is adone in atwo asteps. First a finite set
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
29/39
28
of transformations which map instances to ainstances is defined. A program to
atransformaone ainstance (a) to aanother (b) is a afinite asequence of
atransformations astarting at a andaterminating at b.[17]
3.2.3 J48
A adecision atree is a predictive machine-learning model that decides the atarget
avalue (dependent avariable) of a anew asample abased on avarious aattribute
avalues of the aavailable adata. The ainternal anodes of a adecision atree adenote the
different aattributes, the abranches abetween the anodes tell us the apossible avalues
that athese aattributes can ahave in athe aobserved asamples, awhile the aterminal
anodes tell us the afinal avalue (aclassification) ofathe adependent avariable.[19]
The attribute that is to be predicted is known as the dependent variable, since its value
depends upon, or is decided by, the values of all the other attributes. The other
attributes, which help in predicting the value of the dependent variable, are known as
the independent variables in the dataset.[19]
The J48 aDecision atree aclassifier afollows athe afollowing asimple aalgorithm. In
aorder ato aclassify aa aanew aitema, it afirst aneeds to acreate a adecision atree
abased on the aattribute avalues of the available training data. So, whenever it
encounters aaset of aitems (atraining set) it aidentifies the aattribute athat
adiscriminates the various instances most clearly. Thisafeature athat ais aable ato
atell us amost aabout athe adata ainstancesa so athat awe acan aclassify athem the
abest ais asaid ato have athe ahighest ainformation again. aNow, aamong the
apossible avalues of this afeature, if athere is any avalue for awhich athere is no
aambiguity, athat is, for awhich the adata ainstances falling within its category have
the same value for the target variable, then we terminate that branch and assign to it
the target value that we have obtained.
For the other cases, we then look for another attribute that gives us the highest
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
30/39
29
information gain.[19] Hence we continue in this manner until we either get a clear
decision of what combination of attributes gives us a particular target value, or we run
out of attributes. In the event that we run out of attributes, or if we cannot get an
unambiguous result from the available information, we assign this branch a target value
that the majority of the items under this branch possess.
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
31/39
30
CHAPTER 4RESULTS & DISCUSSION
4.1 Initial Feature Selection F1The webapages aare ainitially apreprocessed aas afollows. The HTML tags,
astopwords, apunctuation and digits are removed. Words are reduced to their root.
Toadiminish athe aweight of afrequently aoccurring awords aand aincrease the
aweight ofarare words in a web page, the aterm afrequency ainverse adocument
afrequency ofaeach awordain a aweb apage is computed. This forms the initial set of
features F1 as listed in aTable 1.
Table 1: Initial set of features F1
S.
No.
Sample
size
No. of
Instances
No. of
features
1. p70-n30 10 694
2. p350-
n150
100 2759
4.2 Feature Selection F2
CfsSubsetEvaluatora is arun on aF1, to afurther aselect the afeatures awith amore
ainformation again on the aclass aand aless ainter-correlation. aThis aforms F2 as
ashown in atable 2.
4.3 Feature Selection F3
A C4.5 adecision atree aclassifier ais abuilt ausing F2. The afeatures athat aC4.5
auses in its apruned atree are aalone aselected as the afinal afeatures F3 as ashown
in atable 3.
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
32/39
31
Table 2: Features selected, F2
Table 3: Features selected, F3
4.4 CLASSIFICATION
The aperformance of the avarious amachine alearning aclassifiers are evaluated on
the differentafeature asets aF1, aF2 and aF3. The atime ataken to amodel the
classifier in each case is also observed as aillustrated in the afollowing atables.
Table 4: Classification on the afeature aset F1
S.
No
Sample
Size
NB K* J48
1. p70-n30 70% 90% 40%
2. p350-n150 91% 50% 95%
S.
No.
Sample
size
No. of
Instances
No. of
features
1. p70-n30 10 6
2. p350-
n150
100 25
S.
No.
Sample size No. of
features
1 p70-n30 22 p350-n150 5
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
33/39
32
Table 5: Classification on the feature set F2
S.
No
Sample
Size
NB K* J48
1. p70-n30 90% 100% 50%
2. p350-
n150
98% 96% 95%
It acan beaobserved that the aclassification aaccuracy has aincreased with all
aclassifiers with the areducedafeatures F2. Also the atime taken to amodel the
aclassifiers is aalso areduced. kNN has aexhibited more aaccuracy over F2 than F1.
Figure 1: Decision Tree on file 1
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
34/39
33
Figure 2: Decision Tree on file 2
Table 8: Classification on the feature set F3
S.
No
Sample
Size
NB K* J48
1. p70-n30 97% 100% 85%%
2. p350-
n150
99% 100% 95%
It can be ainferred from aTable 8 and 9 that the atime taken to abuild the aclassifiers
with F3, has significantly reduced with no compromise in the accuracy. Therefore for
better aaccuracy with reduced resource utilization, selecting the best and optimum
features is amore aimportant.aThis is aone away ofaimproving the aperformance ofthe aclassifiers.
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
35/39
34
CHAPTER 5CONCLUSION
Afterreviewing web classification research with respect to its features and algorithms,we conclude this article by summarizing the lessons we have learned from existing
research and pointing out future opportunities in web classification.
Web apageaclassification is a atype ofasupervised alearning aproblem athat aims to
categorizeaweb apages into a aset ofapredefined acategories abased onalabeled
trainingaadata. aClassification atasks ainclude aassigning adocuments on the abasis
of subject, function, sentiment, genre, and more.
Unlike more general text classification, web page classification methods can take
advantage of the semi-structured content and connections to other pages within the
Web. We have surveyed the space of published approaches to web page classification
from various viewpoints, and summarized their findings and contributions, with a special
emphasis on the utilization and benefits of web-specific features and methods. We
found that while the appropriate use of textual and visual features that reside directly on
the page can improve classification performance, features from neighboring pages
provide significant supplementary information to the page being classified. Feature
selection and the combination of multiple techniques can bring further improvement. We
expect that future web classification efforts will certainly combine content and link
information in some form.
Finally, we wish to explicitly note the connection between machine learning and
information retrieval (especially ranking). This idea is not new, and has underpinned
many of the ideas in the work presented here. A learning to rank community (Joachims
2002; Burges, Shaked, Renshaw, Lazier, Deeds, Hamilton, and Hullender 2005;
Radlinski and Joachims 2005; Roussinov and Fan 2005; Agarwal 2006; Cao, Xu, Liu,
Li, Huang, and Hon 2006; Richardson, Prakash, and Brill 2006) is making advances for
both static and query specific ranking. There may be unexplored opportunities for using
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
36/39
35
retrieval techniques for classification as well. In general, many of the features and
approaches for web page classification have counterparts in analysis for web page
retrieval. Future advances in web page classification should be able to inform retrieval
and vice versa.
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
37/39
36
CHAPTER 6LITERATURE SURVEY/RESOURCES
BOOKS
[1] Discovering Knowledge in Data: An Introduction to Data Mining (Daniel T Larose)[2] The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Second
Edition) by Trevor Hastie, Robert Tibshirani and Jerome Friedman (2009)
[3] Data Mining Techniques for Web Page Classification (Gabriel Fiol-Roig, Margaret
Mir-Juli, Eduardo Herraiz)
[4] Qi X and Davison B.D. (2009) Web Page Classification: Features and Algorithms.
ACM Computing Surveys
[5] Feature Selection with Rough Sets for Web Page Classification Lecture Notes in
Computer Science, 2005, Volume 3135/2005, 280-305,
[6] Web page feature selection and classification using neural networks - Computer and
System Sciences, Graduate School of Engineering, Osaka Prefecture University
[7] Web page classification based on k-nearest neighbor approach Dept. of Computer
Science and Engineering, Pohang University of Science and Technology, San 31 Hyoja
Dong, Pohang, 790-784, Korea
[8] M. A. Hall (1998). Correlation based Feature Subset Selection for Machine Learning.
Hamilton, New Zealand.
JOURNAL RESEARCH PAPERS
[9] Feature Selection for Web Page Classification (Daniele Riboni D.S.I., Universita
degli Studi di Milano, Italy)
[10] WEKAExperiences with a Java Open-Source Project (Department of Computer
Science, University of Waikatoi, Hamilton, New Zealand)
[11] Mir-Juli M., Fiol-Riog G and Vaquer-Ferrer D (2009) Classification using
Intelligent Approaches: an Example in Social Assistance. Frontiers in Artificial
Intelligence and Applications
[12] Arul Prakash Asirvatham, Kranthi Kumar Ravi (2001 ), Web Page Classification
based on Document Structure, Awarded Second Prize in National Level Student Paper
Contest conducted by IEEE India Council..
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
38/39
37
[13] Xiaogang Peng, Ben Choi (2002), Automatic Web Page Classification in a
Dynamic and Hierarchial Way, In Proceedings of Second IEEE International
Conference on Data Mining, Washington DC, IEEE Computer Society, pp:386-393
[14] Yicen Liu, Mingrong Liu, Liang Xiang and Qing Yang, (2008), Entity-Based
Classification of Web Page in Search Engine, ICADL, LNCS, Vol. 5362, pp:411- 412.
[15] S.B. Kotsiantis, Supervised Machine Learning: A Review of Classification
Techniques, Informatica 31(2007) 249-268, 2007
[16] Harry Zhang "The Optimality of Naive Bayes". FLAIRS2004 conference.
WEBSITES
[17] Effectiveness of Web Page Classification on Finding List Answers
(www.comp.nus.edu.sg)
[18] Data Mining Techniques for Web Page Classification
(http://eduherraiz.com/papers/)
[19] Term Frequency Inverse Document (http://nlp.stanford.edu/IR-
book/html/htmledition/inverse-document-frequency-1.html )
[20] Classification via Decision Trees in WEKA
(http://maya.cs.depaul.edu/classes/ect584/weka/classify.html )
-
7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++
39/39
LIST OF FIGURES AND TABLES
TABLES
1) Table 1: Initial set of features F1...31
2) Table 2: Features selected, F2
...
...32
3) Table 3: Features selected, F3......32
4) Table 4: Classification on the feature set F132
5) Table 5: Classification on the feature set F2...........33
6) Table 6: Classification on the feature set F3...........34
FIGURES
1) Figure 1: Decision Tree on file 1....33
2) Figure 2: Decision Tree on file 2....34