Investigation of Feature Selection Methods in Web Page Classification using Data Mining...

7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

1/39

PROJECT REPORT

ON

Investigation of Feature Selection methods for Web Page Classification

BY

Aayush Gupta 2008A4PS004U

FOR

FINAL YEAR PROJECT

COMPUTER PROJECT (BITSC331)

BITS Pilani, Dubai CampusDubai International Academic City (DIAC)

Dubai, U.A.E

FIRST SEMESTER 2011-2012


2/39

1

PROJECT REPORT

ON

Investigation of Feature Selection methods for Web Page Classification

BY

Aayush Gupta 2008A4PS004U MECH

FOR

FINAL YEAR PROJECT

COMPUTER PROJECT (BITSC331)


Dubai, U.A.E

FIRST SEMESTER 2011-2012


3/39

2


Dubai, U.A.E

Duration: 6th Sept 2012 to 27th Dec 2012 Date of Start: 06/09/2012

Date of Submission: ___________

Title of the Project: Investigation of Feature Selection methods for Web Page

classification

Name of Student:

Aayush Gupta 2008A4PS004U

Discipline of Student: Mechanical Engineering

Name of the Faculty: Ms. J. Alamelu Mangai

Abstract:

Since the aInternet aprovides billions ofaweb apages foraevery asearch word, agetting

suitable and relevant aresults aquickly from it abecomes avery adifficult. aAutomatic

aclassification of aweb apages into required classifications is the acurrent research

subject, which allows the asearch aengine to get essential aresults. As the aweb pages

acontain amany extraneous, ainfrequent awords athat areduce the effectiveness of the

aclassifier, mining or aselecting characteristic afeatures from athe web apage is an

aessential apre-processing astep. aThis project analyzes various feature selection

methods and try to use them for representing web pages in order to improve

categorization accuracy. We process data collected from WebKB which is sourced from

more than 100 webpages of different universities, and we use these processed files to

find the efficiency of various feature selection methods through WEKA tool and also

classify them in Decision Tree method to further analyze our results and find the optimal

algorithm. Our experiments show the usefulness of dimensionality reduction and of a

new, structure-oriented weighting technique. The objective of this report is to identify the

applications of web page classification, to understand the various challenges of WPC


4/39

3

algorithms and to understand, survey and analyze the various afeature aselection

methods used for web apage aclassification


5/39

4

TABLE OF CONTENTS

Abstracta

TableaofaContents

List of Figures/Tables.39

Chapter 1 aINTRODUCTION

1.1DATA MINING: BASICS 05

1.2WEB PAGE CLASSIFICATION: BASICS. ..07

1.2.1 APPLICATIONS OF WEB CLASSIFICATION...08

1.3 DISCUSSION: FEATURES.......12

1.3.1 USING ON PAGES FEATURES .12

1.3.2 USING NEIGHBOR FEATURES. 12

1.4DISCUSSION: DATA MINING IN WEB PAGE CLASSIFICATION. .15

1.5DISCUSSION: FEATURE SELECTION IN DATA MINING. .16

1.6RELATED WORK..17

Chapter 2 LITERATURE SURVEY17

Chapter 3 METHODOLOGIES

2.1 PROBLEM DESCRIPTION....19

2.2 TOOLS USED FOR RESEARCH.....20

2.3 DATA EXTRACTION.......21

Chapter 4 PROPOSED WORK / IMPORTANT CONCEPTS

3.1 ALGORITHMS USED FOR WEP PAGE CLASSIFIER....23

3.2 IMPLEMENTED CLASSIFIERS...28

3.2.1 NAVE BAYES...29

3.2.2 K STAR29

3.2.3 J48 ...29

Chapter 5 RESULTS AND DISCUSSION


6/39

5

4.1 INITIAL FEATURE SELECTION....30

4.2 FEATURE SELECTION, F2....31

4.3 FEATURE SELECTION, F3....31

4.2 CLASSIFICATION....32

Chapter 6 CONCLUSION..35

Chapter 7 LITERATURE SURVEY/ RESOURCES..39


7/39

6

CHAPTER 1INTRODUCTION

1.1 DATA MINING: BASICS

Data Mining (the analysis step of the Knowledge Discovery in Databases process, or

KDD), a relatively young and interdisciplinary field of computer science, is the process

of discovering new patterns from large data sets involving methods

from statistics and artificial intelligence but also database management. In contrast

to machine learning, the emphasis lies on the discovery ofpreviously unknown patterns

as opposed to generalizing known patterns to new data.[1]

According to the Gartner Group, Data mining is the process of discovering meaningful

new correlations, patterns and trends by sifting through large amounts of data stored in

repositories, using pattern recognition technologies as well as statistical and

mathematical techniques.[1]

Data mining is predicted to be one of the most revolutionary developments of the next

decade, according to the online technology magazine ZDNET News. In fact, the MIT

Technology Review chose data mining as one of 10 emerging technologies that will

change the world. Data mining expertise is the most sought after . . . among

information technology professionals, according to the 1999 Information Week National

Salary Survey [9]. The survey reports: Data mining skills are in high demand this year,

as organizations increasingly put data repositories online. Effectively analyzing

information from customers, partners, and suppliers has become important to more

companies. Many companies have implemented a data warehouse strategy and are

now starting to look at what they can do with all that data, says Dudley Brown,

managing partner of BridgeGate LLC, a recruiting firm in Irvine, Calif.[3]


8/39

7

The following list shows the most common data mining tasks.[3]

1. Description Sometimes, researchers and analysts are simply trying to find

ways to describe patterns. Descriptions of patterns and trends often suggest possibleexplanations for such patterns and trends. For example, those who are laid off are now

less well off financially than before the incumbent was elected, and so would tend to

prefer an alternative and trends lying within data.

2. Estimationa Estimation is asimilar to aclassification aexcept athat athe

atarget avariable is anumerical arather than acategorical. aModels are abuilt using

complete arecords, awhich aprovide the avalue of the atarget avariable as well as the

predictors.aThen, for new aobservations, approximations of the avalue of the atarget

avariable are made, abased on the avalues of the apredictors. For aexample, we

might be ainterested in aestimating the asystolic blood apressure areading of a

hospital apatient, abased on the apatients age, agender, body-mass index, and

ablood asodium alevels.

3. Prediction aPrediction is asimilar to aclassification and aestimation, aexcept

that foraprediction, the aresults lie in the afuture. Any of the amethods and atechniques

used for aclassification and aestimation may also be used, aunder aappropriate

circumstancesa, foraprediction. These ainclude the atraditional astatistical amethods of

point aestimation and aconfidence ainterval aestimations, simple alinear aregression

and acorrelation, and amultiple aregressions.

4. Classification In classification, there is a target unconditional variable, such

as income bracket, which, for example, could be divided into three classes or

categories: high income, middle income, and low income. The data-mining model

inspects a large set of records, each record containing information on the target variable

as well as a set of input or predictorvariables.


9/39

8

5. Clustering -Clustering refers to the grouping of records, observations, or cases

into classes of similar objects. A cluster is a collection of records that are similar to one

another, and dissimilar to records in other clusters. Clustering differs from classification

in that there is no target variable for clustering. The clustering task does not try to

classify, estimate, or predict the value of a target variable. Instead, clustering algorithms

seek to segment the entire data set into relatively homogeneous subgroups or clusters,

where the similarity of the records within the cluster is maximized and the similarity to

records outside the cluster is minimized.

6. Association - The association task afor data amining ais the job of afinding,

which aattributes go together. Most aprevalent in the abusiness aworld, awhere it is

aknown as aaffinity aanalysis oramarket abasket aanalysis, the atask ofaassociation

seeks to auncover arules for aenumerating the arelationship abetween two or more

attributes. aAssociation rules are of the aform If antecedent, then consequent,

together with a ameasure of the asupport and aconfidence arelated with the rule.[9]

1.2 WEB PAGE CLASSIFICATION: BASICS

Webpage aclassification orawebpage acategorization is the aprocess ofaassigning a

awebpage to one or more acategory alabels. E.g. News, Sport, Business"

Classification of web page content is essential to many tasks in web information

retrieval such as maintaining web directories and focused crawling. The uncontrolled

nature of web content presents additional challenges to web page classification as

compared to traditional text classification, but the interconnected nature of hypertext

also provides features that can assist the process.[4]

Classification Aplays AaAfundamental AroleAinAmanyAinformationAmanagement

Aand Aretrieval Atasks. AOn Athe AWeb, Aclassification AofApage Acontent Ais


10/39

9

Aimportant Ato Afocused Acrawling, Ato Athe Asupported Adevelopment AofAweb

Adirectories, AtoAtopic-specific AwebAlink Aanalysis, AandAtoAanalysis AofAthe

Atopical Astructure AofAthe AWeb. AWeb Apage Aclassification Acan Aalso Ahelp

AimproveAtheAqualityAofAwebAsearch.

The universal aproblem ofawebpage aclassification can be adivided into:

Subject aclassification: subject or topic of webpage e.g. Adult, Sport, Business.

Function classification: the role that the webpage play e.g. Personal homepage,

Course page, and Admission page.

Based on the number of classes in webpage classification can be divided into

Binary classification

Multi-class classification - Based on the number of classes that can be assigned to

an instance, classification can be separated into single-label classification and multi-

label classification.[5]

BasedAonAtheAnumberAofAclassesAthatAcanAbeAassignedAtoAanAinstance,

Aclassification AcanAbeAdividedAintoAsingle-label Aclassification AandAmulti-label

Aclassification.

1.2.1 Applications of web classification

1) Constructing, Amaintaining Aor Aexpanding Aweb Adirectories (web

hierarchiesA)

Web directories, such as those provided by Yahoo! (2007) and the dmoz Open

Directory Project (ODP) (2007), provide an efficient way to browse for information within

a predefined set of categories. Currently, these directories are mainly constructed and

maintained by editors, requiring extensive human effort. As of July 2006, it was reported

(Corporation 2007) that there are 73,354 editors involved in the dmoz ODP. As the Web


11/39

10

changes and continues to grow, this manual approach will become less effective. One

could easily imagine building classifiers to help update and expand such directories. For

example, Huang et al. (2004a, 2004b) propose an approach to automatic creation of

classifiersA AfromAwebAcorpora Abased AonAuser-defined Ahierarchies. AFurther

more, AwithAadvancedAclassification Atechniques, Acustomized (orAeven dynamic)

viewsAofAwebAdirectoriesAcan beAgenerated Aautomatically.[7]

2) ImprovingA qualityAAofAsearchAresults

Query ambiguity isAamong Athe Aproblems Athat Aundermine Athe Aquality Aof

Asearch Aresults. AForAexample, Athe Aquery Aterm bank Acould Amean theborderAof a AwaterAarea or a Afinancial Aestablishment. Numerous aapproaches

ahave abeen aproposed ato aimprove aretrieval aquality aby adisambiguating aquery

terms. aChekuri et al. (Chekuri et al. 1997) aastudied aautomatic aweb apage

classification in aorder ato asurge athe aprecision of aweb search. A astatistical

aclassifier, atrained aon aexisting aweb adirectories, is aapplied ato anew aweb apages

and acreates aan aordered alist of acategories in awhich the aweb apage acould be

placed. At aquery atime the auser ais asked to stipulate one or more adesired

acategories aso that only the aresults in athose acategories are areverted, or the search

engine areturns a list ofacategories aunder which the apages would fall. This amethod

works when the user is alooking for a known item. In asuch a acase, it is not adifficult to

specify the afavored acategories. aHowever, there are acircumstances in awhich the

user is less apositive about awhat adocuments will amatch, for awhich the aabove

amethod does not help amuch.

Search results are usually presented in a ranked list. However, presenting categorized,

or clustered, results could be more useful to users. An approach proposed by Chen and

Dumais (2000) classifies search results into a predefined hierarchical structure and

presents the categorized view of the results to the user. Their user study demonstrated

that the category interface is liked by the users better than the result list interface, and is

more effective for users to find the desired information. Compared to the approach


12/39

11

suggested by Chekuri et al., this approach is less efficient at query time because it

categorizes web pages on the fly. However, it does not require the user to specify

desired categories; therefore, it is more helpful when the user does not know the query

terms well. Similarly, Kaki (2005) also proposed to present a categorized aview of

asearch aresults to ausers. aExperiments ashowed that the acategorized view is

abeneficial for the ausers, aespecially when the aranking ofaresults is not asatisfying.

In 1998, Page and Brin developed the link-based ranking algorithm called PageRank

(1998). PageRank calculates the authoritativeness of web pages based on a graph

constructed by web pages and their hyperlinks, without considering the topic of each

page. Since then, much research has been explored to differentiate authorities of

diff

erent topics. Haveliwala (2002) proposed Topic-sensitive PageRank, which performsmultiple PageRank calculations, one for 4each topic. When computing the PageRank

score for each category, the random surfer jumps to a page in that category at random

rather than just any web page. This has the effect of biasing the PageRank to that topic.

This approach needs a set of pages that are accurately classified. Nie et al. (2006)

proposed another web ranking algorithm that considers the topics of web pages. In that

work, the contribution that each category has to the authority of web pages is

distinguished by means of soft classification, in which a probability distribution is given

for a web page being in each category. In order to answer the question to what

granularity of topic the computation of biased page ranks make sense, aKohlschutter et

al. (2007) aconducted aanalysis on ODP acategories, and ashowed that aranking

aperformance aincreases with the aODP alevel up to a acertain apoint. It aseems

furtheraresearch aalong athis adirection is aquite apromising.[7]

4) Building efficient focused crawlers or vertical (domain-specific) search engines

When only domain-specific queries are expected, performing a full crawl is usually

inefficient. Chakrabarti et al. (Chakrabarti et al. 1999) proposed an approach called

focused crawling, in which only documents relevant to a predefined set of topics are of


13/39

12

interest. In this approach, a classifier is used to evaluate the relevance of a web page to

the given topics so as to provide evidence for the crawl boundary.

5) Other applications

Besides the applications discussed above, web page classification is also useful in web

content filtering (Hammami, Chahir, and Chen 2003; Chen, Wu, Zhu, and Hu 2006),

assisted web browsing (Armstrong, Freitag, Joachims, and Mitchell 1995; Pazzani,

Muramatsu, and Billsus 1996; Joachims, Freitag, and Mitchell 1997) and in knowledge

base construction (Craven, DiPasquo, Freitag, McCallum, Mitchell, Nigam, and Slattery

1998).

1.3 DISCUSSION: FEATURES

Feature aselection has abeen an adynamic aresearch aarea in apattern arecognition,

astatistics, and adata amining agroups. The amain aidea of afeature aselection is to

achoose a asubset of input avariables by excluding afeatures with alittle or no

prognostic ainformation. aFeature aselection can asignificantly aimprove the aclarity of

the aresulting aclassifier amodels and aoften abuild a amodel athat asimplifies abetter

to aunseen apoints. aFurther, it is aoften the acase that afinding the acorrect asubset of

apredictive afeatures is an aimperative aproblem in its own aright. For aexample,

aphysician may make a adecision abased on the aselected afeatures awhether a

adangerous asurgery is anecessary aforatreatment or not. [5]

In this asection, we areview the atypes ofafeatures afound to be auseful in web page

aclassification aresearch.

Written in HTML, web pages contain additional information, such as HTML tags,

hyperlinks and anchor text (the text to be clicked on to activate and follow a hyperlink to

another web page, placed between HTML and tags), other than the textual

content visible in a web browser. These features can be divided into two broad classes:


14/39

13

on-page features, which are directly located on the page to be classified, and features

of neighbors, which are found on the pages related in some way with the page to be

classified. [7]

1.3.1 Using on-page features

a) Textual content and tags - Directly located on the page, the textual content is the

most forthright feature that one may consider to use. However, due to the variety of

unrestrained noise in web pages, directly using a bag-of-words demonstration for all

terms may not achieve top performance

One obvious feature that appears in HTML documents but not in plain text documents

is HTML tags. It has been demonstrated that using information derived from tags can

boost the classifiers performance. Golub and Ardo (2005) derived significance

indicators for textual content in different tags. In their work, four elements from the web

page are used: title, headings, metadata, and main text. They showed that the best

result is achieved from a well-tuned linear amalgamation of the fourelements.

Thus, utilizing tags can take benefit of the structural information embedded in the

HTML files, which is usually ignored by plain text methods. However, since most HTML

tags are leaning toward representation rather than semantics, web page authors may

generate different but theoretically corresponding tag structures. Therefore, using HTML

tagging information in web classification may suffer from the unpredictable formation of

HTML documents. [7]

b) Visual analysis- Each web page has two representations, if not more. One is the

text representation written in HTML. The other one is the visual representation rendered

by a web browser. They provide different views of a page. Most approaches focus on

the text representation while ignoring the visual information. Yet the visual

representation is useful as well.


15/39

14

Although the visual layout of a page relies on the tags, using visual information of the

rendered page is arguably more generic than analyzing document structure focusing on

HTML tags (Kovacevic, Diligenti, Gori, and Milutinovic 2004). The reason is that

different tagging may have the same rendering effect. In other words, sometimes one

can change the tags without affecting the visual representation. Based on the

assumption that most web pages are built for human eyes, it makes more sense to use

visual information rather than intrinsic tags [7]

1.3.2 Using features of neighbors

a) Motivation - Although web pages contain useful features as discussed above, in aparticular web page these features are sometimes missing, misleading, or

unrecognizable for various reasons. For example, web pages contain large images or

flash objects but little textual content

In such cases, it is difficult for classifiers to make reasonable judgments based on

features on the page. In order to address this problem, features can be extracted from

neighboring pages that are related in some way to the page to be classified to supply

supplementary information for categorization. There are a variety of ways to derive such

connections among pages. One obvious connection is the hyperlink. Since most

existing work that utilizes features of neighbors is based on hyperlink connection, in the

following, we focus on hyperlinks connection. However, other types of connections can

also be derived; and some of them have been shown to be useful for web page

classification. [7]

c) Neighbor selection - Anotherquestion when using features from neighbors is that

of which neighbors to examine. Existing research mainly focuses on pages within two

steps of the page to be classified. At a distance no greater than two, there are six types

of neighboring pages according to their hyperlink relationship with the page in question:

parent, child, sibling, spouse, grandparent and grandchild, as illustrated in Figure 4. The

effect and contribution of the first four types of neighbors have been studied in existing


16/39

15

research. Although grandparent pages and grandchild pages have also been used, their

individual contributions have not yet been specifically studied [7]

d) Features of neighbors - The features that have been used from neighbors

comprise labels, fractional content (anchor text, the surrounding text of anchor text,

titles, and headers), and full content. [7]

1.4 DISCUSSION: DATA MINING IN WEB PAGE CLASSIFICATION.

Data mining turns a large collection of data into knowledge.

A search engine (e.g., Google receives hundreds of millions of queries every day, each

query can he viewed as a transaction where any user describes her or his information

need. What novel and useful knowledge can a search engine learn from such a huge

collection of queries collected from users over time? Interestingly, some patterns found

in user search queries can disclose invaluable knowledge that cannot be obtained by

reading individual data items alone. For example, Google Flu Trends uses specific

search terms as indicators of flu activity. It found a close association between the

number of people who search for flu-related evidence and the number of people who

actually have flu symptoms. A pattern emerges, when all of the search queries related

to flu are aggregated. Using combined Google search data, Flu Trend, can estimate

activity up to two weeks quicker than traditional systems can. 2 This example shows

how data mining can turn a large collection ofdata into knowledge that can help meet a

current global challenge. [9]

1.5 DISCUSSION: FEATURE SELECTION IN DATA MINING

Featureaselection is a amust for aany adatamining aproduct. aThat is abecause,

awhen you abuild a adata amining amodel, the adataset often acontains amore

ainformation athan is aneeded to shape the amodel. For aexample, a adataset may

contain 1000 column the describes characteristic of customer, but aperhaps only 100 of

those acolumns is aused to abuild a specific amodel. If you akeep the aunneeded


17/39

16

acolumns while abuilding the amodel, amore aCPU and amemory are essential aduring

the atraining aprocess, and more astorage aspace is aessential for the completed

model.

Even if resources are not an issue, you typically want to remove unneeded columns

because they might degrade the quality of discovered patterns, for the following

reasons:

Some columns are noisy or redundant. This noise makes it more difficult to

discover meaningful patterns from the data;

Feature selection helps solve this problem, of having too much data that is of little

value, or of having too little data that is of high value. [9]

1.6 RELATED WORK

Aijun An, Yanhui Huang, Xiangji Huang and Nick Cercone (2005) - presented a

feature reduction method based on the rough set theory and investigate the

effectiveness of the rough set feature selection method on web page classification. [10]

Ali Selamat, Sigeru Omatu (2003) - apropose aanews aweb apage aclassification

amethod (WPCM). TheaWPCM auses a aneural anetwork with ainputs aobtained by

aboth the aprincipal acomponents and aclass aprofile-based afeatures. aEach anews

web page is represented by theaterm-weightingascheme. As the anumber of

aunique awords in the acollection set is big, the aprincipal acomponent aanalysis

(PCA) has been used to aselect the amost arelevant afeatures for the aclassification.

Then the afinal aoutput of the aPCA is acombined with the afeature avectors from the

class-profile, which acontains the amost aregular awords ain aeach aclass. The

experimental evaluation demonstrates that the WPCM method provides acceptable

classification accuracy with the sports news datasets.[11]


18/39

17

Oh-Woog Kwon, Jong-Hyeok Lee (2006) - proposed a Web page classifier based on

an adaptation of k-Nearest Neighbor (k-NN) approach. To improve the performance of

k-NN approach, they supplement k-NN approach with a feature selection method and a

term-weighting scheme using markup tags, and reform document-document similarity

measure used in vector space model. [13]

Wakaki, T, Itakura, H, Tamura, M (2005) - ainvestigatesahow arough set theory can

aid select applicable features for aWeb-page aclassification. Our aexperimental

aresults ashow that the arrangement of the rough set-aided afeature aselection

amethod and the aSupport aVectoraMachine with a alinearakernel is aquite beneficial

in apractice to aclassify aWeb-pages into amany acategories since anot aonly the

aperformance agives aacceptable aaccuracy but aalso athe ahigh adimensionality

areduction is aachieved without depending on arbitrary thresholds for feature selection.

[14]

Soumen Chakrabarti, Byron Dom, Rakesh Agrawal and Prabhakar Raghavan

(2007) - describe an aautomaticasystem that astarts with a asmall asample of the

abody in awhich atopics have abeen allocated by ahand, and athen revises the

adatabase awith new adocuments as the body agrows, allocating atopics to these new

documents with high speed and accuracy. [12]


19/39

18

CHAPTER 2METHODOLOGY

2.1 PROBLEM DESCRIPTION

Web page classification, also known as web page categorization, is the process of

assigning a web page to one or more predefined category labels. Classification is often

posed as a supervised learning problem (Mitchell 1997) in which a set of labeled data is

used to train a classifier which can be applied to label future examples. [6]

This project aims at discovering the most efficient feature selection methods that can

be used for web page classification. We will study in depth about data mining, feature

selection and their applications in web page classification. We will also learn how to use

the WEKA tool in testing the performance of various feature selection methods through

classification tools such as decision tree. [3] We will then collect appropriate datasets

for testing these feature selection methods and use these data sets in finding the

accuracy of various feature selection methods. Finally we will work towards the

development of algorithms for optimum classification results using WEKA tool. [5]

Based on theclasses that can be aassigned to a case, aclassification can be adivided

into asingle-label aclassification and amulti-label aclassification. In asingle-label

classification, one and only one class label is to be assigned to each instance, while in

multi-label classification, more than one class can be assigned to an instance. If a

problem is multi-class, say four-class classification, it meansafour classes are

ainvolved, say aArts, aBusiness, aComputers, and aSports. It acan be either single-

label, where exactly one class label can be assigned to an instance, or multi-label,

where an instance canabelong to aany one, twoa, oraall ofathe aclasses. aBased on

the atype of class aassignment, aclassification can abe adivided ainto hard

aclassification and asoft aclassification. In ahard aclassification, an ainstance acan

either abe or not be in a aparticular aclass, awithout an atransitional astate; awhile in


20/39

19

soft aclassification, an ainstance can be apredicted to be in asome aclass with

someaprobability (often a aprobability adistribution aacross all aclasses.

Based on the organization of categories, web page classification can also be divided

into flat classification and hierarchical classification. In flat classification, categories are

considered parallel, i.e., one category does not supersede another. While in hierarchical

classification, the categories are organized in a hierarchical tree-like structure, in which

each category may have a number of subcategories.

2.2 DETAILS OF TOOLS/SOFTWARES USED (WEKA)

WEKA

Java package developed at the University of Waikato in New Zealand. Weka stands

for the Waikato Environment for Knowledge Analysis.

Weka is a acollection ofamachine alearning aalgorithms forasolving areal-world adata

mining aproblems. The aalgorithms can aeither be aapplied adirectly to a adataset or

acalled from ayour own aJava acode.

Weka contains tools for data pre-processing, classification, regression, clustering,

association rules, and visualization. It is also well suited for developing new machine

learning schemes. Weka is open source software issued under the GNU General

Public License. [6]

The Weka aworkbench

acontains a acollection ofavisualization atools and aalgorithms

foradata aanalysis and apredictive amodeling, atogether with agraphical user

ainterfaces foraeasy aaccess to athis afunctionality. [6]


21/39

20

Advantagesa ofaWeka ainclude:

Free aavailability aunder the aGNU aGeneral aPublic aLicense

Portabilitya, asince it is fully aimplemented in the aJava aprogramming

languagea and athus aruns on aalmost any amodern acomputing aplatform

A acomprehensiveacollection ofadata apreprocessing and amodeling

techniquesease of use due to its agraphicalauserainterfaces

2.2 DATA EXTRACTION (WEBKB)

The Datasets used for testing the feature selection methods have been taken from

WEBkb.WebKB refers to the knowledge base (KB) servers WebKB-1 and WebKB-2.

KB servers are not Web search engines; they are online Knowledge based systems.

KBMSs are database management systems that, unlike relational DBMSs, object-

oriented DBMSs and deductive DBMSs, permit end-users to dynamically modify a large

number of conceptual definitions in their KBs and hence do not limit end-users to

predefined kinds of data [7].

Specifically we will use the 4 Universities Dataset.aThis adata aset acontains aWWW-

pages acollected from acomputer ascience adepartments of avarious auniversities in

January 1998 By Webkb. For each aclass the adata aset acontains apages from the

four auniversities. The 8,280 apages were amanually aclassified into the afollowing

acategories:

Category No. of Pages

Student 1641

Faculty 1124

Staff 137

Department 182

Course 930

University Pages

Cornell 867

Washington 1205

Texas 827

Wisconsin 1263


22/39

21

Project 504

Other 3764


23/39

22

CHAPTER 3PROPOSED WORK AND IMPORTANT CONCEPTS

3.1 ALGORITHM USED FOR WEB PAGE CLASSIFIER

The aim of this project is to find the grouping of afeature aselection methods for web

page acategorization. It also beats the topics in afeature aselection. aThis procedure

covers 3 stage : a) the aextraction of illustrative features, to adescribe acontent - the

initial aset, b) the assortment of the abest afeatures from ainitial set by aapplying

another afeature aselection asystem (aminimizing the anumber of afeatures and

exploiting the adiscriminative ainformation acarried by athem) and c) the training and

classification using the resulting features in the different classifiers to adetermine the

quality ofafeatures.[14]

3.1.1. PREPROCESSING

Do stop word, Common Word, Punctuation symbols and HTML tags removal

Do stemming

1) Stop Words - Inacomputing, astop awords are awords awhich are afiltered out

aprior to, oraafter, aprocessing ofanatural alanguage adata (text). It is acontrolled by

ahuman ainput and not aautomated. aThere is not one adefinite list ofastop awords

which aallatools ause, if aeven aused. aSome atools aspecifically aavoid ausing

athem to asupport aphrase asearch.[15]

2) Stemming - In linguistic morphology and information retrieval, stemming is the

process for reducing inflected (or sometimes derived) words to their stem, base

or root formgenerally a written word form. The stem need not be identical to

the morphological root of the word; it is usually sufficient that related words map to the

same stem, even if this stem is not in itself a valid root. Algorithms for stemming have

been studied in computer science since 1968.[15]


24/39

23

3.1.2. FEATURE SET EXTRACTION, F1

Find frequency of occurrence of independent words

Find the term frequency, inverse document frequency of the documents.

Term-Frequency Inverse Document Frequency

The term frequency in the given document is the number of times a given term appears

in that document. This count is usually normalized to prevent a bias towards longer web

pages (which may have a higher term frequency regardless of the actual importance of

that term in the web page) to give a measure of the importance of the term tiwithin the

particular web page dj.[16]

where ni,j is the number of occurrences of the considered term in webpage dj, and the

denominator is the number of occurrences of all terms in webpage dj.

The inverseadocumentafrequencyis a ameasure of the ageneral aimportance of the

aterm (aobtained byadividing the number of all adocuments by the anumber of

webpages containing the term, and then ataking the alogarithm of that aquotient).[16]

with

| D | : total anumber ofaterms in the acorpus

anumber of webpages awhere athe term tiappears (that

is ).

Thena


25/39

24

A ahigh aweight in tfidf is areached by a ahigh aterm afrequency (in the agiven

webpage) and a alow document frequency of the term in the whole collectionofadocuments; the aweights ahence atend to afilteraout acommon aterms.[16]

3.1.3. FEATURE SELECTION, F2

Select features from F1 using CfsSubset Evaluatormethod.

The CfsSubsetEval Algorithm

A afeature is auseful if it is acorrelated with orapredictive of the aclass; aotherwise it is

airrelevant. A afeature Vi is asaid to be arelevant iffathere aexists asome vi and c for

awhich

p(Vi = vi) > 0asuch athat p(C = c|Vi = vi) =6 p(C = c).

Empiricalaaevidence afrom the afeature aselection aliterature shows that, aalong with

airrelevant afeatures, redundant information should be eliminated asawell. The aabove

adefinitions forarelevance and aredundancy lead to the afollowing ahypothesis, on

whicha the afeature aselection amethod apresented in this athesis is abased: A agood

feature asubset is one athat contains features highly correlated with (predictive of) the

aclass, yet uncorrelated with (anot apredictive of) aeach aother. [17]

When we develop a composite, which we intend to use as a basis for predicting an

outside variable, it is likely that the components we select to form the composite will

have relatively low inter-correlations. When we seek topredict some variable from

several other variables, we try to select predictor variables which measure different

aspects of the outside variable.


26/39

25

The CfsSubsetEval algorithm evaluates the worth of a subset of attributes by

considering the individual predictive ability of each feature along with the degree of

redundancy between them. Hence it is an intensely important part of our approach.[17]

3.1.4. TRAINING THE C4.5 DECISION TREE CLASSIFIER WITH F2.

C4.5 aDecision aTree ClassifierDetermininga the arelative aimportance of a afeature is one of the abasic atasks

aduring adecision atree generation.C4.5 (Quinlan, 1993) uses a univariate

featureaselection astrategy. At each level of the tree building process only one

aattribute, the attribute with the highest values for the selection criteria, is picked out of

the set of all attributes. Afterwards the sample set is split into subsample sets according

to theavalues ofathis aattribute and the whole procedure is recursively repeated until

only samples from one aclass are in the aremaining asample aset or auntil the

aremaining asample aset has ano adiscrimination apoweraanymore aand athe atree

abuilding aprocess astops.[19[

As we can see feature selection is only done at the root node over the entire decision

space. After this level, the sample set is split into sub-samples and only the most

important feature in the remaining sub-sample set is selected. Geometrically it means,

the search for good features is only done in orthogonal decision subspaces, which

might not represent the real distributions, beginning after the root node. Thus, unlike

statistical feature search strategies (Fukunaga, 1990) this approach is not driven by the

evaluation measure for the combinatorial feature subset; the best single feature only

drives it. This might not lead to an optimal featureasubset in aterms ofaclassification

aaccuracy.[19]

C4.5 builds decision trees from a set of training data in the same way as ID3, using the

concept of information entropy. The training data is a set S = s1,s2,... of already

classified samples. Each sample si = x1,x2,... is a vector where x1,x2,...represent


27/39

26

attributes or features of the sample. The training data is augmented with a vector C=

c1,c2,... where c1,c2,...represent the class to which each sample belongs.

At each node of the tree, C4.5 chooses one attribute of the data that most effectively

splits its set of samples into subsets enriched in one class or the other. Its criterion is

the normalized information gain (difference in entropy) that results from choosing an

attribute for splitting the data. The attribute with the highest normalized information gain

is chosen to make the decision. The C4.5 algorithm then recourses on the smaller sub

lists.[13]

3.1.5. Theafeatures in the apruned adecision atree aform the afinal aset of

afeatures F3.

3.1.6. aEvaluate the aperformance of the amachine alearning aclassifiers ausing

the afinal set ofafeatures F3.

3.2 IMPLEMENTED CLASSIFIERS

3.2.1 NAVE BAYES CLASSIFIER

Also, in some cases it is also seen that Nave Bayes outperforms most complex

algorithms. It makes use of the variables contained in the adata asample, by observing

them individually, independent of eachaother.[12]

The aNave Bayes classifier is based on the Bayes rule of conditional probability. It

makes use of all the attributes contained in the data, and analyses them individually as

though they are equally important and independent of each other. For example,

consider that the training data consists of various animals (say elephants, monkeys and


28/39

27

giraffes), and our classifier has to classify any new instance that it encounters. We know

that elephants have attributes like they have a trunk, huge tusks, a short tail, are

extremely big, etc. Monkeys are short in size, jump around a lot, and can climb trees;

whereas giraffes are tall, have a long neck and short ears.[12]

It is seen that the Nave Bayes classifier performs almost at par with the other

classifiers in most of the cases. Of the 26 different experiments carried out on various

datasets, the Nave Bayes classifier shows a drop in performance in only 3-4 cases,

when compared with J48 and Support Vector Machines. This aproves the awidely

aheld abelief that athough asimple in aconcept, the aNave aBayes aclassifier aworks

well in amost adata aclassification aproblems. [16]

3.2.2. K*

K* is an instance-based classifier, that is the class of a test instance is based upon the

class of those training instances similar to it, as determined by some similarity function.

It differs from other instance-based learners in that it uses an entropy-based distance

function. [17]

Instancebased alearners aclassify an ainstance by acomparing it to a adatabase of

preclassified aexamples. The afundamental aassumption is athat asimilarainstances

awill ahave asimilaraclassifications. The aquestion alies in ahow to adefine similar

instance and similar classification. The acorresponding acomponents of an

ainstance-abased alearner are the distance function which adetermines how asimilar

two are, and the classification function which specifies how instance similarities yield a

final classification for the new instance.

The approach we take here to computing the distance between twoainstances is

amotivated by ainformation atheory. The aintuition is that the adistance abetween

ainstances be adefined as the acomplexity of atransforming one ainstance into

aanother. The acalculation of the acomplexity is adone in atwo asteps. First a finite set


29/39

28

of transformations which map instances to ainstances is defined. A program to

atransformaone ainstance (a) to aanother (b) is a afinite asequence of

atransformations astarting at a andaterminating at b.[17]

3.2.3 J48

A adecision atree is a predictive machine-learning model that decides the atarget

avalue (dependent avariable) of a anew asample abased on avarious aattribute

avalues of the aavailable adata. The ainternal anodes of a adecision atree adenote the

different aattributes, the abranches abetween the anodes tell us the apossible avalues

that athese aattributes can ahave in athe aobserved asamples, awhile the aterminal

anodes tell us the afinal avalue (aclassification) ofathe adependent avariable.[19]

The attribute that is to be predicted is known as the dependent variable, since its value

depends upon, or is decided by, the values of all the other attributes. The other

attributes, which help in predicting the value of the dependent variable, are known as

the independent variables in the dataset.[19]

The J48 aDecision atree aclassifier afollows athe afollowing asimple aalgorithm. In

aorder ato aclassify aa aanew aitema, it afirst aneeds to acreate a adecision atree

abased on the aattribute avalues of the available training data. So, whenever it

encounters aaset of aitems (atraining set) it aidentifies the aattribute athat

adiscriminates the various instances most clearly. Thisafeature athat ais aable ato

atell us amost aabout athe adata ainstancesa so athat awe acan aclassify athem the

abest ais asaid ato have athe ahighest ainformation again. aNow, aamong the

apossible avalues of this afeature, if athere is any avalue for awhich athere is no

aambiguity, athat is, for awhich the adata ainstances falling within its category have

the same value for the target variable, then we terminate that branch and assign to it

the target value that we have obtained.

For the other cases, we then look for another attribute that gives us the highest


30/39

29

information gain.[19] Hence we continue in this manner until we either get a clear

decision of what combination of attributes gives us a particular target value, or we run

out of attributes. In the event that we run out of attributes, or if we cannot get an

unambiguous result from the available information, we assign this branch a target value

that the majority of the items under this branch possess.


31/39

30

CHAPTER 4RESULTS & DISCUSSION

4.1 Initial Feature Selection F1The webapages aare ainitially apreprocessed aas afollows. The HTML tags,

astopwords, apunctuation and digits are removed. Words are reduced to their root.

Toadiminish athe aweight of afrequently aoccurring awords aand aincrease the

aweight ofarare words in a web page, the aterm afrequency ainverse adocument

afrequency ofaeach awordain a aweb apage is computed. This forms the initial set of

features F1 as listed in aTable 1.

Table 1: Initial set of features F1

S.

No.

Sample

size

No. of

Instances

No. of

features

1. p70-n30 10 694

2. p350-

n150

100 2759

4.2 Feature Selection F2

CfsSubsetEvaluatora is arun on aF1, to afurther aselect the afeatures awith amore

ainformation again on the aclass aand aless ainter-correlation. aThis aforms F2 as

ashown in atable 2.

4.3 Feature Selection F3

A C4.5 adecision atree aclassifier ais abuilt ausing F2. The afeatures athat aC4.5

auses in its apruned atree are aalone aselected as the afinal afeatures F3 as ashown

in atable 3.


32/39

31

Table 2: Features selected, F2

Table 3: Features selected, F3

4.4 CLASSIFICATION

The aperformance of the avarious amachine alearning aclassifiers are evaluated on

the differentafeature asets aF1, aF2 and aF3. The atime ataken to amodel the

classifier in each case is also observed as aillustrated in the afollowing atables.

Table 4: Classification on the afeature aset F1

S.

No

Sample

Size

NB K* J48

1. p70-n30 70% 90% 40%

2. p350-n150 91% 50% 95%

S.

No.

Sample

size

No. of

Instances

No. of

features

1. p70-n30 10 6

2. p350-

n150

100 25

S.

No.

Sample size No. of

features

1 p70-n30 22 p350-n150 5


33/39

32

Table 5: Classification on the feature set F2

S.

No

Sample

Size

NB K* J48

1. p70-n30 90% 100% 50%

2. p350-

n150

98% 96% 95%

It acan beaobserved that the aclassification aaccuracy has aincreased with all

aclassifiers with the areducedafeatures F2. Also the atime taken to amodel the

aclassifiers is aalso areduced. kNN has aexhibited more aaccuracy over F2 than F1.

Figure 1: Decision Tree on file 1


34/39

33

Figure 2: Decision Tree on file 2

Table 8: Classification on the feature set F3

S.

No

Sample

Size

NB K* J48

1. p70-n30 97% 100% 85%%

2. p350-

n150

99% 100% 95%

It can be ainferred from aTable 8 and 9 that the atime taken to abuild the aclassifiers

with F3, has significantly reduced with no compromise in the accuracy. Therefore for

better aaccuracy with reduced resource utilization, selecting the best and optimum

features is amore aimportant.aThis is aone away ofaimproving the aperformance ofthe aclassifiers.


35/39

34

CHAPTER 5CONCLUSION

Afterreviewing web classification research with respect to its features and algorithms,we conclude this article by summarizing the lessons we have learned from existing

research and pointing out future opportunities in web classification.

Web apageaclassification is a atype ofasupervised alearning aproblem athat aims to

categorizeaweb apages into a aset ofapredefined acategories abased onalabeled

trainingaadata. aClassification atasks ainclude aassigning adocuments on the abasis

of subject, function, sentiment, genre, and more.

Unlike more general text classification, web page classification methods can take

advantage of the semi-structured content and connections to other pages within the

Web. We have surveyed the space of published approaches to web page classification

from various viewpoints, and summarized their findings and contributions, with a special

emphasis on the utilization and benefits of web-specific features and methods. We

found that while the appropriate use of textual and visual features that reside directly on

the page can improve classification performance, features from neighboring pages

provide significant supplementary information to the page being classified. Feature

selection and the combination of multiple techniques can bring further improvement. We

expect that future web classification efforts will certainly combine content and link

information in some form.

Finally, we wish to explicitly note the connection between machine learning and

information retrieval (especially ranking). This idea is not new, and has underpinned

many of the ideas in the work presented here. A learning to rank community (Joachims

2002; Burges, Shaked, Renshaw, Lazier, Deeds, Hamilton, and Hullender 2005;

Radlinski and Joachims 2005; Roussinov and Fan 2005; Agarwal 2006; Cao, Xu, Liu,

Li, Huang, and Hon 2006; Richardson, Prakash, and Brill 2006) is making advances for

both static and query specific ranking. There may be unexplored opportunities for using


36/39

35

retrieval techniques for classification as well. In general, many of the features and

approaches for web page classification have counterparts in analysis for web page

retrieval. Future advances in web page classification should be able to inform retrieval

and vice versa.


37/39

36

CHAPTER 6LITERATURE SURVEY/RESOURCES

BOOKS

[1] Discovering Knowledge in Data: An Introduction to Data Mining (Daniel T Larose)[2] The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Second

Edition) by Trevor Hastie, Robert Tibshirani and Jerome Friedman (2009)

[3] Data Mining Techniques for Web Page Classification (Gabriel Fiol-Roig, Margaret

Mir-Juli, Eduardo Herraiz)

[4] Qi X and Davison B.D. (2009) Web Page Classification: Features and Algorithms.

ACM Computing Surveys

[5] Feature Selection with Rough Sets for Web Page Classification Lecture Notes in

Computer Science, 2005, Volume 3135/2005, 280-305,

[6] Web page feature selection and classification using neural networks - Computer and

System Sciences, Graduate School of Engineering, Osaka Prefecture University

[7] Web page classification based on k-nearest neighbor approach Dept. of Computer

Science and Engineering, Pohang University of Science and Technology, San 31 Hyoja

Dong, Pohang, 790-784, Korea

[8] M. A. Hall (1998). Correlation based Feature Subset Selection for Machine Learning.

Hamilton, New Zealand.

JOURNAL RESEARCH PAPERS

[9] Feature Selection for Web Page Classification (Daniele Riboni D.S.I., Universita

degli Studi di Milano, Italy)

[10] WEKAExperiences with a Java Open-Source Project (Department of Computer

Science, University of Waikatoi, Hamilton, New Zealand)

[11] Mir-Juli M., Fiol-Riog G and Vaquer-Ferrer D (2009) Classification using

Intelligent Approaches: an Example in Social Assistance. Frontiers in Artificial

Intelligence and Applications

[12] Arul Prakash Asirvatham, Kranthi Kumar Ravi (2001 ), Web Page Classification

based on Document Structure, Awarded Second Prize in National Level Student Paper

Contest conducted by IEEE India Council..


38/39

37

[13] Xiaogang Peng, Ben Choi (2002), Automatic Web Page Classification in a

Dynamic and Hierarchial Way, In Proceedings of Second IEEE International

Conference on Data Mining, Washington DC, IEEE Computer Society, pp:386-393

[14] Yicen Liu, Mingrong Liu, Liang Xiang and Qing Yang, (2008), Entity-Based

Classification of Web Page in Search Engine, ICADL, LNCS, Vol. 5362, pp:411- 412.

[15] S.B. Kotsiantis, Supervised Machine Learning: A Review of Classification

Techniques, Informatica 31(2007) 249-268, 2007

[16] Harry Zhang "The Optimality of Naive Bayes". FLAIRS2004 conference.

WEBSITES

[17] Effectiveness of Web Page Classification on Finding List Answers

(www.comp.nus.edu.sg)

[18] Data Mining Techniques for Web Page Classification

(http://eduherraiz.com/papers/)

[19] Term Frequency Inverse Document (http://nlp.stanford.edu/IR-

book/html/htmledition/inverse-document-frequency-1.html )

[20] Classification via Decision Trees in WEKA

(http://maya.cs.depaul.edu/classes/ect584/weka/classify.html )


39/39

LIST OF FIGURES AND TABLES

TABLES

1) Table 1: Initial set of features F1...31

2) Table 2: Features selected, F2

...

...32

3) Table 3: Features selected, F3......32

4) Table 4: Classification on the feature set F132

5) Table 5: Classification on the feature set F2...........33

6) Table 6: Classification on the feature set F3...........34

FIGURES

1) Figure 1: Decision Tree on file 1....33

2) Figure 2: Decision Tree on file 2....34

Investigation of Feature Selection Methods in Web Page Classification using Data Mining...

Documents

Transcript of Investigation of Feature Selection Methods in Web Page Classification using Data Mining...