Investigation of Feature Selection Methods in Web Page Classification using Data Mining...

download Investigation of Feature Selection Methods in Web Page Classification using      Data Mining Techniques in C++

of 39

Transcript of Investigation of Feature Selection Methods in Web Page Classification using Data Mining...

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    1/39

    PROJECT REPORT

    ON

    Investigation of Feature Selection methods for Web Page Classification

    BY

    Aayush Gupta 2008A4PS004U

    FOR

    FINAL YEAR PROJECT

    COMPUTER PROJECT (BITSC331)

    BITS Pilani, Dubai CampusDubai International Academic City (DIAC)

    Dubai, U.A.E

    FIRST SEMESTER 2011-2012

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    2/39

    1

    PROJECT REPORT

    ON

    Investigation of Feature Selection methods for Web Page Classification

    BY

    Aayush Gupta 2008A4PS004U MECH

    FOR

    FINAL YEAR PROJECT

    COMPUTER PROJECT (BITSC331)

    BITS Pilani, Dubai CampusDubai International Academic City (DIAC)

    Dubai, U.A.E

    FIRST SEMESTER 2011-2012

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    3/39

    2

    BITS Pilani, Dubai CampusDubai International Academic City (DIAC)

    Dubai, U.A.E

    Duration: 6th Sept 2012 to 27th Dec 2012 Date of Start: 06/09/2012

    Date of Submission: ___________

    Title of the Project: Investigation of Feature Selection methods for Web Page

    classification

    Name of Student:

    Aayush Gupta 2008A4PS004U

    Discipline of Student: Mechanical Engineering

    Name of the Faculty: Ms. J. Alamelu Mangai

    Abstract:

    Since the aInternet aprovides billions ofaweb apages foraevery asearch word, agetting

    suitable and relevant aresults aquickly from it abecomes avery adifficult. aAutomatic

    aclassification of aweb apages into required classifications is the acurrent research

    subject, which allows the asearch aengine to get essential aresults. As the aweb pages

    acontain amany extraneous, ainfrequent awords athat areduce the effectiveness of the

    aclassifier, mining or aselecting characteristic afeatures from athe web apage is an

    aessential apre-processing astep. aThis project analyzes various feature selection

    methods and try to use them for representing web pages in order to improve

    categorization accuracy. We process data collected from WebKB which is sourced from

    more than 100 webpages of different universities, and we use these processed files to

    find the efficiency of various feature selection methods through WEKA tool and also

    classify them in Decision Tree method to further analyze our results and find the optimal

    algorithm. Our experiments show the usefulness of dimensionality reduction and of a

    new, structure-oriented weighting technique. The objective of this report is to identify the

    applications of web page classification, to understand the various challenges of WPC

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    4/39

    3

    algorithms and to understand, survey and analyze the various afeature aselection

    methods used for web apage aclassification

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    5/39

    4

    TABLE OF CONTENTS

    Abstracta

    TableaofaContents

    List of Figures/Tables.39

    Chapter 1 aINTRODUCTION

    1.1DATA MINING: BASICS 05

    1.2WEB PAGE CLASSIFICATION: BASICS. ..07

    1.2.1 APPLICATIONS OF WEB CLASSIFICATION...08

    1.3 DISCUSSION: FEATURES.......12

    1.3.1 USING ON PAGES FEATURES .12

    1.3.2 USING NEIGHBOR FEATURES. 12

    1.4DISCUSSION: DATA MINING IN WEB PAGE CLASSIFICATION. .15

    1.5DISCUSSION: FEATURE SELECTION IN DATA MINING. .16

    1.6RELATED WORK..17

    Chapter 2 LITERATURE SURVEY17

    Chapter 3 METHODOLOGIES

    2.1 PROBLEM DESCRIPTION....19

    2.2 TOOLS USED FOR RESEARCH.....20

    2.3 DATA EXTRACTION.......21

    Chapter 4 PROPOSED WORK / IMPORTANT CONCEPTS

    3.1 ALGORITHMS USED FOR WEP PAGE CLASSIFIER....23

    3.2 IMPLEMENTED CLASSIFIERS...28

    3.2.1 NAVE BAYES...29

    3.2.2 K STAR29

    3.2.3 J48 ...29

    Chapter 5 RESULTS AND DISCUSSION

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    6/39

    5

    4.1 INITIAL FEATURE SELECTION....30

    4.2 FEATURE SELECTION, F2....31

    4.3 FEATURE SELECTION, F3....31

    4.2 CLASSIFICATION....32

    Chapter 6 CONCLUSION..35

    Chapter 7 LITERATURE SURVEY/ RESOURCES..39

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    7/39

    6

    CHAPTER 1INTRODUCTION

    1.1 DATA MINING: BASICS

    Data Mining (the analysis step of the Knowledge Discovery in Databases process, or

    KDD), a relatively young and interdisciplinary field of computer science, is the process

    of discovering new patterns from large data sets involving methods

    from statistics and artificial intelligence but also database management. In contrast

    to machine learning, the emphasis lies on the discovery ofpreviously unknown patterns

    as opposed to generalizing known patterns to new data.[1]

    According to the Gartner Group, Data mining is the process of discovering meaningful

    new correlations, patterns and trends by sifting through large amounts of data stored in

    repositories, using pattern recognition technologies as well as statistical and

    mathematical techniques.[1]

    Data mining is predicted to be one of the most revolutionary developments of the next

    decade, according to the online technology magazine ZDNET News. In fact, the MIT

    Technology Review chose data mining as one of 10 emerging technologies that will

    change the world. Data mining expertise is the most sought after . . . among

    information technology professionals, according to the 1999 Information Week National

    Salary Survey [9]. The survey reports: Data mining skills are in high demand this year,

    as organizations increasingly put data repositories online. Effectively analyzing

    information from customers, partners, and suppliers has become important to more

    companies. Many companies have implemented a data warehouse strategy and are

    now starting to look at what they can do with all that data, says Dudley Brown,

    managing partner of BridgeGate LLC, a recruiting firm in Irvine, Calif.[3]

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    8/39

    7

    The following list shows the most common data mining tasks.[3]

    1. Description Sometimes, researchers and analysts are simply trying to find

    ways to describe patterns. Descriptions of patterns and trends often suggest possibleexplanations for such patterns and trends. For example, those who are laid off are now

    less well off financially than before the incumbent was elected, and so would tend to

    prefer an alternative and trends lying within data.

    2. Estimationa Estimation is asimilar to aclassification aexcept athat athe

    atarget avariable is anumerical arather than acategorical. aModels are abuilt using

    complete arecords, awhich aprovide the avalue of the atarget avariable as well as the

    predictors.aThen, for new aobservations, approximations of the avalue of the atarget

    avariable are made, abased on the avalues of the apredictors. For aexample, we

    might be ainterested in aestimating the asystolic blood apressure areading of a

    hospital apatient, abased on the apatients age, agender, body-mass index, and

    ablood asodium alevels.

    3. Prediction aPrediction is asimilar to aclassification and aestimation, aexcept

    that foraprediction, the aresults lie in the afuture. Any of the amethods and atechniques

    used for aclassification and aestimation may also be used, aunder aappropriate

    circumstancesa, foraprediction. These ainclude the atraditional astatistical amethods of

    point aestimation and aconfidence ainterval aestimations, simple alinear aregression

    and acorrelation, and amultiple aregressions.

    4. Classification In classification, there is a target unconditional variable, such

    as income bracket, which, for example, could be divided into three classes or

    categories: high income, middle income, and low income. The data-mining model

    inspects a large set of records, each record containing information on the target variable

    as well as a set of input or predictorvariables.

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    9/39

    8

    5. Clustering -Clustering refers to the grouping of records, observations, or cases

    into classes of similar objects. A cluster is a collection of records that are similar to one

    another, and dissimilar to records in other clusters. Clustering differs from classification

    in that there is no target variable for clustering. The clustering task does not try to

    classify, estimate, or predict the value of a target variable. Instead, clustering algorithms

    seek to segment the entire data set into relatively homogeneous subgroups or clusters,

    where the similarity of the records within the cluster is maximized and the similarity to

    records outside the cluster is minimized.

    6. Association - The association task afor data amining ais the job of afinding,

    which aattributes go together. Most aprevalent in the abusiness aworld, awhere it is

    aknown as aaffinity aanalysis oramarket abasket aanalysis, the atask ofaassociation

    seeks to auncover arules for aenumerating the arelationship abetween two or more

    attributes. aAssociation rules are of the aform If antecedent, then consequent,

    together with a ameasure of the asupport and aconfidence arelated with the rule.[9]

    1.2 WEB PAGE CLASSIFICATION: BASICS

    Webpage aclassification orawebpage acategorization is the aprocess ofaassigning a

    awebpage to one or more acategory alabels. E.g. News, Sport, Business"

    Classification of web page content is essential to many tasks in web information

    retrieval such as maintaining web directories and focused crawling. The uncontrolled

    nature of web content presents additional challenges to web page classification as

    compared to traditional text classification, but the interconnected nature of hypertext

    also provides features that can assist the process.[4]

    Classification Aplays AaAfundamental AroleAinAmanyAinformationAmanagement

    Aand Aretrieval Atasks. AOn Athe AWeb, Aclassification AofApage Acontent Ais

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    10/39

    9

    Aimportant Ato Afocused Acrawling, Ato Athe Asupported Adevelopment AofAweb

    Adirectories, AtoAtopic-specific AwebAlink Aanalysis, AandAtoAanalysis AofAthe

    Atopical Astructure AofAthe AWeb. AWeb Apage Aclassification Acan Aalso Ahelp

    AimproveAtheAqualityAofAwebAsearch.

    The universal aproblem ofawebpage aclassification can be adivided into:

    Subject aclassification: subject or topic of webpage e.g. Adult, Sport, Business.

    Function classification: the role that the webpage play e.g. Personal homepage,

    Course page, and Admission page.

    Based on the number of classes in webpage classification can be divided into

    Binary classification

    Multi-class classification - Based on the number of classes that can be assigned to

    an instance, classification can be separated into single-label classification and multi-

    label classification.[5]

    BasedAonAtheAnumberAofAclassesAthatAcanAbeAassignedAtoAanAinstance,

    Aclassification AcanAbeAdividedAintoAsingle-label Aclassification AandAmulti-label

    Aclassification.

    1.2.1 Applications of web classification

    1) Constructing, Amaintaining Aor Aexpanding Aweb Adirectories (web

    hierarchiesA)

    Web directories, such as those provided by Yahoo! (2007) and the dmoz Open

    Directory Project (ODP) (2007), provide an efficient way to browse for information within

    a predefined set of categories. Currently, these directories are mainly constructed and

    maintained by editors, requiring extensive human effort. As of July 2006, it was reported

    (Corporation 2007) that there are 73,354 editors involved in the dmoz ODP. As the Web

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    11/39

    10

    changes and continues to grow, this manual approach will become less effective. One

    could easily imagine building classifiers to help update and expand such directories. For

    example, Huang et al. (2004a, 2004b) propose an approach to automatic creation of

    classifiersA AfromAwebAcorpora Abased AonAuser-defined Ahierarchies. AFurther

    more, AwithAadvancedAclassification Atechniques, Acustomized (orAeven dynamic)

    viewsAofAwebAdirectoriesAcan beAgenerated Aautomatically.[7]

    2) ImprovingA qualityAAofAsearchAresults

    Query ambiguity isAamong Athe Aproblems Athat Aundermine Athe Aquality Aof

    Asearch Aresults. AForAexample, Athe Aquery Aterm bank Acould Amean theborderAof a AwaterAarea or a Afinancial Aestablishment. Numerous aapproaches

    ahave abeen aproposed ato aimprove aretrieval aquality aby adisambiguating aquery

    terms. aChekuri et al. (Chekuri et al. 1997) aastudied aautomatic aweb apage

    classification in aorder ato asurge athe aprecision of aweb search. A astatistical

    aclassifier, atrained aon aexisting aweb adirectories, is aapplied ato anew aweb apages

    and acreates aan aordered alist of acategories in awhich the aweb apage acould be

    placed. At aquery atime the auser ais asked to stipulate one or more adesired

    acategories aso that only the aresults in athose acategories are areverted, or the search

    engine areturns a list ofacategories aunder which the apages would fall. This amethod

    works when the user is alooking for a known item. In asuch a acase, it is not adifficult to

    specify the afavored acategories. aHowever, there are acircumstances in awhich the

    user is less apositive about awhat adocuments will amatch, for awhich the aabove

    amethod does not help amuch.

    Search results are usually presented in a ranked list. However, presenting categorized,

    or clustered, results could be more useful to users. An approach proposed by Chen and

    Dumais (2000) classifies search results into a predefined hierarchical structure and

    presents the categorized view of the results to the user. Their user study demonstrated

    that the category interface is liked by the users better than the result list interface, and is

    more effective for users to find the desired information. Compared to the approach

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    12/39

    11

    suggested by Chekuri et al., this approach is less efficient at query time because it

    categorizes web pages on the fly. However, it does not require the user to specify

    desired categories; therefore, it is more helpful when the user does not know the query

    terms well. Similarly, Kaki (2005) also proposed to present a categorized aview of

    asearch aresults to ausers. aExperiments ashowed that the acategorized view is

    abeneficial for the ausers, aespecially when the aranking ofaresults is not asatisfying.

    In 1998, Page and Brin developed the link-based ranking algorithm called PageRank

    (1998). PageRank calculates the authoritativeness of web pages based on a graph

    constructed by web pages and their hyperlinks, without considering the topic of each

    page. Since then, much research has been explored to differentiate authorities of

    diff

    erent topics. Haveliwala (2002) proposed Topic-sensitive PageRank, which performsmultiple PageRank calculations, one for 4each topic. When computing the PageRank

    score for each category, the random surfer jumps to a page in that category at random

    rather than just any web page. This has the effect of biasing the PageRank to that topic.

    This approach needs a set of pages that are accurately classified. Nie et al. (2006)

    proposed another web ranking algorithm that considers the topics of web pages. In that

    work, the contribution that each category has to the authority of web pages is

    distinguished by means of soft classification, in which a probability distribution is given

    for a web page being in each category. In order to answer the question to what

    granularity of topic the computation of biased page ranks make sense, aKohlschutter et

    al. (2007) aconducted aanalysis on ODP acategories, and ashowed that aranking

    aperformance aincreases with the aODP alevel up to a acertain apoint. It aseems

    furtheraresearch aalong athis adirection is aquite apromising.[7]

    4) Building efficient focused crawlers or vertical (domain-specific) search engines

    When only domain-specific queries are expected, performing a full crawl is usually

    inefficient. Chakrabarti et al. (Chakrabarti et al. 1999) proposed an approach called

    focused crawling, in which only documents relevant to a predefined set of topics are of

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    13/39

    12

    interest. In this approach, a classifier is used to evaluate the relevance of a web page to

    the given topics so as to provide evidence for the crawl boundary.

    5) Other applications

    Besides the applications discussed above, web page classification is also useful in web

    content filtering (Hammami, Chahir, and Chen 2003; Chen, Wu, Zhu, and Hu 2006),

    assisted web browsing (Armstrong, Freitag, Joachims, and Mitchell 1995; Pazzani,

    Muramatsu, and Billsus 1996; Joachims, Freitag, and Mitchell 1997) and in knowledge

    base construction (Craven, DiPasquo, Freitag, McCallum, Mitchell, Nigam, and Slattery

    1998).

    1.3 DISCUSSION: FEATURES

    Feature aselection has abeen an adynamic aresearch aarea in apattern arecognition,

    astatistics, and adata amining agroups. The amain aidea of afeature aselection is to

    achoose a asubset of input avariables by excluding afeatures with alittle or no

    prognostic ainformation. aFeature aselection can asignificantly aimprove the aclarity of

    the aresulting aclassifier amodels and aoften abuild a amodel athat asimplifies abetter

    to aunseen apoints. aFurther, it is aoften the acase that afinding the acorrect asubset of

    apredictive afeatures is an aimperative aproblem in its own aright. For aexample,

    aphysician may make a adecision abased on the aselected afeatures awhether a

    adangerous asurgery is anecessary aforatreatment or not. [5]

    In this asection, we areview the atypes ofafeatures afound to be auseful in web page

    aclassification aresearch.

    Written in HTML, web pages contain additional information, such as HTML tags,

    hyperlinks and anchor text (the text to be clicked on to activate and follow a hyperlink to

    another web page, placed between HTML and tags), other than the textual

    content visible in a web browser. These features can be divided into two broad classes:

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    14/39

    13

    on-page features, which are directly located on the page to be classified, and features

    of neighbors, which are found on the pages related in some way with the page to be

    classified. [7]

    1.3.1 Using on-page features

    a) Textual content and tags - Directly located on the page, the textual content is the

    most forthright feature that one may consider to use. However, due to the variety of

    unrestrained noise in web pages, directly using a bag-of-words demonstration for all

    terms may not achieve top performance

    One obvious feature that appears in HTML documents but not in plain text documents

    is HTML tags. It has been demonstrated that using information derived from tags can

    boost the classifiers performance. Golub and Ardo (2005) derived significance

    indicators for textual content in different tags. In their work, four elements from the web

    page are used: title, headings, metadata, and main text. They showed that the best

    result is achieved from a well-tuned linear amalgamation of the fourelements.

    Thus, utilizing tags can take benefit of the structural information embedded in the

    HTML files, which is usually ignored by plain text methods. However, since most HTML

    tags are leaning toward representation rather than semantics, web page authors may

    generate different but theoretically corresponding tag structures. Therefore, using HTML

    tagging information in web classification may suffer from the unpredictable formation of

    HTML documents. [7]

    b) Visual analysis- Each web page has two representations, if not more. One is the

    text representation written in HTML. The other one is the visual representation rendered

    by a web browser. They provide different views of a page. Most approaches focus on

    the text representation while ignoring the visual information. Yet the visual

    representation is useful as well.

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    15/39

    14

    Although the visual layout of a page relies on the tags, using visual information of the

    rendered page is arguably more generic than analyzing document structure focusing on

    HTML tags (Kovacevic, Diligenti, Gori, and Milutinovic 2004). The reason is that

    different tagging may have the same rendering effect. In other words, sometimes one

    can change the tags without affecting the visual representation. Based on the

    assumption that most web pages are built for human eyes, it makes more sense to use

    visual information rather than intrinsic tags [7]

    1.3.2 Using features of neighbors

    a) Motivation - Although web pages contain useful features as discussed above, in aparticular web page these features are sometimes missing, misleading, or

    unrecognizable for various reasons. For example, web pages contain large images or

    flash objects but little textual content

    In such cases, it is difficult for classifiers to make reasonable judgments based on

    features on the page. In order to address this problem, features can be extracted from

    neighboring pages that are related in some way to the page to be classified to supply

    supplementary information for categorization. There are a variety of ways to derive such

    connections among pages. One obvious connection is the hyperlink. Since most

    existing work that utilizes features of neighbors is based on hyperlink connection, in the

    following, we focus on hyperlinks connection. However, other types of connections can

    also be derived; and some of them have been shown to be useful for web page

    classification. [7]

    c) Neighbor selection - Anotherquestion when using features from neighbors is that

    of which neighbors to examine. Existing research mainly focuses on pages within two

    steps of the page to be classified. At a distance no greater than two, there are six types

    of neighboring pages according to their hyperlink relationship with the page in question:

    parent, child, sibling, spouse, grandparent and grandchild, as illustrated in Figure 4. The

    effect and contribution of the first four types of neighbors have been studied in existing

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    16/39

    15

    research. Although grandparent pages and grandchild pages have also been used, their

    individual contributions have not yet been specifically studied [7]

    d) Features of neighbors - The features that have been used from neighbors

    comprise labels, fractional content (anchor text, the surrounding text of anchor text,

    titles, and headers), and full content. [7]

    1.4 DISCUSSION: DATA MINING IN WEB PAGE CLASSIFICATION.

    Data mining turns a large collection of data into knowledge.

    A search engine (e.g., Google receives hundreds of millions of queries every day, each

    query can he viewed as a transaction where any user describes her or his information

    need. What novel and useful knowledge can a search engine learn from such a huge

    collection of queries collected from users over time? Interestingly, some patterns found

    in user search queries can disclose invaluable knowledge that cannot be obtained by

    reading individual data items alone. For example, Google Flu Trends uses specific

    search terms as indicators of flu activity. It found a close association between the

    number of people who search for flu-related evidence and the number of people who

    actually have flu symptoms. A pattern emerges, when all of the search queries related

    to flu are aggregated. Using combined Google search data, Flu Trend, can estimate

    activity up to two weeks quicker than traditional systems can. 2 This example shows

    how data mining can turn a large collection ofdata into knowledge that can help meet a

    current global challenge. [9]

    1.5 DISCUSSION: FEATURE SELECTION IN DATA MINING

    Featureaselection is a amust for aany adatamining aproduct. aThat is abecause,

    awhen you abuild a adata amining amodel, the adataset often acontains amore

    ainformation athan is aneeded to shape the amodel. For aexample, a adataset may

    contain 1000 column the describes characteristic of customer, but aperhaps only 100 of

    those acolumns is aused to abuild a specific amodel. If you akeep the aunneeded

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    17/39

    16

    acolumns while abuilding the amodel, amore aCPU and amemory are essential aduring

    the atraining aprocess, and more astorage aspace is aessential for the completed

    model.

    Even if resources are not an issue, you typically want to remove unneeded columns

    because they might degrade the quality of discovered patterns, for the following

    reasons:

    Some columns are noisy or redundant. This noise makes it more difficult to

    discover meaningful patterns from the data;

    Feature selection helps solve this problem, of having too much data that is of little

    value, or of having too little data that is of high value. [9]

    1.6 RELATED WORK

    Aijun An, Yanhui Huang, Xiangji Huang and Nick Cercone (2005) - presented a

    feature reduction method based on the rough set theory and investigate the

    effectiveness of the rough set feature selection method on web page classification. [10]

    Ali Selamat, Sigeru Omatu (2003) - apropose aanews aweb apage aclassification

    amethod (WPCM). TheaWPCM auses a aneural anetwork with ainputs aobtained by

    aboth the aprincipal acomponents and aclass aprofile-based afeatures. aEach anews

    web page is represented by theaterm-weightingascheme. As the anumber of

    aunique awords in the acollection set is big, the aprincipal acomponent aanalysis

    (PCA) has been used to aselect the amost arelevant afeatures for the aclassification.

    Then the afinal aoutput of the aPCA is acombined with the afeature avectors from the

    class-profile, which acontains the amost aregular awords ain aeach aclass. The

    experimental evaluation demonstrates that the WPCM method provides acceptable

    classification accuracy with the sports news datasets.[11]

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    18/39

    17

    Oh-Woog Kwon, Jong-Hyeok Lee (2006) - proposed a Web page classifier based on

    an adaptation of k-Nearest Neighbor (k-NN) approach. To improve the performance of

    k-NN approach, they supplement k-NN approach with a feature selection method and a

    term-weighting scheme using markup tags, and reform document-document similarity

    measure used in vector space model. [13]

    Wakaki, T, Itakura, H, Tamura, M (2005) - ainvestigatesahow arough set theory can

    aid select applicable features for aWeb-page aclassification. Our aexperimental

    aresults ashow that the arrangement of the rough set-aided afeature aselection

    amethod and the aSupport aVectoraMachine with a alinearakernel is aquite beneficial

    in apractice to aclassify aWeb-pages into amany acategories since anot aonly the

    aperformance agives aacceptable aaccuracy but aalso athe ahigh adimensionality

    areduction is aachieved without depending on arbitrary thresholds for feature selection.

    [14]

    Soumen Chakrabarti, Byron Dom, Rakesh Agrawal and Prabhakar Raghavan

    (2007) - describe an aautomaticasystem that astarts with a asmall asample of the

    abody in awhich atopics have abeen allocated by ahand, and athen revises the

    adatabase awith new adocuments as the body agrows, allocating atopics to these new

    documents with high speed and accuracy. [12]

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    19/39

    18

    CHAPTER 2METHODOLOGY

    2.1 PROBLEM DESCRIPTION

    Web page classification, also known as web page categorization, is the process of

    assigning a web page to one or more predefined category labels. Classification is often

    posed as a supervised learning problem (Mitchell 1997) in which a set of labeled data is

    used to train a classifier which can be applied to label future examples. [6]

    This project aims at discovering the most efficient feature selection methods that can

    be used for web page classification. We will study in depth about data mining, feature

    selection and their applications in web page classification. We will also learn how to use

    the WEKA tool in testing the performance of various feature selection methods through

    classification tools such as decision tree. [3] We will then collect appropriate datasets

    for testing these feature selection methods and use these data sets in finding the

    accuracy of various feature selection methods. Finally we will work towards the

    development of algorithms for optimum classification results using WEKA tool. [5]

    Based on theclasses that can be aassigned to a case, aclassification can be adivided

    into asingle-label aclassification and amulti-label aclassification. In asingle-label

    classification, one and only one class label is to be assigned to each instance, while in

    multi-label classification, more than one class can be assigned to an instance. If a

    problem is multi-class, say four-class classification, it meansafour classes are

    ainvolved, say aArts, aBusiness, aComputers, and aSports. It acan be either single-

    label, where exactly one class label can be assigned to an instance, or multi-label,

    where an instance canabelong to aany one, twoa, oraall ofathe aclasses. aBased on

    the atype of class aassignment, aclassification can abe adivided ainto hard

    aclassification and asoft aclassification. In ahard aclassification, an ainstance acan

    either abe or not be in a aparticular aclass, awithout an atransitional astate; awhile in

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    20/39

    19

    soft aclassification, an ainstance can be apredicted to be in asome aclass with

    someaprobability (often a aprobability adistribution aacross all aclasses.

    Based on the organization of categories, web page classification can also be divided

    into flat classification and hierarchical classification. In flat classification, categories are

    considered parallel, i.e., one category does not supersede another. While in hierarchical

    classification, the categories are organized in a hierarchical tree-like structure, in which

    each category may have a number of subcategories.

    2.2 DETAILS OF TOOLS/SOFTWARES USED (WEKA)

    WEKA

    Java package developed at the University of Waikato in New Zealand. Weka stands

    for the Waikato Environment for Knowledge Analysis.

    Weka is a acollection ofamachine alearning aalgorithms forasolving areal-world adata

    mining aproblems. The aalgorithms can aeither be aapplied adirectly to a adataset or

    acalled from ayour own aJava acode.

    Weka contains tools for data pre-processing, classification, regression, clustering,

    association rules, and visualization. It is also well suited for developing new machine

    learning schemes. Weka is open source software issued under the GNU General

    Public License. [6]

    The Weka aworkbench

    acontains a acollection ofavisualization atools and aalgorithms

    foradata aanalysis and apredictive amodeling, atogether with agraphical user

    ainterfaces foraeasy aaccess to athis afunctionality. [6]

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    21/39

    20

    Advantagesa ofaWeka ainclude:

    Free aavailability aunder the aGNU aGeneral aPublic aLicense

    Portabilitya, asince it is fully aimplemented in the aJava aprogramming

    languagea and athus aruns on aalmost any amodern acomputing aplatform

    A acomprehensiveacollection ofadata apreprocessing and amodeling

    techniquesease of use due to its agraphicalauserainterfaces

    2.2 DATA EXTRACTION (WEBKB)

    The Datasets used for testing the feature selection methods have been taken from

    WEBkb.WebKB refers to the knowledge base (KB) servers WebKB-1 and WebKB-2.

    KB servers are not Web search engines; they are online Knowledge based systems.

    KBMSs are database management systems that, unlike relational DBMSs, object-

    oriented DBMSs and deductive DBMSs, permit end-users to dynamically modify a large

    number of conceptual definitions in their KBs and hence do not limit end-users to

    predefined kinds of data [7].

    Specifically we will use the 4 Universities Dataset.aThis adata aset acontains aWWW-

    pages acollected from acomputer ascience adepartments of avarious auniversities in

    January 1998 By Webkb. For each aclass the adata aset acontains apages from the

    four auniversities. The 8,280 apages were amanually aclassified into the afollowing

    acategories:

    Category No. of Pages

    Student 1641

    Faculty 1124

    Staff 137

    Department 182

    Course 930

    University Pages

    Cornell 867

    Washington 1205

    Texas 827

    Wisconsin 1263

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    22/39

    21

    Project 504

    Other 3764

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    23/39

    22

    CHAPTER 3PROPOSED WORK AND IMPORTANT CONCEPTS

    3.1 ALGORITHM USED FOR WEB PAGE CLASSIFIER

    The aim of this project is to find the grouping of afeature aselection methods for web

    page acategorization. It also beats the topics in afeature aselection. aThis procedure

    covers 3 stage : a) the aextraction of illustrative features, to adescribe acontent - the

    initial aset, b) the assortment of the abest afeatures from ainitial set by aapplying

    another afeature aselection asystem (aminimizing the anumber of afeatures and

    exploiting the adiscriminative ainformation acarried by athem) and c) the training and

    classification using the resulting features in the different classifiers to adetermine the

    quality ofafeatures.[14]

    3.1.1. PREPROCESSING

    Do stop word, Common Word, Punctuation symbols and HTML tags removal

    Do stemming

    1) Stop Words - Inacomputing, astop awords are awords awhich are afiltered out

    aprior to, oraafter, aprocessing ofanatural alanguage adata (text). It is acontrolled by

    ahuman ainput and not aautomated. aThere is not one adefinite list ofastop awords

    which aallatools ause, if aeven aused. aSome atools aspecifically aavoid ausing

    athem to asupport aphrase asearch.[15]

    2) Stemming - In linguistic morphology and information retrieval, stemming is the

    process for reducing inflected (or sometimes derived) words to their stem, base

    or root formgenerally a written word form. The stem need not be identical to

    the morphological root of the word; it is usually sufficient that related words map to the

    same stem, even if this stem is not in itself a valid root. Algorithms for stemming have

    been studied in computer science since 1968.[15]

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    24/39

    23

    3.1.2. FEATURE SET EXTRACTION, F1

    Find frequency of occurrence of independent words

    Find the term frequency, inverse document frequency of the documents.

    Term-Frequency Inverse Document Frequency

    The term frequency in the given document is the number of times a given term appears

    in that document. This count is usually normalized to prevent a bias towards longer web

    pages (which may have a higher term frequency regardless of the actual importance of

    that term in the web page) to give a measure of the importance of the term tiwithin the

    particular web page dj.[16]

    where ni,j is the number of occurrences of the considered term in webpage dj, and the

    denominator is the number of occurrences of all terms in webpage dj.

    The inverseadocumentafrequencyis a ameasure of the ageneral aimportance of the

    aterm (aobtained byadividing the number of all adocuments by the anumber of

    webpages containing the term, and then ataking the alogarithm of that aquotient).[16]

    with

    | D | : total anumber ofaterms in the acorpus

    anumber of webpages awhere athe term tiappears (that

    is ).

    Thena

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    25/39

    24

    A ahigh aweight in tfidf is areached by a ahigh aterm afrequency (in the agiven

    webpage) and a alow document frequency of the term in the whole collectionofadocuments; the aweights ahence atend to afilteraout acommon aterms.[16]

    3.1.3. FEATURE SELECTION, F2

    Select features from F1 using CfsSubset Evaluatormethod.

    The CfsSubsetEval Algorithm

    A afeature is auseful if it is acorrelated with orapredictive of the aclass; aotherwise it is

    airrelevant. A afeature Vi is asaid to be arelevant iffathere aexists asome vi and c for

    awhich

    p(Vi = vi) > 0asuch athat p(C = c|Vi = vi) =6 p(C = c).

    Empiricalaaevidence afrom the afeature aselection aliterature shows that, aalong with

    airrelevant afeatures, redundant information should be eliminated asawell. The aabove

    adefinitions forarelevance and aredundancy lead to the afollowing ahypothesis, on

    whicha the afeature aselection amethod apresented in this athesis is abased: A agood

    feature asubset is one athat contains features highly correlated with (predictive of) the

    aclass, yet uncorrelated with (anot apredictive of) aeach aother. [17]

    When we develop a composite, which we intend to use as a basis for predicting an

    outside variable, it is likely that the components we select to form the composite will

    have relatively low inter-correlations. When we seek topredict some variable from

    several other variables, we try to select predictor variables which measure different

    aspects of the outside variable.

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    26/39

    25

    The CfsSubsetEval algorithm evaluates the worth of a subset of attributes by

    considering the individual predictive ability of each feature along with the degree of

    redundancy between them. Hence it is an intensely important part of our approach.[17]

    3.1.4. TRAINING THE C4.5 DECISION TREE CLASSIFIER WITH F2.

    C4.5 aDecision aTree ClassifierDetermininga the arelative aimportance of a afeature is one of the abasic atasks

    aduring adecision atree generation.C4.5 (Quinlan, 1993) uses a univariate

    featureaselection astrategy. At each level of the tree building process only one

    aattribute, the attribute with the highest values for the selection criteria, is picked out of

    the set of all attributes. Afterwards the sample set is split into subsample sets according

    to theavalues ofathis aattribute and the whole procedure is recursively repeated until

    only samples from one aclass are in the aremaining asample aset or auntil the

    aremaining asample aset has ano adiscrimination apoweraanymore aand athe atree

    abuilding aprocess astops.[19[

    As we can see feature selection is only done at the root node over the entire decision

    space. After this level, the sample set is split into sub-samples and only the most

    important feature in the remaining sub-sample set is selected. Geometrically it means,

    the search for good features is only done in orthogonal decision subspaces, which

    might not represent the real distributions, beginning after the root node. Thus, unlike

    statistical feature search strategies (Fukunaga, 1990) this approach is not driven by the

    evaluation measure for the combinatorial feature subset; the best single feature only

    drives it. This might not lead to an optimal featureasubset in aterms ofaclassification

    aaccuracy.[19]

    C4.5 builds decision trees from a set of training data in the same way as ID3, using the

    concept of information entropy. The training data is a set S = s1,s2,... of already

    classified samples. Each sample si = x1,x2,... is a vector where x1,x2,...represent

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    27/39

    26

    attributes or features of the sample. The training data is augmented with a vector C=

    c1,c2,... where c1,c2,...represent the class to which each sample belongs.

    At each node of the tree, C4.5 chooses one attribute of the data that most effectively

    splits its set of samples into subsets enriched in one class or the other. Its criterion is

    the normalized information gain (difference in entropy) that results from choosing an

    attribute for splitting the data. The attribute with the highest normalized information gain

    is chosen to make the decision. The C4.5 algorithm then recourses on the smaller sub

    lists.[13]

    3.1.5. Theafeatures in the apruned adecision atree aform the afinal aset of

    afeatures F3.

    3.1.6. aEvaluate the aperformance of the amachine alearning aclassifiers ausing

    the afinal set ofafeatures F3.

    3.2 IMPLEMENTED CLASSIFIERS

    3.2.1 NAVE BAYES CLASSIFIER

    Also, in some cases it is also seen that Nave Bayes outperforms most complex

    algorithms. It makes use of the variables contained in the adata asample, by observing

    them individually, independent of eachaother.[12]

    The aNave Bayes classifier is based on the Bayes rule of conditional probability. It

    makes use of all the attributes contained in the data, and analyses them individually as

    though they are equally important and independent of each other. For example,

    consider that the training data consists of various animals (say elephants, monkeys and

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    28/39

    27

    giraffes), and our classifier has to classify any new instance that it encounters. We know

    that elephants have attributes like they have a trunk, huge tusks, a short tail, are

    extremely big, etc. Monkeys are short in size, jump around a lot, and can climb trees;

    whereas giraffes are tall, have a long neck and short ears.[12]

    It is seen that the Nave Bayes classifier performs almost at par with the other

    classifiers in most of the cases. Of the 26 different experiments carried out on various

    datasets, the Nave Bayes classifier shows a drop in performance in only 3-4 cases,

    when compared with J48 and Support Vector Machines. This aproves the awidely

    aheld abelief that athough asimple in aconcept, the aNave aBayes aclassifier aworks

    well in amost adata aclassification aproblems. [16]

    3.2.2. K*

    K* is an instance-based classifier, that is the class of a test instance is based upon the

    class of those training instances similar to it, as determined by some similarity function.

    It differs from other instance-based learners in that it uses an entropy-based distance

    function. [17]

    Instancebased alearners aclassify an ainstance by acomparing it to a adatabase of

    preclassified aexamples. The afundamental aassumption is athat asimilarainstances

    awill ahave asimilaraclassifications. The aquestion alies in ahow to adefine similar

    instance and similar classification. The acorresponding acomponents of an

    ainstance-abased alearner are the distance function which adetermines how asimilar

    two are, and the classification function which specifies how instance similarities yield a

    final classification for the new instance.

    The approach we take here to computing the distance between twoainstances is

    amotivated by ainformation atheory. The aintuition is that the adistance abetween

    ainstances be adefined as the acomplexity of atransforming one ainstance into

    aanother. The acalculation of the acomplexity is adone in atwo asteps. First a finite set

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    29/39

    28

    of transformations which map instances to ainstances is defined. A program to

    atransformaone ainstance (a) to aanother (b) is a afinite asequence of

    atransformations astarting at a andaterminating at b.[17]

    3.2.3 J48

    A adecision atree is a predictive machine-learning model that decides the atarget

    avalue (dependent avariable) of a anew asample abased on avarious aattribute

    avalues of the aavailable adata. The ainternal anodes of a adecision atree adenote the

    different aattributes, the abranches abetween the anodes tell us the apossible avalues

    that athese aattributes can ahave in athe aobserved asamples, awhile the aterminal

    anodes tell us the afinal avalue (aclassification) ofathe adependent avariable.[19]

    The attribute that is to be predicted is known as the dependent variable, since its value

    depends upon, or is decided by, the values of all the other attributes. The other

    attributes, which help in predicting the value of the dependent variable, are known as

    the independent variables in the dataset.[19]

    The J48 aDecision atree aclassifier afollows athe afollowing asimple aalgorithm. In

    aorder ato aclassify aa aanew aitema, it afirst aneeds to acreate a adecision atree

    abased on the aattribute avalues of the available training data. So, whenever it

    encounters aaset of aitems (atraining set) it aidentifies the aattribute athat

    adiscriminates the various instances most clearly. Thisafeature athat ais aable ato

    atell us amost aabout athe adata ainstancesa so athat awe acan aclassify athem the

    abest ais asaid ato have athe ahighest ainformation again. aNow, aamong the

    apossible avalues of this afeature, if athere is any avalue for awhich athere is no

    aambiguity, athat is, for awhich the adata ainstances falling within its category have

    the same value for the target variable, then we terminate that branch and assign to it

    the target value that we have obtained.

    For the other cases, we then look for another attribute that gives us the highest

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    30/39

    29

    information gain.[19] Hence we continue in this manner until we either get a clear

    decision of what combination of attributes gives us a particular target value, or we run

    out of attributes. In the event that we run out of attributes, or if we cannot get an

    unambiguous result from the available information, we assign this branch a target value

    that the majority of the items under this branch possess.

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    31/39

    30

    CHAPTER 4RESULTS & DISCUSSION

    4.1 Initial Feature Selection F1The webapages aare ainitially apreprocessed aas afollows. The HTML tags,

    astopwords, apunctuation and digits are removed. Words are reduced to their root.

    Toadiminish athe aweight of afrequently aoccurring awords aand aincrease the

    aweight ofarare words in a web page, the aterm afrequency ainverse adocument

    afrequency ofaeach awordain a aweb apage is computed. This forms the initial set of

    features F1 as listed in aTable 1.

    Table 1: Initial set of features F1

    S.

    No.

    Sample

    size

    No. of

    Instances

    No. of

    features

    1. p70-n30 10 694

    2. p350-

    n150

    100 2759

    4.2 Feature Selection F2

    CfsSubsetEvaluatora is arun on aF1, to afurther aselect the afeatures awith amore

    ainformation again on the aclass aand aless ainter-correlation. aThis aforms F2 as

    ashown in atable 2.

    4.3 Feature Selection F3

    A C4.5 adecision atree aclassifier ais abuilt ausing F2. The afeatures athat aC4.5

    auses in its apruned atree are aalone aselected as the afinal afeatures F3 as ashown

    in atable 3.

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    32/39

    31

    Table 2: Features selected, F2

    Table 3: Features selected, F3

    4.4 CLASSIFICATION

    The aperformance of the avarious amachine alearning aclassifiers are evaluated on

    the differentafeature asets aF1, aF2 and aF3. The atime ataken to amodel the

    classifier in each case is also observed as aillustrated in the afollowing atables.

    Table 4: Classification on the afeature aset F1

    S.

    No

    Sample

    Size

    NB K* J48

    1. p70-n30 70% 90% 40%

    2. p350-n150 91% 50% 95%

    S.

    No.

    Sample

    size

    No. of

    Instances

    No. of

    features

    1. p70-n30 10 6

    2. p350-

    n150

    100 25

    S.

    No.

    Sample size No. of

    features

    1 p70-n30 22 p350-n150 5

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    33/39

    32

    Table 5: Classification on the feature set F2

    S.

    No

    Sample

    Size

    NB K* J48

    1. p70-n30 90% 100% 50%

    2. p350-

    n150

    98% 96% 95%

    It acan beaobserved that the aclassification aaccuracy has aincreased with all

    aclassifiers with the areducedafeatures F2. Also the atime taken to amodel the

    aclassifiers is aalso areduced. kNN has aexhibited more aaccuracy over F2 than F1.

    Figure 1: Decision Tree on file 1

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    34/39

    33

    Figure 2: Decision Tree on file 2

    Table 8: Classification on the feature set F3

    S.

    No

    Sample

    Size

    NB K* J48

    1. p70-n30 97% 100% 85%%

    2. p350-

    n150

    99% 100% 95%

    It can be ainferred from aTable 8 and 9 that the atime taken to abuild the aclassifiers

    with F3, has significantly reduced with no compromise in the accuracy. Therefore for

    better aaccuracy with reduced resource utilization, selecting the best and optimum

    features is amore aimportant.aThis is aone away ofaimproving the aperformance ofthe aclassifiers.

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    35/39

    34

    CHAPTER 5CONCLUSION

    Afterreviewing web classification research with respect to its features and algorithms,we conclude this article by summarizing the lessons we have learned from existing

    research and pointing out future opportunities in web classification.

    Web apageaclassification is a atype ofasupervised alearning aproblem athat aims to

    categorizeaweb apages into a aset ofapredefined acategories abased onalabeled

    trainingaadata. aClassification atasks ainclude aassigning adocuments on the abasis

    of subject, function, sentiment, genre, and more.

    Unlike more general text classification, web page classification methods can take

    advantage of the semi-structured content and connections to other pages within the

    Web. We have surveyed the space of published approaches to web page classification

    from various viewpoints, and summarized their findings and contributions, with a special

    emphasis on the utilization and benefits of web-specific features and methods. We

    found that while the appropriate use of textual and visual features that reside directly on

    the page can improve classification performance, features from neighboring pages

    provide significant supplementary information to the page being classified. Feature

    selection and the combination of multiple techniques can bring further improvement. We

    expect that future web classification efforts will certainly combine content and link

    information in some form.

    Finally, we wish to explicitly note the connection between machine learning and

    information retrieval (especially ranking). This idea is not new, and has underpinned

    many of the ideas in the work presented here. A learning to rank community (Joachims

    2002; Burges, Shaked, Renshaw, Lazier, Deeds, Hamilton, and Hullender 2005;

    Radlinski and Joachims 2005; Roussinov and Fan 2005; Agarwal 2006; Cao, Xu, Liu,

    Li, Huang, and Hon 2006; Richardson, Prakash, and Brill 2006) is making advances for

    both static and query specific ranking. There may be unexplored opportunities for using

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    36/39

    35

    retrieval techniques for classification as well. In general, many of the features and

    approaches for web page classification have counterparts in analysis for web page

    retrieval. Future advances in web page classification should be able to inform retrieval

    and vice versa.

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    37/39

    36

    CHAPTER 6LITERATURE SURVEY/RESOURCES

    BOOKS

    [1] Discovering Knowledge in Data: An Introduction to Data Mining (Daniel T Larose)[2] The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Second

    Edition) by Trevor Hastie, Robert Tibshirani and Jerome Friedman (2009)

    [3] Data Mining Techniques for Web Page Classification (Gabriel Fiol-Roig, Margaret

    Mir-Juli, Eduardo Herraiz)

    [4] Qi X and Davison B.D. (2009) Web Page Classification: Features and Algorithms.

    ACM Computing Surveys

    [5] Feature Selection with Rough Sets for Web Page Classification Lecture Notes in

    Computer Science, 2005, Volume 3135/2005, 280-305,

    [6] Web page feature selection and classification using neural networks - Computer and

    System Sciences, Graduate School of Engineering, Osaka Prefecture University

    [7] Web page classification based on k-nearest neighbor approach Dept. of Computer

    Science and Engineering, Pohang University of Science and Technology, San 31 Hyoja

    Dong, Pohang, 790-784, Korea

    [8] M. A. Hall (1998). Correlation based Feature Subset Selection for Machine Learning.

    Hamilton, New Zealand.

    JOURNAL RESEARCH PAPERS

    [9] Feature Selection for Web Page Classification (Daniele Riboni D.S.I., Universita

    degli Studi di Milano, Italy)

    [10] WEKAExperiences with a Java Open-Source Project (Department of Computer

    Science, University of Waikatoi, Hamilton, New Zealand)

    [11] Mir-Juli M., Fiol-Riog G and Vaquer-Ferrer D (2009) Classification using

    Intelligent Approaches: an Example in Social Assistance. Frontiers in Artificial

    Intelligence and Applications

    [12] Arul Prakash Asirvatham, Kranthi Kumar Ravi (2001 ), Web Page Classification

    based on Document Structure, Awarded Second Prize in National Level Student Paper

    Contest conducted by IEEE India Council..

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    38/39

    37

    [13] Xiaogang Peng, Ben Choi (2002), Automatic Web Page Classification in a

    Dynamic and Hierarchial Way, In Proceedings of Second IEEE International

    Conference on Data Mining, Washington DC, IEEE Computer Society, pp:386-393

    [14] Yicen Liu, Mingrong Liu, Liang Xiang and Qing Yang, (2008), Entity-Based

    Classification of Web Page in Search Engine, ICADL, LNCS, Vol. 5362, pp:411- 412.

    [15] S.B. Kotsiantis, Supervised Machine Learning: A Review of Classification

    Techniques, Informatica 31(2007) 249-268, 2007

    [16] Harry Zhang "The Optimality of Naive Bayes". FLAIRS2004 conference.

    WEBSITES

    [17] Effectiveness of Web Page Classification on Finding List Answers

    (www.comp.nus.edu.sg)

    [18] Data Mining Techniques for Web Page Classification

    (http://eduherraiz.com/papers/)

    [19] Term Frequency Inverse Document (http://nlp.stanford.edu/IR-

    book/html/htmledition/inverse-document-frequency-1.html )

    [20] Classification via Decision Trees in WEKA

    (http://maya.cs.depaul.edu/classes/ect584/weka/classify.html )

  • 7/28/2019 Investigation of Feature Selection Methods in Web Page Classification using Data Mining Techniques in C++

    39/39

    LIST OF FIGURES AND TABLES

    TABLES

    1) Table 1: Initial set of features F1...31

    2) Table 2: Features selected, F2

    ...

    ...32

    3) Table 3: Features selected, F3......32

    4) Table 4: Classification on the feature set F132

    5) Table 5: Classification on the feature set F2...........33

    6) Table 6: Classification on the feature set F3...........34

    FIGURES

    1) Figure 1: Decision Tree on file 1....33

    2) Figure 2: Decision Tree on file 2....34