Session 03_Paper 35

download Session 03_Paper 35

of 13

Transcript of Session 03_Paper 35

  • 8/11/2019 Session 03_Paper 35

    1/13

    Fuzzy association rule mining for web usage visualization

    Suash Deb1, Simon Fong

    2*, Cecilia Ho

    2

    aDepartment of Computer Science and Engineering, Cambridge Institute of Technology, Ranchi, India

    2Department of Computer Science, University of Macau, Taipa, Macau SAR

    Abstract

    As an important task to web business management, is monitoring the growth of the website via visual inspection and to alert

    of any anomaly. Web mining is a known popular research area for knowledge discovery on Websites and Web operations. In

    particular, association rule mining (ARM) has been studied and applied for finding Web pages or Web links that are

    frequently accessed together in a session. However, most of the previous works articles in the literature used ARM are forstudying the Web visitors/customers browsing patterns on the Website, hence the Website could be fine-tuned or

    personalized according to their Web surfing preferences. In this paper, we embark on a slightly different perspective from the

    views and requirements of Website monitor which aims at visualizing the dynamic activities (also known as Web usages) on

    the Website; so that the relations in terms of being clicked in a sequence of visits between the Web pages could be visualized.

    Fuzzy ARM is applied here because the contextual relations between Web pages are supposed not to be strictly defined but

    fuzzy in nature. An experiment is conducted to verify the efficacy of our proposed model with superior results when

    compared to using ARM algorithm alone.

    2013 Elsevier Science. All rights reserved.

    Keywords: Web usage mining; Fuzzy association rules; Web usage visualization

    1. Introduction

    Website monitoring is crucial to an online business or e-government as it offers insight on the progress of thebusiness which runs upon the Website. Many Web diagnosis software programs are readily available incommercial markets; and they usually output the insights and knowledge of the Web operation in forms oftabulated statistics or sometimes bar-charts at best, such as the most number of hits, visitor counts, and busiesthours of the day/week/year etc.Earlier, a Website performance monitoring system called WebVS (Webvisualization system) was proposed by the authors [1]. WebVS checks and visualizes both the static websitestructure and the dynamic usage data respectively. When combined, both static view and dynamic view representthe health of the growth of a website with respective to addition of Web contents/pages and the actual popularityof the contents on the Website.With the aid of visualization, The Web structure is rendered as a radial tree and theanalytic results will be overlaid on it by some visual cues. Web administrations and analysts can select different

    data attributes and thresholds to visualize. The static and dynamic view of Web graphs rendered by the WEBVSgive the idea of how a Website is doing. Such presentation of the portal status is easier to understand and tolocate anomalies or interesting phenomenon than lists of statistics and numbers in a report. While radial tree Webgraphs are useful for illustrating the full view of a website and the information pertained at different parts of thewebsite, they fall short in visualizing the association of parts of the websites being visited. WEBVS alsogenerates association rules by applying the Fuzzy Apriori-T Association Rules algorithm and visualizes the rulesin a relation graph. Visualization of such associations is implemented here along with radial tree Web graphs

    because dynamic usages of the website that are represented by Web visits complement with the growth ofwebsite structures and contents as a holistic approach.

    For discovering the relations between Web pages or Web links, association rule mining (ARM) has beenstudied widely by researchers in Web mining research community. However, most of the previous works articlesin the literature used ARM are for studying the Web visitors/customers browsing patterns on the Website, hence

    Proceedings of International Conference on Computing Sciences

    WILKES100 ICCS 2013ISBN: 978-93-5107-172-3

    * Corresponding author -

  • 8/11/2019 Session 03_Paper 35

    2/13

    Suash Deb, Simon Fong and Cecilia Ho

    the Website could be fine-tuned or personalized according to their Web surfing preferences. In this paper, wefocus on a slightly different perspective from the views and requirements of WebVS which aims at visualizingthe dynamic activities (also known as usages) on the Website. Therefore the relations in terms of being clicked in

    a sequence of visits between the Web pages could be visualized. Fuzzy ARM is applied here instead of theoriginal ARM. It is because the contextual relations between Web pages are known to be defined differently bydifferent people; for example, a session of Web browsing may be long in one culture but not so in the other.The term evening is loosely defined as the period of time after sunset which of course geographically differsfrom city to city. The measures used in such rule association are hence fuzzy in nature. The main contribution ofthis paper is a fuzzified ARM model which could be used as an important element in WebVS or similar Webvisualization package.The paper is structured as follow. A brief review of the related technology such as Website performancevisualization and ARM algorithms is done in Section 2. Section 3 depicts the theoretical model Fuzzy ARM orsimply FARM. An experiment is conducted in Section 4 and the efficacy of our proposed FARM model isdiscussed as well. Section 5 shows the visualized results. Section 6 concludes this paper.

    2.

    Reviews of Web Mining and Visualization Systems

    The authors in [3] introduced different 2D and 3D visualization diagrams of particular interest, classifyingWeb pages into two classes of hot (with many hits) and cold (with few hits) ones and illustrating behavior ofusers. The framework enables flexible selection of mappings between data attributes and visualizationdimensions for different diagrams. Selected existing academic and commercial Web analysis and visualizationsystems in this section are briefly reviewed.

    Table I. Selected academic Web analysis and visualization systems proposed in the past

    Authors Web Mining Visualization

    ContentMin

    ing

    StructureMining

    UsageMining

    Clustering

    Structure

    Static

    Informatio

    n

    Dynamic

    usage

    UsageRelation

    Personaliza-

    tion

    rowthMonitoring

    Inter-sites

    Comparison

    Smith and Ng [4]

    Song and Shepperd [5]

    Chen et al[6]

    Munzner [7, 8]

    Chi et al[9]

    Chi et al [10]

    Erick [11]

    Liu et al [12] Liu et al [13]

    Niu et al [14]

    Reiss and Eddon [15]

    Chen [16]

    Pascual-Cid et al [17, 18]

    To help users search for information and organize information layout, Smith and Ng [19] suggested using aself-organizing map (SOM) to mine Web data and provided a visual tool to assist user navigation. LOGSOMorganizes Web pages into a two-dimensional map based solely on the users' navigation behavior, rather than thecontent of the Web pages.

  • 8/11/2019 Session 03_Paper 35

    3/13

    Fuzzy Association Rule Mining for Web Usage Visualization

    Song and Shepperd [20] view the topology of a Web site as a directed graph and mine Web browsing patternsfor ecommerce. They use vector analysis and fuzzy set theory to cluster users and URLs. Their frequent access

    path identification algorithm is not based on sequence mining which has a very important role in knowledge

    discovery in Web log data, due to the ordered nature of click-streams.Chen et al [21] describe a novel representation technique which makes use of the Web structure together withsummarization techniques to better represent knowledge in actual Web Documents. They named the proposedtechnique as Semantic Virtual Document (SVD). The SVD can be used together with a suitable clusteringalgorithm to achieve an automatic content-based categorization of similar Web Documents. This techniqueallows an automatic content-based classification of Web documents as well as a tree-like graphical user interfacefor browsing post retrieval document browsing enhances the relevant judgment process for Internet users. Theyalso introduce cluster-biased automatic query expansion technique to interpret short queries accurately. They

    present a prototype of Intelligent Search and Review of Cluster Hierarchy (iSEARCH) for Web content mining.The H3 hyperbolic site viewer is developed by Tamara Munzner while at Stanford University [7, 8]. Using a

    sophisticated two-pass algorithm to organize pages in hyperbolic space, it results in pages laid out on ahemisphere using a non-Euclidean distance metric, ensuring there is exponential room to place nodes andenabling it to cope with large Web sites. The H3 viewer is also interactive; rotating the sphere gives the viewer a

    fixed target frame rate to maintain interactive performance.Chi et al introduced a system based on the visualization of website structure using the radial Tree visual

    metaphor [9]. The edge thickness was used to map the amount of traffic occurred on a link while colour was usedfor mapping the type of content of the target node. The same authors also presented Time Tube [10], whichconsists of a set of snapshots that represent the evolution of the website with time.

    Eick [11] proposed a visualisation method to depict users behaviour based on the usage of three columns. Theleft column contain nodes that represent the most frequent referrer pages used to reach a desired page, located inthe middle column. The destination pages after the focus node are placed in the right column. Hence, it is quiteintuitive to identify the users flow around a single node.

    WebCompare in [12], is a Web comparison system that uses information retrieval and data mining techniquesto compare keywords in U pages and C pages to identify those potentially interesting pages. Liu et al [13]

    proposed VSComp combines clustering and visualization to highlight those potentially interesting pages from twoWeb sites. The key idea of the proposed approach is that Web pages from the two sites are combined first, andthen clustered and displayed together. This naturally reveals those interesting pages, i.e., similar and different

    pages in the two sites. In terms of techniques, VSComp is different from WebCompare as VSComp usesclustering and visualization, which are not used in WebCompare

    In the WebKIV system [14], a radial tree algorithm is used to construct the Web site structure in a 2D plane. Itimplemented the disktree representation to compare Web navigational patterns and defined a three dimensionalscale to describe the Web visualization task.

    Reiss and Eddon [15] proposed the Webviz that gathers usage data from large numbers of users, monitoringthe URLs of the Web pages they are currently browsing. It summarizes this information by categories and thendisplays the results so that users can understand browsing patterns over time, can spot trends, and can identify anyunusual patterns. The display consists of concentric circles, each representing a different time interval, with theoutermost interval representing the most recent period. Within each interval Webviz displays the differentcategories of information. The saturation and brightness of the region and the frequency, width, and amplitude ofthe interior line code the additional information.

    Web Knowledge Visualization and Discovery System (WEBKVDS) [16] is mainly composed of two parts: a)

    FootPath: for visualizing the Web structure with the different data and pattern layers; b) Web Graph Algebra: formanipulating and operating on the Web graph objects for visual data mining. The authors presented the idea oflayering data and patterns in distinct layers on top of a disktree representation of the Web structure, allowing thedisplay of information in context which is more suited for the interpretation of discovered patterns. And with thehelp of the Web graph algebra, the system provides a means for interactive visual Web mining.

    WebViz system, a tool to visualize both the structure and usage of Web sites, is proposed in [22]. The structureof a Web segment is rendered as a radial tree, and usage data is extracted and layered as polygonal graphs. Byinteractively creating and adjusting these layers, a user can develop real time insight into the data. The systemshows the idea of interactive visual operators and the idea of a polygon graph as a visual cue. This techniqueextends the concept of the radial tree by generating polygons that appear from the connection of parent nodes inthe hierarchy with representative points in the edges calculated according to any usage data of its children node.The polygonal graphs, however, are not straight forward enough and it needs time to interpret to discover usefulinformation.

  • 8/11/2019 Session 03_Paper 35

    4/13

    Suash Deb, Simon Fong and Cecilia Ho

    Website Exploration Tool (WET) in [17], the closest system we found from academia for Web graphvisualization, uses the Graphs Logic System (GLS) to calculate the representative subgraphs from the wholecollected Web graph, simplifying the quantity of data to be visualized and avoiding overlapped visualizations.

    GLS generates GraphML file to be visualized as radial tree and treemap. The main goal of WET is to assist in theconversion of Web data into information by providing an already known context where Web analysts mayinterpret the data. In the most recent paper, the authors describe the assessment process of two Virtual LearningEnvironments (VLE). An improved version of WET [18], provides a set of combined visual abstractions that can

    be visually customised as well as recomputed by changing the focus of interest. However, the WET only focuseson visualizing the website data for the usage evaluation

    3. Fuzzy Association Rules Mining

    Fuzzy association rules mining [31] is an improved version of ARM which has been applied in many areas,such as rainfall prediction [32], refining search query in Web retrieval [33], and Website personalization togetherwith a case-reasoning engine [34]. In this section, the basic concepts and the crisp problem for traditional

    association rules are introduced, followed by a new fuzzy approach that improves the crisp problem.

    3.1.Definitions of Association Rules

    To measure the reliability/accuracy of a rule two values, support and confidence, that have been extensivelyused, were initially introduced. Let I={i1 , i 2 , ... , i m}be a set of items (objects) and T={ t1 , t 2 , ... , t n}a set oftransactions with items in I, both assumed to be finite. Icontains all the possible items of a database, differentcombinations of those items are called itemsets

    Definition 1.An association rule is an expression of the form X Y, where X,Y I, X,Y , and X Y =.The rule X Y means every transaction of T that contains X contains Y too.The usual measures to assessassociation rules are support and confidence, both based on the concept of support of an itemset. The supportmeasures the reliability by the relative frequency of co-occurrence of the rules items. The confidence measuresthe rule accuracy as the quotient between the support of that rule and the relative frequency of the items

    belonging to the left part of the rule

    Definition 2.The support of an itemsetIjI with respect to a set of transactions T issupp(Ij, T)= (1)

    indicating the probability that a transaction of T contains I jDefinition 3.The support of the association rule X Y in T is

    Supp(X Y, T ) = supp(X Y, T) (2)and its confidence is

    Conf(X Y, T ) = = (3)

    It is usual to assume that T is fixed for each problem and thus, it is customary to avoid any reference to it.Then, the above introduced values are simply noted as supp(Ij), Supp(X Y ) and Conf(X Y )

    respectively.Support it is the percentage of transactions where the rule holds. Confidence is the conditionalprobability of Y with respect to X or, in other words, the relative cardinality of Y with respect to X. Associationrules mining is the attempt to discover rules whose support and confidence are greater than two user-definedthresholds called minsuppand minconfrespectively. Such rules are called strong rules.

    Most of the existing algorithms work in the following dual steps:Step 1. Find the frequent itemsets. Considering transactions one by one, it updates the support of the itemsets

    each time a transaction is considered. It is the most expensive step from the computational point of view.Step 2. Obtain rules with support and confidence greater than the user-defined thresholds, from the frequent

    itemsets obtained in the previous step. Specifically, if the itemsets X and X Y are frequent, we can obtain the ruleX Y since it is equal to the support of the itemset X Y according to the Definition 3.

    3.2 Crisp Boundary Problem: Motivation to Fuzzy Approach

    ||

    |}|{|

    T

    I j

    ),(

    ),(

    TXsupp

    TYXsupp

    ),(

    ),(

    TXsupp

    TYXSupp

  • 8/11/2019 Session 03_Paper 35

    5/13

    Fuzzy Association Rule Mining for Web Usage Visualization

    Conventional Association Rule Mining (ARM) algorithms usually deal with datasets with categorical valuesand expect any numerical values to be converted to categorical ones using ranges. In real life, data is neither onlycategorical nor only numerical but a combination of both. And the general method adopted is to convert

    numerical attributes into categorical attributes using ranges. The problem with the above approach to dividingranged values into sub-ranges is that the boundaries between the sub-ranges are crisp boundaries. FuzzyAssociation Rule Mining (FARM) is intended to address the crisp boundary problemencountered in traditionalARM. The principal idea is that ranged values can belong to more than one sub-ranges, we say that the value hasa membership degree, , that associates it with each available sub-ranges.

    The most known ARM algorithm, Apriori, is based on a simple but key fundamental observation aboutfrequent itemsets: Every subset of a frequent itemset must be a frequent itemset too. From this, the algorithm isdesigned to proceed iteratively starting from frequent itemsets containing a single item. The Fuzzy Apriori-Talgorithm we use in this research is a fuzzy version of the Apriori-T algorithm.Using this fuzzy function, it is possible to assign a membership degree to each of the elements in X. Elements ofthe set could but are not required to be numbers as long as a degree of membership can be deduced from them.For the purpose of mining fuzzy association rules, numeric elements are used for quantitative data, but othercategories might also exist where no numerical elements will be found. For example, we define three age

    categories: Young, Middle-aged and Old, and then ascertain the fuzzy membership (range [0, 1]) of each crispnumerical value in these categories. Thus, Age = 35 may have = 0.6 for the fuzzy partition Middle-aged, =0.3 for Young and = 0.1 for Old [35]. Thus, by using fuzzy partitions, we preserve the information encapsulatedin the numerical attribute, and are also able to convert it to a categorical attribute, albeit a fuzzy one. Therefore,many fuzzy sets can be defined on the domain of each quantitative attribute, and the original dataset istransformed into an extended one with attribute values having fuzzy memberships in the interval [0, 1].Applyingto Web mining, since we aim to evaluate the reputation or popularity of the Web pages or e-Services anddifferent types of information in a given Website, we can predefine the content categories of the Web pages. And,

    by the accessed URL in Web log, we can identify what content the accessed page is about.

    3.3 Fuzzy Association Rules

    As in classical association rules, I={i1 , i 2 , ... , i m } represents all the attributes appearing in the transactiondatabase T={ i1 , i 2 , ... , i n }. Icontains all the possible items of a database, different combinations of those items

    are called itemsets. Each item ikwill associate (to some degree) with several fuzzy sets. The degree of associationis given by a membership degree in the range [0, 1].

    Definition 5.A fuzzy transaction is a nonempty fuzzy subset I.For every i I , we note (i) the membership degree of iin a fuzzy transaction . We note (Ij) the degree of

    inclusion of an itemsetIjI in a fuzzy transaction , defined as: .

    Definition 6.Let I be a set of items, T a FT-set, and X,Y I two crisp subsets, with X,Y , and AC = . Afuzzy association rule X Y holds in T iff (X) (Y) T, i.e., the membership degree of Y is greater than thatof X for every fuzzy transaction in T. This definition preserves the meaning of association rules, because if weassumeX and Y given that (X) (Y).

    The support of the fuzzy association rule XY in the FT-set T is supp(X Y). The confidence computation is touse a scalar cardinality of fuzzy sets based on the weighted summation of the cardinalities of its -cuts. The new

    confidence measure then looks as follows:conf(XY)= (4)

    This method puts greater emphasis on elements with higher membership degrees due to the fact that an elementwith membershipkoccurs in each summand of k, k{1,... , t}.

    3.4 Our Approach

    An item and a transaction are abstract concepts that may be seen as representing some kind of an objectand a subset of objects, respectively. Fuzzy data have uncertain values associated with fuzzy (linguistic) labels(such as "high" and "low") and a membership function, which normalizes the design parameter to the rangebetween 0 and 1. In this case, a fuzzy transaction can contain more than one item corresponding to different labels

    i

    i

    X

    YXt

    i

    ii

    ||

    ||)(

    1

    11

    =

    +

  • 8/11/2019 Session 03_Paper 35

    6/13

    Suash Deb, Simon Fong and Cecilia Ho

    j

    j

    j X

    c

    XLL ,...,1

    of the same attribute, because it is possible for a single value in the table to fit more than one label to a certaindegree.

    Similar to [37], let Lab(Xj) = {} be a set of linguistic labels for attribute Xj. We shall use the labels

    to name the corresponding fuzzy set, i.e. : Dom(Xj) [0, 1]

    Let L = j{1,...,m}Lab(Xj). Then, the set of items with labels in L associated to RE is

    Every instance r of RE is associated to a FT-set, denoted

    , with items in

    . Each tuple t r is associatedtoa single fuzzy transaction

    , such that

    In this project, we aim to discover the interesting and meaningful patterns of browsing behaviors andpreferences of users from different origins. Therefore, we have four attributes Ar = {Origin, Hour, Duration,

    Content} as described in Table II below. Although the authors in [38] suggest to use both the duration and visitingfrequency as the weighting parameters, we simply only use the duration in this research since visiting frequencymay be unreliable

    Table II. Categorical attributes for FARM

    Attribute Description

    Origin Visitor location, i.e. the country resolved from the remote-host field of a log entry

    Hour Hour of the day a visitor made the page access request

    Duration The length of the period that a visitor spent on a page, i.e. the activity time

    Content Different content categories of the requested page predefined by experts

    We use for origin the set of labels Lab(Origin) = {US, UK, China, Japan, Canada, }. The geographical

    location of a visitor can be absolutely resolved from the IP or domain name in the Web access log. Therefore themembership degree of the origin attribute is always 1. This information can help in relating the user behaviors and

    preferences to their origin.Hour is the time of day a visitor made the access request. This can be obtained from the timestamp of the

    access log entry in the format of HH:MM:SS. Visitors may behave differently in different hour of the day, forexample: they may interest in different kinds of content. According to [37], the set of labels for hourLab(Hour) ={Early morning, Morning, Noon, Afternoon, Night} can be defined as in Figure 1.

    Fig. 1.Representation of some fuzzy labels for Hour.

    Duration refers to the length of the period that a user spent on a page, indicating the users interest of the pagecontent. It is measured by the length of time between two successive activities or requests of the same user withina session. Figure 2 shows a possible definition of the set of labels for durationLab(Duration) = {Short, QuiteShort, Medium, Quite Long, Long}. Note that we use the fixed 15 minutes as the maximum difference betweentwo requests in the same session as suggested by [39]. The duration reflects the relative importance of each page,

    because a user generally spends more time on a more useful page that contains updated, useful and attractivecontent. If a user in not interested in a page, he/she will usually jump to another page quickly. However, a quick

    jump may also occur when there is too little content on the page to browse through. Hence, it is more appropriateto normalize the duration by the total bytes of the page. The equation is defined as shown in (5), where eachpiPis the page viewed.

    jX

    kL

    =

    },...,1{

    },...,1{and|,

    mj

    ckREXLXI jj

    X

    kjRE

    L

    j

    ])[(, jX

    k

    X

    kj

    t

    L XtLLX jj=

    ))(

    )((max

    )(

    )(

    )(

    pSize

    pionTotalDuratpSize

    pionTotalDurat

    pDuration

    TQ

    =

  • 8/11/2019 Session 03_Paper 35

    7/13

    Fuzzy Association Rule Mining for Web Usage Visualization

    Fig. 2.Representation of some fuzzy labels for Duration

    We categorize the Web pages by the page content fuzzily in the sense that a Web page may contain more thanone kinds of information and, therefore, it can be classified into more than one content categories in a certaindegree. Table 7 shows an example of the fuzzy membership associated with the set of labelsLab(Content) = {A,B, C, D} for the content attribute, where P = {p1,,pk} is a set of Web pages. 60% of the content of the pagep1 isclassified as the content category B and 40% as the content category C.

    We use the sets of labels for hour and duration, discovering the relations between the length and the hour of theactivity timeThen we haveL= Lab(Hour) Lab(Duration) and

    ={, ,, , , , , , , }. The FT-set Tr on

    is =

    . Table III shows the fuzzy transactions with items inwith the columns defining the fuzzy transactions

    ofas fuzzy subsets of

    Table III. Fuzzy transactions for the temporal relation of results

    0 0 0 1 0 0.25

    0 0 0 0 0 0.75

    0 0 0.5 0 0 0

    0.75 0 0.5 0 1 0

    0.25 1 0 0 0 0

    0 0 0 0.33 0 0

    0 0 0.25 0.67 0 1

    1 1 0.75 0. 0 0

    0 0 0 0 1 0

    0 0 0 0 0 0

    We have a collection of Web pages P= {p1, . . . , pn} and a set of access log transactions associated to thecollection of pagesPas TP= {t1, . . . , tm}. We can obtain a set of itemsI= {i1, . . . , im} which represents all the

    attribute labels appearing in the transaction collection TP. The degrees of association to these items in a accesstransaction tiare given by a membership value normalizedin the range [0, 1] and are represented by = {i1,. . . ,

    im}. Therefore, we can define a set of fuzzy transactionsF= {t1, . . . , tn}, where each document ticorrespondsto a fuzzy transactionfiF, and where the membership value = {i1,. . . , im} of the item setI= {i1, . . . , im}are fuzzy values from a fuzzy weighting scheme as mentioned earlier.The fuzzy representation building process of TP is shown is the Algorithm 1. Given a set of transactions, all

    possible items are extracted and their associated membership values are obtained by the item weighting scheme.On this set of fuzzy transactions we apply the Algorithm 2 to extract the association rules

    Algorithm 1 Basic algorithm to obtain the fuzzy representations of all Web page access t ransactions

    Input: a set of transactions TP= {t1, . . . , tm}

    Output: a fuzzy representation for all transactions in TP.

  • 8/11/2019 Session 03_Paper 35

    8/13

    Suash Deb, Simon Fong and Cecilia Ho

    1. Let TP= {t1, . . . , tm}be a collection of page access transactions

    2. Extract an initial set of itermsIfrom each transactiontiT

    3. Apply the fuzzy membership weighting scheme in Section 5.3.2.4

    4. The representation of tiobtained is a set of itemsI= {i1, . . . , im} with their associated membership values {i1,

    . . . ,

    im}

    Algorithm 2 Basic algorithm to obtain the association rules from Web access log

    Input: a set of transactions F= {t1, . . . , tm}where ticontains a set of itemsI = {i1, . . . , im} with their associated membership

    values = {i1,. . . , im}.

    Output: a set of association rules.

    1. Construct the itemsets from the set of transactions F.

    2. Establish the threshold values of minimum support minsuppand minimum confidence minconf.

    3. Find all the itemsets that have a support above threshold minsupp, that is, thefrequent itemsets.

    4. Generate the rules, discarding those rules below threshold minconf.

    4. Experiment

    A real life dataset of a public organization Web site, the NASA Web site was used to evaluate the efficacy ofthe proposed method for Web visualization. NASA was chosen because of its large amount of data that are

    publicly available.We extracted the Web structure and the page information of the present NASA Web site withsome Web crawler utility software [41] and Web Content Extractor [20]. By specifying the crawling rules,

    preferred data and output format, we can obtain specific data from a particular website automatically from theInternet. Since the Web site is very large, we only extracted the Missions part with crawling depth of 3 for thisexperiment use. The extracted data contains 103 unique HTML pages. Theoretically our prototype can work withany size of data provided that a view zooming function is available.

    The logs were collected from 00:00:00 August 1, 1995 through 23:59:59 August 31, 1995. In this period therewere 1,569,898 requests. Timestamps have 1-second resolution. There are a total of 18,688 unique IPs requesting

    pages, having a total of 171,529 sessions. A total of 15,429 unique pages are requested. The logs are in an ASCIIfile format with one line per request, with the following attribute columns:

    1. Remote-host making the request. A hostname when possible, otherwise the Internet address if the name

    could not be looked up.

    2. Timestamp in the format "DAY/MON/YEAR HH:MM:SS ZONE", where DAY is the day of the month,

    MON is the name of the month, YEAR is the year, HH:MM:SS is the time of day using a 24-hour clock,

    and ZONE is the time zone which is -0400 in this dataset.

    3. Request given in quotes.

    4. Status code.

    5. Bytes in the reply.A transaction contains a set of Web pages access requested by a user in the Web logs within a predefined

    period of time. After all pre-processing, which includes the filtering of unwanted data, user identification, content

    type mapping, etc., we calculate the fuzzy membership values of the attributes, {Hour, Duration, Origin}according to the label defined earlier and fuzzily classify the pages into content categories by the keywords oftheir URLs.Therefore, we obtain the data in pairs. In every entry, the first pair indicates the geographical locationof the user following with the itemsets of the content accessed by the user. The first numeric value in a pair is thecode representation of attribute label while the second numeric value is the fuzzy membership value.

    For experiments, prepared two datasets containing the access requests of more than 60 hosts from 5 countriesare prepared. We aim to assess the performance of our proposed FARM algorithm with respect to standardApriori-T algorithm.NASA1 contains 3 attributes {Hour, Duration, Origin} while NASA2 contains 4 attributes{Hour, Duration, Origin, Content}. The experiments are run on a Windows Vista machine with a 2 GHz IntelCore Duo CPU and 3 GB RAM.

    Association Rules (ARs) are generated from what are called frequent itemsets, these are itemsets with asupport count above some user specified support threshold. The support threshold is typically given a low valueso that no potentially interesting rules are missed. Once the frequent itemsets in a data set have been identified the

  • 8/11/2019 Session 03_Paper 35

    9/13

    Fuzzy Association Rule Mining for Web Usage Visualization

    ARs can be generated. Each frequent itemsets of size greater than one can produce 2 or more ARs. To reduce thisnumber only those rules above a given confidence threshold are selected. Therefore the confidence thresholdvalue chosen is usually quite high.

    More importantly, for any dataset there is a particular support value for which optimal number of itemsets isgenerated and for supports less than this value, we get a flood of itemsets which are of no practical use. From ourexperiments, we have observed that our algorithm performs most efficiently at this optimal support value, whichoccurs in the range of 0.015 - 0.03 for the dataset NASA1 and the range of 0.025 - 0.05 for NASA2. Applying theFuzzy Apriori-T algorithm, we have the following results

    Fig. 3. Number of frequent itemsets (NASA1). Fig. 4. Number of frequent itemsets (NASA2).

    Fig. 5. Number of rules with minsupp= 0.01 (NASA1+2). Fig. 6. Number of rules with minsupp= 0.02 (NASA1+2)

  • 8/11/2019 Session 03_Paper 35

    10/13

    Suash Deb, Simon Fong and Cecilia Ho

    Fig. 7. Execution time with confidence= 0.03 (NASA1) Fig. 8. Execution timewith confidence= 0.03 (NASA2

    Figures 3 and 4show the results and demonstrate the difference between the numbers of frequent itemsets

    generated using the Fuzzy Apriori-T and Apriori-T algorithms from two different datasets, NASA1 and NASA2.As expected the number of frequent itemsets increases as the minimum support decreases. From the results, it isclear that FARM produces more frequent itemsets (and consequently rules) than Apriori-T. Figures 5 and 6 showthe number of rules produced using support threshold 0.01 and 0.02 respectively. The Fuzzy Apriori-T generatesmany more rules than Apriori-T in the case of the dataset with 4 attributes.

    Figures 7 and 8 show the performance of the two algorithms on execution time by varying the supportthreshold for different datasets. It can be seen that the execution time increases as the threshold decreases in allcases irrespective of dataset type. Although the two algorithms have similar performance on execution time andnumber of frequent itemsets, Fuzzy Apriori-T benefits over standard Apriori-T by extracting more interestingrules, especially in the case of more attributes. Table IV lists groups of rules with quite a high confidencediscarded by Apriori-T but considered by Fuzzy Apriori-T for different settings from the NASA2

    Table IV. Rules discarded by Apriori-T

    RuleConfidence(%)

    Lift Ratio

    {Procurement} -> {US} 99.57 1.75

    {History Japan} -> {Short} 83.77 1.35

    {QuiteLong} -> {US} 81.92 1.44

    {History Night} ->{Short} 80.77 1.3

    minsupp= 0.03 minconf= 0.65

    Rule Confidence(%) Lift Ratio

    {Long Home} -> {US} 91.06 1.6

    {Noon Home} -> {US} 88.52 1.56

    {Afternoon History} -> {Short} 80.88 1.3

    {Noon History} -> {Short} 78.51 1.27

    {UK Missions} -> {Short} 74.67 1.2

    {USCountdown} -> {Short} 67.18 1.08

    {EarlyMorningMissions} -> {Short} 66.67 1.08

    minsupp= 0.035, minconf= 0.65

    Rule Confidence(%) Lift Ratio

    {Long Home} -> {US} 91.06 1.6

    {Noon Home} -> {US} 88.52 1.56

    {US Countdown} -> {Short} 67.18 1.08

    5.Visualization by Relation Graph

    The association rules, generated by applying Fuzzy Apriori-T algorithm, are visualized in Figure 9 showing therelationship between the four attributes of the dataset NASA2. The horizontal bars in the visualization show theabsolute frequency of how often each category occurred. The category N/A indicates no item of the rule fits in anyof the categories in that dimension. The purple line from the category Early morning in the left indicates the rule

  • 8/11/2019 Session 03_Paper 35

    11/13

    Fuzzy Association Rule Mining for Web Usage Visualization

    {Early morning, About}->{US}; while the one in the right indicates {Early morning, History}->{Short}.Bychoosing the dimensions of origin and content, we obtain the relation graph visualizing the rules of these twoattributes in Figure 10. We can see clearly that US users interest in the content of Procurementand Aboutwhile

    Japan users preferHistorycontent

    Fig. 9. Association rules with support0.02 and confidence80%. Fig. 10. Association rules showing users from different origins

    6. Conclusion

    In this paper we described the Web usage analysis by applying an association rule mining algorithm, called FuzzyApriori-T algorithm. It is set to find out the relations between visitors locations and their navigation preferences.Visualization of the generated rules in relation graph helped to show easy -understanding of the discovered

    patterns. The motive of this approach is to enable visualization of a balanced growth of a Website which canquantitatively be observed from the website structure, as well as the distribution of popularity received by the

    Web visitors, from the association rules. The Web graph only visualizes part of the NASA Web site as a trial test.And the fuzzy association rule mining covers a period of the Web logs as experiment. The experiment validatedthe capabilities of our proposed visualization and data mining models

    References

    [1] Simon Fong, Ho Si Meng, A Web-based Performance Monitoring System for e-Government Portal, The 3rd International Conference on

    Theory and Practice of Electronic Governance (ICEGOV 2009), 10-13 November 2009, Bogota, Colombia, pp.74-82.

    [2] A. H. Youssefi, D. J. Duke, M. J. Zaki, Visual Web Mining, WWW2004, New York, May 2004.

    [3] Oosthuizen, C., Wesson, J., Cilliers, C., Visual Web Mining of Organizational Web Sites, Tenth International Conference on

    Information Visualization, (2006), pp.395-401.

    [4] Smith K.A. and Ng A., Web page clustering using a self-organizing map of user navigation patterns,Decision Support Systems, Volume

    35, Issue 2, (2003), pp.245-256.

    [5] Q. Song and M. Shepperd, Mining Web browsing patterns for e-Commerce, Computer. Indus. 57(7) (2006) 62263[6] L. Chen, W. Lian and W. Chue, Using Web structure and summarization techniques for Web content mining,Information Processing and

    Management: an International Journal Volume 41 Issue 5, September (2005).

    [7] Munzner, T., Interactive Visualization of Large Graphs and Networks, Ph.D. Dissert., Stanford University, June 2000;

    graphics.stanford.edu/papers/munzner_thesis

    [8] Munzner, T., Exploring large graphs in 3D hyperbolic space.IEEE Comput. Graph. Appl.18, 4 (July/Aug. 1998), 1823

    [9] E. H. Chi., Improving Web usability through visualization.IEEE Internet Computing, 6(2), (2002), 6471.

    [10] E. H. Chi, J. Pitkow, J. Mackinlay, P. Pirolli, R. Gossweiler, and S. K. Card., Visualizing the evolution of Web ecologies.In CHI 98:

    Proceedings of the SIGCHI conference on Human factors in computing systems, (1998), pp.400407.

    [11] S. G. Eick., Visualizing online activity, Communications. ACM, 44(8), (2001), 4550.

    [12] Liu, B., Ma, Y. and Yu, P. S. Discovering Unexpected Information from Your Competitors Web Sites. KDD-01, 2001.

    [13] Bing Liu, Kaidi Zhao and Lan Yi, Visualizing Web Site Comparisons, WWW 2002, May 7-11, (2002), Honolulu, Hawaii, USA.

    [14] Yonghe Niu, Tong Zheng, Jiyang Chen, Randy Goebel, WebKIV: Visualizing Structure and Navigation for Web Mining Applications,

    IEEE/WIC International Conference on Web Intelligence (WI03)

  • 8/11/2019 Session 03_Paper 35

    12/13

    Suash Deb, Simon Fong and Cecilia Ho

    [15] Steven P. Reiss and Guy Eddon, Visualizing What People are Doing on the Web,IEEE Symposium on Visual Languages and Human-

    Centric Computing (VL/HCC05), (2005).

    [16] J. Chen, L. Sun, O. R. Zaane, R. Goebel, Visualizing and Discovering Web Navigational Patterns, 7th International Workshop on the

    Web and Databases, Paris, June (2004).[17] V. Pascual-Cid, An information System for the Understanding of Web Data,IEEE Symposium on Visual Analytics Science and

    Technology, October (2008).

    [18] V. Pascual-Cid1;2, R. Baeza-Yates2;3, J.C. Dursteler2, S. Minguez1 and C. Middleton1, New Techniques for Visualising Web

    Navigational Data, 13th International Conference Information Visualization, (2009).

    [19] Toyoda M., Kitsuregawa M., A system for visualizing and analyzing the evolution of the web with a time series of graphs, Proceedings

    of the sixteenth ACM conference on Hypertext and hypermedia, (2005), pp.151-160.

    [20] Web Content Extractor. http://www.newprosoft.com/Web-content-extractor.htm

    [21] Hood, C. and Margetts, H., The Tools of Government in the Digital Age, London: Palgrave, (2006).

    [22] J. Chen, T. Zheng, W. Thorne, D. Huntley, O. R. Zaane, R. Goebel, Visualizing Web Navigation Data with Polygon Graphs,

    Proceedings of the 11th International Conference Information Visualization, (2007).

    [23] WebTrends, http://www.webtrends.com/Products/Analytics/Web

    [24] NetTracker, http://www.sane.com/products/NetTracker/

    [25] David Durand and Paul Kahn, MAPA: a system for inducing and visualizing hierarchy in websites

  • 8/11/2019 Session 03_Paper 35

    13/13

    Index

    AAd hoc network

    VANET, 245

    DDynamic MANET on Demand (DYMO) routing protocol, 248

    Dynamic source routing (DSR) protocol, 248

    H

    Hybrid architecture

    VANET, 245

    J

    Jitter comparisonVANET, 250

    R

    Routing protocolin VANET, 247248

    Routing protocol in VANET

    AODV, 247248DYMO and DSR, 248

    ZPR, 248249

    V

    VANET. seeVehicular ad hoc network (VANET)Vehicular ad hoc network (VANET)

    applications, 246247

    hybrid architecture, 245

    Jitter comparison, 250performance evaluation, 249251

    pure Ad hoc network, 245

    pure cellular architecture, 245routing protocol in, 247248

    ZZone routing protocol (ZRP), 248249

    ZRP. seeZone routing protocol (ZRP)