Data Preparation for Web Usage Mining

download Data Preparation for Web Usage Mining

of 21

Transcript of Data Preparation for Web Usage Mining

  • 8/16/2019 Data Preparation for Web Usage Mining

    1/21

    Data Preparation for Web Usage

    Mining

    Reference : http://maya.cs.depaul.edu/~classes/ect584/papers/cms-kais.pdf 

    Dr. Ahmed Rafea

    http://maya.cs.depaul.edu/~classes/ect584/papers/cms-kais.pdfhttp://maya.cs.depaul.edu/~classes/ect584/papers/cms-kais.pdf

  • 8/16/2019 Data Preparation for Web Usage Mining

    2/21

    Outline

    • A General Overview

    • Preprocessing – Data Cleaning – User Identification

     – Session Identification

     – Path Completion

     – Formatting

    • Transaction Identification

     – General Model

     – Transaction Identification by Reference Length – Transaction Identification by Maximal Forward

    Reference

     – Transaction Identification by Time Window

  • 8/16/2019 Data Preparation for Web Usage Mining

    3/21

     A General Overview

  • 8/16/2019 Data Preparation for Web Usage Mining

    4/21

    Data Cleaning

    • Techniques to clean a server log to eliminate irrelevant itemsare of importance for any type of Web log analysis, not just

    data mining.• The discovered associations are only useful if the datarepresented in the server log gives an accurate picture of theuser accesses to the Web site. .

    • A user’s request to view a particular page often results inseveral log entries since graphics and scripts are down-loaded in addition to the HTML file.

    • In most cases, only the log entry of the HTML file request isrelevant and should be kept for the user session file .

    • Elimination of the items deemed irrelevant can be reasonablyaccomplished by checking the suffix oft he URL name.

    • For instance, all log entries with filename suffixes such as, gif, jpeg, GIF, JPEG, jpg, JPG, and map can be removed.

  • 8/16/2019 Data Preparation for Web Usage Mining

    5/21

    User Identification (1)

    • This task is greatly complicated by the existence of local caches, corporatefirewalls, and proxy servers.

    • The Web Usage Mining methods that rely on user cooperation are the

    easiest ways to deal with this problem.• However, even for the log/site based methods, there are heuristics that can

    be used to help identify unique users.

    • A reasonable assumption to make is that each different agent type for an IPaddress represents a different user.

    • If a requested page is not directly reachable by a hyperlink from any of thepages visited by the user, the heuristic assumes that there is another userwith the same IP address.

    • Two users with the same IP address that use the same browser on thesame type of machine can easily be confused as a single user if they arelooking at the same set of pages.

    • Conversely, a single user with two different browsers running, or who typesin URLs directly without using a sites link structure can be mistaken formultiple users.

  • 8/16/2019 Data Preparation for Web Usage Mining

    6/21

    User Identification (2)

    •The fifth, sixth, eighth, andtenth entries were accessedusing a different agent than theothers, suggesting that the logrepresents at least two usersessions.

    •The third entry, page L, is notdirectly reachable from pages Aor B. Also, the seventh entry,page R is reachable from pageL, but not from any of the other

    previous log entries. This wouldsuggest that there is a third userwith the same IP address

    •Three unique users areidentified with browsing paths of

     A-B-F-O-G-A-D, A-B-C-J, and L-R, respectively.

  • 8/16/2019 Data Preparation for Web Usage Mining

    7/21

    Session Identification (1)

    • The goal of session identification is to divide thepage accesses of each user into individual

    sessions.• The simplest method of achieving this is through

    a timeout,

    • Many commercial products use 30 minutes as adefault timeout.

    • Once a site log has been analyzed and usage

    statistics obtained, a timeout that is appropriatefor the specific Web site can be fed back into thesession identification algorithm.

  • 8/16/2019 Data Preparation for Web Usage Mining

    8/21

    Session Identification (2)

    • Using a 30 minute timeout,the path for user 1 from the

    sample log is broken into two

    separate sessions since the

    last two references are over

    an hour later than the first five.

    The session identification step

    results in four user sessions consisting of A-B-F-

    O-G, A-D, A-B-C-J, and L-R.

  • 8/16/2019 Data Preparation for Web Usage Mining

    9/21

    Path Completion (1)

    • Another problem in reliably identifying unique user sessions is determining ifthere are important accesses that are not recorded in the access log. Thisproblem is referred to as  path completion.

    • If a page request is made that is not directly linked to the last page a userrequested, the referrer log can be checked to see what page the requestcame from.

    • If the page is in the user’s recent request history, the assumption is that theuser backtracked with the “back” button available on most browsers,

    • If the referrer log is not clear, the site topology can be used to the sameeffect.

    • If more than one page in the user’s history contains a link to the requestedpage, it is assumed that the page closest to the previously requested pageis the source of the new request.

    • Missing page references that are inferred through this method are added tothe user session file.

    • A simple method of picking a time-stamp is to assume that any visit to apage already seen will be effectively treated as an auxiliary page.

    • The average reference length for auxiliary pages for the site can be used toestimate the access time for the missing pages.

  • 8/16/2019 Data Preparation for Web Usage Mining

    10/21

    Path Completion (2)

    • Page G is not directly accessiblefrom page O. The referrer log forthe page G request lists page B as

    the requesting page. Thissuggests that user 1 backtrackedto page B using the back buttonbefore requesting page G.

    • Therefore, pages F and B should

    be added into the session file foruser 1.

    • Again, while it is possible that theuser knew the URL for page Gand typed it in directly, this isunlikely, and should not occuroften enough to affect the miningalgorithms.

    • The path completion step resultsin user paths of A-B-F-O-F-B-G, A-D, A-B-A-C-J, and L-R.

  • 8/16/2019 Data Preparation for Web Usage Mining

    11/21

    Formatting

    • A final preparation module can be used to

    properly format the sessions or transactions forthe type of data mining to be accomplished.

    • For example, since temporal information is not

    needed for the mining of association rules, afinal association rule preparation module would:

     – strip out the time for each reference, and

     – do any other formatting of the data necessary for the

    specific data mining algorithm to be used.

  • 8/16/2019 Data Preparation for Web Usage Mining

    12/21

    Summary of Sample Log

    Preprocessing Results

  • 8/16/2019 Data Preparation for Web Usage Mining

    13/21

    General Model for Transaction

    Identification (1)• The goal of transaction identification is to create

    meaningful clusters of references for each user.

    • Let L be a set of user session file entries. Asession entry l L includes the client IP addressl.ip, the client user id l.uid, the URL of the

    accessed page l.url, and the time of accessl.time.

    • A General Transaction t is:

    t =< ipt, uidt, { (l t 1.url, l t 1.time), . . . , (l 

    t m.url, l 

    t m .time) } >

    for 1 ≤ k ≤ m, l t k ε L, l t 

    k .ip = ipt, l t 

    k .uid  = uidt 

  • 8/16/2019 Data Preparation for Web Usage Mining

    14/21

    General Model for Transaction

    Identification (2)• Since the initial input to the transaction

    identification process consists of all of the page

    references for a given user session, the first stepin the transaction identification process willalways be the application of a divide approach.

    • There are three divide transaction identificationapproaches. – The first two, reference length and maximal forward

    reference, make an attempt to identify semanticallymeaningful transactions.

     – The third, time window , is not based on any browsingmodel, and is mainly used as a benchmark tocompare with the other two algorithms.

  • 8/16/2019 Data Preparation for Web Usage Mining

    15/21

    Transaction Identification by

    Reference Length (1)• This approach is based on the

    assumption that the amount of time auser spends on a page correlates towhether the page should be classifiedas a auxiliary or content page.

    • The following Figure shows ahistogram of the lengths of pagereferences between 0 and 600seconds for a server log of a site.

    • It is expected that the variance of thetimes spent on the auxiliary pages issmall, and the auxiliary referencesmake up the lower end of the curve.

    • The length of content references isexpected to have a wide variance and

    would make up the upper tail thatextends out to the longest reference.

    • We need to have a method to computethe reference length that discriminatesauxiliary and content pages

  • 8/16/2019 Data Preparation for Web Usage Mining

    16/21

    Transaction Identification by

    Reference Length (2)• The definition of a transaction within the reference length approach is :

    trl  =< iptrl , uid trl  , { (l trl 

    1 .url, l trl 

    1 .time, l trl 

    1 .length),. . . , (l trl 

    m .url, l trl 

    m .time, l trl 

    m .length) } >

    for 1≤

    k≤

    m, l 

    trl 

    k    L, l 

    trl 

    k  .ip = iptrl  , l 

    trl 

    k  .uid  = uid trl 

    • The length of each reference is estimated by taking the difference between the timeof the next reference and the current reference.

    • Obviously, the last reference in each transaction has no “next” time to use inestimating the reference length.

    • The reference length approach makes the assumption that all of the last referencesare content references, and ignores them while calculating the cutoff time.

    • This assumption can introduce errors if a specific auxiliary page is commonly used asthe exit point for a Web site.

    • While interruptions such as a phone call or lunch break can result in the erroneous

    classification of a auxiliary reference as a content reference,• it is unlikely that the error will occur on a regular basis for the same page.

    • A reasonable minimum support threshold during the application of a data miningalgorithm would be expected to weed out these errors.

  • 8/16/2019 Data Preparation for Web Usage Mining

    17/21

    Transaction Identification by

    Reference Length (3)• Once the cutoff time is calculated, the two types of

    transactions can be formed by comparing each

    reference length against the cutoff time. Depending onthe goal of the analysis, the auxiliary-contenttransactions or the content-only transactions can beidentified.

    • If C is the cutoff time, for auxiliary-content transactionsthe conditions,for 1 ≤ k ≤ (m− 1) : l trl k  .length ≤ C and k = m : l 

    trl k  .length > C are

    added as auxiliary-content transaction• For content-only transactions, the condition,

    for 1 ≤ k ≤ m : l trl k  .length > C is added as content transaction .

  • 8/16/2019 Data Preparation for Web Usage Mining

    18/21

    Transaction Identification by

    Reference Length (4)• Using user session: A-B-F-O-F-B-G, (given

    example) and assuming that the cutoff time is78.4 seconds, this results in the followingauxiliary content transaction – A-B-F because the user stayed in A,B less than 78.4,

    and stayed in F for 240 sec.

     – O-F-B-G because F, B are added to complete thepath, and G was a final page

    • To extract content-only transactions for theabove user, contents pages are only taken andthis results in one transaction: F-G for this user 

  • 8/16/2019 Data Preparation for Web Usage Mining

    19/21

    Transaction Identification by

    Maximal Forward Reference• In this approach, each transaction is defined to be the set of pages in

    the path from the first page in a user session up to the page before abackward reference is made.

    • A forward reference is defined to be a page not already in the set ofpages for the current transaction.

    • Similarly, a backward reference is defined to be a page that is alreadycontained in the set of pages for the current transaction.

    • A new transaction is started when the next forward reference is made.• The underlying model for this approach is that the maximal forward

    reference pages are the content pages, and the pages leading up toeach maximal forward reference are the auxiliary pages.

    • Using the user session: A-B-F-O-F-B-G, (given example) , auxiliary-content transactions of A-B-F-O, A-B-G, would be formed.

    • The content-only transactions would be O-G, for the above usersession

  • 8/16/2019 Data Preparation for Web Usage Mining

    20/21

    Transaction Identification by

    Time Window• This approach partitions a user session into time intervals no larger than a

    specified parameter.

    • The approach assumes that meaningful transactions have an overall

    average length associated with them.• For a sufficiently large specified time window, each transaction will contain

    an entire user session.

    • If W is the length of the time window, the transactions will be identified withthe following added condition:

    (l t m .time − l t 1.time) ≤ W • Since there is some standard deviation associated with the length of each

    “real” transaction, it is unlikely that a fixed time window will break a log upappropriately.

    • However, the time window approach can also be used as a merge approach

    in conjunction with one of the other divide approaches. For example, afterapplying the reference length approach, a merge time window approachwith a 10 minute input parameter could be used to ensure that eachtransaction has some minimum overall length.

  • 8/16/2019 Data Preparation for Web Usage Mining

    21/21

    Summary of Sample Transaction

    Identification Results