Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn...

19
Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)

Transcript of Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn...

Page 1: Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)

Web Usage Mining(Clickstream Analysis)

Mark Levene

(Follow the links to learn more!)

Page 2: Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)

Reminder - W3C Extended Log File FormatField Date Description

Date date The date that the activity occurredTime time The time that the activity occurredClient IP address c-ip The IP address of the client that accessed your server

User Name cs-usernameThe name of the autheticated user who access your server, anonymous users are represented by -

Servis Name s-sitename The Internet service and instance number that was accessed by a clientServer Name s-computername The name of the server on which the log entry was generatedServer IP Address s-ip The IP address of the server that accessed your serverServer Port s-port The port number the client is connected toMethod cs-method The action the client was trying to performURI Stem cs-uri-stem The resource accessedURI Query cs-uri-query The query, if any, the client was trying to performProtocol Status sc-status The status of the action, in HTTP or FTP termsWin32 Status sc-win32-status The status of the action, in terms used by Microsoft WindowsBytes Sent sc-bytes The number of bytes sent by the serverBytes Received cs-bytes The number of bytes received by the serverTime Taken time-taken The duration of time, in milliseconds, that the action consumedProtocol Version cs-version The protocol (HTTP, FTP) version used by the clientHost cs-host Display the content of the host header

User Agent cs(User Agent) The browser used on the clientCookie cs(Cookie) The content of the cookie sent or received, if any

Referrer cs(Referrer)The previous site visited by the user. This site provided a link to the current site

cs = client-to-server actions

s = server actionsc = client actions

sc = server-to-client actions

Page 3: Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)

Analog – Web Log File Analyser

• Gives basic statistics such as– number of hits– average hits per time period – what are the popular pages in your site– who is visiting your site – what keywords are users searching for to get to you– what is being downloaded

• Log data does not disclose the visitor’s identity• What do analog’s reports mean?• Report for www.dcs.bbk.ac.uk/~mark

Page 4: Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)

Applications of Usage Mining

• Pre-fetching and caching web pages

• eCommerce and clickstream analysis

• Web site reorganisation

• Personalisation

• Recommendation of links and products

Page 5: Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)

Identification of User

• By IP address – Not so reliable as IP can be dynamic– Different users may use same IP

• Through cookies– Reliable but user may remove cookies– Security and privacy issues

• Through login– Users have to register

Page 6: Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)

Sessionising

• Time oriented (robust)– By total duration of session

• not more than 30 minutes

– By page stay times (good for short sessions)• not more than 10 minutes per page

• Navigation oriented (good for short sessions and when timestamps unreliable)– Referrer is previous page in session, or– Referrer is undefined but request within 10 secs, or – Link from previous to current page in web site

Page 7: Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)

Mining Navigation Patterns

• Each session induces a user trail through the site

• A trail is a sequence of web pages followed by a user during a session, ordered by time of access.

• A pattern in this context is a frequent trail.• Co-occurrence of web pages is important, e.g.

shopping-basket and checkout.• Use a Markov chain model.

Page 8: Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)

Trails inferred from Log data(Each session results in a trail)

ID Trail

1 A1 > A2 > A3

2 A1 > A2 > A3

3 A1 > A2 > A3 > A4

4 A5 > A2 > A4

5 A5 > A2 > A4 > A6

6 A5 > A2 > A3 > A6

Page 9: Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)

Construct Markov Chain from Data

• Add a unique start state.– the start state has a transition to all visited

web pages in the site.

• Add a unique final state.– the last page in each trail has a transition to

the final state.

• The transition probabilities are obtained from counting click-throughs.

• The Markov chain built is called absorbing since we always end up in the final state.

Page 10: Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)

The Markov Chain from the Data

Page 11: Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)

Support and Confidence

• Support s in [0,1) – accept only trails whose initial probability is above s.– Setting support to be above the average click-

through is reasonable.

• Confidence c in [0,1) – accept only trails whose probability is above c. – The probability of a trail is obtained by

multiplying the transition probabilities of the links in the trail.

Page 12: Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)

Mining Frequent Trails

• Find all trails whose initial probability is higher than s, and whose trail probability is above c.

• Use depth-first search on the Markov chain to compute the trails.

• The average time needed to find the frequent trails is proportional to the number of web pages in the site.

Page 13: Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)

Frequent Trails Support = 0.1 and Confidence = 0.3

Trail Probability

A1 > A2 > A3 0.67

A5 > A2 > A3 0.67

A2 > A3 0.67

A1 > A2 > A4 0.33

A5 > A2 > A4 0.33

A2 > A4 0.33

A4 > A6 0.33

Page 14: Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)

Frequent Trails Support = 0.1 and Confidence = 0.5

Trail Probability

A1 > A2 > A3 0.67

A5 > A2 > A3 0.67

A2 > A3 0.67

Page 15: Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)

Content Mining

• Incorporate the categories that users are navigating through so we may better understand their activities.– E.g. what type of book is the user interested

in; this may be used for recommendation.

• Classify users according to behaviour.– Is the user’s intent to browse, search or buy?

• Cluster users with common interests.

Page 16: Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)

Pre-fetching and Caching Pages

• Learn access patterns to predict future accesses.

• Pre-fetch predicted pages to reduce latency.

• Can use Markov model and base the prediction on history of access.

• Also cache results of popular search engine queries.

Page 17: Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)

ECommerce Click stream Analysis

• What is the user’s intention: browse, search or buy?

• Measure time spent on site - site stickiness

• Repeat visits – it has been shown that repeat visitors spend less time on the site; can be explained by learning.

• Measure visit-to-purchase conversion ratio, and predict purchase likelihood.

Page 18: Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)

Supplementary Analyses to Improve eCommerce Web Sites

• Detecting visits from crawlers as opposed to human visitors.

• Form error analysis, e.g. login errors, mandatory fields not filled, incorrect format.

• When and why do people exit the site, e.g. visitor puts item in cart but exists before reaching the checkout.

• Analysis of local search engine logs – correlate with site behaviour.

• Product recommendations based on association rules (people who bought x also bought y).

• Geographic analysis – where are the customers?• Demographic analysis – who are the customers?

Page 19: Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)

Adaptive web sites

• Modify the web site according to user access.– Automatic synthesis of index pages (hubs that

contain links on a specific topic)– Based on a clustering algorithm that uses the

co-occurrence frequencies of pages from the log data.

– Finds a concept that best describes each cluster.