Post on 15-Dec-2015
Nov, 2002 Banerjee and Ghosh 1
Characterizing Visitors to a Website Across Multiple Sessions
NGDM Workshop, Nov 2002
Arindam BanerjeeJoydeep Ghosh
Nov, 2002 Banerjee and Ghosh 2
Motivation
Why Characterize or Predict web user behavior?
• Site-centric view: Personalization, sticky websites
• User-centric view: personal agents for information acquisition
• Universalist approaches: Pagerank, web metrics,…
Nov, 2002 Banerjee and Ghosh 3
Clustering Users from Web Logs
• Wide variety of web behavior segment users based on surfing behavior as a first step to further analysis.
• User: set of sessions• Session: sequence of
– (page I.d., time spent on that page) tuples
– How to cluster sets of sequences?
Nov, 2002 Banerjee and Ghosh 4
The Approach
• Cluster Sessions– Session Similarity Measure
– Session Similarity Graph
• Outlier Detection
– Graph Partitioning
• Create a Cluster Space
• Cluster users in this Space
Nov, 2002 Banerjee and Ghosh 5
A Similarity Measure for Sessions
1. Overlap between two sessions represented by the longest common subsequence (LCS)
2. Obtain session similarity using LCS and time information session similarity = (time similarity in LCS) x (importance of LCS)
• The similarity component : – Average min-max similarity for each page in the LCS
• The importance component : – Average of the fraction of overall session time spent in the LCS
1,0
Nov, 2002 Banerjee and Ghosh 6
Session Clustering
• Find the pairwise similarity values between all pair of sessions; record only similarities >
• Incrementally construct similarity graph G
– the vertices are the sessions, the edge weights are the session similarity values
– no isolated vertices (discard “outliers”)
• Balanced Graph Partitioning– we used Metis [Karypis, Kumar]
Nov, 2002 Banerjee and Ghosh 7
The Cluster Space
• Given: each session assigned to one of k clusters (sets)Sessions of a user are distributed among the k sets
– vector u = [u1 u2 … uk ]T where ui = number of sessions of the user belonging to cluster I
• Stage II : User Clustering
– find pairwise similarity values using the extended Jaccard measure
– partition similarity graph
• Gives l user clusters and a set of outlier users
Nov, 2002 Banerjee and Ghosh 9
Dataset details
• Logs over a one month period
• Raw log size 184 Mb
• 453,953 files accessed
• 37,753 sessions in all
• 23,310 sessions after some preprocessing/filtering
• 2,493 users
Nov, 2002 Banerjee and Ghosh 10
Results : Session ClustersCluster 1 – interest in coffeehouse, contests Cluster 2 – glance through home, articles
-(/,12)(/movies,6)(/contests,178)
-(/contests,142)
-(/coffeehouse,5)(/contests,183)
-(/contests,172)
-(/,10)(/contests,143)
-(/,22)(/articles,22)
-(/,20)(/articles,20)
-(/,21)(/articles,21)
-(/,19)(/articles,19)
-(/,20)(/articles,19)
Cluster 3 – interest in author, articles Cluster 4 – read articles
-(/,148)(/authors,6)(/articles,77)
-(/authors,290)(/articles,290)
-(/authors,295)(/articles,295)
-(/,33)(/authors,90)(/articles,475)
-(/,32)(/authors,91)(/articles,425)
-(/,39)(/articles,98)(/misc,17)
(/articles,2649)
-(/,9)(/articles,2666)
-(/authors,26)(/articles,2561)
-(/misc,20)(/articles,77)(/misc
32)(/articles,43)(/authors,16)
(/articles,2373.1)
Nov, 2002 Banerjee and Ghosh 11
Results : User Clusters• user : [(128.194.xxx.xxx)]
– (/authors,3)(/articles,129)– (/authors,8)(/articles,8)– (/authors,80)(/articles,2141)
• user : [(209.30.xxx.xxx)]– (/home,77)(/articles,111)(/authors,93)(/articles,629)(/
misc,58) (/coffeehouse,75)(/wo-men,967)– (/articles,2627)
• user : [(171.68.xxx.xxx)]– (/home,323)(/articles,24)(/authors,45)(/articles,1290)
A user cluster :
people who read the articles
Nov, 2002 Banerjee and Ghosh 12
Results : User Clusters• user : [(152.170.xxx.xxx)]
– (/home,21)(/wo-men,1075)(/philosophy,52)
• user : [(209.244.xxx.xxx)]– (/home,5)(/coffeehouse,94)(/wo-men,75)(/movies,75)(/wo-
men,31)– (/home,52)(/philosophy,67)(/wo-men,955)(/philosophy, 26)
(/coffeehouse,382)(/biztech,298)(/philosophy,290)– (/home,17)(/coffeehouse,12)(/wo-men,15)(/personal,6)
(/biztech,94)(/coffeehouse,2)(/philosophy,1093)
A user cluster :
people interested in wo-men, philosophy, coffeehouse
Nov, 2002 Banerjee and Ghosh 13
Results : User Clusters• user : [(216.154.xxx.xxx)]
– (/coffeehouse,12)(/biztech,25)(/books,48)– (/coffeehouse,13)(/biztech,26)(/books,19)
• user : [(204.220.xxx.xxx)]– (/coffeehouse,162)– (/coffeehouse,40)
• user : [(32.100.xxx.xxx)]– (/coffeehouse,12)(/contests 12)– (/coffeehouse,43)(/contests 44)
A user cluster :
people interested in coffeehouse – bookmarked it !
Nov, 2002 Banerjee and Ghosh 14
Result Visualization using CLUSION [Strehl &Ghosh 01]
Sessions Users
Nov, 2002 Banerjee and Ghosh 15
Conclusions
• Segmentation: a basic pre-processing step for Web Mining• Similarity measure + Cluster Space Concept: applicable to
clustering of sets of any data-structure • For certain websites, time spent on the pages matters
– not handled by current commercial tools
• Outlier detection before clustering is important• Results QA-ed by human subjects
– Results for clusters & outliers at both levels were subjectively good
No good way to find cluster quality analytically
Formation of similarity graph is a slow process
Nov, 2002 Banerjee and Ghosh 16
Future Work
• Improve the present method by:– using cluster seeds for cluster growing
– using alternative clustering algorithms for each stage
– studying the effect of thresholds, number of clusters on performance
– studying the importance of order of page-visits
– studying the importance of balanced clustering
Nov, 2002 Banerjee and Ghosh 18
Issues : Choice of Parameters
• Number of session clusters, k, should be chosen appropriately
• Thresholds for forming session & user similarity graphs :– threshold value should be chosen after looking at the
distribution of edge weights
Nov, 2002 Banerjee and Ghosh 19
Related Work
• Research in Web Mining :– Extraction of navigational patterns : Spiliopoulou,
Faulstich
– Ordering relationships : Mannila, Meek
– Surfing prediction : Pitkow, Pirolli
– Clustering web usage sessions : Fu, Sandhu, Shih
Nov, 2002 Banerjee and Ghosh 20
Example
• Sessions :
– Session1 = [(a,8) (b,100) (d,8) (c,5) (e,23) (a,5)]
– Session2 = [(b,5) (d,12) (f,1) (a,7) (c,5)]
• LCS pages = [(b)(d)(c)]
• Corresponding Index, Times Sequences :– Index1 = [(1)(2)(3)], Time1 = [(100) (8) (5)]
– Index2 = [(0)(1)(4)], Time2 = [ (5) (12) (5)]
• Similarity over each LCS page : of the two times– Similarity on page b = 5/100 = 0.05
– Similarity on page d = 8/12 = 0.67
– Similarity on page c = 5/5 = 1.00
max
min
Nov, 2002 Banerjee and Ghosh 21
Example (contd.)
• The similarity component = (0.05 + 0.67 + 1.00)/3
= 0.57
• The importance component :– Fraction of time spent in the LCS by Session1 = 113/149 = 0.76
– Fraction of time spent in the LCS by Session2 = 22/30 = 0.73
– The mean = (0.76+0.73)/2 = 0.75
• The overall similarity= 0.57 x 0.75
= 0.43
Nov, 2002 Banerjee and Ghosh 22
Issues : Session Resolution
• Generate coarse resolution paths making use of the concept hierarchy of the website
• Reduces computations; Increases interpretability of results
Original Path Concept-level Path(/authors/ramesh_mahadevan.html,3)
(/articles/rm_phattas.html,75)
(/articles/rm_desidads.html,39)
(/authors,3)
(/articles,114)
(/authors/arun_sampath.html,109)
(/philosophy/messages/1951.html,102)
(/philosophy/messages/1953.html,46)
(/,3)
(/philosophy/messages/1954.html,69)
(/authors,109)
(/philosophy,148)
(/,3)
(/philosophy,69)
Nov, 2002 Banerjee and Ghosh 23
Comments
• Results QA-ed by human subject– Results for clusters & outliers at both levels were subjectively
good
– No good way to find cluster quality analytically
• Clustering algorithms for the two stages– Stage I : Graph partitioning works well for large sparse graphs, so
it is desirable in this stage
– Stage II : Since the space is not high-dimensional, any reasonable clustering algorithm should be adequate
• Cluster space – Gives a general framework for mapping any non-vector clustering
problem to an equivalent vector clustering problem