Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto...
-
Upload
lynette-lawson -
Category
Documents
-
view
219 -
download
0
Transcript of Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto...
Mining the Structure of User Mining the Structure of User Activity using Cluster StabilityActivity using Cluster Stability
Jeffrey Heer, Ed H. ChiPalo Alto Research Center, Inc.
2002.04.13 – SIAM Web Analytics Workshop
MotivationMotivation
Want to understanding the composition of web user traffic.– What are users’ information goals?– Leads to improved site design, content, and
performance
Strategy: Content, Usage, and Topology
User Session ClusteringUser Session Clustering
Cluster user sessions into common activities such as product browsing and job seeking.
A number of approaches have been proposed ([Shahabi97], [Fu99], [Banerjee01], and [Heer01])
These require specifying the number of clusters in advance or browsing a large cluster hierarchy.
Can we automatically infer the structure of user activity?
OverviewOverview
System Description– Clustering Method– Stability Analysis
Case Studies Discussion
System DescriptionSystem Description
Use web access logs and web site content to generate a user profile for each site visitor.– How: Build a multi-featured vector space model of
user activity (multi-modal clustering).
Group user profiles into common activities like “product browsing” and “job seeking”– How: Apply clustering algorithms to user profiles
System DescriptionSystem Description
Web CrawlAccess Logs
Document Model
User Sessions
User Profiles
ClusteredProfiles
1. Process Access Logs
2. Crawl Web Site
3. Build Document Model
4. Extract User Sessions
5. Build User Profiles
6. Cluster Profiles
Document ModelDocument Model
Web site is crawled, relevant pages listed in web logs are retrieved.
Retrieved data is represented as feature vectors:Content: TF.IDF weighted keyword vector
URL: Tokenized and TF.IDF weighted
Inlinks: Column vectors in topology matrix
Outlinks: Row vectors in topology matrix
These are concatenated to form a single multi-modal vector Pd for each document.
Web CrawlAccess Logs
Document Model
User Sessions
User Profiles
ClusteredProfiles
User SessionsUser Sessions
Sessions are extracted from web logs, and represented by an attribute vector– For path i = ABD, si = <1,1,0,1,0>
» (For site with 5 documents <A,B,C,D,E>)
Experimented with various weightings for s, including viewing-times and path position.
Viewing times achieved highest accuracy in empirical studies.– A10sB20sD15s, si = <10,20,0,15,0>
Web CrawlAccess Logs
Document Model
User Sessions
User Profiles
ClusteredProfiles
User ProfilesUser Profiles
User profiles are created by linearly combining the document and session models:
N
ddidi PsUP
1
Web CrawlAccess Logs
Document Model
User Sessions
User Profiles
ClusteredProfiles
ClusteringClustering
Similarity Metric is a weighted cosine measure
Clustering is then done by recursive bisection, using K-Means to perform the bisections [Karypis00, Zhao01]. The corresponding criterion function is:
Modalitesm
mj
mimji UPUPwUPUPd ),cos(),( 1
mmw
k
rr
SUPi CUPdI
ri12 ),(
Web CrawlAccess Logs
Document Model
User Sessions
User Profiles
ClusteredProfiles
User population breakdown
Detailed stats
Keywords describing user groups
Frequent documents accessed by group
Clustering EvaluationClustering Evaluation
Ran user study on www.xerox.com to evaluate effectiveness of method [Heer02].
15 tasks, 5 task categories (104 user traces) Using certain modalities and weighting
schemes we were able to achieve accuraciesas high as 99%! Found that page content and
page viewing time significantly contribute to clustering accuracy.
OK, Great, but…OK, Great, but…
In real-world applications the number of clusters is an undetermined variable.
Want a method for automatically choosing the number of clusters.
After review of literature, decided to apply a cluster stability technique recently proposed by [BenHur02].
Measuring Clustering SimilarityMeasuring Clustering Similarity
For a given clustering of a data set X, define
Cij = {
Two clusterings can then be compared using a dot product:
This dot product can be normalized to get a cosine metric:
ji
ijijCCCC,
2121,
2211
21
21
,,
,),(
CCCC
CCCCcor
1 if xi, xj are in the same cluster and i j
0 otherwise
Cluster StabilityCluster Stability
for k = 2 to kmax– for i = 1 to n
» Si = Subsample of data set X using sampling ratio f
» Ci = cluster(Si,k)
– Perform pairwise comparisons of all Ci, generating a distribution of similarity values for the current k
Analyze the resulting distributions to determine the most stable clusterings.
ExampleExample
Stability Analysis
Example using 4 Gaussians [BenHur02]
Graph on right shows plot of the cumulative similarity distribution
Case Study 1 – www.xerox.comCase Study 1 – www.xerox.com
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
similarity
cum
ula
tive
User Study 8/2001; 104 sessionsn = 15, f = 0.8, k = 2 to 10
Case Study 2 – guir.berkeley.eduCase Study 2 – guir.berkeley.edu
0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
similarity
cum
ula
tive
Nov. 1-16, 2001; 7700 sessionsn = 30, f = 0.8, k = 2 to 15
Case Study 2 – guir.berkeley.eduCase Study 2 – guir.berkeley.edu
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
n = 30, f = 0.8, k = 3 to 7
Cluster Contents (guir, k=5)Cluster Contents (guir, k=5)
Cluster 1: DENIM Web Design Tool
Cluster 2: Research projects & publications
Cluster 3: Quiz-Bowl Competition Site
Cluster 4: CSCW (1 project + 1 course)
Cluster 5: Random pubs + project JavaDoc
At higher values of k, more concentrated clusters appear– Personal pages (faculty, students) cluster emerges– JavaDoc separates into it’s own cluster
DiscussionDiscussion
Stability method shows some utility, but results are far from conclusive… perhaps web data is not particularly structured?
User Goals– Does the user have a specific goal?
Web Site Structure– Does the web site support user goals?
Task Structure– Level of generality
Possible CasesPossible Cases
User has task - Site supports task– www.xerox.com study
User has task - Site doesn’t support it User w/o singular goals - Well designed site
– Possibly guir.berkeley.edu
User w/o task - Poorly designed site
The Future…The Future…
More actionable empirical data– Need more users over a range of sites– Larger user study already begun
Alternative approaches– Human supervision– Augmented stability metric / criterion function– Other clustering methods
» Fuzzy Clustering
Questions?Questions?
Suggestions?Suggestions?