Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto...

24
Mining the Structure of User Mining the Structure of User Activity using Cluster Activity using Cluster Stability Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc. 2002.04.13 – SIAM Web Analytics Workshop

Transcript of Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto...

Page 1: Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc. 2002.04.13 – SIAM Web Analytics Workshop.

Mining the Structure of User Mining the Structure of User Activity using Cluster StabilityActivity using Cluster Stability

Jeffrey Heer, Ed H. ChiPalo Alto Research Center, Inc.

2002.04.13 – SIAM Web Analytics Workshop

Page 2: Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc. 2002.04.13 – SIAM Web Analytics Workshop.

MotivationMotivation

Want to understanding the composition of web user traffic.– What are users’ information goals?– Leads to improved site design, content, and

performance

Strategy: Content, Usage, and Topology

Page 3: Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc. 2002.04.13 – SIAM Web Analytics Workshop.

User Session ClusteringUser Session Clustering

Cluster user sessions into common activities such as product browsing and job seeking.

A number of approaches have been proposed ([Shahabi97], [Fu99], [Banerjee01], and [Heer01])

These require specifying the number of clusters in advance or browsing a large cluster hierarchy.

Can we automatically infer the structure of user activity?

Page 4: Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc. 2002.04.13 – SIAM Web Analytics Workshop.

OverviewOverview

System Description– Clustering Method– Stability Analysis

Case Studies Discussion

Page 5: Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc. 2002.04.13 – SIAM Web Analytics Workshop.

System DescriptionSystem Description

Use web access logs and web site content to generate a user profile for each site visitor.– How: Build a multi-featured vector space model of

user activity (multi-modal clustering).

Group user profiles into common activities like “product browsing” and “job seeking”– How: Apply clustering algorithms to user profiles

Page 6: Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc. 2002.04.13 – SIAM Web Analytics Workshop.

System DescriptionSystem Description

Web CrawlAccess Logs

Document Model

User Sessions

User Profiles

ClusteredProfiles

1. Process Access Logs

2. Crawl Web Site

3. Build Document Model

4. Extract User Sessions

5. Build User Profiles

6. Cluster Profiles

Page 7: Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc. 2002.04.13 – SIAM Web Analytics Workshop.

Document ModelDocument Model

Web site is crawled, relevant pages listed in web logs are retrieved.

Retrieved data is represented as feature vectors:Content: TF.IDF weighted keyword vector

URL: Tokenized and TF.IDF weighted

Inlinks: Column vectors in topology matrix

Outlinks: Row vectors in topology matrix

These are concatenated to form a single multi-modal vector Pd for each document.

Web CrawlAccess Logs

Document Model

User Sessions

User Profiles

ClusteredProfiles

Page 8: Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc. 2002.04.13 – SIAM Web Analytics Workshop.

User SessionsUser Sessions

Sessions are extracted from web logs, and represented by an attribute vector– For path i = ABD, si = <1,1,0,1,0>

» (For site with 5 documents <A,B,C,D,E>)

Experimented with various weightings for s, including viewing-times and path position.

Viewing times achieved highest accuracy in empirical studies.– A10sB20sD15s, si = <10,20,0,15,0>

Web CrawlAccess Logs

Document Model

User Sessions

User Profiles

ClusteredProfiles

Page 9: Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc. 2002.04.13 – SIAM Web Analytics Workshop.

User ProfilesUser Profiles

User profiles are created by linearly combining the document and session models:

N

ddidi PsUP

1

Web CrawlAccess Logs

Document Model

User Sessions

User Profiles

ClusteredProfiles

Page 10: Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc. 2002.04.13 – SIAM Web Analytics Workshop.

ClusteringClustering

Similarity Metric is a weighted cosine measure

Clustering is then done by recursive bisection, using K-Means to perform the bisections [Karypis00, Zhao01]. The corresponding criterion function is:

Modalitesm

mj

mimji UPUPwUPUPd ),cos(),( 1

mmw

k

rr

SUPi CUPdI

ri12 ),(

Web CrawlAccess Logs

Document Model

User Sessions

User Profiles

ClusteredProfiles

Page 11: Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc. 2002.04.13 – SIAM Web Analytics Workshop.

User population breakdown

Detailed stats

Keywords describing user groups

Frequent documents accessed by group

Page 12: Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc. 2002.04.13 – SIAM Web Analytics Workshop.

Clustering EvaluationClustering Evaluation

Ran user study on www.xerox.com to evaluate effectiveness of method [Heer02].

15 tasks, 5 task categories (104 user traces) Using certain modalities and weighting

schemes we were able to achieve accuraciesas high as 99%! Found that page content and

page viewing time significantly contribute to clustering accuracy.

Page 13: Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc. 2002.04.13 – SIAM Web Analytics Workshop.

OK, Great, but…OK, Great, but…

In real-world applications the number of clusters is an undetermined variable.

Want a method for automatically choosing the number of clusters.

After review of literature, decided to apply a cluster stability technique recently proposed by [BenHur02].

Page 14: Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc. 2002.04.13 – SIAM Web Analytics Workshop.

Measuring Clustering SimilarityMeasuring Clustering Similarity

For a given clustering of a data set X, define

Cij = {

Two clusterings can then be compared using a dot product:

This dot product can be normalized to get a cosine metric:

ji

ijijCCCC,

2121,

2211

21

21

,,

,),(

CCCC

CCCCcor

1 if xi, xj are in the same cluster and i j

0 otherwise

Page 15: Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc. 2002.04.13 – SIAM Web Analytics Workshop.

Cluster StabilityCluster Stability

for k = 2 to kmax– for i = 1 to n

» Si = Subsample of data set X using sampling ratio f

» Ci = cluster(Si,k)

– Perform pairwise comparisons of all Ci, generating a distribution of similarity values for the current k

Analyze the resulting distributions to determine the most stable clusterings.

Page 16: Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc. 2002.04.13 – SIAM Web Analytics Workshop.

ExampleExample

Stability Analysis

Example using 4 Gaussians [BenHur02]

Graph on right shows plot of the cumulative similarity distribution

Page 17: Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc. 2002.04.13 – SIAM Web Analytics Workshop.

Case Study 1 – www.xerox.comCase Study 1 – www.xerox.com

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

similarity

cum

ula

tive

User Study 8/2001; 104 sessionsn = 15, f = 0.8, k = 2 to 10

Page 18: Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc. 2002.04.13 – SIAM Web Analytics Workshop.

Case Study 2 – guir.berkeley.eduCase Study 2 – guir.berkeley.edu

0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

similarity

cum

ula

tive

Nov. 1-16, 2001; 7700 sessionsn = 30, f = 0.8, k = 2 to 15

Page 19: Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc. 2002.04.13 – SIAM Web Analytics Workshop.

Case Study 2 – guir.berkeley.eduCase Study 2 – guir.berkeley.edu

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

n = 30, f = 0.8, k = 3 to 7

Page 20: Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc. 2002.04.13 – SIAM Web Analytics Workshop.

Cluster Contents (guir, k=5)Cluster Contents (guir, k=5)

Cluster 1: DENIM Web Design Tool

Cluster 2: Research projects & publications

Cluster 3: Quiz-Bowl Competition Site

Cluster 4: CSCW (1 project + 1 course)

Cluster 5: Random pubs + project JavaDoc

At higher values of k, more concentrated clusters appear– Personal pages (faculty, students) cluster emerges– JavaDoc separates into it’s own cluster

Page 21: Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc. 2002.04.13 – SIAM Web Analytics Workshop.

DiscussionDiscussion

Stability method shows some utility, but results are far from conclusive… perhaps web data is not particularly structured?

User Goals– Does the user have a specific goal?

Web Site Structure– Does the web site support user goals?

Task Structure– Level of generality

Page 22: Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc. 2002.04.13 – SIAM Web Analytics Workshop.

Possible CasesPossible Cases

User has task - Site supports task– www.xerox.com study

User has task - Site doesn’t support it User w/o singular goals - Well designed site

– Possibly guir.berkeley.edu

User w/o task - Poorly designed site

Page 23: Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc. 2002.04.13 – SIAM Web Analytics Workshop.

The Future…The Future…

More actionable empirical data– Need more users over a range of sites– Larger user study already begun

Alternative approaches– Human supervision– Augmented stability metric / criterion function– Other clustering methods

» Fuzzy Clustering

Page 24: Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc. 2002.04.13 – SIAM Web Analytics Workshop.

Questions?Questions?

Suggestions?Suggestions?