Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto...

Mining the Structure of User Mining the Structure of User Activity using Cluster StabilityActivity using Cluster Stability

Jeffrey Heer, Ed H. ChiPalo Alto Research Center, Inc.

2002.04.13 – SIAM Web Analytics Workshop

MotivationMotivation

Want to understanding the composition of web user traffic.– What are users’ information goals?– Leads to improved site design, content, and

performance

Strategy: Content, Usage, and Topology

User Session ClusteringUser Session Clustering

Cluster user sessions into common activities such as product browsing and job seeking.

A number of approaches have been proposed ([Shahabi97], [Fu99], [Banerjee01], and [Heer01])

These require specifying the number of clusters in advance or browsing a large cluster hierarchy.

Can we automatically infer the structure of user activity?

OverviewOverview

System Description– Clustering Method– Stability Analysis

Case Studies Discussion

System DescriptionSystem Description

Use web access logs and web site content to generate a user profile for each site visitor.– How: Build a multi-featured vector space model of

user activity (multi-modal clustering).

Group user profiles into common activities like “product browsing” and “job seeking”– How: Apply clustering algorithms to user profiles

System DescriptionSystem Description

Web CrawlAccess Logs

Document Model

User Sessions

User Profiles

ClusteredProfiles

1. Process Access Logs

2. Crawl Web Site

3. Build Document Model

4. Extract User Sessions

5. Build User Profiles

6. Cluster Profiles

Document ModelDocument Model

Web site is crawled, relevant pages listed in web logs are retrieved.

Retrieved data is represented as feature vectors:Content: TF.IDF weighted keyword vector

URL: Tokenized and TF.IDF weighted

Inlinks: Column vectors in topology matrix

Outlinks: Row vectors in topology matrix

These are concatenated to form a single multi-modal vector Pd for each document.


Document Model

User Sessions

User Profiles

ClusteredProfiles

User SessionsUser Sessions

Sessions are extracted from web logs, and represented by an attribute vector– For path i = ABD, si = <1,1,0,1,0>

» (For site with 5 documents <A,B,C,D,E>)

Experimented with various weightings for s, including viewing-times and path position.

Viewing times achieved highest accuracy in empirical studies.– A10sB20sD15s, si = <10,20,0,15,0>


Document Model

User Sessions

User Profiles

ClusteredProfiles

User ProfilesUser Profiles

User profiles are created by linearly combining the document and session models:

N

ddidi PsUP

1


Document Model

User Sessions

User Profiles

ClusteredProfiles

ClusteringClustering

Similarity Metric is a weighted cosine measure

Clustering is then done by recursive bisection, using K-Means to perform the bisections [Karypis00, Zhao01]. The corresponding criterion function is:

Modalitesm

mj

mimji UPUPwUPUPd ),cos(),( 1

mmw

k

rr

SUPi CUPdI

ri12 ),(


Document Model

User Sessions

User Profiles

ClusteredProfiles

User population breakdown

Detailed stats

Keywords describing user groups

Frequent documents accessed by group

Clustering EvaluationClustering Evaluation

Ran user study on www.xerox.com to evaluate effectiveness of method [Heer02].

15 tasks, 5 task categories (104 user traces) Using certain modalities and weighting

schemes we were able to achieve accuraciesas high as 99%! Found that page content and

page viewing time significantly contribute to clustering accuracy.

OK, Great, but…OK, Great, but…

In real-world applications the number of clusters is an undetermined variable.

Want a method for automatically choosing the number of clusters.

After review of literature, decided to apply a cluster stability technique recently proposed by [BenHur02].

Measuring Clustering SimilarityMeasuring Clustering Similarity

For a given clustering of a data set X, define

Cij = {

Two clusterings can then be compared using a dot product:

This dot product can be normalized to get a cosine metric:

ji

ijijCCCC,

2121,

2211

21

21

,,

,),(

CCCC

CCCCcor

1 if xi, xj are in the same cluster and i j

0 otherwise

Cluster StabilityCluster Stability

for k = 2 to kmax– for i = 1 to n

» Si = Subsample of data set X using sampling ratio f

» Ci = cluster(Si,k)

– Perform pairwise comparisons of all Ci, generating a distribution of similarity values for the current k

Analyze the resulting distributions to determine the most stable clusterings.

ExampleExample

Stability Analysis

Example using 4 Gaussians [BenHur02]

Graph on right shows plot of the cumulative similarity distribution

Case Study 1 – www.xerox.comCase Study 1 – www.xerox.com

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

similarity

cum

ula

tive

User Study 8/2001; 104 sessionsn = 15, f = 0.8, k = 2 to 10

http://www.xerox.com/go/xrx/template/013.jsp?Xcntry=USA&Xlang=en_US&Xseg=corp

Case Study 2 – guir.berkeley.eduCase Study 2 – guir.berkeley.edu

0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

similarity

cum

ula

tive

Nov. 1-16, 2001; 7700 sessionsn = 30, f = 0.8, k = 2 to 15

Case Study 2 – guir.berkeley.eduCase Study 2 – guir.berkeley.edu

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

n = 30, f = 0.8, k = 3 to 7

Cluster Contents (guir, k=5)Cluster Contents (guir, k=5)

Cluster 1: DENIM Web Design Tool

Cluster 2: Research projects & publications

Cluster 3: Quiz-Bowl Competition Site

Cluster 4: CSCW (1 project + 1 course)

Cluster 5: Random pubs + project JavaDoc

At higher values of k, more concentrated clusters appear– Personal pages (faculty, students) cluster emerges– JavaDoc separates into it’s own cluster

DiscussionDiscussion

Stability method shows some utility, but results are far from conclusive… perhaps web data is not particularly structured?

User Goals– Does the user have a specific goal?

Web Site Structure– Does the web site support user goals?

Task Structure– Level of generality

Possible CasesPossible Cases

User has task - Site supports task– www.xerox.com study

User has task - Site doesn’t support it User w/o singular goals - Well designed site

– Possibly guir.berkeley.edu

User w/o task - Poorly designed site

The Future…The Future…

More actionable empirical data– Need more users over a range of sites– Larger user study already begun

Alternative approaches– Human supervision– Augmented stability metric / criterion function– Other clustering methods

» Fuzzy Clustering

Questions?Questions?

Suggestions?Suggestions?

Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto...

Documents

Transcript of Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto...