Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM...

Nov, 2002 Banerjee and Ghosh 1

Characterizing Visitors to a Website Across Multiple Sessions

NGDM Workshop, Nov 2002

Arindam BanerjeeJoydeep Ghosh

Motivation

Why Characterize or Predict web user behavior?

• Site-centric view: Personalization, sticky websites

• User-centric view: personal agents for information acquisition

• Universalist approaches: Pagerank, web metrics,…

Clustering Users from Web Logs

• Wide variety of web behavior segment users based on surfing behavior as a first step to further analysis.

• User: set of sessions• Session: sequence of

– (page I.d., time spent on that page) tuples

– How to cluster sets of sequences?

The Approach

• Cluster Sessions– Session Similarity Measure

– Session Similarity Graph

• Outlier Detection

– Graph Partitioning

• Create a Cluster Space

• Cluster users in this Space

A Similarity Measure for Sessions

1. Overlap between two sessions represented by the longest common subsequence (LCS)

2. Obtain session similarity using LCS and time information session similarity = (time similarity in LCS) x (importance of LCS)

• The similarity component : – Average min-max similarity for each page in the LCS

• The importance component : – Average of the fraction of overall session time spent in the LCS

Session Clustering

• Find the pairwise similarity values between all pair of sessions; record only similarities >

• Incrementally construct similarity graph G

– the vertices are the sessions, the edge weights are the session similarity values

– no isolated vertices (discard “outliers”)

• Balanced Graph Partitioning– we used Metis [Karypis, Kumar]

The Cluster Space

• Given: each session assigned to one of k clusters (sets)Sessions of a user are distributed among the k sets

– vector u = [u1 u2 … uk ]T where ui = number of sessions of the user belonging to cluster I

• Stage II : User Clustering

– find pairwise similarity values using the extended Jaccard measure

– partition similarity graph

• Gives l user clusters and a set of outlier users

The Dataset : Sulekha.com

Dataset details

• Logs over a one month period

• Raw log size 184 Mb

• 453,953 files accessed

• 37,753 sessions in all

• 23,310 sessions after some preprocessing/filtering

• 2,493 users

Results : Session ClustersCluster 1 – interest in coffeehouse, contests Cluster 2 – glance through home, articles

-(/,12)(/movies,6)(/contests,178)

-(/contests,142)

-(/coffeehouse,5)(/contests,183)

-(/contests,172)

-(/,10)(/contests,143)

-(/,22)(/articles,22)

-(/,20)(/articles,20)

-(/,21)(/articles,21)

-(/,19)(/articles,19)

-(/,20)(/articles,19)

Cluster 3 – interest in author, articles Cluster 4 – read articles

-(/,148)(/authors,6)(/articles,77)

-(/authors,290)(/articles,290)

-(/,39)(/articles,98)(/misc,17)

(/articles,2649)

-(/,9)(/articles,2666)

-(/misc,20)(/articles,77)(/misc

32)(/articles,43)(/authors,16)

(/articles,2373.1)

Results : User Clusters• user : [(128.194.xxx.xxx)]

– (/authors,3)(/articles,129)– (/authors,8)(/articles,8)– (/authors,80)(/articles,2141)

• user : [(209.30.xxx.xxx)]– (/home,77)(/articles,111)(/authors,93)(/articles,629)(/

misc,58) (/coffeehouse,75)(/wo-men,967)– (/articles,2627)

• user : [(171.68.xxx.xxx)]– (/home,323)(/articles,24)(/authors,45)(/articles,1290)

A user cluster :

people who read the articles

– (/home,21)(/wo-men,1075)(/philosophy,52)

• user : [(209.244.xxx.xxx)]– (/home,5)(/coffeehouse,94)(/wo-men,75)(/movies,75)(/wo-

men,31)– (/home,52)(/philosophy,67)(/wo-men,955)(/philosophy, 26)

(/coffeehouse,382)(/biztech,298)(/philosophy,290)– (/home,17)(/coffeehouse,12)(/wo-men,15)(/personal,6)

(/biztech,94)(/coffeehouse,2)(/philosophy,1093)

A user cluster :

people interested in wo-men, philosophy, coffeehouse

– (/coffeehouse,12)(/biztech,25)(/books,48)– (/coffeehouse,13)(/biztech,26)(/books,19)

• user : [(204.220.xxx.xxx)]– (/coffeehouse,162)– (/coffeehouse,40)

• user : [(32.100.xxx.xxx)]– (/coffeehouse,12)(/contests 12)– (/coffeehouse,43)(/contests 44)

A user cluster :

people interested in coffeehouse – bookmarked it !

Result Visualization using CLUSION [Strehl &Ghosh 01]

Sessions Users

Conclusions

• Segmentation: a basic pre-processing step for Web Mining• Similarity measure + Cluster Space Concept: applicable to

clustering of sets of any data-structure • For certain websites, time spent on the pages matters

– not handled by current commercial tools

• Outlier detection before clustering is important• Results QA-ed by human subjects

– Results for clusters & outliers at both levels were subjectively good

No good way to find cluster quality analytically

Formation of similarity graph is a slow process

Future Work

• Improve the present method by:– using cluster seeds for cluster growing

– using alternative clustering algorithms for each stage

– studying the effect of thresholds, number of clusters on performance

– studying the importance of order of page-visits

– studying the importance of balanced clustering

Backup

Issues : Choice of Parameters

• Number of session clusters, k, should be chosen appropriately

• Thresholds for forming session & user similarity graphs :– threshold value should be chosen after looking at the

distribution of edge weights

Related Work

• Research in Web Mining :– Extraction of navigational patterns : Spiliopoulou,

Faulstich

– Ordering relationships : Mannila, Meek

– Surfing prediction : Pitkow, Pirolli

– Clustering web usage sessions : Fu, Sandhu, Shih

Example

• Sessions :

– Session1 = [(a,8) (b,100) (d,8) (c,5) (e,23) (a,5)]

– Session2 = [(b,5) (d,12) (f,1) (a,7) (c,5)]

• LCS pages = [(b)(d)(c)]

• Corresponding Index, Times Sequences :– Index1 = [(1)(2)(3)], Time1 = [(100) (8) (5)]

– Index2 = [(0)(1)(4)], Time2 = [ (5) (12) (5)]

• Similarity over each LCS page : of the two times– Similarity on page b = 5/100 = 0.05

– Similarity on page d = 8/12 = 0.67

– Similarity on page c = 5/5 = 1.00

Example (contd.)

• The similarity component = (0.05 + 0.67 + 1.00)/3

= 0.57

• The importance component :– Fraction of time spent in the LCS by Session1 = 113/149 = 0.76

– Fraction of time spent in the LCS by Session2 = 22/30 = 0.73

– The mean = (0.76+0.73)/2 = 0.75

• The overall similarity= 0.57 x 0.75

= 0.43

Issues : Session Resolution

• Generate coarse resolution paths making use of the concept hierarchy of the website

• Reduces computations; Increases interpretability of results

Original Path Concept-level Path(/authors/ramesh_mahadevan.html,3)

(/articles/rm_phattas.html,75)

(/articles/rm_desidads.html,39)

(/authors,3)

(/articles,114)

(/authors/arun_sampath.html,109)

(/philosophy/messages/1951.html,102)

(/authors,109)

(/philosophy,148)

(/philosophy,69)

Comments

• Results QA-ed by human subject– Results for clusters & outliers at both levels were subjectively

– No good way to find cluster quality analytically

• Clustering algorithms for the two stages– Stage I : Graph partitioning works well for large sparse graphs, so

it is desirable in this stage

– Stage II : Since the space is not high-dimensional, any reasonable clustering algorithm should be adequate

• Cluster space – Gives a general framework for mapping any non-vector clustering

problem to an equivalent vector clustering problem

Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM...

Documents

Transcript of Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM...

Joydeep Mitra: Curriculum Vitaemitraj/mitra_cv.pdf · Assistant Professor, Department of Electrical & Computer Engineering ... – “A Holistic Approach to Customer-driven ... passed

Nanotechnology: Overview and AlitiApplications/media/Files/Projects/nano-commodities/DuttaNT.pdf · Nanotechnology: Overview and AlitiApplications Joydeep Dutta Center of Excellence

Joydeep Ghosh UT-ECE Multiclassifier Systems: Back to the Future Joydeep Ghosh The University of Texas at Austin.

Joydeep Mitra: Curriculum Vitae - College of …mitraj/mitra_cv.pdfJOYDEEP MITRA: CURRICULUM VITAE ... Electrical & Computer Engineering ... – “A Holistic Approach to Customer-driven

MONTHLY SUMMARY REPORT...2.05% 2.10% 2.15% Nov 1 Nov 2 Nov 3 Nov 4 Nov 5 Nov 6 Nov 7 Nov 8 Nov 9 Nov 10 Nov 11 Nov 12 Nov 13 Nov 14 Nov 15 Nov 16 Nov 17 Nov 18 Nov 19 Nov 20 Nov 21

Conductor Review Oct 16-17, 2013LARP Strand :Specs. Procurement, Measurement- A. Ghosh1 LARP Strand: Specifications, Procurement and Measurement Plans.

Rapid, continuous streaking of tremor in Cascadiaseisweb/emily_brodsky/reprints/2010GC0033… · Rapid, continuous streaking of tremor in Cascadia Abhijit Ghosh1, John E. Vidale1,

VITAE DIPANKAR CHAKRAVARTI PERSONAL DATA · Chakravarti, Dipankar, Rajan Krish, Pallab Paul and Joydeep Srivastava (2002) “Partitioned Presentation of Multi-Component Bundle Prices:

Expectation Maximization for Clustering on …inderjit/public_papers/tr03-07.pdfExpectation Maximization for Clustering on Hyperspheres Arindam Banerjee⁄ Inderjit Dhillony Joydeep

static-content.springer.com10.1038... · Web viewSupplementary Information Embedded Gate CVD MoS 2 Microwave FETsAtresh Sanne1,*, Saungeun Park1,*, Rudresh Ghosh1, Maruthi Nagavalli

Zero-Knowledge Accumulators and Set Operationsnikos/papers/ZKACC-ePrint15.pdfZero-Knowledge Accumulators and Set Operations Esha Ghosh1, Olga Ohrimenko2, Dimitrios Papadopoulos3, Roberto

H.I. GHOSH1 Challenges of NCDs in Palestine *** Heidar Abu Ghosh Director of Chronic Diseases Program *** Palestinian Medical Relief Society.

David Holland, Leroy Stodick, Stephen Devadoss and Joydeep Ghosh 2004

Combinatorial Macbeath Regions for Semi-Algebraic Set …Combinatorial Macbeath Regions for Semi-Algebraic Set Systems Arijit Ghosh1 1Indian Statistical Institute Kolkata, India College

Business Market Management Class # 2 Instructor: Joydeep Bhattacharya Praxis Business School, Bakrahat Road, P.O. Rasapunja, 24 Parganas (South), Kolkata-

Controlling chain conformation in conjugated polymers ... · Controlling chain conformation in conjugated polymers using defect inclusion strategies Giannis Buonos1, Subhadip Ghosh1,

GELL: Automatic Extraction of Epidemiological Line Lists from …people.cs.vt.edu/naren/papers/fp1073-ghoshA.pdf · from Open Sources Saurav Ghosh1, 5, Prithwish Chakraborty1, 5,

Primary lung neoplasms presenting as multiple synchronous lung nodules · 2020. 9. 2. · Primary lung neoplasms presenting as multiple synchronous lung nodules Subha Ghosh1, Atul

Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase

Andrographis paniculata transcriptome provides molecular ... · Anchal Garg1, Lalit Agrawal2, Rajesh Chandra Misra1, Shubha Sharma1 and Sumit Ghosh1* Abstract Background: Kalmegh

static-content.springer.com10.1038... · Web viewSupplementary Information Embedded Gate CVD MoS 2 Microwave FETsAtresh Sanne1,, Saungeun Park1,, Rudresh Ghosh1, Maruthi Nagavalli

H.I. GHOSH1 Challenges of NCDs in Palestine * Heidar Abu Ghosh Director of Chronic Diseases Program * Palestinian Medical Relief Society.