Web Usage Mining A case study of the GoMercer.com website
Martin ZhaoMar 16, 2007
Topics
• What is data mining?
• The data mining process
• Web usage mining: basic concepts
• The robust fuzzy relational clustering algorithm
• An application to the GoMercer.com web logs
• Q & A
What is Data Mining? – definition
• A concise definition Finding hidden information from large datasets
• A slightly longer version Data mining is the process of exploration and
analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules
• Differences from accessing info in a database• The query is not well formed or precisely stated• The data needs to be pre-processed before mining• The output is new knowledge, which may not be a
subset of the database
What is Data Mining? – a historical perspective
• Data mining is a relatively new field of study.• The 1st International Conference on Knowledge
Discovery and Data Mining (KDD) was held in 1995• But its roots can be traced back to five areas:
Data Mining
StatisticsBayes theorem (1700s)Regression (1900s)Classification (1960s)K-means clustering (1970s)
Artificial IntelligenceNeural networks (1940s)Genetic algorithms (1970s)Decision tree alg.s (1980s)
Algorithms
Information RetrievalSimilarity measures (1960s)Clustering (1960s)SMART IR systems (1970s)
DatabasesBatch reports (1960s)Relational data models (1970s)Data warehousing & OLAP (1990s)
Why Data Mining?
• The growth of data is the most important factor propelling the growth of data mining• In 2003, Wal-Mart captured 20 million
transactions per day in a 10-terabyte database (1TB = 106 MB)
• In 1950, the largest companies had only several dozen megabytes
• The total amount of data that were produced in 2002 was estimated as 5 exabytes (1XB = 106 TB)
• 40% of this was produced in the US
• When we have more data, we are expecting more sophisticated information from them
Business Intelligence – from data to knowledge
Data-Factual information-May be incomplete-Stored in huge amount
Information-Relevant data-Well formatted-For targeted audience
Knowledge-Models, patterns, and rules -Can be used in prediction
IntelligenceUsing knowledge in decision making
Basic Data Mining Tasks
• Classification (map data into predefined groups)
• Regression (map a data item to a real valued prediction variable)
• Prediction (similar to classification, but deal with a future state)
• Clustering (similar to classification, but the groups are defined by the data)
• Association rules (identifies association among data)
• Sequence discovery (determine sequential patterns in data)
The Data Mining Process – the steps
• Develop an understanding of the purpose
• Obtain the dataset to be used
• Explore, clean, and preprocess the data
• Reduce the data, if necessary
• Determine the data mining tasks
• Choose the data mining techniques to be used
• Use algorithm to perform the task
• Interpret the results
• Deploy the model
Phases in the DM ProcessPhases in the DM Process – CCRISP-DMRISP-DM
Web Data Mining
• Web mining: the use of data mining techniques to automatically discover and extract useful and novel information from web docs and services
• Web mining can be categorized as• Content mining: extract model from web contents,
such as text, images, video, and semi- structures (HTML or XML) or structures documents (digital libraries)
• Structure mining: aims at finding the underlying topology and organization of web resources
• Usage mining: discover usage patterns from web server log files, user queries, and registration data
User Clustering and Profiling – goals goals
• Major application areas for web usage mining• Personalization• System improvement• Site modification• Business intelligence• Usage characterization
User Clustering and Profiling – processprocess
• Data cleaning• omitting entries about individual objects on a page
(such as .gif or .jpg image files)
• (User and) session identification: • including identifying distinct pages, IPs, and agents• a session is a sequence of page views accessed
through a certain IP using a certain agent within a certain amount of time (set as 45 minutes)
• Clustering and profiling:• Define similarity between page views• Categorize user sessions into clusters based on
similarity of the pages visited
Web Log File Entries
• Web log files keep track of the following data • Date and time (e.g., 2006-10-01@00:01:01)• Client IP address (e.g., 70.168.242.49)• Server IP address (e.g., 192.168.1.52 or www.GoMercer.com)
• URI stem (web page or a specific file requested, e.g., /choose-mercer/apply-online.aspx)
• User Agent (browser used by the user, e.g., Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322))
• Referrer (the previous page visited)• Cookie • Etc
Data Model
UserSession
Web Page
WebBrowser
IPAddress
1 *
*
1
* 5..*
Within 45 minutes
UserCluster
1
*
Session Identification
1. Use original web server log files as input2. Parse log entries to omit individual objects
(such as images), and a. Keep track of unique client IPs, URIs of interest,
and user agentsb. Keep track of date/time and identifiers for IP, URI,
and agent for each entry of interest
3. For each entry of interesta. add the URI to an existing session with the same
{IP, agent} identifiers and within 45 minutesb. create a new session with the URI
4. Persist the session information to a file (or DB)
Sample Session Information
8 6 6
Inter-cluster distance (gap used here)
Clustering – a one-dimensional example
0
1
2
50 55 60 65 70 75 80 85 90 95
Classification:Map data into pre-defined groups
Clustering:Just specify number of groups.Groups themselves are defined by data
Intra-cluster distance
3 4 2.13 3.33
Maximize the inter-clusterdistance and minimize the
intra-cluster distance
Let’s try to group this set of test scores into letter grades
Page and Session (Dis-)Similarity
• The “syntactic” similarity between (the URL’s of) the ith and jth pages, is defined as the smaller of 1 and the ratio of the overlap of the two and the larger of the two lengths Su(i, j) = min(1, |pi^pj|/max(1, max(|pi|, |pj|))• For instance, the similarity score for
/mercer-411/contact.aspx and /mercer-411/ask-a-student.aspx is 1/2, whereas the score for /mercer-411/contact.aspx and assets/flash/location.xml is 0
• Dissimilarity is defined as (1 - Su(i, j))2
• Dissimilarity between two clusters is then calculated by summing up pair-wise dissimilarity scores
Page Similarity – an example
• For instance, the similarity score for /mercer-411/contact.aspx and /mercer-411/ask-a-student.aspx is 1/2, whereas the score for /mercer-411/contact.aspx and assets/flash/location.xml
/
/mercer-411
/contact.aspx /ask-a-student.aspx
/assets
/flash
/location.xml/CLA_1.flv …
…
…
Medoid and Membership
• Each cluster is represented by a medoid, which is a centrally located session in the cluster
• The affiliation of a session to a cluster is represented as a membership score, or the similarity to the corresponding medoid • A session is not considered to exclusively belong to
a single cluster• The affiliation is determined by the highest
membership score in a given iteration
Relational Clustering Algorithm
1. Use identified sessions as input2. Specify number of clusters, C and maximum
number of iterations, M to be used3. Choose an initial medoid for each cluster i in [1, C]
4. Compute membership uij for each session j in [1, N] with regard to each cluster i (using the similarity measure)
5. Store the old medoids6. Compute the new medoids to minimize overall
intra-cluster distances7. Repeat steps 4 through 6 until the medoids do not
change or the maximum number of iterations M is reached
Application to GoMercer.com
Meeting w/ Rob Saxon
Obtain & readWeb log files
Preliminary study using CSC data
Parsing data for sessions
Clustering w/ FCMdd
Data analysis& visualization
On going
Results – summary of log files
• 148 files (one per day from 09/29/06 to 02/23/07), totaling about 2.5 GB
• File sizes for Oct 2006 and Feb 2007 as shown• Session counts in the same periods present similar
patterns
0
10000
20000
30000
40000
50000
60000
1-s
un
2-m
on
3-t
ue
4-w
ed
5-t
hu
6-f
ri
7-s
at
8-s
un
9-m
on
10
-tu
e
11
-we
d
12
-th
u
13
-fri
14
-sa
t
15
-su
n
16
-mo
n
17
-tu
e
18
-we
d
19
-th
u
20
-fri
21
-sa
t
22
-su
n
23
-mo
n
24
-tu
e
25
-we
d
26
-th
u
27
-fri
28
-sa
t
29
-su
n
30
-mo
n
31
-tu
e
0
10000
20000
30000
40000
50000
60000
07-2
-1-t
hu
07-2
-2-f
ri
07-2
-3-s
at
07-2
-4-s
un
07-2
-5-m
on
07-2
-6-t
ue
07-2
-7-w
ed
07-2
-8-t
hu
07-2
-9-f
ri
07-2
-10-
sat
07-2
-11-
sun
07-2
-12-
mon
07-2
-13-
tue
07-2
-14-
wed
07-2
-15-
thu
07-2
-16-
fri
07-2
-17-
sat
07-2
-18-
sun
07-2
-19-
mon
07-2
-20-
tue
07-2
-21-
wed
07-2
-22-
thu
07-2
-23-
fri
0
100
200
300
400
500
600
Results – frequencies by URI type
• User client programs (or browsers used)• Main page• ASP scripts
• Breakdown for /accepted, /choose-mercer, and /mercer-411
• Flash videos• Individual videos • Combined by topic
1
10
100
1000
10000
0
200
400
600
800
1000
1200
1400
Aca
dem
ic+
Ove
rvie
w_1
Res
iden
ce+
Life
+1
Ath
letic
s+1
Cam
pus+
Life
+1
CLA
+1
Rel
igio
us_L
ife_1
Gre
ek+
Life
+1
Edu
catio
n+1
Eng
inee
ring+
1
Bus
ines
s+1
mer
cer-
adm
issi
ons-
Rec
+an
d+A
ctiv
ites+
1
Res
iden
ce+
Life
+2
Mus
ic+
1
Gre
ek+
Life
+2
Cam
pus+
Life
+2
CLA
+2
Ath
letic
s+2
Eng
inee
ring+
2
Rec
+an
d+A
ctiv
ites_
1
Rel
igio
us_L
ife_2
Bus
ines
s+2
Mus
ic+
2
Edu
catio
n+2
0
200
400
600
800
1000
1200
1400
1600
1800
Residenc
e_Life
Academ
ic_O
verv
iew
Campus_Lif
eCLA
Athletic
s
Greek
_Life
Religiou
s_Lif
e
Engine
ering
Educa
tion
Busin
ess
Rec_and
_Acti
vites
mer
cer-a
dmiss
ions
Mus
ic
0
500
1000
1500
2000
2500
3000
3500
/accepted/enrollment-
checklist-spring.aspx
/accepted/financial-aid-inform
ation.aspx
/accepted/new-
student-housing.aspx
/choose-mercer/apply-
online.aspx
/choose-m
ercer/checklist-for-
/choose-m
ercer/financial-
/choose-m
ercer/international-
/choose-m
ercer/transfer-
/default.aspx
/mercer-recruitm
ent-video.aspx
/mercer-411/all-
degrees.aspx
/mercer-411/ask-a-student.aspx
/mercer-
411/directions.aspx
/mercer-411/m
ore-m
ercer.aspx
/accepted
/choose-mercer
/mercer-411
0
1000
2000
3000
4000
5000
6000
/ /why-mercer /mercer-life /choose-mercer
/accepted /mercer-411
Results – user cluster and profiles
279 128 156 278 267 145 305 399 190 320 268 279 158 251 225
162 147 166 263 112 150 345 206 233 281 291 151 186 229
Questions and Discussions
References• Data mining for business intelligence, by Shmuli et al,
Wiley Inter-Science, 2007• Data mining, by Dunham, Prentice Hall, 2003• Web mining: applications and techniques, Scime (ed.),
IDEA group, 2005• What is data mining? by Squier, (
www.dama-ncr.org/Library/2001.11.14-Laura%20Squier.ppt)• Automatic web user profiling and personalization using
robust fuzzy relational clustering, by Nasraoui et al, 1999
• Web usage mining: discovery and application of interesting patterns from web data, by Cooley, PhD thesis, Univ. of Minnesota, 2000