Outline
• Hype of the web• Difficulties with web • Web Mining• Advantages / Disadvantages• Categories of Web Mining• Web Usage Mining
Difficulties With Web
§ Very Huge Information§ Semi-Structured Data§ Redundant Data§ Web is noisy§ Customer Behaviour
Web Mining - Sub Tasks
• Resource Finding• Information Selection• Information Preprocessing• Data Mining Techniques• Analysis
Web Usage Mining
The Process of Automatic discovering patterns and profile of users interacting with a web site.
Data Cleaning: Clean the raw data,
-Missing value-Redundant data
-Outliers deletion
Trasaction Derivation:
-according to individual user transaction
Data Integration:
- Combines data from multiple sources into a data store
Transformation:
- The data are transformed into appropriate forms for mining such as generalization and normalization
Pattern Discovery
Association Rule:
• X == > Y (support, confidence)• 60% of clients who accessed /products/, also
accessed /products/software/webminer.htm
Sequential Pattern:
• Discovery of frequently occurring ordered events or subsequences as patterns.
Clustering
process of grouping a set of objects into classes of similar objects
Classification
process of finding a model that describes and distinguishes data classes or concepts
K-means Algorithm
Used for clustering, where each cluster’s center is represented by the mean value of the objects in the cluster.
Input: k: the number of clusters, D: a data set containing n objects.Output: A set of k clusters.
Steps:
(1) randomly choose k objects from D as the initial cluster centers;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most similar , based on the mean value of the objects in the cluster;
(4) update the cluster means, i.e., calculate the mean value of the objects for each cluster;
(5) until no change;
K-Medoids Algorithm
-To over come the limitations found in k-means algorithm -Cluster is represented by the cost value of the objects in the cluster.
Input:
k: the number of clusters,
D: a data set containing n objects.
Output:
A set of k clusters.
Steps:
(1) randomly choose k objects in D as the initial representative objects or seeds;
(2) repeat
(3) assign each remaining object to the cluster with the nearest representative object;
(4) randomly select a nonrepresentative object, orandom;
(5) compute the total cost, S, of swapping representative object, oj, with orandom;
(6) if S < 0 then swap oj with orandom to form the new set of k representative objects;
(7) until no change;
P : non represntative objectOj : current representative objectOi : another representative objectOrandom : good replacement object (non represetative object)
Top Related