Web Mining

30
Presented By: Akshat Saxena Anjul Sahu

description

 

Transcript of Web Mining

Page 1: Web Mining

Presented By:Akshat Saxena Anjul Sahu

Page 2: Web Mining

Definition

Application of data mining techniques on the web to discover interesting patterns.

Page 3: Web Mining

Introduction

Size of web is extremely large Data present on web is unstructured Good scope of data mining Types of data on web

Content of actual webpage Intrapage structure Interpage structure Usage data User profiles and cookies

Page 4: Web Mining

Web Mining Taxonomy

Page 5: Web Mining

Web Content Mining

Extends work of search engineImproves on traditional crawler

techniqueUse data mining for efficiency,

effectiveness and scalabilityFurther divided into

◦ Agent based approach◦ Database based approach

Text mining is/isn’t content miningCrawlersPersonalization

Page 6: Web Mining

Web Content Mining Subtasks

Resource finding Retrieving intended documents

Information selection/pre-processing Select and pre-process specific information

from selected documents Generalization

Discover general patterns within and across web sites

Analysis Validation and/or interpretation of mined

patterns

Page 7: Web Mining

Text Mining

Page 8: Web Mining

Web Crawler

Program which browses WWW in a methodical, automated manner

Copy in cache and do Indexing Starts from a seed url Searches and finds links, keywords Types of Crawler

Context focused Focused Incremental Periodic

Page 9: Web Mining

Focused Crawler

Page 10: Web Mining

Focused Crawler

Visits only pages of interest Architecture consists of:

Hyperlink Classifier Distiller Crawler

Hub pages - links to relevant pages Hard focus - parent node relevant Soft focus - probability of relevance Harvest rate – precision rate

Page 11: Web Mining

Context Focused Crawler

Focused crawler was static Drawbacks:

Non-relevant pages having links to relevant ones. These to be followed

Relevant ones not having links to other relevant ones. Backward crawling

CFC in two steps Construct context graphs and classifiers Crawl using these classifiers

Page 12: Web Mining

Harvest System

Uses caching, indexing and crawling Act as a tool in gathering information

from other sources Components:

Gatherer - obtains information Broker - provides index and query

interface Essence systems Semantic indexing

Page 13: Web Mining

Virtual Web View

Web as multiple layer database A view of MLDB is virtual web view No spiders used Websites send their indices to others WebML – DMQL for web mining KEYWORDS – covers, covered by,

like, close to Difficult to implement

Page 14: Web Mining

Personalization

Contents of web are modified as per user’s desires

Personalized not targeted Use cookies, userID, profile

information Legal issues to be considered Includes clustering, classification or

even prediction

Page 15: Web Mining

Personalization

Types: User preference Collaborative filtering Content based filtering

Example : My Yahoo! was first. Now almost every service offers personalization.

Page 16: Web Mining

Personalization

Yahoo was the first to introduce the concept of a ’personalized portal’, i.e. a Web site designed to have the look-and-feel as well as content personalized to the needs of an individual end-user.

Mining MyYahoo usage logs provides Yahoo valuable insight into an individual’s Web usage habits, enabling Yahoo to provide compelling personalized content, which in turn has led to the tremendous popularity of the Yahoo Web site.

Page 17: Web Mining

Web Structure Mining

Creating a model of web organization

Classify web pages Create similarity measures between

web pages Page Rank The Clever system Hyperlink induced topic search(HITS)

Page 18: Web Mining

PageRankTM

Link analysis algorithm which assigns numerical weight to a webpage.

The numerical weight that it assigns to any given element E is also called the PageRank of E and denoted by PR(E).

the PageRank value for a page u is dependent on the PageRank values for each page v out of the set Bu (this set contains all pages linking to page u), divided by the number L(v) of links from page v.

Page 19: Web Mining

Page Rank

Increase effectiveness of search engines

Based on number of back links Rank sink problem exists

Page 20: Web Mining

Clever System

Finds both authoritative pages and hubs

Authoritative - best source Hub - link to authoritative pages Most value page returned Hyperlink Induced Topic Search

Keywords Authority and hub measure

Page 21: Web Mining

Alternatives to PageRank

HITS Algorithm IBM Clever Project TrustRank But PageRank is the most popular

and widely used algorithm by search engines

Page 22: Web Mining

Web Usage Mining

Applies mining on web usage data or weblogs or clickstream data

Client perspective Server perspective Aid in personalization Helps in evaluating quality and

effectiveness Preprocessing, pattern discovery and

data structures

Page 23: Web Mining

Trackers for site usage and analysis

Page 24: Web Mining
Page 25: Web Mining

Issues in Web Log

Identify exact user

Exact sequence of pages visited

Security, privacy and legal issues

Page 26: Web Mining

Preprocessing

Information not in presentable format

Data cleaning required Log: (<src

id>,<literal>,<timestamp>) Data might be grouped Sessions Path completion

Page 27: Web Mining

Data Structure

DS needed to keep track of patterns identified

DS used is trie A rooted tree where each path from

root to node represents a sequence

Page 28: Web Mining

Pattern Discovery

Traversal pattern - pages visited in a session

Properties: Duplicate reference may / may not be

allowed Consist of only contiguous page reference Pattern may / may not be maximal

Association rules - pages accessed together

Page 29: Web Mining

Pattern Discovery

Sequential Pattern - ordered set satisfying a support and maximal

Similar to apriori algorithm Web access pattern - efficient

counting Episodes – partially ordered by

access time; users not identified Pattern analysis

Page 30: Web Mining

Queries ‘N Suggestions

References: http://maya.cs.depaul.edu/~mobasher/w

ebminer/survey/ Google.com/Technology http://www.almaden.ibm.com/projects/

clever.shtml

Thanks !! {akshatsaxena11, anjulsahu}@gmail.com