Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006...

94
Web Mining Kyumars Sheykh Esmaili Data Mining Course Sharif University of Technology Fall 2006

Transcript of Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006...

Page 1: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

Web Mining

Kyumars Sheykh Esmaili

Data Mining CourseSharif University of Technology

Fall 2006

Page 2: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 2

Table of ContentsIntroductionWeb Content Mining

Feature Selection and Similarity MeasuresWeb Structure Mining

Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms

PageRankCyber-Communities

HITSCT

Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining

Focus CrawlingWeb Search Result Clustering

Summary

Page 3: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 3

Table of ContentsIntroductionWeb Content Mining

Feature Selection and Similarity MeasuresWeb Structure Mining

Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms

PageRankCyber-Communities

HITSCT

Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining

Focus CrawlingWeb Search Result Clustering

Summary

Page 4: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 4

Introduction

Information Overloading on the webSize

2001New information created: 6 exabytes (10^18 bytes) 10 billion (nonspam) e-mail messages were sent per day.

2002New information created: 12 exabytes (10^18 bytes)

2003the public Internet contained about 1 trillion pages and was increasing at a rate of approximately 8 million pages per day.

200535 billion messages per day by 2005.

Page 5: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 5

Challenges on WWW Interactions

Finding Relevant InformationCreating knowledge from Information availablePersonalization of the informationLearning about customers / individual users

Web Mining can play an important Role!

Page 6: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 6

Introduction

Web mining - data mining techniques to automatically discover and extract information from Web documents/servicesWeb mining research – integrate research from several research communities :

Database (DB) Information retrieval (IR) The sub-areas of machine learning (ML) Natural language processing (NLP)

Page 7: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 7

Web Data

Web pagesIntra-page structuresInter-page structuresUsage dataSupplemental data

ProfilesRegistration informationCookies

Page 8: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 8

Web Data Categories

Web Data

Content Data

Structure Data

Usage Data

User Profile Data

Free Texts

HTML Files

XML Files

Dynamic Content

Multimedia

Static Link

Dynamic Link

Page 9: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 9

Web Mining

Web StructureMining

Web ContentMining

Web C-SMining

Web UsageMining

Web Mining Taxonomy

Page 10: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 10

Web Mining : SubtasksResource Finding

Task of retrieving intended web-documents

Information Selection & Pre-processingAutomatic selection and pre-processing specific information from retrieved web resources

GeneralizationAutomatic Discovery of patterns in web sites

AnalysisValidation and / or interpretation of mined patterns

Page 11: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 11

Table of ContentsIntroductionWeb Content Mining

Feature Selection and Similarity MeasuresWeb Structure Mining

Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms

PageRankCyber-Communities

HITSCT

Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining

Focus CrawlingWeb Search Result Clustering

Summary

Page 12: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 12

Feature Selection for Web Mining

for the purposes of automated text classification text features should be:

Relatively few in numberModerate in frequency of assignmentLow in redundancyLow in noiseRelated in semantic scope to the classes

to be assignedRelatively unambiguous in meaning

Page 13: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 13

Feature Selection

Potential features:BODYMETATITLESnippet

Means sentences attached with URL u appeared in search results

Anchor WindowThe anchor text and text around the hyperlink v->u in the

source page vMT, the union of META and TITLE content;BMT, the union of BODY, META and TITLE content.

Page 14: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 14

Percentage of Web Pages With Words in HTML Tags

Feature Selection for Content Mining

Page 15: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 15

Feature Selection For Web Pages

Classification performance for various representations of web pages

Page 16: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 16

Vector Space Model for Content-Similarity

IR systems usually adopt index terms to process queriesIndex term:

a keyword or group of selected wordsany word (more general)

Stemming might be used:connect: connecting, connection, connections

An inverted file is built for the chosen index terms

Page 17: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 17

Vector Space Model - Basic Concepts

Ki is an index termdj is a documentt is the number of index termsK = (k1, k2, …, kt) is the set of all index termswij >= 0 is a weight associated with (ki,dj)wij = 0 indicates that term does not belong to docvec(dj) = (w1j, w2j, …, wtj) is a weighted vector associated with the document djgi(vec(dj)) = wij is a function which returns the weight associated with pair (ki,dj)

Page 18: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 18

The Vector Space Model

Sim(dk,dj) = cos(Θ) = [vec(dk) • vec(dj)] / |dk| * |dj| = [Σ wik * wij] / |dk| * |dj|Since wij > 0 and wik > 0, 0 <= sim(dk,dj) <=1

A document is retrieved even if it matches the target document terms only partially

i

j

dj

dkΘ

Page 19: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 19

The Vector Space Model: Example

d1

d2

d3d4 d5

d6d7

k1k2

k3

k1 k2 k3 q • dj |dj| Sim(dj,q)d1 1 0 1 2 1.41 0.82d2 1 0 0 1 1 0.58d3 0 1 1 2 1.41 0.82d4 1 0 0 1 1 0.58d5 1 1 1 3 1.73 1d6 1 1 0 2 1.41 0.82d7 0 1 0 1 1 0.58

q 1 1 1 |q| 1.73

Page 20: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 20

The Vector Space Model - Weighting

Sim(q,dj) = [Σ wij * wiq] / |dj| * |q|How to compute the weights wij and wiq ?A good weight must take into account two effects:

quantification of intra-document contents (similarity)tf factor, the term frequency within a document

quantification of inter-documents separation (dissi-milarity)idf factor, the inverse document frequency

wij = tf(i,j) * idf(i)

Page 21: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 21

Example:• A collection includes 10,000 documents• The term A appears 20 times in a particular document• The maximum apperance of any term in this document is 50• The term A appears in 2,000 of the collection documents.• f(i,j) = freq(i,j) / max(freq(l,j)) = 20/50 = 0.4• idf(i) = log(N/ni) = log (10,000/2,000) = log(5) = 2.32• wij = f(i,j) * log(N/ni) = 0.4 * 2.32 = 0.93

The Vector Model - Weighting

Page 22: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 22

Table of ContentsIntroductionWeb Content Mining

Feature Selection and Similarity MeasuresWeb Structure Mining

Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms

PageRankCyber-Communities

HITSCT

Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining

Focus CrawlingWeb Search Result Clustering

Summary

Page 23: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 23

Social network analysis

Social network is the study of social entities (people in an organization, called actors), and their interactions and relationships. The interactions and relationships can be represented with a network or graph,

each vertex (or node) represents an actor and each link represents a relationship.

From the network, we can study the properties of its structure, and the role, position and prestige of each social actor. We can also find various kinds of sub-graphs, e.g., communities formed by groups of actors.

Page 24: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 24

Social network and the Web

Social network analysis is useful for the Web because the Web is essentially a virtual society, and thus a virtual social network,

Each page: a social actor and each hyperlink: a relationship.

Many results from social network can be adapted and extended for use in the Web context.

Page 25: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 25

Web Structure MiningThe Web consists not only of pages, but also of hyperlinks pointing from one page to another

These hyperlinks contain an enormous amount of latent human annotation

Assumption: link from page A to page B is a recommendation of page B by AIf A and B are connected by a link, there is a higher probability that they are on the same topic

Page 26: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 26

Web Link Analysis

Used for Ordering documents matching a user query: rankingDeciding what pages to add to a collection: crawlingPage categorizationFinding related pagesFinding duplicated web sites

Page 27: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 27

Table of ContentsIntroductionWeb Content Mining

Feature Selection and Similarity MeasuresWeb Structure Mining

Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms

PageRankCyber-Communities

HITSCT

Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining

Focus CrawlingWeb Search Result Clustering

Summary

Page 28: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 28

Structural Similarity MeasuresWe must define the similarity of two nodes

Method I:For page and page B, A is related to B if there is a hyper-link from A to B, or from B to A

Not so good. Consider the home page of IBM and Microsoft.

Page A

Page B

Page 29: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 29

Structural Similarity Measures

Method II (from Bibliometrics)Co-citation: the similarity of A and B is measured by the number of pages cite both A and B

Bibliographic coupling: the similarity of A and B is measured by the number of pages cited by both A and B.

Page A Page B

Page A Page B

Page 30: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 30

Table of ContentsIntroductionWeb Content Mining

Feature Selection and Similarity MeasuresWeb Structure Mining

Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms

PageRankCyber-Communities

HITSCT

Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining

Focus CrawlingWeb Search Result Clustering

Summary

Page 31: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 31

Using link structure of web (cont.)

There are two famous Link-Structure based algorithms for ranking :

PageRankHITS

Nearly All other algorithms are base on these ones :

Salsa,Clever,.

Page 32: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 32

PageRank

Introduced by Page et al (1998)An offline algorithm (Query independent)The weight is assigned by the rank of parents

Page 33: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 33

A Practical Example for PageRank

Page 34: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 34

Table of ContentsIntroductionWeb Content Mining

Feature Selection and Similarity MeasuresWeb Structure Mining

Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms

PageRankCyber-Communities

HITSCT

Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining

Focus CrawlingWeb Search Result Clustering

Summary

Page 35: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 35

What is cyber-communityA community on the web is a group of web pages sharing a common interest

Eg. A group of web pages talking about POP MusicEg. A group of web pages interested in data-mining

Main properties: Pages in the same community should be similar to each other in contentsThe pages in one community should differ from the pages in another community Similar to cluster

Page 36: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 36

Cyber Communities

Page 37: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 37

Two different types of communities

Explicitly-defined communitiesThey are well known ones, such as the resource listed by Yahoo!

Implicitly-defined communitiesThey are communities unexpected or invisible to most users

Arts

Music

Classic Pop

Painting

eg.

eg. The group of web pages interested in a particular singer

Page 38: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 38

Different types of communities

The explicit communities are easy to identifyEg. Yahoo!, InfoSeek, Clever System

In order to extract the implicit communities, we need analyze the web-graph objectively

In research, people are more interested in the implicit communities

Page 39: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 39

Methods of clustering

Clustering methods based on co-citation analysis

Methods derived from HITS (Kleinberg)Using co-citation matrix

CT Method

Page 40: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 40

Table of ContentsIntroductionWeb Content Mining

Feature Selection and Similarity MeasuresWeb Structure Mining

Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms

PageRankCyber-Communities

HITSCT

Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining

Focus CrawlingWeb Search Result Clustering

Summary

Page 41: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 41

HITS: Hubs and Authority

Hub: web page links to a collection of prominent sites on a common topicAuthority: Pages that link to a collection of authoritative pages on a broad topic; web page pointed to by hubsMutual Reinforcing Relationship: a good authority is a page that is pointed to by many good hubs, while a good hub is a page that points to many good authorities

Page 42: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 42

Authority and Hubness

2

3

4

1 1

5

6

7

x(1) = y(2) + y(3) + y(4) y(1) = x(5) + x(6) + xs(7)

Page 43: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 43

HITS Steps (1)

Creating root and base sets

Page 44: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 44

HITS Steps (2)

Calculating Weights

Authority weight :

Hub weight :

Matrix notation: A - adjacency matrixA(i, j) = 1 if i-th page points to j-th page

Page 45: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 45

Final Result of HITS

Page 46: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 46

HITS Results – 3D perspective

Page 47: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 47

A Practical Example for HITS

Page 48: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 48

Difference between PageRank and HITS

The PageRank is computed for all web pages stored in the database and then prior to the query; HITS is performed on the set of retrieved web pages, and for each query.HITS computes authorities and hubs; PageRank computes authorities only.PageRank: non-trivial to compute, HITS: easy to compute, but real-time execution is hard

Page 49: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 49

Table of ContentsIntroductionWeb Content Mining

Feature Selection and Similarity MeasuresWeb Structure Mining

Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms

PageRankCyber-Communities

HITSCT

Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining

Focus CrawlingWeb Search Result Clustering

Summary

Page 50: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 50

A cheaper method

Previous methods are expensive

There another simple method called communities trawling (CT)

It has been implemented on the graph of 200 millions pages, it worked very well

Page 51: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 51

Basic idea of CT

Definition of communitiesdense directed bipartite sub graphs

Bipartite graph: Nodes are partitioned into two sets, F and CEvery directed edge in the graph is directed from a node u in F to a node v in Cdense if many of the possible edges between F and C are present

Fans Centers

F C

Page 52: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 52

Basic idea of CT

Bipartite coresa complete bipartite subgraph with at least i nodes from F and at least j nodes from C i and j are tunable parametersA (i, j) Bipartite core

Every community have such a core with a certain i and j.

A (i=3, j=3) bipartite core

Page 53: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 53

Basic idea of CT

A bipartite core is the identity of a community

To extract all the communities is to enumerate all the bipartite cores on the web.

Author invent an efficient algorithm to enumerate the bipartite cores. Its main idea is iterate pruning --elimination-generation pruning

Page 54: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 54

Table of ContentsIntroductionWeb Content Mining

Feature Selection and Similarity MeasuresWeb Structure Mining

Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms

PageRankCyber-Communities

HITSCT

Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining

Focus CrawlingWeb Search Result Clustering

Summary

Page 55: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 55

Content Link Clustering

By CLC, each web page q in data set D is representedas 3 vectors:

qOutqIn

qKword

with M, N and L as the vector dimension respectively

The ith item of vector qOut (and qIn) indicates whether q has the corresponding out-link as the ith one in M out-links. If yes, the ith item is1, else 0.

The kth item of vector qKword indicates the frequency of the corresponding kth term of L appeared in page q.

Page 56: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 56

Similarity Measure

The similarity of two pages Q and R is the linear combination of three parts:

poutS(Qout,Rout)+ pinS(Qin,Rin)+ ptermS(Qterm,Rterm)

pout +pin +pterm =1

S(Qout,Rout) is defined as Cosine of two out-link vectors.

Page 57: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 57

Tuning the similarity measure

By varying weighting factors in second formula, it is possible to study the effects of out-links, in-link and terms on clustering process.

Results of term-based clustering is rather coarse and usually includes very general groups, which are totally different each other from semantic point of view.

E.g. for topic “jaguar”, “car” group and “animal” group are two very general groups with very different semantic topics;

Page 58: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 58

Tuning the similarity measure

So, term-based clustering could only roughly separate pages into general semantic groups and failed to handle the finer case

Like “racing car” and “car driver club” since both pages may include some terms like “car, model etc.

The main reasons of poor “purity” of clusters produced by term-based clustering are:

Noise pages are included into clusters instead of removing since noise pages share some unimportant terms with other pages;

Pages that on different finer topics (but the same general topic) are mixed together.

Page 59: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 59

Tuning the similarity measure

Hyperlinks represent the authors’ view of the relationship among Web pages

hyperlink-based clustering expresses “association” of pages.

Therefore, we could say that clusters produced by link-based clustering are in finer granularity.

The problem of link-based clustering is that some similar pages (e.g. new created pages) may not have enough co-citation/citation to be grouped together. That is to say, recall is some low.

Page 60: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 60

Tuning the similarity measure

“T”, “L” and “CLC” to denote terms–based (with pout , pin and pKword as (0, 0, 1), link-based (with pout ,pin and pKword as (0.5, 0.5, 0) and contents-link coupled (with pout , pin and pKword as (0.2,0.3, 0.5) clustering approaches respectively.

Parameters are Similarity threshold weighting factors

The label of each cluster is identified automatically by term vector of centroid for each cluster.

Page 61: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 61

Content Link Mining

Page 62: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 62

Table of ContentsIntroductionWeb Content Mining

Feature Selection and Similarity MeasuresWeb Structure Mining

Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms

PageRankCyber-Communities

HITSCT

Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining

Focus CrawlingWeb Search Result Clustering

Summary

Page 63: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 63

Web Usage Mining

Web usage mining also known as Web log miningmining techniques to discover interesting usage patterns from the secondary data derived from the interactions of the users while surfing the webIncluding

web log data, click-stream data, cookies, user queries, and any data related to the results of interaction between human’s interaction with the web

Page 64: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 64

Web Usage MiningApplications

Target potential customers for electronic commerceEnhance the quality and delivery of Internet information services to the end userImprove Web server system performanceIdentify potential prime advertisement locationsFacilitates personalization/adaptive sitesImprove site designFraud/intrusion detectionPredict user’s actions (allows prefetching)

Page 65: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 65

Page 66: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 66

Web Log Clustering Applications

Association rules– Find pages that are often viewed togetherClustering– Cluster users based on browsing patterns– Cluster pages based on content

Page 67: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

Server Logs

Page 68: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 68

Fields

Client IP: 128.101.228.20Authenticated User ID: - -Time/Date: [10/Nov/1999:10:16:39 -0600]Request: "GET / HTTP/1.0"Status: 200Bytes: -Referrer: “-”Agent: "Mozilla/4.61 [en] (WinNT; I)"

Page 69: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 69

WUM – Pre-Processing

Data CleaningRemoves log entries that are not needed for the mining

processData Integration

Synchronize data from multiple server logsUser Identification

Associates page references with different users

Session/Episode IdentificationGroups user’s page references into user sessions

Path CompletionFills in page references missing due to browser and proxy caching

Page 70: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 70

Page 71: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 71

WUM – Association Rule Generation

Discovers the correlations between pages that are most often referenced together in a single server sessionProvide the information

What are the set of pages frequently accessed together by Web users?What page will be fetched next?What are paths frequently accessed by Web users?

Association ruleA B [ Support = 60%, Confidence = 80% ]

Example“50% of visitors who accessed URLs /infor-f.html and labo/infos.htmlalso visited situation.html”

Page 72: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 72

WUM – Clustering

Groups together a set of items having similar characteristicsUser Clusters

Discover groups of users exhibiting similar browsing patternsPage recommendation

User’s partial session is classified into a single clusterThe links contained in this cluster are recommended

Page 73: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 73

Web Usage Clustering –Sample Results

clients who often access/products/software/webminer.htmltend to be from educational institutions.clients who placed an online order for software tend to be students in the 20-25 age group and live in the United States.75% of clients who download software from/products/software/demos/ visit between 7:00 and 11:00 pm on weekends.

Page 74: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 74

Table of ContentsIntroductionWeb Content Mining

Feature Selection and Similarity MeasuresWeb Structure Mining

Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms

PageRankCyber-Communities

HITSCT

Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining

Focus CrawlingWeb Search Result Clustering

Summary

Page 75: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 75

Focused Crawling

Only visit links from a page if that page is determined to be relevant.Classifier is static after learning phase.Components:

Classifier which assigns relevance score to each page based on crawl topic.Distiller to identify hub pages.Crawler visits pages to based on crawler and distiller scores.

Page 76: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 76

Focused Crawling

Classifier also determines how useful outgoing links areHub Pages contain links to many relevant pages. Must be visited even if not high relevance score.

Page 77: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 77

Focused Crawling

Page 78: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 78

Table of ContentsIntroductionWeb Content Mining

Feature Selection and Similarity MeasuresWeb Structure Mining

Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms

PageRankCyber-Communities

HITSCT

Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining

Focus CrawlingWeb Search Result Clustering

Summary

Page 79: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 79

In the web search context:organizing web pages (search results) into groups, so that different groups correspond to different user needs

search enginei.e.: engine car part

Engine Corp.Why not other data mining techniques?

Motivation

Page 80: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 80

(1) Using Contents of Documents

Creating clusters based on snippets returned by web search engines.Clusters based on snippets are almost as good as clusters created using the full text of Web documents.Suffix Tree Clustering (STC) : incremental, O(n)time algorithm

LinearIncrementalOverlappingCan be extended to hierarchical

Page 81: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 81

STC algorithm

Step 1: CleaningStemmingSentence boundary identificationPunctuation elimination

Step 2: Suffix tree constructionProduces base clusters (internal nodes)Base clusters are scored based on size and phrase score (which depends on length and word “quality”)

Step 3: Merging base clustersHighly overlapping clusters are merged

Page 82: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 82

(2) Using user’s usage logs

Advantage: relevancy information is objectively reflected by the usage logsAn experimental result on www.nasa.gov/

Cluster 1 /shuttle/missions/41-c/news/shuttle/missions/61-b…

Cluster 2 /history/apollo/sa-2/news//history/apollo/sa-2/images…

Cluster 3 /software/winvn/userguide/3_3_2.htm/software/winvn/userguide/3_3_4.htm…

… ….

Page 83: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 83

(3) Using hyperlinks

For each URL P in search results R, we extract its all out-links as well as top n in-links by services of AltaVistaWe could get all distinct N out-links and M in-links for all URLs in R.Each page P in R (result set) is represented as 2 vectors:

POut (N- dimension) PIn (Mdimension)

Page 84: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 84

(3) Using Hyperlinks: continued

Page 85: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 85

(3) Using Hyperlinks: continued

Page 86: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 86

Concerns on current methods

Each method has pros and cons

Using hyperlinks : the best accuracy and still some room to improve

STC : best to browse and for incrementality.

Page 87: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 87

Sample systems

Scatter/GatherGrouperCarrot2

VivisimoMapuccinoSHOC

Page 88: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 88

Grouper

OnlineOperates on query result snippetsClusters together documents with large common subphrasesSuffix Tree Clustering (STC)STC induces labeling

Page 89: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 89

Page 90: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 90

Page 91: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 91

Page 92: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 92

Table of ContentsIntroductionWeb Content Mining

Feature Selection and Similarity MeasuresWeb Structure Mining

Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms

PageRankCyber-Communities

HITSCT

Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining

Focus CrawlingWeb Search Result Clustering

Summary

Page 93: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 93

Web Mining

Web StructureMining

Web ContentMining

Web PageContent Mining

Search ResultMining

Web UsageMining

General AccessPattern Tracking

CustomizedUsage Tracking

Summary

Page 94: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also

December 24, 2006 Web Mining 94

Thank You