Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006...

Post on 24-Mar-2019

220 views 0 download

Transcript of Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006...

Web Mining

Kyumars Sheykh Esmaili

Data Mining CourseSharif University of Technology

Fall 2006

December 24, 2006 Web Mining 2

Table of ContentsIntroductionWeb Content Mining

Feature Selection and Similarity MeasuresWeb Structure Mining

Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms

PageRankCyber-Communities

HITSCT

Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining

Focus CrawlingWeb Search Result Clustering

Summary

December 24, 2006 Web Mining 3

Table of ContentsIntroductionWeb Content Mining

Feature Selection and Similarity MeasuresWeb Structure Mining

Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms

PageRankCyber-Communities

HITSCT

Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining

Focus CrawlingWeb Search Result Clustering

Summary

December 24, 2006 Web Mining 4

Introduction

Information Overloading on the webSize

2001New information created: 6 exabytes (10^18 bytes) 10 billion (nonspam) e-mail messages were sent per day.

2002New information created: 12 exabytes (10^18 bytes)

2003the public Internet contained about 1 trillion pages and was increasing at a rate of approximately 8 million pages per day.

200535 billion messages per day by 2005.

December 24, 2006 Web Mining 5

Challenges on WWW Interactions

Finding Relevant InformationCreating knowledge from Information availablePersonalization of the informationLearning about customers / individual users

Web Mining can play an important Role!

December 24, 2006 Web Mining 6

Introduction

Web mining - data mining techniques to automatically discover and extract information from Web documents/servicesWeb mining research – integrate research from several research communities :

Database (DB) Information retrieval (IR) The sub-areas of machine learning (ML) Natural language processing (NLP)

December 24, 2006 Web Mining 7

Web Data

Web pagesIntra-page structuresInter-page structuresUsage dataSupplemental data

ProfilesRegistration informationCookies

December 24, 2006 Web Mining 8

Web Data Categories

Web Data

Content Data

Structure Data

Usage Data

User Profile Data

Free Texts

HTML Files

XML Files

Dynamic Content

Multimedia

Static Link

Dynamic Link

December 24, 2006 Web Mining 9

Web Mining

Web StructureMining

Web ContentMining

Web C-SMining

Web UsageMining

Web Mining Taxonomy

December 24, 2006 Web Mining 10

Web Mining : SubtasksResource Finding

Task of retrieving intended web-documents

Information Selection & Pre-processingAutomatic selection and pre-processing specific information from retrieved web resources

GeneralizationAutomatic Discovery of patterns in web sites

AnalysisValidation and / or interpretation of mined patterns

December 24, 2006 Web Mining 11

Table of ContentsIntroductionWeb Content Mining

Feature Selection and Similarity MeasuresWeb Structure Mining

Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms

PageRankCyber-Communities

HITSCT

Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining

Focus CrawlingWeb Search Result Clustering

Summary

December 24, 2006 Web Mining 12

Feature Selection for Web Mining

for the purposes of automated text classification text features should be:

Relatively few in numberModerate in frequency of assignmentLow in redundancyLow in noiseRelated in semantic scope to the classes

to be assignedRelatively unambiguous in meaning

December 24, 2006 Web Mining 13

Feature Selection

Potential features:BODYMETATITLESnippet

Means sentences attached with URL u appeared in search results

Anchor WindowThe anchor text and text around the hyperlink v->u in the

source page vMT, the union of META and TITLE content;BMT, the union of BODY, META and TITLE content.

December 24, 2006 Web Mining 14

Percentage of Web Pages With Words in HTML Tags

Feature Selection for Content Mining

December 24, 2006 Web Mining 15

Feature Selection For Web Pages

Classification performance for various representations of web pages

December 24, 2006 Web Mining 16

Vector Space Model for Content-Similarity

IR systems usually adopt index terms to process queriesIndex term:

a keyword or group of selected wordsany word (more general)

Stemming might be used:connect: connecting, connection, connections

An inverted file is built for the chosen index terms

December 24, 2006 Web Mining 17

Vector Space Model - Basic Concepts

Ki is an index termdj is a documentt is the number of index termsK = (k1, k2, …, kt) is the set of all index termswij >= 0 is a weight associated with (ki,dj)wij = 0 indicates that term does not belong to docvec(dj) = (w1j, w2j, …, wtj) is a weighted vector associated with the document djgi(vec(dj)) = wij is a function which returns the weight associated with pair (ki,dj)

December 24, 2006 Web Mining 18

The Vector Space Model

Sim(dk,dj) = cos(Θ) = [vec(dk) • vec(dj)] / |dk| * |dj| = [Σ wik * wij] / |dk| * |dj|Since wij > 0 and wik > 0, 0 <= sim(dk,dj) <=1

A document is retrieved even if it matches the target document terms only partially

i

j

dj

dkΘ

December 24, 2006 Web Mining 19

The Vector Space Model: Example

d1

d2

d3d4 d5

d6d7

k1k2

k3

k1 k2 k3 q • dj |dj| Sim(dj,q)d1 1 0 1 2 1.41 0.82d2 1 0 0 1 1 0.58d3 0 1 1 2 1.41 0.82d4 1 0 0 1 1 0.58d5 1 1 1 3 1.73 1d6 1 1 0 2 1.41 0.82d7 0 1 0 1 1 0.58

q 1 1 1 |q| 1.73

December 24, 2006 Web Mining 20

The Vector Space Model - Weighting

Sim(q,dj) = [Σ wij * wiq] / |dj| * |q|How to compute the weights wij and wiq ?A good weight must take into account two effects:

quantification of intra-document contents (similarity)tf factor, the term frequency within a document

quantification of inter-documents separation (dissi-milarity)idf factor, the inverse document frequency

wij = tf(i,j) * idf(i)

December 24, 2006 Web Mining 21

Example:• A collection includes 10,000 documents• The term A appears 20 times in a particular document• The maximum apperance of any term in this document is 50• The term A appears in 2,000 of the collection documents.• f(i,j) = freq(i,j) / max(freq(l,j)) = 20/50 = 0.4• idf(i) = log(N/ni) = log (10,000/2,000) = log(5) = 2.32• wij = f(i,j) * log(N/ni) = 0.4 * 2.32 = 0.93

The Vector Model - Weighting

December 24, 2006 Web Mining 22

Table of ContentsIntroductionWeb Content Mining

Feature Selection and Similarity MeasuresWeb Structure Mining

Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms

PageRankCyber-Communities

HITSCT

Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining

Focus CrawlingWeb Search Result Clustering

Summary

December 24, 2006 Web Mining 23

Social network analysis

Social network is the study of social entities (people in an organization, called actors), and their interactions and relationships. The interactions and relationships can be represented with a network or graph,

each vertex (or node) represents an actor and each link represents a relationship.

From the network, we can study the properties of its structure, and the role, position and prestige of each social actor. We can also find various kinds of sub-graphs, e.g., communities formed by groups of actors.

December 24, 2006 Web Mining 24

Social network and the Web

Social network analysis is useful for the Web because the Web is essentially a virtual society, and thus a virtual social network,

Each page: a social actor and each hyperlink: a relationship.

Many results from social network can be adapted and extended for use in the Web context.

December 24, 2006 Web Mining 25

Web Structure MiningThe Web consists not only of pages, but also of hyperlinks pointing from one page to another

These hyperlinks contain an enormous amount of latent human annotation

Assumption: link from page A to page B is a recommendation of page B by AIf A and B are connected by a link, there is a higher probability that they are on the same topic

December 24, 2006 Web Mining 26

Web Link Analysis

Used for Ordering documents matching a user query: rankingDeciding what pages to add to a collection: crawlingPage categorizationFinding related pagesFinding duplicated web sites

December 24, 2006 Web Mining 27

Table of ContentsIntroductionWeb Content Mining

Feature Selection and Similarity MeasuresWeb Structure Mining

Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms

PageRankCyber-Communities

HITSCT

Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining

Focus CrawlingWeb Search Result Clustering

Summary

December 24, 2006 Web Mining 28

Structural Similarity MeasuresWe must define the similarity of two nodes

Method I:For page and page B, A is related to B if there is a hyper-link from A to B, or from B to A

Not so good. Consider the home page of IBM and Microsoft.

Page A

Page B

December 24, 2006 Web Mining 29

Structural Similarity Measures

Method II (from Bibliometrics)Co-citation: the similarity of A and B is measured by the number of pages cite both A and B

Bibliographic coupling: the similarity of A and B is measured by the number of pages cited by both A and B.

Page A Page B

Page A Page B

December 24, 2006 Web Mining 30

Table of ContentsIntroductionWeb Content Mining

Feature Selection and Similarity MeasuresWeb Structure Mining

Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms

PageRankCyber-Communities

HITSCT

Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining

Focus CrawlingWeb Search Result Clustering

Summary

December 24, 2006 Web Mining 31

Using link structure of web (cont.)

There are two famous Link-Structure based algorithms for ranking :

PageRankHITS

Nearly All other algorithms are base on these ones :

Salsa,Clever,.

December 24, 2006 Web Mining 32

PageRank

Introduced by Page et al (1998)An offline algorithm (Query independent)The weight is assigned by the rank of parents

December 24, 2006 Web Mining 33

A Practical Example for PageRank

December 24, 2006 Web Mining 34

Table of ContentsIntroductionWeb Content Mining

Feature Selection and Similarity MeasuresWeb Structure Mining

Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms

PageRankCyber-Communities

HITSCT

Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining

Focus CrawlingWeb Search Result Clustering

Summary

December 24, 2006 Web Mining 35

What is cyber-communityA community on the web is a group of web pages sharing a common interest

Eg. A group of web pages talking about POP MusicEg. A group of web pages interested in data-mining

Main properties: Pages in the same community should be similar to each other in contentsThe pages in one community should differ from the pages in another community Similar to cluster

December 24, 2006 Web Mining 36

Cyber Communities

December 24, 2006 Web Mining 37

Two different types of communities

Explicitly-defined communitiesThey are well known ones, such as the resource listed by Yahoo!

Implicitly-defined communitiesThey are communities unexpected or invisible to most users

Arts

Music

Classic Pop

Painting

eg.

eg. The group of web pages interested in a particular singer

December 24, 2006 Web Mining 38

Different types of communities

The explicit communities are easy to identifyEg. Yahoo!, InfoSeek, Clever System

In order to extract the implicit communities, we need analyze the web-graph objectively

In research, people are more interested in the implicit communities

December 24, 2006 Web Mining 39

Methods of clustering

Clustering methods based on co-citation analysis

Methods derived from HITS (Kleinberg)Using co-citation matrix

CT Method

December 24, 2006 Web Mining 40

Table of ContentsIntroductionWeb Content Mining

Feature Selection and Similarity MeasuresWeb Structure Mining

Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms

PageRankCyber-Communities

HITSCT

Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining

Focus CrawlingWeb Search Result Clustering

Summary

December 24, 2006 Web Mining 41

HITS: Hubs and Authority

Hub: web page links to a collection of prominent sites on a common topicAuthority: Pages that link to a collection of authoritative pages on a broad topic; web page pointed to by hubsMutual Reinforcing Relationship: a good authority is a page that is pointed to by many good hubs, while a good hub is a page that points to many good authorities

December 24, 2006 Web Mining 42

Authority and Hubness

2

3

4

1 1

5

6

7

x(1) = y(2) + y(3) + y(4) y(1) = x(5) + x(6) + xs(7)

December 24, 2006 Web Mining 43

HITS Steps (1)

Creating root and base sets

December 24, 2006 Web Mining 44

HITS Steps (2)

Calculating Weights

Authority weight :

Hub weight :

Matrix notation: A - adjacency matrixA(i, j) = 1 if i-th page points to j-th page

December 24, 2006 Web Mining 45

Final Result of HITS

December 24, 2006 Web Mining 46

HITS Results – 3D perspective

December 24, 2006 Web Mining 47

A Practical Example for HITS

December 24, 2006 Web Mining 48

Difference between PageRank and HITS

The PageRank is computed for all web pages stored in the database and then prior to the query; HITS is performed on the set of retrieved web pages, and for each query.HITS computes authorities and hubs; PageRank computes authorities only.PageRank: non-trivial to compute, HITS: easy to compute, but real-time execution is hard

December 24, 2006 Web Mining 49

Table of ContentsIntroductionWeb Content Mining

Feature Selection and Similarity MeasuresWeb Structure Mining

Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms

PageRankCyber-Communities

HITSCT

Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining

Focus CrawlingWeb Search Result Clustering

Summary

December 24, 2006 Web Mining 50

A cheaper method

Previous methods are expensive

There another simple method called communities trawling (CT)

It has been implemented on the graph of 200 millions pages, it worked very well

December 24, 2006 Web Mining 51

Basic idea of CT

Definition of communitiesdense directed bipartite sub graphs

Bipartite graph: Nodes are partitioned into two sets, F and CEvery directed edge in the graph is directed from a node u in F to a node v in Cdense if many of the possible edges between F and C are present

Fans Centers

F C

December 24, 2006 Web Mining 52

Basic idea of CT

Bipartite coresa complete bipartite subgraph with at least i nodes from F and at least j nodes from C i and j are tunable parametersA (i, j) Bipartite core

Every community have such a core with a certain i and j.

A (i=3, j=3) bipartite core

December 24, 2006 Web Mining 53

Basic idea of CT

A bipartite core is the identity of a community

To extract all the communities is to enumerate all the bipartite cores on the web.

Author invent an efficient algorithm to enumerate the bipartite cores. Its main idea is iterate pruning --elimination-generation pruning

December 24, 2006 Web Mining 54

Table of ContentsIntroductionWeb Content Mining

Feature Selection and Similarity MeasuresWeb Structure Mining

Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms

PageRankCyber-Communities

HITSCT

Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining

Focus CrawlingWeb Search Result Clustering

Summary

December 24, 2006 Web Mining 55

Content Link Clustering

By CLC, each web page q in data set D is representedas 3 vectors:

qOutqIn

qKword

with M, N and L as the vector dimension respectively

The ith item of vector qOut (and qIn) indicates whether q has the corresponding out-link as the ith one in M out-links. If yes, the ith item is1, else 0.

The kth item of vector qKword indicates the frequency of the corresponding kth term of L appeared in page q.

December 24, 2006 Web Mining 56

Similarity Measure

The similarity of two pages Q and R is the linear combination of three parts:

poutS(Qout,Rout)+ pinS(Qin,Rin)+ ptermS(Qterm,Rterm)

pout +pin +pterm =1

S(Qout,Rout) is defined as Cosine of two out-link vectors.

December 24, 2006 Web Mining 57

Tuning the similarity measure

By varying weighting factors in second formula, it is possible to study the effects of out-links, in-link and terms on clustering process.

Results of term-based clustering is rather coarse and usually includes very general groups, which are totally different each other from semantic point of view.

E.g. for topic “jaguar”, “car” group and “animal” group are two very general groups with very different semantic topics;

December 24, 2006 Web Mining 58

Tuning the similarity measure

So, term-based clustering could only roughly separate pages into general semantic groups and failed to handle the finer case

Like “racing car” and “car driver club” since both pages may include some terms like “car, model etc.

The main reasons of poor “purity” of clusters produced by term-based clustering are:

Noise pages are included into clusters instead of removing since noise pages share some unimportant terms with other pages;

Pages that on different finer topics (but the same general topic) are mixed together.

December 24, 2006 Web Mining 59

Tuning the similarity measure

Hyperlinks represent the authors’ view of the relationship among Web pages

hyperlink-based clustering expresses “association” of pages.

Therefore, we could say that clusters produced by link-based clustering are in finer granularity.

The problem of link-based clustering is that some similar pages (e.g. new created pages) may not have enough co-citation/citation to be grouped together. That is to say, recall is some low.

December 24, 2006 Web Mining 60

Tuning the similarity measure

“T”, “L” and “CLC” to denote terms–based (with pout , pin and pKword as (0, 0, 1), link-based (with pout ,pin and pKword as (0.5, 0.5, 0) and contents-link coupled (with pout , pin and pKword as (0.2,0.3, 0.5) clustering approaches respectively.

Parameters are Similarity threshold weighting factors

The label of each cluster is identified automatically by term vector of centroid for each cluster.

December 24, 2006 Web Mining 61

Content Link Mining

December 24, 2006 Web Mining 62

Table of ContentsIntroductionWeb Content Mining

Feature Selection and Similarity MeasuresWeb Structure Mining

Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms

PageRankCyber-Communities

HITSCT

Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining

Focus CrawlingWeb Search Result Clustering

Summary

December 24, 2006 Web Mining 63

Web Usage Mining

Web usage mining also known as Web log miningmining techniques to discover interesting usage patterns from the secondary data derived from the interactions of the users while surfing the webIncluding

web log data, click-stream data, cookies, user queries, and any data related to the results of interaction between human’s interaction with the web

December 24, 2006 Web Mining 64

Web Usage MiningApplications

Target potential customers for electronic commerceEnhance the quality and delivery of Internet information services to the end userImprove Web server system performanceIdentify potential prime advertisement locationsFacilitates personalization/adaptive sitesImprove site designFraud/intrusion detectionPredict user’s actions (allows prefetching)

December 24, 2006 Web Mining 65

December 24, 2006 Web Mining 66

Web Log Clustering Applications

Association rules– Find pages that are often viewed togetherClustering– Cluster users based on browsing patterns– Cluster pages based on content

Server Logs

December 24, 2006 Web Mining 68

Fields

Client IP: 128.101.228.20Authenticated User ID: - -Time/Date: [10/Nov/1999:10:16:39 -0600]Request: "GET / HTTP/1.0"Status: 200Bytes: -Referrer: “-”Agent: "Mozilla/4.61 [en] (WinNT; I)"

December 24, 2006 Web Mining 69

WUM – Pre-Processing

Data CleaningRemoves log entries that are not needed for the mining

processData Integration

Synchronize data from multiple server logsUser Identification

Associates page references with different users

Session/Episode IdentificationGroups user’s page references into user sessions

Path CompletionFills in page references missing due to browser and proxy caching

December 24, 2006 Web Mining 70

December 24, 2006 Web Mining 71

WUM – Association Rule Generation

Discovers the correlations between pages that are most often referenced together in a single server sessionProvide the information

What are the set of pages frequently accessed together by Web users?What page will be fetched next?What are paths frequently accessed by Web users?

Association ruleA B [ Support = 60%, Confidence = 80% ]

Example“50% of visitors who accessed URLs /infor-f.html and labo/infos.htmlalso visited situation.html”

December 24, 2006 Web Mining 72

WUM – Clustering

Groups together a set of items having similar characteristicsUser Clusters

Discover groups of users exhibiting similar browsing patternsPage recommendation

User’s partial session is classified into a single clusterThe links contained in this cluster are recommended

December 24, 2006 Web Mining 73

Web Usage Clustering –Sample Results

clients who often access/products/software/webminer.htmltend to be from educational institutions.clients who placed an online order for software tend to be students in the 20-25 age group and live in the United States.75% of clients who download software from/products/software/demos/ visit between 7:00 and 11:00 pm on weekends.

December 24, 2006 Web Mining 74

Table of ContentsIntroductionWeb Content Mining

Feature Selection and Similarity MeasuresWeb Structure Mining

Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms

PageRankCyber-Communities

HITSCT

Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining

Focus CrawlingWeb Search Result Clustering

Summary

December 24, 2006 Web Mining 75

Focused Crawling

Only visit links from a page if that page is determined to be relevant.Classifier is static after learning phase.Components:

Classifier which assigns relevance score to each page based on crawl topic.Distiller to identify hub pages.Crawler visits pages to based on crawler and distiller scores.

December 24, 2006 Web Mining 76

Focused Crawling

Classifier also determines how useful outgoing links areHub Pages contain links to many relevant pages. Must be visited even if not high relevance score.

December 24, 2006 Web Mining 77

Focused Crawling

December 24, 2006 Web Mining 78

Table of ContentsIntroductionWeb Content Mining

Feature Selection and Similarity MeasuresWeb Structure Mining

Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms

PageRankCyber-Communities

HITSCT

Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining

Focus CrawlingWeb Search Result Clustering

Summary

December 24, 2006 Web Mining 79

In the web search context:organizing web pages (search results) into groups, so that different groups correspond to different user needs

search enginei.e.: engine car part

Engine Corp.Why not other data mining techniques?

Motivation

December 24, 2006 Web Mining 80

(1) Using Contents of Documents

Creating clusters based on snippets returned by web search engines.Clusters based on snippets are almost as good as clusters created using the full text of Web documents.Suffix Tree Clustering (STC) : incremental, O(n)time algorithm

LinearIncrementalOverlappingCan be extended to hierarchical

December 24, 2006 Web Mining 81

STC algorithm

Step 1: CleaningStemmingSentence boundary identificationPunctuation elimination

Step 2: Suffix tree constructionProduces base clusters (internal nodes)Base clusters are scored based on size and phrase score (which depends on length and word “quality”)

Step 3: Merging base clustersHighly overlapping clusters are merged

December 24, 2006 Web Mining 82

(2) Using user’s usage logs

Advantage: relevancy information is objectively reflected by the usage logsAn experimental result on www.nasa.gov/

Cluster 1 /shuttle/missions/41-c/news/shuttle/missions/61-b…

Cluster 2 /history/apollo/sa-2/news//history/apollo/sa-2/images…

Cluster 3 /software/winvn/userguide/3_3_2.htm/software/winvn/userguide/3_3_4.htm…

… ….

December 24, 2006 Web Mining 83

(3) Using hyperlinks

For each URL P in search results R, we extract its all out-links as well as top n in-links by services of AltaVistaWe could get all distinct N out-links and M in-links for all URLs in R.Each page P in R (result set) is represented as 2 vectors:

POut (N- dimension) PIn (Mdimension)

December 24, 2006 Web Mining 84

(3) Using Hyperlinks: continued

December 24, 2006 Web Mining 85

(3) Using Hyperlinks: continued

December 24, 2006 Web Mining 86

Concerns on current methods

Each method has pros and cons

Using hyperlinks : the best accuracy and still some room to improve

STC : best to browse and for incrementality.

December 24, 2006 Web Mining 87

Sample systems

Scatter/GatherGrouperCarrot2

VivisimoMapuccinoSHOC

December 24, 2006 Web Mining 88

Grouper

OnlineOperates on query result snippetsClusters together documents with large common subphrasesSuffix Tree Clustering (STC)STC induces labeling

December 24, 2006 Web Mining 89

December 24, 2006 Web Mining 90

December 24, 2006 Web Mining 91

December 24, 2006 Web Mining 92

Table of ContentsIntroductionWeb Content Mining

Feature Selection and Similarity MeasuresWeb Structure Mining

Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms

PageRankCyber-Communities

HITSCT

Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining

Focus CrawlingWeb Search Result Clustering

Summary

December 24, 2006 Web Mining 93

Web Mining

Web StructureMining

Web ContentMining

Web PageContent Mining

Search ResultMining

Web UsageMining

General AccessPattern Tracking

CustomizedUsage Tracking

Summary

December 24, 2006 Web Mining 94

Thank You