Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the...

28
Chapter 8 Web Structure Mining Part-1 1

Transcript of Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the...

Page 1: Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.

1

Chapter 8Web Structure MiningPart-1

Page 2: Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.

2

Web Structure Mining• Deals mainly with discovering the

model underlying the link structure of the web

• Deals with the topology of hyperlinks with or without the description of the links

Page 3: Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.

3

Why?The model can be used to

classify web pages.Helpful to create information

such as the similarity and relationship between different websites.

Useful for discovering website type.

Page 4: Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.

4

Website type • Web structure mining is a suitable

tool for discovering authority sites and overview sites for the subjects

• Authority sites contain information about the subject

• Overview sites point to many authority sites

Page 5: Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.

5

Web Content Mining/ Web Structure MiningWeb Content Mining explores the

structure within the document

Web Structure Mining studies citation relationship of documents within the web.

Page 6: Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.

6

Algorithms for Web Structure MiningPageRank algorithm (Google Founders)

Looks at number of links to a website and importance of referring links

Computed before the user enters the query.

HITS algorithm (Hyperlinked Induced Topic Search)

User receives two lists of pages for query (authority and link pages)

Computations are done after the user enters the query.

Page 7: Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.

7

PageRank

Page 8: Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.

8

PageRank Algorithm The idea of the algorithm came from

academic citation literature. It was developed in 1998 as part of the

Google search engine prototype Studies citation relationship of

documents within the web. Google search engine ranks documents as

a function of both the query terms and the hyperlink structure of the web.

Page 9: Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.

9

Definition of PageRank

The PageRank produces ranking

independent of a user’s query.

The importance of a web page is

determined by the number of other

important web pages that are pointing

to that page and the number of out links

from other web pages.

Page 10: Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.

10

An art draw drawn by Felipe Micaroni Lalli( .micaroni@gmail com.)

Page 11: Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.

11

Example of Backlinks

Page A is a backlink of page B and page C, while page B and page C are backlinks of page D.

Backlink = Outlink= OutDegree

Page 12: Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.

12

Example-1

PR(A)=0.25+0.25+0.25PR(A)=0.75

A B

D C

Page 13: Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.

13

Example-2

PR(A)= PR(B)/2+ PR(C)/1+ PR(D)/3= 0.125+0.25+0.0833=0.4583

A B

CD

Page 14: Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.

14

Page RankingA page will have high page rank if:

There are many pages pointing to it. There are some pages pointing to it which

have high page ranks.

In other words: Pages well sited from around the web are

worth looking at. Pages that only have one citation from

high rating web page is worth looking at.

Page 15: Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.

15

Damping FactorThe PageRank theory holds that

even an imaginary surfer who is randomly clicking on links will eventually stop clicking. The probability, at any step, that the person will continue is a damping factor d.

Page 16: Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.

16

Damping Factor dThe damping factor is subtracted from 1 and this term is then added to the product of the damping factor and the sum of the incoming PageRank scores.So any page's PageRank is derived in large part from the PageRanks of other pages. The damping factor adjusts the derived value downward.

Page 17: Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.

17

Computing PageRankThe PageRank of a page u is computed as follows:

where, OutDegree(v) represents the number of links going out of the page v and parameter d be a damping factor, which can be a real number between 0 and 1.

The value of d is generally taken as 0.85.

Euv vOutDegree

vPageRankdduPageRank

,

1

Page 18: Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.

18

PageRank Algorithm

Page 19: Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.

19

Applied Example

Page 20: Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.

20

A Simple Network of Pages(Ian Roger, 2006)

OutDegree(A) = 1 and OutDegree(B) = 1). Here, we do not know what their PageRanks should be to begin with, so we can take a guess at 1.0 , assuming d=0.85, and perform following calculations

PageRank(A)= (1 – d) + d (PageRank(B)/1)PageRank(B)= (1 – d) + d (PageRank(A)/1)

PageRank(A)= 0.15 + 0.85 * 1=1 PageRank(B)= 0.15 + 0.85 * 1=1

We calculated that the PageRank of A and B is 1.

Page 21: Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.

21

A Simple Network of Pages(Ian Roger, 2006)

Now, we plug in 0 as the guess and perform calculations again:PageRank(A) = 0.15 + 0.85 * 0= 0.15 PageRank(B) = 0.15 + 0.85 * 0.15= 0.2775

We have now another guess for PageRank(A) so we use it to calculate PageRank(B) and continue:

PageRank(A) = 0.15 + 0.85 * 0.2775 = 0.3859PageRank(B) = 0.15 + 0.85 * 0.3859 = 0.4780

Page 22: Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.

22

Example-cont.Repeating the calculations, we get:

PageRank(A) = 0.15 + 0.85 * 0.4780 = 0.5563PageRank(B) = 0.15 + 0.85 * 0.5563 = 0.6229

If we repeat the calculations, eventually the PageRanks for both the pages converge to 1.

Page 23: Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.

23

Rank Sink A, and B both

have rank, but they will never circulate any rank.

A

D

A

Page 24: Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.

24

Remarks on PageRank

Remarks on PageRank Algorithm: A page with no successors has no scope to

send its importance. As well, a group of pages that have no links out of the group will eventually collect all the importance of the Web.

Page 25: Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.

25

PageRank Toolbar

Page 26: Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.

26

Sample Scores with Their Meaning

Page 27: Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.

27

Toolbar PageRank and Corresponding Real PageRank

Page 28: Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.

28

Activity There is a link between

page A to both B and C. Also there is a link from pages B and C to A.

Begin with intial value of PageRank as 0.

Complete 6 iterations

A B

C