Web Page Clustering using Heuristic Search in the Web Graph

19
Web Page Clustering using Heuristic Search in the Web Graph IJCAI 07

description

Web Page Clustering using Heuristic Search in the Web Graph. IJCAI 07. Motivation - 1/2. The reasons for clustering of search results are two-fold cluster hypothesis : similar documents tend to be relevant to the same requests - PowerPoint PPT Presentation

Transcript of Web Page Clustering using Heuristic Search in the Web Graph

Page 1: Web Page Clustering  using  Heuristic Search  in the  Web Graph

Web Page Clustering using Heuristic Search in the

Web Graph

IJCAI 07

Page 2: Web Page Clustering  using  Heuristic Search  in the  Web Graph

2

Motivation - 1/2

• The reasons for clustering of search results are two-fold– cluster hypothesis : similar documents tend to be relevant t

o the same requests

– ranked list is usually too large and contains many irrelevant documents

• Successful academic and industrial (vivisimo.com)– Organize search results into groups (clusters)

– Topical similarity

Page 3: Web Page Clustering  using  Heuristic Search  in the  Web Graph

3

Motivation - 2/2• Clustering problem :

– there is not enough contextual information on a page• For example: savethejaguar.com

– Web sites are contextually different but actually refer to the same meaning of the query

• Michel D´ecary– a computer scientist (www.zoominfo.com/MichelDecary),– a lawyer (www.stikeman.com/cgi-bin/profile.cfm?P ID=366), – and a chansonnier (www.decary.com).

Page 4: Web Page Clustering  using  Heuristic Search  in the  Web Graph

4

Introduction - 1/3

• Thematic locality of the Web graph:– Directed graph in which nodes are Web pages and edges ar

e hyperlink

– If page A hyperlink page B, page A and page B are semantically close.

– For example:

– Michel D´ecary– a computer scientist (www.zoominfo.com/MichelDecary),

– and a chansonnier (www.decary.com).

– cogilex.com

Page 5: Web Page Clustering  using  Heuristic Search  in the  Web Graph

5

Introduction - 2/3

• Heuristic Search : – To collect as much useful information as possible while

crawling the Web– Heuristic estimate the amount of information available in a

particular Web sub-graph.– This paper uses heuristics to estimate the utility of

expanding the current node in terms of leading to the target node.

• The heuristics are not to reduce the search time, but to improve the search accuracy.– Heuristics are used as filters to prune branches of search

trees that are likely to establish undesired connections between unrelated Web pages.

Page 6: Web Page Clustering  using  Heuristic Search  in the  Web Graph

6

Introduction - 3/3

• Multi-agent system:– Given n Web pages in the ranked list

– n collaborative Web agents • initial dataset : assigned one page

• Each agent performs heuristic search to traverse the Web graph in order to meet as many other agents as possible.

• Two applications:– Web appearance disambiguation

– Search result clustering

Page 7: Web Page Clustering  using  Heuristic Search  in the  Web Graph

7

Multi-agent heuristic search

• Two multi-agent heuristic search– Sequential Heuristic Search (SHS)

• Frontier:

– a list of nodes (URL) to be expanded (initially, the URL of its source page)

• Filter : ( later)

• Initialize :

Page 8: Web Page Clustering  using  Heuristic Search  in the  Web Graph

8

Multi-agent heuristic search

• The SHS algorithm– simple and intuitive

• One crucial drawback– there is no possibility to control the topology of the

constructed clusters

– In a worst case• If , , and

• Pages A and D will be placed in the same cluster despite that the semantic relation between them is probably weak

Page A --> Page B Page B --> Page C

Page C --> Page D

Page 9: Web Page Clustering  using  Heuristic Search  in the  Web Graph

9

Multi-agent heuristic search

• Incremental Heuristic Search (IHS)

Page 10: Web Page Clustering  using  Heuristic Search  in the  Web Graph

10

Heuristics - 1/2

• Two heuristics– Topology-driven

• High-degree node elimination– Remove high out-degree pages and high in-degree pages

– Content-driven• Person name heuristic

Page 11: Web Page Clustering  using  Heuristic Search  in the  Web Graph

11

Heuristics - 2/2

• To detect high out-degree URL– Using Google’s link:operator

– Threshold in/out hyperlinks 1000

• Person names consist of two, three or four words– This heuristic excludes people names that are too common

(again, we use Google’s link: operator)• In many cases, an entity tagged as a person name has millions of G

oogle’s hits if it is a tagger error.

• Examples of such entities are Price Range and Mac Os.

Page 12: Web Page Clustering  using  Heuristic Search  in the  Web Graph

12

Datasets - disambiguation dataset

• Web appearance disambiguation dataset– www.cs.umass.edu/~ronb

– It consists of 1085 Web pages retrieved on 12 names of people from Melinda Gervasio’s social network (mostly, SRI engineers and university professors).

– The dataset is labeled according to the person’s occupation.

• The process crawled the Web starting with these 1085 pages (source pages).– 7009 pages at the first hop (( 一次飛行的 )航程 ), – 69,454 pages at the second hop

– 592,299 pages at the third hop

Page 13: Web Page Clustering  using  Heuristic Search  in the  Web Graph

13

One-Cluster

Page 14: Web Page Clustering  using  Heuristic Search  in the  Web Graph

14

Datasets - Jaguar dataset - 1/2

• Problem of clustering Web search results• Retrieved and labeled 100 first Google hits obtaine

d on the query jaguar.

Page 15: Web Page Clustering  using  Heuristic Search  in the  Web Graph

15

Datasets - Jaguar dataset - 2/2

• Jaguar dataset– K = 3 (car, Mac Os, and cats)

– 883 pages on the first hop

– 8548 pages on the second hop

– 56,287 pages on the third hop

Page 16: Web Page Clustering  using  Heuristic Search  in the  Web Graph

16

Page 17: Web Page Clustering  using  Heuristic Search  in the  Web Graph

17

– Agglomerative/Conglomerative Distributional Clustering (A/CDC) ( Bekkerman and McCallum, 2005)

Page 18: Web Page Clustering  using  Heuristic Search  in the  Web Graph

18

Conclusion

• This paper is the first study of heuristic search in the Web graph.

• Heuristic search :– Viable in the vast domain of the WWW

– Clustering of Web search results

– Web appearance disambiguation

Page 19: Web Page Clustering  using  Heuristic Search  in the  Web Graph

19

Introduction - 4/4

• Topological clustering– Only k largest cluster :

• a set C of k

– Initial : Each document from the original ranked list into one cluster C’

• a set C’ of k’ > k topical cluster

– For each cluster ci C to find it closest cluster cj’ from C’

• j=argmaxj’|ci c’j’|