Web Page Clustering using Heuristic Search in the Web Graph

Web Page Clustering using Heuristic Search in the

Web Graph

IJCAI 07

2

Motivation - 1/2

• The reasons for clustering of search results are two-fold– cluster hypothesis : similar documents tend to be relevant t

o the same requests

– ranked list is usually too large and contains many irrelevant documents

• Successful academic and industrial (vivisimo.com)– Organize search results into groups (clusters)

– Topical similarity

3

Motivation - 2/2• Clustering problem :

– there is not enough contextual information on a page• For example: savethejaguar.com

– Web sites are contextually different but actually refer to the same meaning of the query

• Michel D´ecary– a computer scientist (www.zoominfo.com/MichelDecary),– a lawyer (www.stikeman.com/cgi-bin/profile.cfm?P ID=366), – and a chansonnier (www.decary.com).

4

Introduction - 1/3

• Thematic locality of the Web graph:– Directed graph in which nodes are Web pages and edges ar

e hyperlink

– If page A hyperlink page B, page A and page B are semantically close.

– For example:

– Michel D´ecary– a computer scientist (www.zoominfo.com/MichelDecary),

– and a chansonnier (www.decary.com).

– cogilex.com

5

Introduction - 2/3

• Heuristic Search : – To collect as much useful information as possible while

crawling the Web– Heuristic estimate the amount of information available in a

particular Web sub-graph.– This paper uses heuristics to estimate the utility of

expanding the current node in terms of leading to the target node.

• The heuristics are not to reduce the search time, but to improve the search accuracy.– Heuristics are used as filters to prune branches of search

trees that are likely to establish undesired connections between unrelated Web pages.

6

Introduction - 3/3

• Multi-agent system:– Given n Web pages in the ranked list

– n collaborative Web agents • initial dataset : assigned one page

• Each agent performs heuristic search to traverse the Web graph in order to meet as many other agents as possible.

• Two applications:– Web appearance disambiguation

– Search result clustering

7

Multi-agent heuristic search

• Two multi-agent heuristic search– Sequential Heuristic Search (SHS)

• Frontier:

– a list of nodes (URL) to be expanded (initially, the URL of its source page)

• Filter : ( later)

• Initialize :

8


• The SHS algorithm– simple and intuitive

• One crucial drawback– there is no possibility to control the topology of the

constructed clusters

– In a worst case• If , , and

• Pages A and D will be placed in the same cluster despite that the semantic relation between them is probably weak

Page A --> Page B Page B --> Page C

Page C --> Page D

9


• Incremental Heuristic Search (IHS)

10

Heuristics - 1/2

• Two heuristics– Topology-driven

• High-degree node elimination– Remove high out-degree pages and high in-degree pages

– Content-driven• Person name heuristic

11

Heuristics - 2/2

• To detect high out-degree URL– Using Google’s link:operator

– Threshold in/out hyperlinks 1000

• Person names consist of two, three or four words– This heuristic excludes people names that are too common

(again, we use Google’s link: operator)• In many cases, an entity tagged as a person name has millions of G

oogle’s hits if it is a tagger error.

• Examples of such entities are Price Range and Mac Os.

12

Datasets - disambiguation dataset

• Web appearance disambiguation dataset– www.cs.umass.edu/~ronb

– It consists of 1085 Web pages retrieved on 12 names of people from Melinda Gervasio’s social network (mostly, SRI engineers and university professors).

– The dataset is labeled according to the person’s occupation.

• The process crawled the Web starting with these 1085 pages (source pages).– 7009 pages at the first hop (( 一次飛行的 )航程 ), – 69,454 pages at the second hop

– 592,299 pages at the third hop

13

One-Cluster

14

Datasets - Jaguar dataset - 1/2

• Problem of clustering Web search results• Retrieved and labeled 100 first Google hits obtaine

d on the query jaguar.

15

Datasets - Jaguar dataset - 2/2

• Jaguar dataset– K = 3 (car, Mac Os, and cats)

– 883 pages on the first hop

– 8548 pages on the second hop

– 56,287 pages on the third hop

17

– Agglomerative/Conglomerative Distributional Clustering (A/CDC) ( Bekkerman and McCallum, 2005)

18

Conclusion

• This paper is the first study of heuristic search in the Web graph.

• Heuristic search :– Viable in the vast domain of the WWW

– Clustering of Web search results

– Web appearance disambiguation

19

Introduction - 4/4

• Topological clustering– Only k largest cluster :

• a set C of k

– Initial : Each document from the original ranked list into one cluster C’

• a set C’ of k’ > k topical cluster

– For each cluster ci C to find it closest cluster cj’ from C’

• j=argmaxj’|ci c’j’|

Web Page Clustering using Heuristic Search in the Web Graph

Documents

Transcript of Web Page Clustering using Heuristic Search in the Web Graph