Adaptive Focused Crawling

34
Adaptive Focused Adaptive Focused Crawling Crawling Presented by: Siqing Du Date: 10/19/05

description

Adaptive Focused Crawling. Presented by: Siqing Du Date: 10/19/05. Outline. Introduction of web crawling Exploiting the hypertextual information Genetic-based crawler Ant-based crawler Machine learning-based crawler Evaluation. Crawling the Web. - PowerPoint PPT Presentation

Transcript of Adaptive Focused Crawling

Page 1: Adaptive Focused Crawling

Adaptive Focused CrawlingAdaptive Focused Crawling

Presented by: Siqing Du

Date: 10/19/05

Page 2: Adaptive Focused Crawling

Outline

Introduction of web crawling Exploiting the hypertextual information Genetic-based crawler Ant-based crawler Machine learning-based crawler Evaluation

Page 3: Adaptive Focused Crawling

Crawling the Web

Simple crawling on the web proceeds by following the urls in the seed pages, retrieve web pages and add them into a local repository.

Taking the Web as a graph structure (V,E), web crawling is similar to graph traversal problem.

Breadth-first search

Page 4: Adaptive Focused Crawling

Flow of a Basic Sequential Crawler

Page 5: Adaptive Focused Crawling

What is the Problem

Current Size of web (static/crawlable/visible) is 4 ~ 10 billion or maybe a lot more

Average out-degree(# of urls in a page) of a random page on the web is 7

Hence the size of the graph increases exponentially by 7

A well-known web search engine only can cover a part of the whole web

Page 6: Adaptive Focused Crawling

Adaptive Focused Crawling

Focused crawling: developing particular crawlers able to seek out and collect pages related to a given topic.

It is also called topical crawling If a focused crawler includes learning methods in

order to adapt its behavior during the crawl to the particular environment and its relationships with the given input parameters, e.g., the set of retrieved pages and the user-defined topic, the crawler is named adaptive.

Best-first search

Page 7: Adaptive Focused Crawling

Outline

Introduction of web crawling Exploiting the hypertextual information Genetic-based crawler Ant-based crawler Machine learning-based crawler Evaluation

Page 8: Adaptive Focused Crawling

Exploiting the Hypertextural Information PageRank and HITS founded from citation analysis

started in 1950s by Garfield. In focused crawling systems, the precision is not

defined only in terms of number of crawled pages, but in terms of rank.

Short result lists of high rank documents are definitively better than long lists of interesting documents that force the users to sift through them in order to find the most valuable information.

Page 9: Adaptive Focused Crawling

Topical Locality and Anchors

Topical locality occurs each time a page is linked to others with related content. (in order to give users the chance to see further related information or services).

Proximal cues or residues correspond with the imperfect information at intermediate locations that a user exploits to decide the paths to follow in order to reach a target information.

Text snippet, anchor text or icons are usually the imperfect information related to a certain distant content.

Page 10: Adaptive Focused Crawling

HITS

Authorities: have relevant content about a topic Hubs: contain several links toward relevant

authoritative pages.

Epqq

qp ha),(:

)()(

Eqpq

qp ah),(:

)()(

Page 11: Adaptive Focused Crawling

PageRank

Random surfer model : a surfer in that model is able to randomly click on one of the links contained in a page p with equal probability 1/Np

rank p crank q

N qq q p E

( )( )

:( , )

cE p( )

Page 12: Adaptive Focused Crawling

Outline

Introduction of web crawling Exploiting the hypertextual information Genetic-based crawler Ant-based crawler Machine learning-based crawler Evaluation

Page 13: Adaptive Focused Crawling

AI-based Approaches

Speculate that crawlers as single autonomous units live and keep moving for interesting resources.

Genetic-based crawlers Ant paradigm

Page 14: Adaptive Focused Crawling

Genetic-based crawlers

InfoSpiders, also known as ARACHNID (Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery)

Genetic algorithms have been introduced in order to find approximate solutions to hard-to-solve combinatorial optimization problems.

Inspired by evolutionary biology studies.

Page 15: Adaptive Focused Crawling

Basic Idea of GA

A population Genetic operators, such as, inheritance,

mutation, crossover. The ones that are closer to the better solutions

are given more chances to live and reproduce, while the ones that are ill-suited for an environment die out.

The initial population generated randomly

Page 16: Adaptive Focused Crawling

InfoSpider

In InfoSpiders an evolving population of intelligent agents browse the Web driven by the user queries.

Each agent is able to draw relevant resources and reason autonomously about next page to download and analyze.

The goal is to mimic the intelligent browsing behavior of human users with little or no interaction among agents.

Page 17: Adaptive Focused Crawling

InfoSpider cont.

Each agent is built on top of a genotype (parameter that represents the degree to which a gent trusts the textual description about outgoing links, a set of keywords initialized with the query terms, and a vector of weights)

A feed-forward neural network used to judging what are the best keywords in the first set that best discriminate the documents relevant to the user.

Page 18: Adaptive Focused Crawling

InfoSpider cont.

The adaptivity is both unsupervised and supervised. (With or without users’ feedback)

If any error occurs (uninteresting page )due to the agents action selection, the weight of the neural networks are updated subsequently.

Mutation and crossovers provide the second kind of adaptivity to the environment.

An agent’s energy value is assigned at the beginning, updated according to the relevance of page visited.

The energy determines which agent survives or dies out.

Page 19: Adaptive Focused Crawling

Itsy Bitsy Spider

Itsy Bitsy spider, an implementation of genetic-based crawler, experimented on Yahoo database.

During the evaluation, the genetic approach dose not outperform the best first search algorithm. (recall high, precision no significant difference)

However, Itsy Bitsy is a simple version of InfoSpiders, no neural network and some other components, and no ability to autonomously reasoning.

Page 20: Adaptive Focused Crawling

Outline

Introduction of web crawling Exploiting the hypertextual information Genetic-based crawler Ant-based crawler Machine learning-based crawler Evaluation

Page 21: Adaptive Focused Crawling

Ant-based Crawlers

Based on a model of social insect collective behavior.

Studies on how blind animals, such as ants, are able to find out the shortest ways from their nest to the feeding sources and back.

Ants can release an hormonal substance, the pheromone, to mark the ground, leaving a trail.

Other ants follow the train and reinforce the trail.

Page 22: Adaptive Focused Crawling

Mechanism

The first ants returning to their nest from the feeding sources are those which chosen the shortest paths.

The back and forth trip let them release pheromone twice.

Others, if have to make choice between different paths, will prefer those with more pheromone path.

Page 23: Adaptive Focused Crawling

Ant-based Crawlers

Each agent corresponds to a virtual ant, move from urli to urlj.

The system execution is divided into cycles; in each of them, the ants make a sequence of moves.

At the end of a cycle, the ants update the pheromone intensity values of the followed path as a function of the retrieved resource scores.

Page 24: Adaptive Focused Crawling

Ant-based Crawlers

The transition probability from urli to urlj at cycle t is

Prevent circular paths, each ant stores a L list containing the visited urls.

p tij

t

t

ij

l i l E il

( )( )

( ): ( , )

Page 25: Adaptive Focused Crawling

Updating Rule

The pheromone of trail from urli to urlj at cycle t+1

Adaptivity: the pheromone intensities are updated according to the visited resource scores.

M

k

kijij tt

1

)()()1(

||

])[()(

||

1

)(

)(

)(

k

P

j

k

k

P

jPscorek

Page 26: Adaptive Focused Crawling

Outline

Introduction of web crawling Exploiting the hypertextual information Genetic-based crawler Ant-based crawler Machine learning-based crawler Evaluation

Page 27: Adaptive Focused Crawling

Intelligent Crawling’s Statistical Model Aims at learning statistically characteristics of the

linkage structure of the Web while performing search. Using particular knowledge obtained in the search to

calculate the conditional probability and interest ratio to determine whether the unseen page satisfies the user needs.

It does not need any collection of topical example for training.

The crawler adapts its behavior by learning the correlations among given features.

Page 28: Adaptive Focused Crawling

Reinforcement Learning-based Approaches A classifier evaluates the relevance of a

hypertext document with respect to the chosen topics.

The interesting documents found are the rewards.

To learn the text in the neighborhood of the hyperlink that most likely point to relevant pages during the crawling.

Page 29: Adaptive Focused Crawling

Outline

Introduction of web crawling Exploiting the hypertextual information Genetic-based crawler Ant-based crawler Machine learning-base crawler Evaluation

Page 30: Adaptive Focused Crawling

Evaluation Methodologies

The goodness of the retrieved documents

The percentage of important page retrieved over the progress of the crawl is another often used measure.

retrievednumbertotal

retrieveddocumentsrelevantofnumberPr

relevantnumbertotal

retrieveddocumentsrelevantofnumberRr

Page 31: Adaptive Focused Crawling

An Example of Performance Plot

Calculated over 159 topics One-tailed t-test performed, p < 0.01

Page 32: Adaptive Focused Crawling

Summarization

Focused crawling has become an interesting alternative to the current Web search tools.

A particular kind of crawlers able to seek out and collect the subset of Web pages related to a given topic.

With learning methods, adaptive focused crawlers are able to adapt the system behavior to the particular environment and input parameters during the search.

Evaluation results show how the whole searching process may profit of those techniques and increase crawling performance.

Page 33: Adaptive Focused Crawling

Reference

Core paper:– Alessandro Micarelli and Fabio Gasparetti, Adaptive

Focused Crawling Additional papers:

– Gautam Pant, Padmini Srinivasan, and Filippo Menczer, Crawling the Web ,Web Dynamics, Springer-Verlag, 2003.

– Martin Ester, Matthias Groß, Hans-Peter Kriegel, Focused Web Crawling: A Generic Framework for Specifying the User Interest and for Adaptive Crawling Strategies (VLDB2001)

Page 34: Adaptive Focused Crawling

Questions & Comments?

Thanks!