Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.
-
Upload
blaine-markins -
Category
Documents
-
view
217 -
download
0
Transcript of Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.
Claudio ScordinoPh.D. Student
Crawling the Web:Crawling the Web:problems and techniquesproblems and techniques
Computer Science Department - University of PisaComputer Science Department - University of Pisa
May 2004
OutlineOutline
• Introduction
• Crawler architectures
- Increasing the throughput
• What pages we do not want to fetch
- Spider traps
- Duplicates
- Mirrors
IntroductionIntroduction
Job of a crawler (or spider): fetching the Web pages to a computer where they will be analyzed
The algorithm is conceptually simple, but…
…it’s a complex and underestimate activity
Famous CrawlersFamous Crawlers
• Mercator (Compaq, Altavista)
Java
Modular (components loaded dynamically)
Priority-based scheduling for URLs downloads
- The algorithm is a pluggable component
Different processing modules for different contents
Checkpointing
- Allows the crawler to recover its state after a failure
- In a distributed crawler is performed by the Queen
Famous CrawlersFamous Crawlers
• GoogleBot (Stanford, Google)
C/C++
• WebBase (Stanford)
• HiWE: Hidden Web Exposer (Stanford)
• Heritrix (Internet Archive)
http://www.crawler.archive.org/
Famous CrawlersFamous Crawlers
• Sphinx Java
Visual and interactive environment
Relocatable: capable of executing on a remote host
Site-specific
- Customizable crawling
- Classifiers: site-specific content analyzers
1. Links to follow
2. Parts to process
- Not scalable
Crawler ArchitectureCrawler Architecture
Load Monitor
SCHEDULER
CrawlMetadata
DuplicateURL
Eliminator
URL Filter
Hosts
HREFsextractor
and normalizer
PARSER
Internet
Internet
seed URLs
URL FRONTIER
Citations
RETRIEVERS
DNS HTTP
Web masters annoyedWeb masters annoyed
Web Server administrators could be annoyed by:
1. Server overload
- Solution: per-server queues
2. Fetching of private pages
- Solution: Robot Exclusion Protocol
- File: /robots.txt
Crawler ArchitectureCrawler Architecture
Per-serverqueues
Robots
Mercator’s schedulerMercator’s scheduler
BACK-END: ensures politeness
(no server overload)
FRONT-END: prioritizes URLs
with a value between 1 and k
Queues containing URLs of only a single host
Specifieswhen aserver maybe contactedagain
Increasing the Increasing the throughputthroughput
Possible levels of parallelization:
Parallelize the process to fetch many pages at the same time (~thousands per second).
DNS HTTP Parsing
Domain Name resolutionDomain Name resolution
Problem: DNS requires time to resolve the server hostname
Domain Name resolutionDomain Name resolution
1.Asynchronous DNS resolver:
• Concurrent handling of multiple outstanding
requests
• Not provided by most UNIX implementations of
gethostbyname
• GNU ADNS library
• http://www.chiark.greenend.org.uk/~ian/adns/
• Mercator reduced the thread’s elapsed time from
87% to 25%
Domain Name resolutionDomain Name resolution
2.Customized DNS component:
• Caching server with persistent cache largely
residing in memory
• Prefetching
• Hostnames extracted by HREFs and requests
made to the caching server
• Does not wait for resolution to be completed
Crawler ArchitectureCrawler Architecture
Per-serverqueues
Robots
AsyncDNS
prefetchDNS
Cache
DNS resolver
client
Page retrievalPage retrieval
1.Multithreading
• Blocking system calls (synchronous I/O)
• pthreads multithreading library
• Used in Mercator, Sphinx, WebRace
• Sphinx uses a monitor to determine the optimal number of threads at runtime
• Mutual exclusion overhead
Problem: HTTP requires time to fetch a page
Page retrievalPage retrieval
2.Asynchronous sockets
• not blocking the process/thread
• select monitors several sockets at the same time
• Does not need mutual exclusion since it performs a serialized completion of threads (i.e. the code that completes processing the page is not interrupted by other completions).
• Used in IXE (1024 connection at once)
Page retrievalPage retrieval
3.Persistent connection
• Multiple documents requested on a single connection
• Feature of HTTP 1.1
• Reduce the number of HTTP connection setups
• Used in IXE
IXE CrawlerIXE Crawler
Retriever
Crawler Parser
Scheduler Retriever
Retriever
Cache CrawlInfo
select()
Table <UrlInfo> Citations
Hosts Robots
Host queues
Feeder
select()
thread
synch. obj
memory disk
UrlEnumerator
IXE ParserIXE Parser
• Problem: parsing requires 30% of execution time
• Possible solution: distributed parsing
IXE ParserIXE Parser
URL1
URL2
CacheParserURL Table Manager
(“Crawler”)
Table<UrlInfo> Citations
URL1
URL2
URL1
URL2
DocID1
DocID2
DocID1
DocID2 URL1
URL2
A distributed parserA distributed parser
CacheSchedulerCitations
Parser 1
Table 1<UrlInfo>
Table 1Manager
Parser N
Table 2<UrlInfo>
Table 2Manager
Hash (URL1)→
Manager2
URL1
URL2
Sched ()→
Parser1
URL1
URL2
URL1
URL2
URL1
URL2
DocID2
DocID1
Hash(URL2)→
Manager1
?
New DocID
HIT
MISS
A distributed parserA distributed parser
• Does this solution scale?
- High traffic on the main link
• Suppose that:
- Average page size = 10KB
- Average out-links per page = 10
- URL size = 40 characters (40 bytes)
- DocID size = 5 byte
• X = throughput (pages per second)
• N = number of parsers
A distributed parserA distributed parser
• Bandwidth for web pages:
- X*10*1024*8 = 81920*X bps
• Bandwidth for messages (hit):
- X/N * 10 * (40+5) * 8 * N = 3600*X bps
Pages per parser
Outlinks per page
DocIDReply
Byte → bit
Numberof parsers
DocIDRequest
• Using 100Mbps : X = 1226 pages per second
What we don’t want to What we don’t want to fetchfetch
1.Spider traps
2.Duplicates
2.1 Different URLs for the same page
2.2 Already visited URLs
2.3 Same document on different sites
2.4 Mirrors
• At least 10% of the hosts are mirrored
Spider trapsSpider traps
• Spider trap: hyperlink graph constructed unintentionally or malevolently to keep a crawler trapped
1. Infinitely “deep” Web sites
• Problem: using CGI is possible to generate an infinite number of pages
• Solution: check of the URL length
Spider trapsSpider traps
2.Large number of dummy pages
• Example: http://www.troutbums.com/Flyfactory/flyfactory/flyfactory/hatchlin
e/hatchline/flyfactory/hatchline/flyfactory/hatchline/flyfactory/
flyfactory/flyfactory/hatchline/flyfactory/hatchline/
• Solution: disable crawling
• a guard removes from consideration any
URL from a site which dominates the
collection
Avoid duplicatesAvoid duplicates
• Problem almost nonexistent in classic IR
• Duplicate content
• wastes resources (index space)
• annoys users
Virtual HostingVirtual Hosting
• Problem: Virtual Hosting
• Allows to map different sites to a single IP address
• Could be used to create duplicates
• Feature of HTTP 1.1
• Rely on canonical hostnames (CNAMEs) provided by DNS
http://www.cocacola.com
http://www.coke.com
129.33.45.163
Already visited URLsAlready visited URLs
• Problem: how to recognize an already visited URL ?
• The page is reachable by many paths
• We need an efficient Duplicate URL Eliminator
Already visited URLsAlready visited URLs
1.Bloom Filter
• Probabilistic data structure for set membership testing
• Problem: false positivs
• new URLs marked as already seen
URL
hash function 1
hash function 2
hash function n
BIT VECTOR
0/1
0/1
0/1
Already visited URLsAlready visited URLs
2.URL hashing
• MD5
• Using a 64-bit hash function, a billion URLs requires 8GB
- Does not fit in memory
- Using the disk limit the crawling rate to 75 downloads per second
MD5URL Digest
128 bits
Already visited URLsAlready visited URLs
3. two-level hash function
• The crawler is luckily to explore URLs within the same site
• Relative URLs create a spatiotemporal locality of access
• Exploit this kind of locality using a cache
PathHostname+Port
24 bits 40 bits
Content based Content based techniquestechniques
• Problem: how to recognize duplicates basing on the page contents?
1.Edit distance
• Number of replacements required to transform one document to the other
• Cost: l1*l2, where l1 and l2 are the lenghts of the documents: Impractical!
Content based Content based techniquestechniques
Problem: pages could have minor syntatic differences !
• site mantainer’s name, latest update
• anchors modified
• different formatting
2.Hashing
• A digest associated with each crawled page
• Used in Mercator
• Cost: one seek in the index for each new crawled page
Content based Content based techniquestechniques
3.Shingling
• Shingle (or q-gram): contiguous subsequence of tokens taken from document d
• representable by a fixed length integer
• w-shingle: shingle of width w
• S(d,w): w-shingling of document d
• unordered set of distinct w-shingles contained in document d
Content based Content based techniquestechniques
a rose is a rose is a roseSentence:
Tokens: a rose is a rose is a rose
a,rose,is,a
rose,is,a,rose
is,a,rose,is
a,rose,is,a
rose,is,a,rose
4-shingles:
S(d,4): a,rose,is,a rose,is,a,rose is,a,rose,is
Content based Content based techniquestechniques
• Each token = 32 bit
• w = 10 (suitable value)
• S(d,10) = set of 320-bits numbers
• We can hash the w-shingles and keep 500
bytes of digests for each document
w-shingle=320 bit
Content based Content based techniquestechniques
• Resemblance of documents d1 and d2:
),2(),1(
),2(),1()2,1(
wdSwdS
wdSwdSddr
Jaccard coefficient
• Eliminate pages too similar (pages whose resem-blance value is close to 1)
MirrorsMirrors
http://www.research.digital.com/SRC/
access
method
hostname path
URL
• Precision = relevant retrieved docs / retrieved
docs
MirrorsMirrors
1.URL String based
• Vector Space model: term vector matching to compute the likelyhood that a pair of hosts are mirrors
• terms with df(t) < 100
MirrorsMirrors
a) Hostname matching
• Terms: substrings of the hostname
• Term weighting:
))(log(1
))(log(
tdf
tlen
len(t)= number of segments obtained by breaking the term at ‘.’ characters
• This weighting favours substrings composed by many segments very specific
27%
MirrorsMirrors
b) Full path matching
• Terms: entire paths
• Term weighting:
))(
log(1tdf
mdf
Connectivity based filtering stage:
• Idea: mirrors share many common paths
• Testing for each common path if it has the same set of out-links on both hosts
• Remove hostnames from local URLs
mdf = max df(t)t∈collection
59%
+19%
MirrorsMirrors
c) Positional word bigram matching
• Terms creation:
• Break the path into a list of words by treating ‘/’ and ‘.’ as breaks
• Eliminate non-alphanumeric characters
• Replace digits with ‘*’ (effect similar to stemming)
• Combine successive pairs of words in the list
• Append the ordinal position of the first word
72%
MirrorsMirrors
conferences/d299/advanceprogram.html
conferences
d*
advanceprogram
html
conferences_d*_0
d*_advanceprogram_1
advanceprogram_html_2
Positional
Word
Bigrams
MirrorsMirrors
2.Host connectivity based
• Consider all documents on a host as a single large document
• Graph:
• host → node
• document on host a pointing to a document on host B → directed edge from A to B
• Idea: two hosts are likely to be mirrors if their nodes point to the same nodes
• Term vector matching
- Terms: set of nodes that a host’s node points to
45%
ReferencesReferences
S. Chakrabarti and M. Kaufmann, Mining the Web: Analysis of Hypertext and Semi Structured Data, 2002. Pages 17-43,71-72.
S.Brin and L.Page, The anatomy of a large-scale hypertextual Web search engine. Proceedings of the 7th World Wide Web Conference (WWW7), 1998.
A.Heydon and M.Najork, Mercator: A scalable, extensible Web crawler, World Wide Web Conference, 1999.
K.Bharat, A.Broder, J.Dean, M,R.Henzinger, A comparison of Techniques to Find Mirrored Hosts on the WWW, Journal of the American Society for Information Science, 2000.
ReferencesReferences
A.Heydon and M.Najork, High performance Web Crawling, Technical Report, SRC Research Report, 173, Compaq Systems Research Center, 26 September 2001.
R.C.Miller and K.Bharat, SPHINX: a framework for creating personal, site-specific web crawlers, Proceedings of the 7th World-Wide Web Conference, 1998.
D. Zeinalipour-Yazti and M. Dikaiakos. Design and Implementation of a Distributed Crawler and Filtering Processor, Proceedings of the 5th Workshop on Next Generation Information Technologies and Systems (NGITS 2002), June 2002.