1 Searching the Web Representation and Management of Data on the Internet.
-
date post
21-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of 1 Searching the Web Representation and Management of Data on the Internet.
1
Searching the WebSearching the Web
Representation and Management
of Data on the Internet
2
What does a Search Engine do?
• Processes users queries
• Finds pages with related information
• Returns a resources list
• Why can’t we use an ordinary database system
that is reachable via an ordinary Web server?
• What are the difficulties in creating a search
engine?
3
Motivation
• The web is
– Used by millions
– Contains lots of information
– Link based
– Incoherent
– Changes rapidly
– Distributed
• Traditional information retrieval was built with the
exact opposite in mind
4
The Web’s Characteristics
• Size
– Over a billion pages available (Google is a spelling of
googol = 10100)
– 5-10K per page => tens of terrabytes
– Size doubles every 2 years
• Change
– 23% change daily
– About half of the pages do not exist after 10 days
– Bowtie structure
5
Bowtie Structure
Core: Strongly
connected component
(28%)
Reachable from core
(22%)Reach the core (22%)
6
Search Engine Components
• User Interface
• Query processor
• Crawler
• Indexer
• Ranker
7
An HTML form for inserting a search query
Usually a query is a list of words
What was the most popular query in Google in the last year?
What does it mean to be popular in Google?
8
9
10
Crawling the Web
11
Basic Crawler (Spider)
Queue of Pages
removeBestPage( )
findLinksInPage( )
insertIntoQueue( )
A crawler finds Web
pages to download
into a search engine
cache
12
Choosing Pages to Download
• Q: Which pages should be downloaded?
• A: It is usually not possible to download all
pages because of space limitations. Try to
get the most important pages
• Q: When is a page important?
• A: Use a metric – by interest, by popularity,
by location, or combination
13
Interest Driven
• Suppose that there is a query Q that contains the words we
will be interested in
• Define the importance of a page P by its textual similarity to
the query Q
• Example: use a formula that combines– The number of appearances of words from Q in P
– For each word of Q how frequently does it being used (why is this
important?)
• Problem: We must decide if a page is important while
crawling. However, we don’t know how rare is a word until the
crawl is complete
• Solution: Use an estimate
14
Popularity Driven
• The importance of a page P is proportional
to the number of pages with a link to P
• This is also called the number of back links
of P
• As before, need to estimate this amount
• There is a more sophisticated metric, called
PageRank (was taught on Tuesday)
15
Location Driven
• The importance of P is a function of its URL
• Example:
– Words appearing on URL (e.g., edu or ac)
– Number of “/” on the URL
• Easily evaluated, requires no data from pervious
crawls
• Note: We can also use a combination of all three
metrics
16
Refreshing Web Pages
• Pages that have been downloaded must be
refreshed periodically
• Q: Which pages should be refreshed?
• Q: How often should we refresh a page?
17
Freshness Metric
• A cached page is fresh if it is identical to the
version on the Web
• Suppose that S is a set of pages (i.e., a
cache)
Freshness(S) =(number of fresh pages in S)
number of pages in S
18
Age Metric
• The age of a page is the number of days
since it was refreshed
• Suppose that S is a set of pages (i.e., a
cache)
Age(S) = Average age of pages in SAge(S) = Average age of pages in S
19
Refresh Goal
• Crawlers can refresh only a certain amount
of pages in a period of time
• The page download resource can be
allocated in many ways
• Goal: Minimize the age of a cache and
maximize the freshness of a cache
• We need a refresh strategy
20
Refresh Strategies
• Uniform Refresh: The crawler revisits all pages
with the same frequency, regardless of how often
they change
• Proportional Refresh: The crawler revisits a page
with frequency proportional to the page’s change
rate (i.e., if it changes more often, we visit it more
often)
Which do you think is better?
21
Trick Question• Two page database
• e1 changes daily
• e2 changes once a week
• Can visit one page per week
• How should we visit pages?
– e1 e2 e1 e2 e1 e2 e1 e2... [uniform]
– e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 … [proportional]
– e1 e1 e1 e1 e1 e1 ...
– e2 e2 e2 e2 e2 e2 ...
– ?
e1
e2
e1
e2
webdatabase
22
Proportional Often Not Good!
• Visit fast changing e1
get 1/2 day of freshness
• Visit slow changing e2
get 1/2 week of freshness
• Visiting e2 is a better deal!
23
Another Example
• The collection contains 2 pages: e1 changes 9
times a day, e2 changes once a day
• Simplified change model:
– Day is split into 9 equal intervals: e1 changes once on
each interval, and e2 changes once during the day
– Don’t know when the pages change within the intervals
• The crawler can download a page a day
• Our goal is to maximize the freshness
24
Which Page Do We Refresh?
• Suppose we refresh e2 in midday
• If e2 changes in first half of the day, it
remains fresh for the rest (half) of the day.
– 50% for 0.5 day freshness increase
– 50% for no increase
– Expectancy of 0.25 day freshness increase
25
Which Page Do We Refresh?
• Suppose we refresh e1 in midday
• If e1 changes in first half of the interval, and we
refresh in midday (which is the middle of the
interval), it remains fresh for the rest half of the
interval = 1/18 of a day.
– 50% for 1/18 day freshness increase
– 50% for no increase
– Expectancy of 1/36 day freshness increase
26
Not Every Page is Equal!
• Suppose that e1 is accessed twice as often
as e2
• Then, it is twice as important to us that e1 is
fresh than it is that e2 is fresh
27
Politeness Issues
• When a crawler crawls a site, it uses the site’s
resources:
– The web server needs to find the file in file system
– The web server needs to send the file in the network
• If a crawler asks for many of the pages and at a
high speed it may
– crash the sites web server or
– be banned from the site
• Solution: Ask for pages “slowly”
28
Politeness Issues (cont)
• A site may identify pages that it doesn’t want to be
crawled (how?)
• A polite crawler will not crawl these sites (although
nothing stops the crawler from being impolite)
• Put a file called robots.txt at the main directory to
identify pages that should not be crawled (e.g.,
http://www.cnn.com/robots.txt)
29
robots.txt
• Use the header User-Agent to identify
programs whose access should be restricted
• Use the header Disallow to identify pages
that should be restricted
30
Other Issues
• Suppose that a search engine uses several
crawlers at the same time (in parallel)
• How can we make sure that they are not
doing the same work (i.e., visiting the same
pages)?
31
Index Repository
32
Storage Challenges
• Scalability: Should be able to store huge amounts
of data (data spans disks or computers)
• Dual Access Mode: Random access (find specific
pages) and Streaming access (find large subsets
of pages)
• Large Batch Updates: Reclaim old space, avoid
access/update conflicts
• Obsolete Pages: Remove pages no longer on the
web (how do we find these pages?)
33
Storage Challenges
• Storage cost: Should be able to store the
huge amounts of data at a reasonable cost
(a disk that can store a few terabytes is very
expensive, so what do search engines such
as Google do?)
34
Update Strategies
• Updates are generated by the crawler
• Several characteristics
– Time in which the crawl occurs and the
repository receives information
– Whether the crawl’s information replaces the
entire database or modifies parts of it
35
Batch Crawler vs. Steady Crawler
• Batch mode
– Periodically executed
– Allocated a certain amount of time
• Steady mode
– Run all the time
– Always send results back to the repository
36
Partial vs. Complete Crawls
• A batch mode crawler can either do
– A complete crawl every run, and replace entire cache
– A partial crawl and replace only a subset of the cache
• The repository can implement
– In place update: Replaces the data in the cache, thus,
quickly refreshes pages
– Shadowing: Create a new index with updates, and later
replace the previous, thus, avoiding refresh-access
conflicts
37
Partial vs. Complete Crawls
• Shadowing resolves the conflicts between
updates and read for the queries
• Batch mode suits well with shadowing
• Steady crawler suits with in place updates
38
Types of Indices
• Content index: Allow us to easily find pages
with certain words
• Links index: Allow us to easily find links
between pages
• Utility index: Allow us to easily find pages in
certain domain, or of a certain type, etc.
• Q: What do we need these for?
39
Is the Following Content Index Good?
• Consider the table:
• We want to quickly find pages with a specific word
• Is this a good way of storing a content index?
Word Frequency UrlId
... ... ...
40
Is the Following Content Index Good? NO
• If a word appears in a thousand documents, then
the word will be in a thousand rows. Why waste the
space?
• If a word appears in a thousand documents, we will
have to access a thousand rows in order to find the
documents
• Does not easily support queries that require
multiple words
41
Inverted Keyword Index
bush: (1, 5, 11, 17) saddam: (3, 5, 11, 17)
war: (3, 5, 17, 28)
butterfly: (22, 4)
Hashtable
Words as keys
lists of matching documents as the
values
lists are sorted by urlId
42
Query: “bush saddam war”
bush: (1, 5, 11, 17)
saddam: (3, 5, 11, 17)
war: (3, 5, 17, 28)
5 17
Answers:
Algorithm:Always advance pointer(s) with lowest urlId
43
Challenges
• Index build must be :
– Fast
– Economic
• Incremental Indexing must be supported
• Tradeoff when using compression: memory
is saved but time is lost compressing and
uncompressing
44
How do we Distribute the Indices Between Files?
• Local inverted file
– Each file contains disjoint random pages of the index
– Query is broadcasted
– Result is the merged query answers
• Global inverted file
– Each file is responsible for a subset of terms in the collection
– Query “sent” only to the apropriate files
• What will happen if a disk will crash (which is better in
this case?)
45
Ranking
46
A Naïve Approach
• Let Q (the query) be a set of words
• Let countQ(P) be the number of occurrences of
words of Q in P
• A naïve approach:
– If countQ(P1) > countQ(P2) then rank P1 should be higher
than rank P2
• What are the problems with the naïve approach?
47
Testing the Naïve Approach
• Q = “green men mars”
– P1 = “I live in a green house with a green roof”
– P2 = “There is no life form on Mars”
– P3 = “Men don’t like green cars”
– P4 = “I saw some little green men yesterday”
• In what order do you think that these ‘pages’
should be returned?
48
The Vector Space Model
• The Vector Space Model (VSM) is a way of
representing documents through the words that
they contain
• It is a standard technique in Information Retrieval
• The VSM allows decisions to be made about which
documents are similar to each other and to
keyword queries
49
How Does it Work
• Each document is broken down into a word
frequency table
• The tables are called vectors and can be stored as
arrays
• A vocabulary is built from all the words in all
documents in the system
• Each document is represented as a vector based
against the vocabulary
50
Example
• Document A
– “A dog and a cat.”
• Document B
– “A frog.”
a dog and cat
2 1 1 1
a frog
1 1
51
Example (continued)
• The vocabulary contains all the words that
are used:
– a, dog, and, cat, frog
• The vocabulary is sorted
– a, and, cat, dog, frog
52
Example (continued)
• Document A: “A dog and a cat.”
– Vector: (2,1,1,1,0)
• Document B: “A frog.”
– Vector: (1,0,0,0,1)
a and cat dog frog
2 1 1 1 0
a and cat dog frog
1 0 0 0 1
53
Queries
• Queries can be represented as vectors in
the same way as documents:
– “dog” = (0,0,0,1,0)
– “frog” = (0,0,0,0,1)
– “dog and frog” = (0,1,0,1,1)
54
Similarity Measures
• There are many different ways to measure how
similar two documents are, or how similar a
document is to a query
• The cosine measure is a very common similarity
measure
• Using a similarity measure, a set of documents can
be compared to a query and the most similar
document returned
55
The Cosine Measure
• For two vectors d and d’ the cosine similarity
between d and d’ is given by:
• Here d d’ is the vector product of d and d’,
calculated by multiplying corresponding
frequencies together
• The cosine measure calculates the angle between
the vectors in a high-dimensional virtual space
'
'
dd
dd
56
Example
• Let d = (2,1,1,1,0) and d’ = (0,0,0,1,0)
– dd’ = 20 + 10 + 10 + 11 + 00 = 1
– |d| = (22+12+12+12+02) = 7=2.646
– |d’| = (02+02+02+12+02) = 1=1
– Similarity = 1/(1 2.646) = 0.378
57
Ranking Documents
• A user enters a query
• The query is compared to all documents
using a similarity measure
• The user is shown the documents in
decreasing order of similarity to the query
term
58
Vocabulary
• Stopword lists
– Commonly occurring words are unlikely to give useful
information and may be removed from the vocabulary to
speed processing
• Examples: a, and , to, is, of, in, if, would, very, when, you, …
– Stopword lists contain frequent words to be excluded
– Stopword lists need to be used carefully
• E.g. “to be or not to be”
59
Stemming
• Suppose that a user is interested in finding
pages about “running shoes”
• In many cases it is desired to return pages
containing shoe instead of shoes, and pages
containing run or runs instead of running
• In order to accommodate such variations, a
stemmer is used
60
Stemming (continued)
• A stemmer receives a keyword as input, and
returns its stem (or normal form)
• For example, the stem of running might be run
• Instead of checking whether a word w appears in a
page P, a search engine might check if there is a
word w' in P that has the same stem as w, i.e.,
stem(w)=stem(w')
61
Term Weighting
• Not all words are equally useful
• A word is most likely to be highly relevant to
document A if it is:
– Infrequent in other documents
– Frequent in document A
• The cosine measure needs to be modified to
reflect this
62
Normalised Term Frequency (tf)
• A normalised measure of the importance of a word
to a document is its frequency, divided by the
maximum frequency of any term in the document
• This is known as the tf factor
• Example:
– Given raw frequency vector: (2,1,1,1,0)
– We get the tf vector: (2/5, 1/5, 1/5, 1/5, 0)
• This stops large documents from scoring higher
63
Inverse Document Frequency (idf)
• A calculation designed to make rare words more
important than common words
• The idf of word w is given by
• Where N is the number of documents and nw is the
number of pages that contain the word w
ww n
Nidf log
64
tf-idf
• The tf-idf weighting scheme is to multiply
each word in each document by its tf factor
and idf factor
– TF-IDF(P, Q) = Sum w in Q (tf(P,w)*idf(w))
• Different schemes are usually used for query
vectors
• Different variants of tf-idf are also used
65
Traditional Ranking Faults (e.g., TF-IDF)
• Many pages containing a term may be of
poor quality or not relevant
• People put popular words in irrelevant sites
to promote the site
• Queries are short, so containing the words
from a query does not indicate importance
66
Additional Factors for Ranking
• Links: If an important page links to P, then P must
be important
• Words on links: If a page links to P with the query
keyword in the link text, the page P must really be
about the keywords
• Style of words: If a keyword appears in P in a title,
header, large font size, it is more important
67
The Hidden Web Challenge
68
The Hidden (Deep) Web
• Web pages that are protected by a password
• Web pages that require filling a registration form in
order to get them
• Web pages that are dynamically created from data
in a database (e.g., search results)
• In a weaker sense:
– Web pages that no other page has a link to them
– Pages that are not allowed for search engines (by
robots.txt)
69
One of the Challenges in Archiving the Web
• Can we reach all of the Web by crawling?
• Why do we care about parts that are not reachable
by ordinary web crawlers?
• There is an estimation that the deep web is 500
larger than the visible web
• What will be the effect of web services on the ratio
between the visible web and the hidden web?