Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web...
-
Upload
elijah-gordon -
Category
Documents
-
view
215 -
download
1
Transcript of Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web...
Web Searching & Ranking
Zachary G. IvesUniversity of Pennsylvania
CIS 455/555 – Internet and Web Systems
April 20, 2023Some content based on slides by Marti Hearst, Ray Larson
Recall Where We Left Off
We were discussing information retrieval ranking models
The Boolean model captures some intuitions of what we want – AND, OR
But it’s too restrictive, and has no real ranking between returned answers
2
3
Sim(q,dj) = cos()= [vec(dj) vec(q)] / |
dj| * |q|= [ wij * wiq] / |dj| * |q|
Since wij > 0 and wiq > 0, 0 ≤ sim(q,dj) ≤ 1
A document is retrieved even if it matches the query terms only partially
i
j
dj
q
Vector Model
4
Weights in the Vector Model
Sim(q,dj) = [ wij * wiq] / |dj| * |q|
How do we compute the weights wij and wiq? A good weight must take into account two
effects: quantification of intra-document contents
(similarity) tf factor, the term frequency within a document
quantification of inter-documents separation (dissimilarity) idf factor, the inverse document frequency
wij = tf(i,j) * idf(i)
5
TF and IDF Factors
Let:N be the total number of docs in the collectionni be the number of docs which contain ki
freq(i,j) raw frequency of ki within dj
A normalized tf factor is given byf(i,j) = freq(i,j) / max(freq(l,j))
where the maximum is computed over all terms which occur within the document dj
The idf factor is computed asidf(i) = log (N / ni)
the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information
associated with the term ki
6
d1
d2
d3d4 d5
d6d7
k1k2
k3
k1 k2 k3 q dj d1 1 0 1 4 d2 1 0 0 1 d3 0 1 1 5 d4 1 0 0 1 d5 1 1 1 6 d6 1 1 0 3 d7 0 1 0 2
q 1 2 3
Vector ModelExample 1I
7
d1
d2
d3d4 d5
d6d7
k1k2
k3
k1 k2 k3 q dj d1 2 0 1 5 d2 1 0 0 1 d3 0 1 3 11 d4 2 0 0 2 d5 1 2 4 17 d6 1 2 0 5 d7 0 5 0 10
q 1 2 3
Vector ModelExample III
8
Vector Model, Summarized
The best term-weighting schemes tf-idf weights:wij = f(i,j) * log(N/ni)
For the query term weights, a suggestion iswiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) *
log(N / ni)
This model is very good in practice: tf-idf works well with general collections Simple and fast to compute Vector model is usually as good as the known
ranking alternatives
9
Advantages: term-weighting improves quality of the answer
set partial matching allows retrieval of docs that
approximate the query conditions cosine ranking formula sorts documents
according to degree of similarity to the query
Disadvantages: assumes independence of index terms; not
clear if this is a good or bad assumption
Pros & Cons of Vector Model
10
Comparison of Classic Models
Boolean model does not provide for partial matches and is considered to be the weakest classic model
Experiments indicate that the vector model outperforms the third alternative, the probabilistic model, in general
Generally we use a variation of the vector model in most text search systems
11
Switching Our Sights to the Web
Information retrieval is more heterogeneous in nature: No editor to control quality Deliberately misleading information (“web spam”) Great variety in types of information
Phone books, catalogs, technical reports, news, slide shows, …
Many languages; partial duplication; jargon Diverse user goals
Very short queries ~2.35 words on average (Aug 2000; Google results)
And much larger scale!
12
Handling Short Queries &Mixed-Quality Information
Human processing Web directories: Yahoo, Open Directory, … Human-created answers: about.com, Search Wikia (Still not clear that automated question-answering
works) Capitalism: “paid placement”
Advertisers pay to be associated with certain keywords Clicks / page popularity: pages visited most often Link analysis: use link structure to determine
credibility
… combination of all?
13
Link Analysis for Starting Points:HITS (Kleinberg), PageRank (Google)
Assumptions: Credible sources will mostly point to credible
sources Names of hyperlinks suggest meaning Ranking is a function of the query terms and of the hyperlink
structure
An example of why this makes sense: The official Olympics site will be linked to by most
high-quality sites about sports, Olympics, etc. A spammer who adds “Olympics” to his/her web site
probably won’t have many links to it Caveat: “Search engine optimization”
14
Google’s PageRank (Brin/Page 98)
Mine structure of web graph independently of the query!
Each web page is a node, each hyperlink is a directed edge
Assumes a random walk (surf) through the web: Start at a random page
At each step, the surfer proceeds to a randomly chosen web page with probability d
to a randomly chosen successor of the current page with probability 1- d
The PageRank of a page p is the fraction of steps the surfer spends at p in the limit
15
Link Counts Aren’t Everything…
“A-Team” page
Hollywood“Series to
Recycle” page
YahooDirectory
WikipediaMr. T’spage
TeamSports
CheesyTV
Showspage
16
PageRank
jBj j
i xN
xi
1
Rank of page jRank of page i
Every pagej that links to i
Number oflinks out
from page j
Importance of page i is governed by pages linking to it
17
Computing PageRank (Simple version)
)()1( 1 kj
Bj j
ki x
Nx
i
nxi
1)0( Initialize so total rank sums to 1.0
Iterate untilconvergence
18
Computing PageRank (Step 0)
0.33
0.33
0.33
Initialize so total rank sums to 1.0 n
xi
1)0(
19
Computing PageRank (Step 1)
0.17
0.33
0.33
)()1( 1 kj
Bj j
ki x
Nx
i
0.17
Propagate weightsacross out-edges
20
Computing PageRank (Step 2)
0.17
0.50
0.33
Compute weightsbased on in-edges
)0()1( 1j
Bj ji x
Nx
i
21
Computing PageRank (Convergence)
0.2
0.40
0.4
)()1( 1 kj
Bj j
ki x
Nx
i
22
Naïve PageRank Algorithm Restated
Let N(p) = number outgoing links from page p B(p) = number of back-links to page p
Each page b distributes its importance to all of the pages it points to (so we scale by N(b))
Page p’s importance is increased by the importance of its back set
)()(
1)( bPageRank
bNpPageRank
iBb
23
In Linear Algebra Terms
Create an m x m matrix M to capture links: M(i, j) = 1 / nj if page i is pointed to by page j
and page j has nj outgoing links
Initialize all PageRanks to 1, multiply by M repeatedly until all values converge:
(Computes principal eigenvector via power iteration)
)(
...
)(
)(
)'(
...
)'(
)'(
2
1
2
1
mm pPageRank
pPageRank
pPageRank
M
pPageRank
pPageRank
pPageRank
24
A Brief Example
Amazon Yahoo
0 0.5
0.5
0 0 0.5
1 0.5
0
g'
y’
a’
g
y
a
= *
Total rank sums to number of pages
g
y
a
1
1
1
=
1
0.5
1.5
,
1
0.75
1.25
,
1
0.01
1.99
, …
Running for multiple iterations:
25
Oops #1 – PageRank Sinks: Dead Ends
Amazon Yahoo
0 0 0.5
0.5
0 0.5
0.5
0 0
g'
y’
a’
g
y
a
= *
g
y
a
1
1
1
=
0.5
1
0.5
,
0.25
0.5
0.25
,
0
0
0, …
Running for multiple iterations:
26
Oops #2 – Hogging all the PageRank
Amazon Yahoo
0 0 0.5
0.5
1 0.5
0.5
0 0
g'
y’
a’
g
y
a
= *
g
y
a
1
1
1
=
0.5
2
0.5
,
0.25
2.5
0.25
,
0
3
0, …
Running for multiple iterations:
27
Improved PageRank
Remove out-degree 0 nodes (or consider them to refer back to referrer)
Add decay factor to deal with sinks PageRank(p) = d b B(p) (PageRank(b) / N(b)) + (1
– d)
Intuition in the idea of the “random surfer”: Surfer occasionally stops following link sequence and
jumps to new random page, with probability 1 - d
28
Stopping the Hog
0 0 0.5
0.5
1 0.5
0.5
0 0
g'
y’
a’
g
y
a
= 0.8 *
g
y
a=
0.35
2.30
0.35
,
0.2
0.2
0.2
+
Running for multiple iterations:
… though does this seem right?
Amazon Yahoo
0.6
1.8
0.6
0.44
2.12
0.44
0.38
2.25
0.38
, ,
29
Summary of Link Analysis
Use back-links as a means of adjusting the “worthiness” or “importance” of a page
Use iterative process over matrix/vector values to reach a convergence point
PageRank is query-independent and considered relatively stable But vulnerable to SEO
Can We Go Beyond?
PageRank assumes a “random surfer” who starts at any node and estimates likelihood that the surfer will end up at a particular page
A more general notion: label propagation Take a set of start nodes each with a different label Estimate, for every node, the distribution of arrivals
from each label In essence, captures the relatedness or influence of
nodes Used in YouTube video matching, schema matching, …
30
31
Overall Ranking Strategies inWeb Search Engines
Everybody has their own “secret sauce” that uses: Vector model (TF/IDF) Proximity of terms Where terms appear (title vs. body vs. link) Link analysis Info from directories Page popularity gorank.com “search engine optimization site” compares
these factors
Some alternative approaches: Some new engines (Vivisimo, Teoma, Clusty) try to do
clustering A few engines (Dogpile, Mamma.com) try to do meta-search