Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.

Sigir’99

Inside Internet Search Engines:Search

Jan Pedersen

and

William Chang

Sigir’992

Basic Architectures: Search

Web

Log

Index

SE

Spider

Spam

Freshness

Quality results

20M queries/day

Browser

800M pages?

24x7

SE

SE

Sigir’993

Query Language

Augmented Vector spaceRelevance scored results

Tf, idf weighting

Boolean constraints: +, -

Phrases: “”

Fields:e.g. title:

Sigir’994

Does Word Order Matter?

Try “information retrieval” versus“retrieval information”

Do you get the same results?

The query parserInterprets query syntax: +,-, “”

Rarely used

General query from free textCritical for precision

Sigir’995

Sigir’996

Precision Enhancement

Phrase inductionAll terms, the closer the better

Url and Title matching

Site clusteringGroup urls from same site

Quality-based reranking

Sigir’997

Link Analysis

Authors vote via linksPages with higher inlink are higher quality

Not all links are equalLinks from higher quality sites are better

Links in context are better

Resistant to SpamOnly cross-site links considered

Sigir’998

Page Rank (Page’98)

Limiting distribution of a random walkJump to a random page with Prob. Follow a link with Prob. 1-

Probability of landing at a page D:/T + P(C)/L(C)

Sum over pages leading to D

L(C) = number of links on page D

Sigir’999

HITS (Kleinbery’98)

Hubs: pages that point to many good pages

Authorities: pages pointed to by many good pages

Operates over a vincity graphpages relevant to a query

Refined by the IBM Clever groupfurther contextualization

Sigir’9910

Hyperlink Vector Voting (Li’97)

Index documents by in-link anchor textsFollow links backward

Can be both precision and recall enhancingThe “evil empire”

How to combine with standard ranking?Relative weight is a tuning issue

Sigir’9911

Evaluation

No industry standard benchmarkEvaluations are qualitative

Excessive claims abound

Press is not be discerning

Shifting targetIndices change daily

Cross engine comparison elusive

Sigir’9912

Complexity Analysis

Search is both CPU and I/O intensiveI/O to access postings

Random access

CPU to compute scores

Caching strategies are very effectiveTerm cache has 40% hit rate

Expensive queries are long and loaded with rare terms

Sigir’9913

Performance versus Size

Index Size

Time

Sigir’9914

Complexity Analysis

CPU costs asymptotically constantDue to term truncation

I/O cost can be kept to one I/O per termAgain due to truncation

Implies the bigger the betterNo advantage to distributed search

Sigir’9915

The Economics of Big Indices

Very large indices require distributed searchEasy scalability; maintenance

Practical hardware limitations

Implies Cost = Size * ThroughputSince each half of a big index requires the same hardware to sustain the same throughput

Worse: queries needing a big index are hard to monetize

Sigir’9916

How to Have your Cake...

Layered SearchSmall, high quality engine for common queries

Low cost per query; high revenue per query

Large, low throughput engine for rare queriesHigh cost per query, low revenue per query

Average query costs can be kept lowWhile still offering comprehensiveness

Sigir’9917

Sigir’9918

Novel Search Engines

Ask JeevesQuestion Answering

Directory for the Hidden Web

Direct HitDirect popularity

Click stream mining

Sigir’9919

Sigir’9920

Sigir’9921

Summary

Search Engines are surprisingly effectiveGiven short queries

Precision enhancing techniques are critical

Centralized search is maximally efficientbut one can achieve a big index through layering

Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.

Documents

Transcript of Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.