Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading...

44
Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8

Transcript of Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading...

Page 1: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Quality of a search engine

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Reading 8

Page 2: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Is it good ?

How fast does it index Number of documents/hour (Average document size)

How fast does it search Latency as a function of index size

Expressiveness of the query language

Page 3: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Measures for a search engine

All of the preceding criteria are measurable

The key measure: user happiness…useless answers won’t make a user happy

Page 4: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Happiness: elusive to measure

Commonest approach is given by the relevance of search results How do we measure it ?

Requires 3 elements:1. A benchmark document collection2. A benchmark suite of queries3. A binary assessment of either Relevant or

Irrelevant for each query-doc pair

Page 5: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Evaluating an IR system

Standard benchmarks TREC: National Institute of Standards and

Testing (NIST) has run large IR testbed for

many years

Other doc collections: marked by human

experts, for each query and for each doc,

Relevant or Irrelevant

On the Web everything is more complicated since we cannot mark the entire corpus !!

Page 6: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

General scenario

Relevant

Retrieved

collection

Page 7: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Precision: % docs retrieved that are relevant [issue “junk” found]

Precision vs. Recall

Relevant

Retrieved

collection

Recall: % docs relevant that are retrieved [issue “info” found]

Page 8: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

How to compute them

Precision: fraction of retrieved docs that are relevant Recall: fraction of relevant docs that are retrieved

Precision P = tp/(tp + fp) Recall R = tp/(tp + fn)

Relevant Not Relevant

Retrieved tp (true positive) fp (false positive)

Not Retrieved

fn (false negative) tn (true negative)

Page 9: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Some considerations

Can get high recall (but low precision) by retrieving all docs for all queries!

Recall is a non-decreasing function of the number of docs retrieved

Precision usually decreases

Page 10: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Precision-Recall curve

We measures Precision at various levels of Recall Note: it is an AVERAGE over many queries

precision

recall

x

x

x

x

Page 11: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

A common picture

precision

recall

x

x

x

x

Page 12: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

F measure

Combined measure (weighted harmonic mean):

People usually use balanced F1 measure

i.e., with = ½ thus 1/F = ½ (1/P + 1/R)

Use this if you need to optimize a single measure

that balances precision and recall.

RP

F1)1(

11

Page 13: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Recommendation systems

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Page 14: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Recommendations

We have a list of restaurants with and ratings for some

Which restaurant(s) should I recommend to Dave?

Brahma Bull Spaghetti House Mango Il Fornaio Zao Ming's Ramona's Straits Homma'sAlice Yes No Yes NoBob Yes No No

Cindy Yes No NoDave No No Yes Yes YesEstie No Yes Yes YesFred No No

Page 15: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Basic Algorithm

Recommend the most popular restaurants say # positive votes minus # negative votes

What if Dave does not like Spaghetti?

Brahma Bull Spaghetti House Mango Il Fornaio Zao Ming's Ramona's Straits Homma'sAlice 1 -1 1 -1Bob 1 -1 -1

Cindy 1 -1 -1Dave -1 -1 1 1 1Estie -1 1 1 1Fred -1 -1

Page 16: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Smart Algorithm

Basic idea: find the person “most similar” to Dave according to cosine-similarity (i.e. Estie), and then recommend something this person likes.

Perhaps recommend Straits Cafe to Dave

Brahma Bull Spaghetti House Mango Il Fornaio Zao Ming's Ramona's Straits Homma'sAlice 1 -1 1 -1Bob 1 -1 -1

Cindy 1 -1 -1Dave -1 -1 1 1 1Estie -1 1 1 1Fred -1 -1

Do you want to rely on one person’s opinions?

Page 17: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Main idea

U

V

W

d1

d2

d5

d3

d4

d6

Y d7

What do we suggest to U ?

Page 18: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

A glimpse on XML retrieval(eXtensible Markup Language)

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Reading 10

Page 19: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

XML vs HTML

HTML is a markup language for a specific purpose (display in browsers) XML is a framework for defining markup

languages

HTML has fixed markup tags, XML no

HTML can be formalized as an XML language (XHTML)

Page 20: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

XML Example (visual)

Page 21: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

XML Example (textual)

<chapter id="cmds"> <chaptitle> FileCab </chaptitle> <para>This chapter describes the

commands that manage the <tm>FileCab</tm>inet application.

</para> </chapter>

Page 22: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Basic Structure

An XML doc is an ordered, labeled tree

character data: leaf nodes contain the actual data (text strings)

element nodes: each labeled with a name (often called the element type), and a set of attributes, each consisting of a

name and a value, can have child nodes

Page 23: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

XML: Design Goals

Separate syntax from semantics to provide a framework for structuring information

Allow tailor-made markup for any imaginable application domain

Support internationalization (Unicode) and platform independence

Be the standard of (semi)structured information (do some of the work now done by databases)

Page 24: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Why Use XML?

Represent semi-structured

XML is more flexible than DBs

XML is more structured than simple IR

You get a massive infrastructure for free

Page 25: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Data vs. Text-centric XML

Data-centric XML: used for messaging between enterprise applications Mainly a recasting of relational data

Text-centric XML: used for annotating content Rich in text Demands good integration of text retrieval

functionality E.g., find me the ISBN #s of Books with at least

three Chapters discussing cocoa production, ranked by Price

Page 26: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

IR Challenges in XML

There is no document unit in XML How do we compute tf and idf? Indexing granularity Need to go to document for retrieving or

displaying a fragment E.g., give me the Abstracts of Papers on

existentialism

Need to identify similar elements in different schemas Example: employee

Page 27: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Xquery: SQL for XML ? Simple attribute/value

/play/title contains “hamlet”

Path queries title contains “hamlet” /play//title contains “hamlet”

Complex graphs Employees with two managers

What about relevance ranking?

Page 28: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Data structures for XML retrieval

Inverted index: give me all elements matching text query Q We know how to do this – treat each

element as a document

Give me all elements below any instance of the Book element (Parent/child relationship is not enough)

Page 29: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Positional containment

Doc:1

27 1122 2033 5790Play

431 867Verse

Term:droppeth720

droppeth under Verse under Play.

Containment can beviewed as mergingpostings.

Page 30: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Summary of data structures

Path containment etc. can essentially be solved by positional inverted indexes

Retrieval consists of “merging” postings

All the compression tricks are still applicable

Complications arise from insertion/deletion of elements, text within elements Beyond the scope of this course

Page 31: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Search Engines

Advertising

Page 32: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Classic approach…

Socio-demo Geographic Contextual

Page 33: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.
Page 34: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.
Page 35: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Search Engines vs Advertisement First generation -- use only on-page, web-text data

Word frequency and language

Second generation -- use off-page, web-graph data Link (or connectivity) analysis Anchor-text (How people refer to a page)

Third generation -- answer “the need behind the query” Focus on “user need”, rather than on query Integrate multiple data-sources Click-through data

Pure search vs Paid search

Ads show on search (who pays more), Goto/Overture

2003 Google/YahooNew model

All players now have:SE, Adv platform + network

Page 36: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

The new scenario

SEs make possible aggregation of interests unlimited selection (Amazon, Netflix,...)

Incentives for specialized niche players

The biggest money is in the smallest sales !!

Page 37: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Two new approaches

Sponsored search: Ads driven by search keywords

(and user-profile issuing them)

AdWords

Page 38: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

-$

+$

Page 39: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Two new approaches

Sponsored search: Ads driven by search keywords

(and user-profile issuing them)

Context match: Ads driven by the content of a web page

(and user-profile reaching that page)

AdWords

AdSense

Page 40: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.
Page 41: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

How does it work ?

1) Match Ads to query or pg content2) Order the Ads3) Pricing on a click-through

IR

Econ

Page 42: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Visited Pages

Clicked Banner

Web Searches

Clicks on Search Results

Web usage data !!!

Page 43: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Dictionary problem

Page 44: Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

A new game

For advertisers: What words to buy, how much to pay SPAM is an economic activity

For search engines owners: How to price the words Find the right Ad Keyword suggestion, geo-coding, business

control, language restriction, proper Ad display

Similar to web searching, but:Ad-DB is smaller, Ad-items are

small pages, ranking depends on clicks