Beauty ofir

27
Beauty of IR Venkatesh Vinayakarao An IR enthusiast!

description

Information Retrieval is about how we can search and retrieve things. In this talk, we look at the various components that make up a typical search engine and discuss the associated challenges.

Transcript of Beauty ofir

Page 1: Beauty ofir

Beauty of IR

Venkatesh VinayakaraoAn IR enthusiast!

Page 2: Beauty ofir

Venkatesh Vinayakarao 2

Disclaimer

Most examples and discussions in this talk revolve around well known search engines. This is just to get

a good learning experience. Please keep in mind that IR is beyond search engines.

25+ slides of interesting discussion ahead…

2/2014

Page 3: Beauty ofir

Venkatesh Vinayakarao 3

Quiz

1. Explain any two challenges in Query Intent Understanding using some examples and discuss why is it a hard problem?

2. How are “Tiles” as discussed in the class used in search engines? What purpose do they solve?

3. Search Engines have no UI related design concerns. True/False?

2/2014

Page 4: Beauty ofir

Venkatesh Vinayakarao 4

About Me

BE Computer Science (Y2K)

MS (IT)

IT Service Industry

Start Up

Nokia

Yahoo

Microsoft (Bing)

PhD

Let me learn everything all

over again!

2/2014

Page 5: Beauty ofir

Venkatesh Vinayakarao 5

Our Agenda: The Beauty of IR!

Crawling Content Processing Indexing

Me!

Query (Intent) Understanding

Ranking User Interface

Offline Horror!

Online Terror!

How to process Korean queries for

local listings?

2/2014

Page 6: Beauty ofir

Venkatesh Vinayakarao 6

Crawling

How frequently should we crawl? Fresh & Super-Fresh! How to crawl cricket scores? Are we even

crawling here?

How to avoid 404 - Page not found? How much time did it take google to show your first personal

page?2/2014

Page 7: Beauty ofir

Venkatesh Vinayakarao 7

Content Processing

Good Read: https://getlisted.org/static/resources/local-search-data-providers.html

2/2014

Page 8: Beauty ofir

Venkatesh Vinayakarao 8

Content Processing

Query: “Schools in Delhi” Answer: “Delhi Public School” Good or Bad?

Query: “Schools in Hyderabad” Answer: “Delhi Public School” Good or Bad?

Query: “Hotels in Bombay” Answer: “Grand Hyatt, Mumbai” Good or Bad? How to get same results for both Mumbai and Bombay?

Query: “Maruti Car service in delhi” Answer: “Rana Motors Private Limited”. What happened?

2/2014

Page 9: Beauty ofir

Venkatesh Vinayakarao 9

Content Processing & Indexing

A real example: http://www.yelp.com/dataset_challenge/

Enriched Business• Category Synonyms (for eg., auto service & car service are replaceable at times)• User’s query forms (for eg., McDonalds is commonly queried as McD)

2/2014

Page 10: Beauty ofir

Venkatesh Vinayakarao 10

Derived Values & Indexing

Given a location, how will you find all businesses within 1km radius?

Query: schools near govindpuri delhi

2/2014

Page 11: Beauty ofir

Venkatesh Vinayakarao 11

Query Understanding Challenge

Need a team of 3 people and one laptop.

Volunteers?

2/2014

Page 12: Beauty ofir

Venkatesh Vinayakarao 12

Rules

I will give an entity name. You will have to frame at least three different

(dissimilar) queries (and as many as you can) that give same document as the correct result at first place.

At the end, you should submit: Query, Max. no. of top n correct results that you

maintained to be same. You will have 5 minutes.

2/2014

Page 13: Beauty ofir

Venkatesh Vinayakarao 13

Questions

Tom Cruise Aishwarya Rai Tom Hanks Srikanta Bedathur Venkatesh Vinayakarao Pankaj Jalote Amir Khan Andre Agassi Manmohan Singh

2/2014

Page 14: Beauty ofir

Venkatesh Vinayakarao 14

Query Understanding

Query: Michael Jordon Which MJ to return? The basketball player or actor?

Factors User profile Query context (session details, browser data, links, etc) …

Query: Delhi School What does user want? “Delhi Public School” or

“Schools in Delhi” or “some Indian school in US”? Query: “IR”

Predict top three results

2/2014

Page 15: Beauty ofir

Venkatesh Vinayakarao 15

Ok! I give up!!

A frustrated search user: “please show me some t-shirt brands”

2/2014

Page 16: Beauty ofir

Venkatesh Vinayakarao 16

More fun with auto completion

2/2014

Page 17: Beauty ofir

Venkatesh Vinayakarao 17

System Overview (Simplified)

Front-end Front-end Front-end Front-end

Query Understanding, Query Classifiers

Web Answer Local AnswerFinance Answer

Tech Answer & Many more

KB

Index Serve Crawled Content

Crawler

WebExpanded Query

User Query

2/2014

Page 18: Beauty ofir

Venkatesh Vinayakarao 18

Ranking & Relevance

How do we know if the document is relevant (in web search context)?

Popularity of url Domain score (is it ac.in or .edu?) TF, IDF Entity, Chain entity? Trust Factor (Wikipedia?) Inlinks/Outlinks Position of query terms Sequence of query terms … and 1000 of such things

2/2014

Page 19: Beauty ofir

Venkatesh Vinayakarao 19

Are current search engines good at relevance & ranking?

Bing GoogleQuery1: Vegetarian hotels in south delhi

Query2: South Indian hotels in south delhi

2/2014

Page 20: Beauty ofir

Venkatesh Vinayakarao 20

…More examples

Query3: South Indian restaurants in south delhi

What’s the difference between query2 and query3? Should search engines give different results?

2/2014

Page 21: Beauty ofir

Venkatesh Vinayakarao 21

How far for a coffee?

Google: Just one word (iiitd) missing. So

what?

Let’s make the query as “coffee shops near iiitd delhi”.

“Coffee shops near me” gives results from Janakpuri, Gurgaon, CP & Kamla Nagar.

2/2014

Page 22: Beauty ofir

Venkatesh Vinayakarao 22

Why is it hard?

What makes Ranking & Relevance hard?

2/2014

Page 23: Beauty ofir

23

User Interface

Is UI important for search engine? Maps in local results Live sport score cards Finance tickers Filters Search Operators Entity Infoboxes

What impact does these make?

2/2014 Venkatesh Vinayakarao

Page 24: Beauty ofir

Venkatesh Vinayakarao 24

Our Agenda: The Beauty of IR!

Crawling Content Processing Indexing

Me!

Query (Intent) Understanding

Ranking User Interface

Offline Horror!

Online Terror!

How to process Korean queries for

local listings?

2/2014

Page 25: Beauty ofir

Venkatesh Vinayakarao 25

Evaluation

Various evaluation methods Precision/Recall Mean Avg Precision Mean Reciprocal Rank

If first relevant doc is at kth position, RR = 1/k. NDCG

Non-Boolean/Graded relevance scores DCG = r1 + r2/log22 + r3/log23 + … rn/log2n

2/2014

Page 26: Beauty ofir

Venkatesh Vinayakarao 26

NDCG - Example

i

Ground Truth Ranking Function1 Ranking Function2

Document Order

riDocument Order

riDocument Order

ri

1 d4 2 d3 2 d3 2

2 d3 2 d4 2 d2 1

3 d2 1 d2 1 d4 2

4 d1 0 d1 0 d1 0

NDCGGT=1.00 NDCGRF1=1.00 NDCGRF2=0.9203

4 documents: d1, d2, d3, d4

Taken from http://www.stanford.edu/class/cs276/handouts/EvaluationNew.ppt

2/2014

Page 27: Beauty ofir

Venkatesh Vinayakarao 27

Are we done?

Q & A

2/2014