Set Retrieval 2.0

51
© 2008 Endeca Technologies, Inc. All rights reserved. Set Retrieval 2.0 Daniel Tunkelang Chief Scientist, Endeca

description

This presentation outlines the principles of information seeking as a dialogue and walk though concrete examples that illustrate the principles of human-computer information retrieval (HCIR). The foundation is an interactive set retrieval approach that responds to queries with an overview of the user\'s current context and an organized set of options for incremental exploration. Contextual summaries of document sets optimize the system\'s communication with the user, while query refinement options optimize user\'s communication with the system. By enabling bidirectional communication between the user and the system, we can address the inherent limitations of best-match approaches.

Transcript of Set Retrieval 2.0

Page 1: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.

Set Retrieval 2.0

Daniel TunkelangChief Scientist, Endeca

Page 2: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.2

howdy!

• 1988 – 1992

• 1993 – 1998

• 1999 -

Page 3: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.3

overview

what’s right with search today?

what’s wrong with search today?

how do we fix it?

Page 4: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.4

let’s quickly review some history…

Page 5: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.5

1947: Hans Peter Luhn

Page 6: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.6

1968: Gerald Salton

Page 7: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.7

1972: Karen Spärck Jones

Page 8: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.8

1980s: lots of progress

Page 9: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.9

1990s – 2000s: WWW

Page 10: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.10

today

Page 11: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.11

so, do we all feel lucky?

Page 12: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.12

recession? what recession?

Page 13: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.13

ask the users…

Page 14: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.14

…though they do have complaints

78% wish search engines could read their minds

what frustrates users most?– 25%: deluge of results– 24%: too many paid listings– 19%: inability to understand their keywords– 19%: disorganized / random results

The State of SearchAutobytel & Kelton Research, Oct ’07

Page 15: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.15

web search vs. enterprise search

“Search on the internet is solved. I always find what I need.

But why not in the enterprise?

Seems like a solution waiting to happen.”

- a Fortune 500 CTO

Page 16: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.16

enterprise users really have complaints

Why is Joe the Knowledge Worker so upset?

– 49%: finding the information needed to do their job is difficult and time consuming

– 50%: findability within organization worse than on their own consumer-facing site

Market IQ Report on FindabilityAIIM, June ’08

Page 17: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.17

selection bias?

Page 18: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.18

the library and information science critique

• models– relevance is subjective

• evaluation– neglects interactivity

• tools– no support for exploration

Page 19: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.19

the rebuttal

"Tell us what to do, and we will do it."

Page 20: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.20

besides, search is 90% solved

Page 21: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.21

we need to call a truce

- real, effective systems

- that support interaction

- cost-effective to evaluate

Page 22: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.22

let’s go back to the 80s for a moment

Page 23: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.23

then vs. now

• known-item search was an open problem– now it’s a commodity

• library and information science ideas of the 80s– ahead of their time

• now we can find known items– let’s tackle more ambitious information needs

Page 24: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.24

requirements

Page 25: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.25

transparency

Page 26: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.26

control

Page 27: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.27

guidance

Page 28: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.28

precision = fraction of retrieved documents that are relevant

recall = fraction of relevant documents that are retrieved

retrieveddocuments

relevantdocuments

set retrieval

Page 29: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.29

recall

precision

the classic trade-off

Page 30: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.30

set retrieval: 2 out of 3

Page 31: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.31

set retrieval 2.0 = set retrieval + guidance

Did you mean: guidance Related SearchesGuidance Counselor SalaryGuidance Counselor Job DescriptionDefinition of GuidanceGuidance CounselingHistory of Guidance CounselingChild GuidanceCareer GuidanceWhat Is the Meaning of GuidanceFree Marriage CounselingProblems in MarriageCareer ExplorationRole of School Counselor

Page 32: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.32

guidance vs. mind reading

• system can’t read your mind

• spouse / best friend can’t read your mind

• sometimes you can’t read your own mind

Page 33: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.33

so where does guidance come from?

Page 34: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.34

it’s people!

Page 35: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.35

human-computer information retrieval

• don’t just guess the user’s intent– optimize communication

• de-emphasize the top ten documents– response is a set of documents

• think beyond single queries– support refinement and exploration

Page 36: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.36

recall

precision

hcir cheats the trade-off

Page 37: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.37

but how do we get there?

Page 38: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.38

set retrieval 2.0

• set retrieval that responds to queries with– overview of the user's current context– organized set of options for exploration

• contextual summaries of document sets– optimize system’s communication with user

• query refinement options– optimize user’s communication with system

Page 39: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.39

faceted search guides refinement

Page 40: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.40

showing the right facets: microwaves

Page 41: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.41

showing the right facets: ceiling fans

Page 42: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.42

query-driven clarification before refinement

Matching Categories include:

Appliances > Small Appliances > Irons & Steamers

Appliances > Small Appliances > Microwaves & Steamers

Bath > Sauna & Spas > Steamers

Kitchen > Bakeware & Cookware > Cookware >Open Stock Pots > Double Boilers & Steamers

Kitchen > Small Appliances > Steamers

Page 43: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.43

results-driven clarification before refinement

Search: storage

Page 44: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.44

taxonomies are so 1990s

Page 45: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.45

dynamic topic facet

Subject

Electronic data processing (1002)

Distributed processing (937)

Parallel processing (619)

Computer networks (562)

Fault-tolerant-computing (365)Show more…

Page 46: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.46

facets populated using entity extraction

apple production

Page 47: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.47

bootstrap on folksonomies

Page 48: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.48

or learn from users

Page 49: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.49

hcir using set retrieval 2.0

emphasize set summaries over ranked lists

establish a dialog between the user and the data

enable exploration and discovery

Page 50: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.50

think outside the (search) box

• best-first search works for many use cases

• but not for some of the most valuable ones

• set retrieval 2.0 = set retrieval + guidance

• human-computer information retrieval

Page 51: Set Retrieval 2.0

© 2008 Endeca Technologies, Inc. All rights reserved.51

thank you

communication 1.0email: [email protected]

communication 2.0blog: http://thenoisychannel.com

twitter: http://twitter.com/dtunkelang