Ryen White Microsoft Research [email protected]...

Ryen WhiteMicrosoft Research

[email protected]/~ryenw/talks/ppt/WhiteIMT542E.ppt

OverviewShort, selfish bit about meUser evaluation in IRCase study combining two approaches

User studyLog-based

Introduction to Exploratory Search SystemsFocus on evaluation

Short group activityWrap-up

Me, Me, MeInterested in understanding and supporting

peoples’ search behaviors, in particular on the WebPh.D. in Interactive Information Retrieval from

University of Glasgow, Scotland (2001 – 2004)Post-doc at University of Maryland Human-

Computer Interaction Lab (2004 – 2006)Instructor for course on Human-Computer Interaction

at UMD College of Library and Information StudiesResearcher in Text Mining, Search, and Navigation

group at Microsoft Research, Redmond (2006 - present)


User studyLog-based



Search InterfacesThere are lots of different search interfaces, for

lots of different situations

Big question: How do we evaluate these interfaces?

Some ApproachesLaboratory ExperimentsNaturalistic StudiesLongitudinal StudiesFormative (during) and Summative (after)

evaluationsTraditional usability studies

Is an interface usable? Generally not comparative.

Case StudiesOften designer, not user, driven

Research QuestionsResearch questions are questions that you

hope that your study will answer (a formal statement of your goal)

Hypotheses are specific predictions about relationships among variables

Questions should be meaningful, answerable, concise, open-ended, and value-free

Research Questions: Example 1For study of advanced query syntax (e.g., +, -, “”,

site:), the research questions were: Is there a relationship between the use of advanced

syntax and other characteristics of a search?Is there a relationship between the use of advanced

syntax and post-query navigation behaviors?Is there a relationship between the use of advanced

syntax and measures of search success?

Research Questions: Example 2For a study of an interface gadget that points users

to popular destinations (i.e., pages that many people visit):Are popular destinations preferable and more

effective than query refinement suggestions and unaided Web search for: Searches that are well-defined (“known-item” tasks)? Searches that are ill-defined (“exploratory” tasks)?

Should popular destinations be taken from the end of query trails or the end of session trails?

More on this research question in the case study later!

VariablesIndependent Variable (IV): the “cause”; this is

often (but not always) controlled or manipulated by the investigator

Dependent Variable (DV): the “effect”; this is what is proposed to change as a result of different values of the independent variable

Other variables:Intervening variable: explains link between variablesModerating variable: affects direction/strength IV-to-

DVConfounding variable: not controlled for, affects DV

HypothesesAlternative Hypothesis: a statement describing

the relationship between two or more variables, e.g.,E.g., Search engine users that use advanced query

syntax find more relevant Web pages

Null Hypothesis: a statement declaring that there is no relationship among variables; you may have heard of“reject the null hypothesis”“failing to reject the null hypothesis”E.g., Search engine users that use advanced query

syntax find Web pages that are no more or less relevant than other users

Experimental DesignWithin and/or Between Subjects

Within-subjects: All subjects use all systemsBetween-subjects: Subjects use only one system,

different blocks of users use each systemControl:

System with no modifications (in within-subjects)Group of subjects that do not use experimental

system, but instead use a baseline (in between-subjects)

Factorial Designs> 1 variable (factor), e.g., system × task type

TasksTask or topic?

Task is the activity the user is asked to performTopic is the subject matter of the task

Artificial tasksSubjects given task or even queries; relevance

pre-determinedSimulated work tasks (Borlund, 2000)

Subjects given task; compose queries; determine relevance

Natural tasks (Kelly & Belkin, 2004)Subjects construct own tasks as part of real needs

System & Task RotationRotation & counterbalancing to

counteract learning effectsLatin Square rotation

n × n table filled with n different symbols so that each symbol occurs exactly once in each row and exactly once in each column

Factorial rotationall possible combinations

Factorial has twice as many subjectsTwice as expensive to perform

213

132

321

123

213

132

312

231

321

Data CollectionQuestionnairesDiariesInterviewsFocus groupsObservationThink-aloudLogging (system, proxy & server, client)

Data Analysis: QuantitativeDescriptive Statistics

Describes the characteristics of a sample of the relationship among variables

Presents summary information about the exampleE.g., mean, correlation coefficient

Inferential StatisticsUsed for hypotheses testingDemonstrate cause/effect relationshipsE.g., t-value (from t-test), F-value (from ANOVA)

Data Analysis: QualitativeCoding – open-questions, transcribed think-aloud,

…Classifying or categorizing individual pieces of dataOpen Coding: codes are suggested by the

investigator’s examination and questioning of the data Iterative process

Closed Coding: codes are identified before the data is collected

Each passage can have more than one codeAll passages do not have to have a codeCode, code, and code some more!


User studyLog-based



Case StudyLeveraging popular destinations to enhance Web search interaction

White, R.W., Bilenko, M., Cucerzan, S. (2007). Studying the use of popular destinations to enhance web search interaction. In Proceedings of the 30th ACM SIGIR Conference on Research

and Development in Information Retrieval, pp. 159-166.

MotivationQuery suggestion is a popular approach to help users

better define their information needs

Incremental: may be inappropriate for exploratory needs

In exploratory searches users rely a lot on browsingCan we use places others go rather than what they

say?

Query = [hubble telescope]

Query suggestio

ns

Search Trails: from user logsInitiated with a query

to a top-5 search engine

Query trailsQuery Query

Session trailsQuery Event:

Session timeout Visit homepage Type URL Check Web-based

email or logon to online service

S1 S3 S4

S3

dpreview.com

S2

pmai.orgdigital

cameras

S2

QueryTrailEnd canon.com

amazon.com

S5

howstuffworks.com

S6

S5 S8

S6 S9

S1 S10 S11

S10 S12 S13 S14

amazon

digitalcamera-hq.com

digital camera canon

S7

S6

canon lenses

SessionTrailEnd

S2

Popular DestinationsPages at which other users end up frequently after

submitting the same or similar queries, and then browsing away from initially clicked search results

Popular destinations lie at the end of many users’ trailsMay not be among the top-ranked resultsMay not contain the queried termsMay not even be indexed by the search engine

Suggesting DestinationsCan we exploit a corpus of trails to support

Web search?

Research QuestionsRQ1: Are destination suggestions preferable

and more effective than query refinement suggestions and unaided Web search for:Searches that are well-defined (“known-item”

tasks)Searches that are ill-defined (“exploratory”

tasks)

RQ2: Should destination suggestions be taken from the end of the query trails or the end of the session trails?

User StudyConducted a user study to answer these

questions36 subjects drawn from subject pool within

our organization4 systems2 task types (“known-item” and “exploratory”)Within-subject experimental designGraeco-Latin square designSubjects attempted 2 known-item and 2

exploratory tasks, one on each system

Systems: Unaided Web SearchLive Search backendNo direct support for query refinement


Systems: Query Suggestion Suggests queries based on popular

extensions for the current query type by the userQuery = [hubble telescope]

Systems: Destination SuggestionQuery Destination (unaided + page support)

Suggests pages many users visit before next query

Session Destination (unaided + page support)Same as above, but before session end not next query


TasksTasks taken and adapted from TREC Interactive

Track and QA communities (e.g., Live QnA, Yahoo! Answers)

Six of each task type, subject chose without replacement

Two task types: known-item and exploratoryKnown-item: Identify three tropical storms

(hurricanes and typhoons) that have caused property damage and/or loss of life.

Exploratory task: You are considering purchasing a Voice Over Internet Protocol (VoIP) telephone. You want to learn more about VoIP technology and providers that offer the service, and select the provider and telephone that best suits you.

MethodologySubjects:

Chose two known-item and two exploratory tasks from six

Completed demographic and experience questionnaire

For each of four interfaces, subjects were:Given an explanation of interface functionality (2 min.)Attempt the task on the assigned system (10 min.)Asked to complete a post-search questionnaire after

each task

After using four systems, subjects answered exit questionnaire

Findings: System RankingSubjects asked to rank the systems in preference order

Subjects preferred QuerySuggestion and QueryDestination

Differences not statistically significantOverall ranking merges performance on different types

of search task to produce one ranking

Systems Baseline QuerySuggest. QueryDest. SessionDest.

Ranking 2.47 2.14 1.92 2.31

Relative ranking of systems (lower = better).

Findings: Subject CommentsResponses to open-ended questions

Baseline:+ familiarity of the system (e.g., “was familiar

and I didn’t end up using suggestions” (S36))− lack of support for query formulation (“Can be

difficult if you don’t pick good search terms” (S20))

− difficulty locating relevant documents (e.g., “Difficult to find what I was looking for” (S13))

Findings: Subject CommentsQuery Suggestion:

+ rapid support for query formulation (e.g., “was useful in saving typing and coming up with new ideas for query expansion” (S12); “helps me better phrase the search term” (S24); “made my next query easier” (S21))

− suggestion quality (e.g., “Not relevant” (S11); “Popular queries weren’t what I was looking for” (S18))

− quality of results they led to (e.g., “Results (after clicking on suggestions) were of low quality” (S35); “Ultimately unhelpful” (S1))

Findings: Subject CommentsQueryDestination:

+ support for accessing new information sources (e.g., “provided potentially helpful and new areas / domains to look at” (S27))

+ bypassing the need to browse to these pages (“Useful to try to ‘cut to the chase’ and go where others may have found answers to the topic” (S3))

− lack of specificity in the suggested domains (“Should just link to site-specific query, not site itself” (S16); “Sites were not very specific” (S24); “Too general/vague” (S28))

− quality of the suggestions (“Not relevant” (S11); “Irrelevant” (S6))

Findings: Subject CommentsSessionDestination:

+ utility of the suggested domains (“suggestions make an awful lot of sense in providing search assistance, and seemed to help very nicely” (S5))

− irrelevance of the suggestions (e.g., “did not seem reliable, not much help” (S30); “irrelevant, not my style” (S21))

− need to include explanations about why the suggestions were offered (e.g., “low-quality results, not enough information presented” (S35))

Findings: Task CompletionSubjects felt that they were more successful

for known-item searches on QuerySuggestion and more successful for exploratory searches in QueryDestination

Task-typeSystem

Baseline QSuggestion QDestination SDestination

Known-item 2.0 1.3 1.4 1.4

Exploratory 2.8 2.3 1.4 2.6

Perceptions of task success (lower = better, scale = 1-5 )

Findings: Task Completion Time

QuerySuggestion and QueryDestination sped up known-item performance

Exploratory tasks took longer

Known-item Exploratory0

100

200

300

400

500

600

Task categories

BaselineQSuggest

Time (seconds)

Systems

348.8

513.7

272.3

467.8

232.3

474.2

359.8

472.2

QDestination

SDestination

Findings: Interaction

Known-item taskssubjects used query suggestion most heavily

Exploratory taskssubjects benefited most from destination

suggestionsSubjects submitted fewer queries and clicked

fewer search results on QueryDestination

Task-typeSystem

QSuggestion QDestination SDestination

Known-item 35.7 33.5 23.4

Exploratory 30.0 35.2 25.3

Suggestion uptake (values are percentages).

Log AnalysisThese findings are all from the laboratoryLogs from consenting users of the Windows

Live Toolbar allowed us to determine the external validity of our experimental findingsDo the behaviors observed in the study mimic

those of real users in the “wild”?

Extracted search sessions from the logs that started with the same initial queries as our user study subjects

Log Analysis: Search TrailsInitiated with a query

to a top-5 search engine

Query trailsQuery Query

Session trailsQuery Event:

Session timeout Visit homepage Type URL Check Web-based

email or logon to online service

S1 S3 S4

S3

dpreview.com

S2

pmai.orgdigital

cameras

S2

QueryTrailEnd canon.com

amazon.com

S5

howstuffworks.com

S6

S5 S8

S6 S9

S1 S10 S11

S10 S12 S13 S14

amazon

digitalcamera-hq.com

digital camera canon

S7

S6

canon lenses

SessionTrailEnd

S2

Log Analysis: TrailsWe extracted 2,038 trails from the logs that began

with the same query as a user study session700 from known-item and 1,338 from exploratory

tasks

In vitro group: User study subjectsEx vitro group: Remote subjects

Compared:# query iterations, # unique query terms, # result

clicks, and # of unique domains visited

Log Analysis: Results

Generally same, apart from in the number of unique query terms submittedSubjects may be taking terms from the textual

task descriptions provided to them

FeatureKnown-item Exploratory

In vitroEx vitro

In vitroEx vitro

10 min All 10 min All

Query iterations 1.9 2.3 2.6 3.1 3.0 3.8

Unique query terms 5.2 2.8 3.2 7.4 4.4 4.9

Result clicks 2.6 1.8 2.5 3.3 2.8 3.1

Unique domains 1.3 1.4 1.7 2.1 1.8 2.1

These numbers are high!

These numbers are high!

Log Analysis: ResultsKnown-item tasks

72% overlap between queries issued and terms appearing in the task description

Exploratory tasks79% overlap between queries issued and terms

appearing in the task description

Could confound experiment if we are interested in query formulation behavior – need to address!

ConclusionsUser study compared the popular destinations with

traditional query refinement and unaided Web search

Results revealed that: RQ1a: Query suggestion preferred for known-item

tasksRQ1b: Destination suggestion preferred for

exploratory tasksRQ2: Destinations from query trails rather than

session trailsDifferences in number of unique query terms

suggests that textual task descriptions may introduce some degree of experimental bias

Case StudyWhat did we learn?

Showed how a user evaluation can be conducted

Showed how analysis of different sources – questionnaire responses and interaction logs (both local and remote) – can be combined to answer our research questions

Showed that the findings of a user study can be generalized in some respects to the “real” world (i.e., has some external validity)

Anything else?


User studyLog-based



Exploratory Search“Exploratory search” describes:

an information-seeking problem context that is open-ended, persistent, and multi-faceted commonly used in scientific discovery, learning, and

decision making contextsinformation-seeking processes that are

opportunistic, iterative, and multi-tactical exploratory tactics are used in all manner of

information seeking and reflect seeker preferences and experience as much as the goal

User’s search

problem

User’s search

strategies

Marchionini’s definition:

Exploratory Search SystemsSupport both querying and browsing

activitiesSearch engines generally just support querying

Help users explore complex information spaces

Help users learn about new topics: go beyond finding

Can consider user contextE.g., Task constraints, user emotion, changing

needs


User studyLog-based



Group ActivityDivide into two groups of 3-4 peopleEach group designs an evaluation of an

exploratory search systemTwo systems:

mSpace: faceted spatial browser for classical music

PhotoMesa: photo browser with flexible filtering, grouping, and zooming tools

You pick the evaluation criteria, comparator systems, approach, metrics, etc.

mSpace (mspace.fm)

PhotoMesa (photomesa.com)

Some questions to think aboutWhat are the independent/dependent variables?Which experimental design?What task types? What tasks? What topics? Any comparator systems?What subjects? How many? How will you recruit?Which instruments? (e.g., questionnaires)Which data analysis methods

(qualitative/quantitative)?

Most importantly: Which metrics?How do you determine user and system performance?


User studyLog-based



Evaluating Exploratory SearchSIGIR 2006 workshop on Evaluating

Exploratory Search Systems Brought together around 40 experts to

discuss issues in the evaluation of exploratory search systems

http://research.microsoft.com/~ryenw/eess

What metrics did they come up with?How do they compare to yours?

Metrics from workshopEngagement and enjoyment:

e.g., task focus, happiness with system responses, the number of actionable events (e.g., purchases, forms filled)

Information novelty:e.g., the amount of new information encountered

Task success: e.g., reach target document? encountered

sufficient information en route?Task time: to assess efficiencyLearning and cognition:

e.g., cognitive loads, attainment of learning outcomes, richness/completeness of post-exploration perspective, amount of topic space covered, number of insights

Activity Wrap-up[insert summary of comments from group

activity]

ConclusionWe have:

Described aspects of user experimentation in IR

Walked through a case studyIntroduced exploratory searchPlanned evaluation of exploratory search

systemsRelated our proposed metrics to those of

others interested in evaluating exploratory search systems

Acknowledgements

Although modified, a few of the earlier slides in this lecture were based on an excellent SIGIR 2006 tutorial given by Diane Kelly and David Harper – Thank you Diane and David!

Referenced ReadingBorlund, P. (2000). Experimental components

for the evaluation of interaction information retrieval systems. Journal of Documentation, 56(1): 71-90.

Kelly, D. and Belkin, N.J. (2004). Display time as implicit feedback: Understanding task effects. Proceedings of the 29th ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 377-384.

Ryen White Microsoft Research [email protected]...

Documents

Transcript of Ryen White Microsoft Research [email protected]...