Machine Learning Applications to Modeling Web …eugene/talks/agichtein-mining-behavior.pdf ·...

45
Machine Learning Applications to Modeling Web Searcher Behavior Eugene Agichtein Intelligent Information Access Lab (IRLab) Emory University

Transcript of Machine Learning Applications to Modeling Web …eugene/talks/agichtein-mining-behavior.pdf ·...

Machine Learning Applications to Modeling Web Searcher Behavior

Eugene Agichtein

Intelligent Information Access Lab (IRLab)

Emory University

Talk Outline

�Overview of the Emory IR Lab

�Intent-centric Web Search

�Contextualized search intent detection

�One example medical application

2Eugene Agichtein, Emory University, IR Lab

Intelligent Information Access Lab (IRLab)

Eugene Agichtein, Emory University, IR Lab 3

Qi Guo Ablimit Aji

• Modeling information seeking behavior• Searching the Web and social media

• Text and data mining for medical informatics

In collaboration with:

- Beth Buffalo (Neurology)- Charlie Clarke (Waterloo)

- Ernie Garcia (Radiology)

- Phil Wolff (Psychology)- Hongyuan Zha (GaTech)

Julia Kiseleva

Dmitry Lagun Qiaoling Liu Yu Wang

Our Approach to Intelligent Information Access

Eugene Agichtein, Emory University, IR Lab

44

Information

sharing

Health

Informatics

Cognitive

Diagnostics

Intelligent

search

Data-Driven Model Discovery(machine learning/data mining)

Search logs:queries, clicksSearch logs:queries, clicks

Web-scale Text Mining

Extract entities, relationships, events from text

Estimate accuracy of web content

Some Applications:

– Incorporating extracted information into (web) search

– Finding implicit connections between events, entities

– Visualizing and exploring large text collections

[DL’00, ICDE 2003 “best student paper”, SIGMOD 2006 “best paper”, … ]

DiseaseOutbreaks,

The New York Times

Intelligent

search

18 November 2009 5Eugene Agichtein, Emory

University, IR Lab

666

Information

sharing

Social Media Language Analysis

Social Media != WSJ Text

Text Mining/NLP Challenges:

�Content quality

�Authority/expertise

�User goals,

�Subjectivity

�Sentiment

�Temporal Sensitivity

�Effort and Incentives

Eugene Agichtein, Emory University, IR Lab 7

Information

sharing

888

Content

Quality

Hybrid Web/Social Search

Eugene Agichtein, Emory University, IR Lab 9

Intelligent

search

Information

sharing+

Talk Outline

�Overview of the Emory IR Lab

�Intent-centric Web Search

�Contextualized search intent detection

�One example medical application

10Eugene Agichtein, Emory University, IR Lab

Some Key Challenges for Web Search

• Query interpretation (infer intent)

• Ranking (high dimensionality)

• Evaluation (system improvement)

• Result presentation (information visualization)

Eugene Agichtein, Emory University, IR Lab 11

Task-Goal-Search Model

Eugene Agichtein, Emory University, IR Lab 12

car safety ratings consumer reports

Problem Statement

• Given: Sequence of user actions, predict user goal,

task, and future actions

– Will define tasks and goal next

• Example applications:

– Predict document relevance (ranking, result

presentation, summarization)

– Predict next query (query suggestion, spelling

correction)

– Predict user satisfaction (market share)

Eugene Agichtein, Emory University, IR Lab 13

Intent (Goal) Classes, top level only

User intent taxonomy (Broder 2002)

– Informational – want to learn about something (~40% / 65%)

– Navigational – want to go to that page (~25% / 15%)

– Transactional – want to do something (web-mediated) (~35% / 20%)

• Access a serviceDownloads

• Shop

– Gray areas

• Find a good hub

• Exploratory search “see what’s there”

Eugene Agichtein, Emory University, IR Lab

[from SIGIR 2008 Tutorial,

Baeza-Yates and Jones]

History nonya food

Singapore Airlines

Jakarta weather

Kalimantan satellite images

Nikon Finepix

Car rental Kuala Lumpur

Information Retrieval Process: Implementation

Eugene Agichtein, Emory University, IR Lab 15

Source

Selection

Search

Query: car safety ratings

Selection

Ranked List

Examination

Documents

Delivery

Documents

Query

Formulation

Resource

query reformulation,

vocabulary learning,

relevance feedback

source reselection

Search Engine

Result Page (SERP)

Search Actions

• Keystrokes

– query, scroll, CTRL-C, …)

• GUI:

– scrolling, button press, clicks

• Mouse:

– moving, scrolling, down/up, scroll

• Browser:

– new tab, close, back/forward

Eugene Agichtein, Emory University, IR Lab 16

All of these can be

easily captured on

SERP (javascript)

How Do We Know “True” User Intent?

• Ask the user (surveys, field studies, pop-ups)

– Does not scale, users get annoyed

• Observe user actions and guess

– Intent usually obvious to humans but not always

• Detect signals from user’s brain(fMRI, EEEG) and attempt to interpret neuron activity

Eugene Agichtein, Emory University, IR Lab 17

Adapted from [Daniel M. Russell, 2007]

What Eye Movement Can Tell

• Eye tracking gives

information about searcher

interests:

– Eye position

– Pupil diameter

– Saccades and fixations

Eugene Agichtein, Emory University, IR Lab 18

Reading

Visual

Search

Camera

“An Eye Tracker on Every Table”

• Eye tracking equipment is bulky and expensive

• Can we infer gaze position from observable actions?

• Exploratory study from Google (Rodden et al.) says

maybe: mouse position is sometimes related to eye

position

Eugene Agichtein, Emory University, IR Lab 19

Relationship Between Mouse and Gaze Position

• Searchers might use

the mouse to focus

reading attention,

bookmark promising

results, or not at all.

• Behavior varies with

task difficulty and

user expertise

Eugene Agichtein, Emory University, IR Lab 20

[K. Rodden, X. Fu, A. Aula, and I. Spiro, Eye-mouse coordination patterns

on web search results pages, Extended Abstracts of ACM CHI 2008]

Assume “Transitivity” Holds

• Given:

– Gaze position ==> user intent

and Mouse movement ==> gaze position

�Mouse movement ==> user intent

� Restate problem:

� Given user actions, infer current user’s intent, focusing

on Individual User’s actions

Eugene Agichtein, Emory University, IR Lab 21

Collecting Search Data: EMU

Eugene Agichtein, Emory University, IR Lab 22

• Firefox + LibX plugin

• Track whitelisted sites e.g., Emory, Google, Yahoo search…

• All SERP events logged (asynchronous http requests)

•150 public use machines, 5,000+ opted-in users

HTTP Log

HTTP Server

Usage DataData Mining &

Management

Train Prediction

Models

EMU: Querying Behavior Data

Eugene Agichtein, Emory Univesity, IR Lab 23

Playback Example

Eugene Agichtein, Emory University, IR Lab 24

Research vs. Purchase Intent

• 12 subjects (grad students and staff) asked to 1. Research a product they want to purchase eventually

(Research intent)

2. Search for a best deal on an item they want to purchase immediately (Purchase intent)

• Eye tracking and browser instrumentation performed in parallel for some of the subjects

• EyeTech systems TM3 (rental) ���� avoid!– At reasonable resolution, samples at only ~12-15 Hz

– Looses calibration after a few minutes

Eugene Agichtein, Emory University, IR Lab 25

Qi Guo

Research Intent

Eugene Agichtein, Emory University, IR Lab 26

Purchase Intent

Eugene Agichtein, Emory University, IR Lab 27

Informational query: “spanish wine”

Eugene Agichtein, Emory University, IR Lab 28

Mouse Features: Simple

Eugene Agichtein, Emory University, IR Lab 29

• First representation:

– Trajectory length

– Horizontal range

– Vertical range

Horizontal range

Vertical range

Trajectory length

Mouse Features: Full

Eugene Agichtein, Emory University, IR Lab 30

• Second representation:

– 5 segments:

initial, early, middle, late, and end

– Each segment:

speed, acceleration, rotation, slope, etc.

1

2

3

4

5

Classifying Search Intent

Eugene Agichtein, Emory University, IR Lab 31

Use Support Vector Machine (SVM) Classifier

• SVMs maximize the margin around the

separating hyperplane.

• A.k.a. large margin classifiers

• The decision function is fully specified

by a subset of training samples, the

support vectors.

• Quadratic programming problem

• Seen by many as most successful

current text classification method

Eugene Agichtein, Emory University, IR Lab 32

Support vectors

Maximize

margin

Results: Research vs. Purchase

Eugene Agichtein, Emory University, IR Lab 33

Contextualized Intent Inference

Eugene Agichtein, Emory University, IR Lab 34

Implementation: Conditional Random Field (CRF) Model

Eugene Agichtein, Emory University, IR Lab 35

From HMMs to MEMMs to CRFs

Eugene Agichtein, Emory University, IR Lab 36

nn oooossss ,...,,..., 2121 ==vv

HMM

MEMM

CRF

St-1 St

Ot

St+1

Ot+1Ot-1

St-1 St

Ot

St+1

Ot+1Ot-1

St-1 St

Ot

St+1

Ot+1Ot-1

...

...

...

...

...

...

∏=

−∝||

1

1)|()|(),(

o

t

ttttsoPssPosP

v

vv

∏∑

=

=

+∝

||

1

1

,

||

1

1

),(

),(

exp1

),|()|(

1

o

t

k

ttkk

j

ttjj

os

o

t

ttt

osg

ssf

Z

ossPosP

tt

v

v

vv

µ

λ

∏∑

=

+∝

||

1

1

),(

),(

exp1

)|(o

t

k

ttkk

j

ttjj

o osg

ssf

ZosP

v

v

vv

µ

λ

Conditional Random Fields (CRFs)

[from Lafferty, McCallum, Pereira 2001]

Application: Predict Ad Receptiveness

Hypothesis: the

right time to

show search ads:

when searcher is

receptive to

seeing ads

Eugene Agichtein, Emory University, IR Lab 37

Results: Ad Click Prediction

• 200%+ precision improvement (within task)

Eugene Agichtein, Emory University, IR Lab 38

Challenges

• Separate context from intent (e.g., smart phones)

• User variability: individual differences, tasks

• Scale of data: representation, compression

• Privacy: client-side data similar to other PII

• Obtaining realistic user data: see above

– EMU toolbar tracking since 2007 in Emory Libraries (biased)

Eugene Agichtein, Emory University, IR Lab 39

Current and Future Work

• Detect mouse “reading” behavior

• Unsupervised intent clustering

• User vs. task

• Personalized behavior models

• Long-term interests/effects

• User mental state (frustration, cognitive impairment, …)

Eugene Agichtein, Emory University, IR Lab 40

Towards Web-based Visual Paired Comparison Test

• VPC can be used to detect MCI (years before AD), but requires eye tracking equipment

• Goal: develop web-based version of VPC

– NIH ADRC Pilot Grant, jointly with Beth Buffalo

• Approach: exploit connection between mouse movement and gaze position.

– Force usage of mouse to reveal image

• Or parts of image

– Develop robust machine learning techniques to predict cognitive impairment based on (noisy) mouse data

Eugene Agichtein, Emory University, IR Lab 41

Dmitry Lagun

with Beth Buffalo, Neurology/Yerkes

Initial Results

• Preference for novel image

(59%) consistently observed

• Still exploring parameter

space and metrics to

optimize

• Results sensitive to Mturk

worker instructions,

incentives, other factors (?)

• Looking for advice on

remote behavioral testing…

Eugene Agichtein, Emory University, IR Lab 42

Eye tracking

Mouse (oculus) center

Summary: From Behavior to State of Mind

• Approach:

– Machine learning methods for detecting searcher intent

– Calibrated and augmented with lab studies

• Foundational contributions:

– Methods to mine and integrate wide range of interactions

– Data-driven discovery of user state-of-mind

• Impact:

– Intelligent, intuitive search and information sharing

– Potential for new research tools and techniques

Eugene Agichtein, Emory University, IR Lab 43

Main References

• Classifying and Characterizing Query Intent, Azin Ashkan, Charles L. A. Clarke, Eugene Agichtein, Qi Guo, In ECIR 2009.

• Qi Guo and Eugene Agichtein, Exploring Client-Side

Instrumentation for Personalized Search Intent Inference: Preliminary Experiments, Proc. of AAAI 2008 Workshop on Intelligent Techniques for Web Personalization (ITWP 2008)

• Qi Guo, Eugene Agichtein, Azin Ashkan and Charles L. A. Clarke: In the Mood to Click? Inferring Searcher Advertising Receptiveness, in Proc. of WI 2009

• Other papers here: http://www.mathcs.emory.edu/~eugene/

Eugene Agichtein, Emory University, IR Lab 44

Thank you!

• Qi Guo, Dmitry Lagun, Beth Buffalo, and Phil Wolff

Eugene Agichtein, Emory University, IR Lab 45

Supported by: