Maximizing the Synergy between Man and Machine

Carnegie Mellon

Maximizing the Synergy between Man and Machine

Alex Hauptmann

School of Computer Science

Carnegie Mellon University

Exploiting Human Abilities in Video Retrieval Interfaces

Carnegie MellonCarnegie Mellon

Background

Automatic video analysis for detection/recognition is still quite poor• Consider baseline (random guessing)

• Improvement is limited

• Consider near-duplicates (trivial similarity)

• Does not generalize well over video sources

• Better than nothing

Need humans to make up for this shortcoming!


Differences to VIRAT• Most interface work done on broadcast TV

• Harder: • Unconstrained subject matter

• Graphics, animations, photos• Broadcasters

• Many different, short shots out of context

• Easier:• Better resolution• Conventions in editing, structure• Audio track

• Keyframes are typical unit of analysis


“Classic” Interface Work

• Interactive Video Queries• Fielded text query matching capabilities• Fast image matching with simplified interface for

launching image queries• Interactive Browsing, Filtering, and Summarizing

• Browsing by visual concepts• Quick display of contents and context in

synchronized views


Informedia Client Interface Example


Interface optionsChristel08


Suggesting Related ConceptsZavesky07


Fork BrowserSnoek09


“Classic” Video Interface Results

• Concept browsing and image search used frequently

• Novices still have lower performance than experts

• Some topics cause “interactivity” to be one-shot query with little browsing/exploration

• Classic Informedia interface including concept browsing often good enough that user never proceeds to any additional text or image query

• “Classic Informedia” scored highest of those testing with novice users in TRECVID evaluations


Visual Browsing


Augmented Video Retrieval

The computer observes the user and LEARNS, based on what is marked as relevant

The system can learn:• What image characteristics are relevant

• What text characteristics (words) are relevant

• What combination weights should be used

We exploit the human’s ability to quickly mark relevant video and the computer’s ability to learn from given examples


Combining Concept Detectors for Retrieval

Final ranked list for interface

Combination of diverse knowledge sources

TextFind Pope John Paul

Audio ImageMotionmultimodal question

Output1 Output2 Outputn

…Closed Caption

Color Feature

Motion Feature

Audio Feature

Video library

Face … (3k)

Knowledge Source/API

Building


Why Relevance Feedback?• Limited training data

• Untrained sources are useful for some specific searches

Query: finding some boats/ships

Txt: 0.5, Img: 0.3, Face: -0.5

Q-Type: general object

(Learned from training set)

Outdoor: ?, Ocean: ?

(Unable to be learned)


Probabilistic Local Context Analysis (pLCA)Yan07

• Goal: Refine results of the current query • Method: assume the combination parameters of “un-learned”

sources υ to be latent variables and compute P(yj|aj,Dj)

• Discover useful search knowledge based on initial results Ai

Query

InitialSearchResult

Video1

Video 2

Video M

A1

A2

Am

Y1

Y2

Ym

υ1:?, υ2:?,…, υN:?


Undirected Model and Parameter Estimation

Compute the posterior probability of document relevance Y given initial results A based on an undirected graphical model

D1D1Y1Y1A1A1

υυ

0

1

1( | ; , ) ( | ; ) exp ,

D

l j j

M

l lj l jll j

yP y a D Q P Q v a f Q dZ

y D

Variational inference, i.e., iterate until convergence and approximate P(yj|aj) by qyj

Maximize w.r.t. variational para. of Y,

Maximize w.r.t. variational para. of v

D2D2Y2Y2A2A2

DmDmYmYmAmAm

1

1 exp ( , )yj j l l jl

q a q f D Q

0 ( , )l l yj l jj

q q f D Q


Automatic vs Interactive Search


Extreme Video Retrieval

• Automatic retrieval baseline for ranking order

• Two methods of presentation: • System-controlled Presentation - Rapid Serial

Visual Presentation (RSVP)

• User-controlled Presentation – Manual Browsing with Resizing of Pages


System-controlled Presentation• Rapid Serial Visual Presentation (RSVP)

• Minimizes eye movements • All images in same location

• Maximizes information transfer: System Human• Up to 10 key images/second• 1 or 2 images per page• Presentation intervals are dynamically adjustable • Click when relevant shot is seen• Mark previous page also as relevant

• A final verification step is necessary


User-controlled presentations

• Manual Browsing with Resizing of Pages• Manually page through images

• User decides to view next page

• Vary the number of images on a page

• Allow chording on the keypad to mark shots

• A very brief final verification step


MBRP - Manual Browsing with Resizable Pages


Extreme QA with RSVP

3x3 display

1 page/second

Numpad chording to select shots


Mindreading – an EEG interface

• LearnRelevant/Non-Relevant

• 5 EEG probes

• Simple features

• Too slow• 250 ms/image

• Significant recoverytime after hit

• Relevance feedback


Summarizing Video: Beyond Keyframes

• BBC Rushes• Unedited video for TV series production

• Summarize as video in 1/50th of total• Note the non-scalable target factor

• Lots of smart analysis• Clustering, salience, redundancy, importance

• Best performance for retrieval was to play every frame


Speed-up summarization results


Surveillance Event Detection

• Interesting stuff is rare

• Detection accuracy is limited

• Monitor many streams


Surveillance Event Detection

• Need actionnot key frame

• Difficult for humans

• Combine speed-upwith automatic analysis• Slow down when interesting stuff happens


Summary

Interfaces have much to contribute in retrieval• We don’t know what is best

• Task-specific• User-specific• System-dependent

• Collaborative search• Combining “best of current systems”• Simpler is usually better (Occam’s razor)

General principles are difficult to find


Questions?

Maximizing the Synergy between Man and Machine

Documents

Transcript of Maximizing the Synergy between Man and Machine