Maximizing the Synergy between Man and Machine
description
Transcript of Maximizing the Synergy between Man and Machine
Carnegie Mellon
Maximizing the Synergy between Man and Machine
Alex Hauptmann
School of Computer Science
Carnegie Mellon University
Exploiting Human Abilities in Video Retrieval Interfaces
Carnegie MellonCarnegie Mellon
Background
Automatic video analysis for detection/recognition is still quite poor• Consider baseline (random guessing)
• Improvement is limited
• Consider near-duplicates (trivial similarity)
• Does not generalize well over video sources
• Better than nothing
Need humans to make up for this shortcoming!
Carnegie MellonCarnegie Mellon
Differences to VIRAT• Most interface work done on broadcast TV
• Harder: • Unconstrained subject matter
• Graphics, animations, photos• Broadcasters
• Many different, short shots out of context
• Easier:• Better resolution• Conventions in editing, structure• Audio track
• Keyframes are typical unit of analysis
Carnegie MellonCarnegie Mellon
“Classic” Interface Work
• Interactive Video Queries• Fielded text query matching capabilities• Fast image matching with simplified interface for
launching image queries• Interactive Browsing, Filtering, and Summarizing
• Browsing by visual concepts• Quick display of contents and context in
synchronized views
Carnegie MellonCarnegie Mellon
Informedia Client Interface Example
Carnegie MellonCarnegie Mellon
Interface optionsChristel08
Carnegie MellonCarnegie Mellon
Suggesting Related ConceptsZavesky07
Carnegie MellonCarnegie Mellon
Carnegie MellonCarnegie Mellon
Fork BrowserSnoek09
Carnegie MellonCarnegie Mellon
“Classic” Video Interface Results
• Concept browsing and image search used frequently
• Novices still have lower performance than experts
• Some topics cause “interactivity” to be one-shot query with little browsing/exploration
• Classic Informedia interface including concept browsing often good enough that user never proceeds to any additional text or image query
• “Classic Informedia” scored highest of those testing with novice users in TRECVID evaluations
Carnegie MellonCarnegie Mellon
Visual Browsing
Carnegie MellonCarnegie Mellon
Augmented Video Retrieval
The computer observes the user and LEARNS, based on what is marked as relevant
The system can learn:• What image characteristics are relevant
• What text characteristics (words) are relevant
• What combination weights should be used
We exploit the human’s ability to quickly mark relevant video and the computer’s ability to learn from given examples
Carnegie MellonCarnegie Mellon
Combining Concept Detectors for Retrieval
Final ranked list for interface
Combination of diverse knowledge sources
TextFind Pope John Paul
Audio ImageMotionmultimodal question
Output1 Output2 Outputn
…Closed Caption
Color Feature
Motion Feature
Audio Feature
Video library
Face … (3k)
Knowledge Source/API
Building
Carnegie MellonCarnegie Mellon
Why Relevance Feedback?• Limited training data
• Untrained sources are useful for some specific searches
Query: finding some boats/ships
Txt: 0.5, Img: 0.3, Face: -0.5
Q-Type: general object
(Learned from training set)
Outdoor: ?, Ocean: ?
(Unable to be learned)
Carnegie MellonCarnegie Mellon
Probabilistic Local Context Analysis (pLCA)Yan07
• Goal: Refine results of the current query • Method: assume the combination parameters of “un-learned”
sources υ to be latent variables and compute P(yj|aj,Dj)
• Discover useful search knowledge based on initial results Ai
Query
InitialSearchResult
Video1
Video 2
Video M
A1
A2
Am
Y1
Y2
Ym
υ1:?, υ2:?,…, υN:?
Carnegie MellonCarnegie Mellon
Undirected Model and Parameter Estimation
Compute the posterior probability of document relevance Y given initial results A based on an undirected graphical model
D1D1Y1Y1A1A1
υυ
0
1
1( | ; , ) ( | ; ) exp ,
D
l j j
M
l lj l jll j
yP y a D Q P Q v a f Q dZ
y D
Variational inference, i.e., iterate until convergence and approximate P(yj|aj) by qyj
Maximize w.r.t. variational para. of Y,
Maximize w.r.t. variational para. of v
D2D2Y2Y2A2A2
DmDmYmYmAmAm
1
1 exp ( , )yj j l l jl
q a q f D Q
0 ( , )l l yj l jj
q q f D Q
Carnegie MellonCarnegie Mellon
Automatic vs Interactive Search
Carnegie MellonCarnegie Mellon
Extreme Video Retrieval
• Automatic retrieval baseline for ranking order
• Two methods of presentation: • System-controlled Presentation - Rapid Serial
Visual Presentation (RSVP)
• User-controlled Presentation – Manual Browsing with Resizing of Pages
Carnegie MellonCarnegie Mellon
System-controlled Presentation• Rapid Serial Visual Presentation (RSVP)
• Minimizes eye movements • All images in same location
• Maximizes information transfer: System Human• Up to 10 key images/second• 1 or 2 images per page• Presentation intervals are dynamically adjustable • Click when relevant shot is seen• Mark previous page also as relevant
• A final verification step is necessary
Carnegie MellonCarnegie Mellon
User-controlled presentations
• Manual Browsing with Resizing of Pages• Manually page through images
• User decides to view next page
• Vary the number of images on a page
• Allow chording on the keypad to mark shots
• A very brief final verification step
Carnegie MellonCarnegie Mellon
MBRP - Manual Browsing with Resizable Pages
Carnegie MellonCarnegie Mellon
Extreme QA with RSVP
3x3 display
1 page/second
Numpad chording to select shots
Carnegie MellonCarnegie Mellon
Mindreading – an EEG interface
• LearnRelevant/Non-Relevant
• 5 EEG probes
• Simple features
• Too slow• 250 ms/image
• Significant recoverytime after hit
• Relevance feedback
Carnegie MellonCarnegie Mellon
Summarizing Video: Beyond Keyframes
• BBC Rushes• Unedited video for TV series production
• Summarize as video in 1/50th of total• Note the non-scalable target factor
• Lots of smart analysis• Clustering, salience, redundancy, importance
• Best performance for retrieval was to play every frame
Carnegie MellonCarnegie Mellon
Speed-up summarization results
Carnegie MellonCarnegie Mellon
Surveillance Event Detection
• Interesting stuff is rare
• Detection accuracy is limited
• Monitor many streams
Carnegie MellonCarnegie Mellon
Surveillance Event Detection
• Need actionnot key frame
• Difficult for humans
• Combine speed-upwith automatic analysis• Slow down when interesting stuff happens
Carnegie MellonCarnegie Mellon
Summary
Interfaces have much to contribute in retrieval• We don’t know what is best
• Task-specific• User-specific• System-dependent
• Collaborative search• Combining “best of current systems”• Simpler is usually better (Occam’s razor)
General principles are difficult to find
Carnegie MellonCarnegie Mellon
Questions?