Marc Davis Chairman and Chief Technology Officer Representing Video for Retrieval and Repurposing...
-
date post
22-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of Marc Davis Chairman and Chief Technology Officer Representing Video for Retrieval and Repurposing...
Marc DavisChairman and Chief Technology Officer
Representing Videofor Retrieval and Repurposing
SIMS 202 Information Organization and Retrieval
amova
Amova Proprietary
Presentation Outline
• Introductions
• Problem space and motivation
• Current approaches
• Issues in video representation and retrieval
• Media streams demonstration
amova
Amova Proprietary
Global Media Network
• Digital video produced anywhere by anyone accessible to anyone anywhere
• Today’s video users become tomorrow’s video producers
• Not 500 Channels — 500,000,000 Video Web Sites
amova
Amova Proprietary
What is the Problem?
• Today people cannot easily create, find, edit, share, and reuse media
• Computers don’t understand video content– Video is opaque and data rich– We lack structured representations
• Without content representation (metadata), manipulating digital video will remain like word-processing with bitmaps
amova
Amova Proprietary
The Search for Solutions
• Current approaches don’t work– Signal-based analysis– Keywords– Natural language
• Need standardized metadata framework– Designed for video and rich media data– Human and machine readable and writable– Standardized and scaleable– Integrated into media capture, production, editing,
distribution, and reuse– Enables widespread use and reuse of video in daily life
amova
Amova Proprietary
Signal-Based Parsing
• Theoretical problem– Mismatch between percepts and concepts
(e.g., dogs, friends, cars)
• Practical problem– Parsing unstructured, unknown video is very,
very hard
amova
Amova Proprietary
Signal-Based Parsing
• Some things are doable and usable– Video
• Scene break detection• Camera motion• Low level visual similarity
– Audio• Pause detection• Audio pattern matching• Simple speech recognition
• Some things can be made easier – At the point of capture, simplify and/or interact with the recording
device, the environment, and agents in the environment– If not, after capture use “human-in-the-loop” algorithms
amova
Amova Proprietary
Keywords vs. Semantic Descriptors
dog,biting,Steve
amova
Amova Proprietary
Keywords vs. Semantic Descriptors
dog,biting,Steve
amova
Amova Proprietary
Why Keywords Don’t Work
• Are not a semantic representation
• Do not describe relations between descriptors
• Do not describe temporal structure
• Do not converge
• Do not scale
amova
Amova Proprietary
Natural Language vs. Visual Language
Jack, an adult male police officer, while walking to the left, starts waving with his left arm, and then has a puzzled look on his face as he turns his head to the right; he then drops his facial expression and stops turning his head, immediately looks up, and then stops looking up after he stops waving but before he stops walking.
amova
Amova Proprietary
Natural Language vs. Visual Language
Jack, an adult male police officer, while walking to the left, starts waving with his left arm, and then has a puzzled look on his face as he turns his head to the right; he then drops his facial expression and stops turning his head, immediately looks up, and then stops looking up after he stops waving but before he stops walking.
amova
Amova Proprietary
Visual Language Advantages
• A language designed as an accurate and readable representation of video (especially for actions, expressions, and spatial relations)
• Enables Gestalt view and quick recognition of descriptors due to designed visual similarities
• Supports global use of annotations
amova
Amova Proprietary
Representing Video
• Streams vs. Clips
• Video syntax and semantics
• Ontological issues in video representation
• Retrieving video
amova
Amova Proprietary
Video is Temporal: Streams vs. Clips
Stream of 100 Frames of Video
A Clip from Frame 47 to Frame 68 with Descriptors
amova
Amova Proprietary
Streams vs. Clips
• Clip-based representation
–Fixes a segmentation of the video stream
–Separates the clip from its context of origin
–Encodes only one particular segmentation of
the original data
amova
Amova Proprietary
Streams vs. Clips
The Stream of 100 Frames of Video with 6 Annotations Resulting in ManyPossible Segmentations of the Stream
Stream of 100 Frames of Video
amova
Amova Proprietary
Streams vs. Clips
• Stream-based representation
–The stream of frames is left intact
–The stream has many possible segmentations by multi-layered annotations with precise time indexes (and the intersections, unions, etc. of these annotations)
amova
Amova Proprietary
Stream-Based Representation
• Makes annotation pay off– The richer the annotation, the more numerous the
possible segmentations of the video stream
• Clips – Change from being fixed segmentations of the video
stream, to being the results of retrieval queries based on annotations of the video stream
• Annotations– Create representations which make clips, not
representations of clips
amova
Amova Proprietary
Video Syntax and Semantics
• The Kuleshov Effect
• Video has a dual semantics
– Sequence-independent invariant semantics of shots
– Sequence-dependent variable semantics of shots
amova
Amova Proprietary
Ontological Issues for Video
• Video plays with rules for identity and continuity
– Space
– Time
– Character
– Action
amova
Amova Proprietary
Space and Time
• Actual Recorded Space and Time– GPS– Studio space and time
• Inferable Space and Time– Establishing shots– Cues and clues
amova
Amova Proprietary
Time: Temporal Durations
• Story (Fabula) Duration– Example: Brushing teeth in story world (5 minutes)
• Plot (Syuzhet) Duration– Example: Brushing teeth in plot world (1 minute: 6
steps of 10 seconds each)
• Screen Duration– Example: Brushing teeth (10 seconds: 2 shots of 5
seconds each)
amova
Amova Proprietary
Character and Continuity
• Identity of character is constructed through– Continuity of actor– Continuity of role
• Alternative continuities– Continuity of actor only– Continuity of role only
amova
Amova Proprietary
Representing Action
• Describe the intersubjective, physically visible aspects of what you see and hear– Emotions vs. expressions– Abstract actions vs. conventionalized actions
• Consider how actions can be decomposed and combined (temporally and spatially) – Actions and subactions
• Consider how actions can be recontextualized– By montage and reuse– By cultural differences
amova
Amova Proprietary
Retrieving Video
• Query:
– Retrieve a video segment of “a hammer hitting a nail into a piece of wood”
• Sample results:
– Video of a hammer hitting a nail into a piece of wood
– Video of a hammer, a nail, and a piece of wood
– Video of a nail hitting a hammer, and a piece of wood
– Video of a sledgehammer hitting a spike into a railroad tie
– Video of a rock hitting a nail into a piece of wood
– Video of a hammer swinging
– Video of a nail in a piece of wood
amova
Amova Proprietary
Types of Video Similarity
• Semantic– Similarity of descriptors
• Relational– Similarity of relations among descriptors in
compound descriptors
• Temporal– Similarity of temporal relations among descriptors
and compound descriptors
amova
Amova Proprietary
Retrieval Examples to Think With
• “Video of a hammer, a nail, and a piece of wood”– Exact semantic and temporal similarity, but no relational similarity
• “Video of a nail hitting a hammer, and a piece of wood” – Exact semantic and temporal similarity, but incorrect relational similarity
• “Video of a sledgehammer hitting a spike into a railroad tie”– Approximate semantic similarity of the subject and objects of the action and
exact semantic similarity of the action; and exact temporal and relational similarity
• “Video of a hammer swinging” cut to “Video of a nail in a piece of wood”– Combines two disparate elements in the database (partial results) to create
an effective query response
amova
Amova Proprietary
Media Streams
amova
Amova Proprietary
Media Annotation and Retrieval Engine
• Key benefits– More accurate annotation and retrieval– Global usability and standardization– Reuse of rich media according to content and structure
• Key features– Stream-based representation (better segmentation)– Semantic indexing (what things are similar to)– Relational indexing (who is doing what to whom)– Temporal indexing (when things happen)– Iconic interface (designed visual language)– Universal annotation (standardized markup schema)
amova
Amova Proprietary
Media Streams Demonstration
amova
Amova Proprietary
Media Streams GUI Components
• Media Time Line
• Icon Space– Icon Workshop– Icon Palette
amova
Amova Proprietary
Media Time Line
• Visualize video at multiple time scales
• Write and read multi-layered iconic annotations
• One interface for annotation, query, and composition
amova
Amova Proprietary
Media Time Line
amova
Amova Proprietary
Icon Space
• Icon Workshop– Utilize categories of video representation– Create iconic descriptors by compounding iconic
primitives– Extend set of iconic descriptors
• Icon Palette– Dynamically group related sets of iconic descriptors– Reuse descriptive effort of others– View and use query results
amova
Amova Proprietary
Icon Space
amova
Amova Proprietary
Icon Space: Icon Workshop
• General to specific (horizontal)– Cascading hierarchy of icons with increasing
specificity on subordinate levels
• Combinatorial (vertical)– Compounding of hierarchically organized
icons across multiple axes of description
amova
Amova Proprietary
Icon Space: Icon Workshop Detail
amova
Amova Proprietary
Icon Space: Icon Palette
• Dynamically group related sets of iconic descriptors
• Collect icon sentences
• Reuse descriptive effort of others
amova
Amova Proprietary
Icon Space: Icon Palette Detail
amova
Amova Proprietary
Video Retrieval In Media Streams
• Same interface for annotation and retrieval
• Assembles responses to queries as well as finds them
• Query responses use semantics to degrade gracefully
amova
Amova Proprietary
Media Streams Technologies
• Minimal consensual representation distinguishing video syntax and semantics
• Iconic visual language for annotating and retrieving video content
• Retrieval-by-composition methods for repurposing video
amova
Amova Proprietary
Methodological Considerations
• Techne-centered methodology combines– Constructing theories by constructing artifacts– Constructing artifacts informed by (de)constructing
theories– Practitioners
• Lev Kuleshov, Sergei Eisenstein, Seymour Papert, Narrative Intelligence Reading Group, Marc Davis
• Designing video representation and retrieval systems requires a techne-centered methodology
amova
Amova Proprietary
Future Work
• MPEG-7 standardization efforts• Gathering more and better metadata at the point
of capture• Integrating metadata into conventional media
editing and sharing• Developing “human-in-the-loop” indexing
algorithms and systems• Representing action sequences and even
higher level narrative structures• Fair use advocacy