Shin’ichi Satoh National Institute of Informatics.
-
Upload
alaina-gregory -
Category
Documents
-
view
217 -
download
0
Transcript of Shin’ichi Satoh National Institute of Informatics.
Introduction to Content-based Media Analysis and Search Technology
Technology Overview and Historical Trends from an
Academic Point of ViewShin’ichi Satoh
National Institute of Informatics
Nowadays abundant multimedia information available
Web, broadband network, CATV, satellite... digital camera, mobile phone,
Abundant Multimedia
YouTube: 35 hours of video uploaded every minute
Abundant Multimedia
Flickr: 5 billion photos Facebook: 3 billion photos per month
Abundant Multimedia
How can we utilize such huge amounts of multimedia?
Search could be one promising option Any technical problems? It seems like multimedia search is already
available Google, Yahoo!, Bing image search, Flickr,
YouTube, etc...
Abundant Multimedia
Multimedia search is possible only via text search technology
This problem is prominent especially for visual media (audio can be converted into text via ASR)
Major Part of Multimedia is Inaccessible
But major part of multimedia data has no text data
We checked a number of photos in Flickr and found around 85% of photos have no tags or description
as far as we use the text search-based technologies, such large amounts of multimedia are inaccessible at all!
Major Part of Multimedia is Inaccessible
Moreover, text-based multimedia search is NOT perfect
searching images of "people playing drums" some results are good
but some results are very strange
Major Part of Multimedia is Inaccessible
Johndog
Multimedia semantic content analysis is required
However it’s difficult◦ Multimedia is difficult to handle by computers◦ Inherently difficult due to “Semantic Gap”
Multimedia Content Analysis and Search
Query: Lion
Lion
Multimedia data is huge◦ text: 1kb/s (10 words), audio: 100kb/s (MP3), video
10Mb/s (MPEG2) computers since 1940s (ENIAC 1946) text processing by computer since 1950s!
(Turing test 1950, ELIZA and SHRDLU 1960s) project Gutenberg since 1971 CD-ROM (1985), DVD (1993), larger memory,
external storage (hard disk drives) multimedia data (audio/image/video) are
getting manageable only after 1990s
Handling Multimedia
Please guess what this is.
Semantic Gap
Water Lilies, Monet
Please guess what this is.
Semantic Gap
Semantic Gap
Computers are so good at handling text, but not so at handling multimedia
text: artificial media, symbolized by nature multimedia: ambiguous, depend on cognition, natural
media, not symbolized, etc... human can easily “see” or perceive but we cannot explain how we “see”
Semantic Gap
The quick brown fox jumps over the lazy dog
1980s Landsat images, medial images, stock
photos Search using relational DB only via statistics and text issue was how to handle “huge” data of
images less attention was paid
to content analysis
Early Media Search System
CBIR: Image retrieval based on “content” T. Kato, TRADEMARK & ART MUSEUM (1989) IBM QBIC (1990s) Take an image as a query, and return “similar”
images Use “features,” e.g., color histogram, edge,
shape, etc. It worked for images without metadata Assume that similar images in the feature
space are semantically similar as well But this is not always true
Content-Based Image Retrieval (CBIR)
Content-Based Image Retrieval (CBIR)
Feature space
Semantic Gap
Let’s take a look at face detection as an example...
Face detection is very stable technology
Before 1990 face detection was very unstable◦ Shape of facial features and their geometric
relations were hard coded
After late 1990s face detector using machine learning succeeded in very stable performance◦ Simply provide a lot of face image
examples (a few thousands) to the system and let it learn
Multimedia Semantic Content Analysis via Machine Learning
Early face detection method
Machine learning
• Following the success of machine-learning-based approaches in face detection, OCR, ASR, etc., researchers decided to “train” computers for media semantic content analysis
• build corpus: tens, hundreds, or thousands images/video shots per concept with manual annotation
• extract features (low-level, but recently “local” features are known to be more effective)
• train computers to automatically map between low-level features and semantic categories using machine learning
• Several corpora available
Media Semantic Content Analysis
Caltech 101 (2003), Caltech 256 (2007) 101/256 concepts define the set of concepts first, then collect
images (via image search engine) manual selection, so clean annotation up to a few hundreds images per concept standard benchmark datasets “small world effect” anticipated questionable selection of concepts
Caltech 101/256
airplane, chair, elephant, faces, leopards, rhino
bonsai, brain, scorpion, trilobite, yin_yang...
Large number of concepts, large number of images
#concepts: 10,000+ #images: 10,000,000+ concepts are systematically selected from
WordNet (computer-readable thesaurus)
Manual annotation by Amazon Mechanical Turk Hard to control quality Scalability issue
• Currently researchers are focusing on the issue: how to effectively learn semantic concepts from GIVEN training media corpus
• Corpus: the larger, the better• But how to obtain large corpus?• CGM (Flickr, Web): noisy• Manual annotation (AMT):
costly, less scalable• Other approaches such as
ESP game could be interesting
Issues
1970 1980 1990 2000 2010
Text
Audio/Speech
Image
Video
Project Gutenberg
bag-of-words
TF/IDF
WSJ
TREC PageRank
MFCC
Viterbi, HMM
CMU-MIT
Face DBPascal VOC
ImageNet
TRECVID
Caltech101
V-J Face
Det.
USPSOCR
single digit
1000 wordsLVCSR
IBM ViaVoice
Multimedia content analysis research: “just started”
More advanced results to come Business value? Killer applications?
Conclusion