Shin’ichi Satoh National Institute of Informatics.

Introduction to Content-based Media Analysis and Search Technology

Technology Overview and Historical Trends from an

Academic Point of ViewShin’ichi Satoh

National Institute of Informatics

Nowadays abundant multimedia information available

Web, broadband network, CATV, satellite... digital camera, mobile phone,

Abundant Multimedia

YouTube: 35 hours of video uploaded every minute

Abundant Multimedia

Flickr: 5 billion photos Facebook: 3 billion photos per month

Abundant Multimedia

How can we utilize such huge amounts of multimedia?

Search could be one promising option Any technical problems? It seems like multimedia search is already

available Google, Yahoo!, Bing image search, Flickr,

YouTube, etc...

Abundant Multimedia

Multimedia search is possible only via text search technology

This problem is prominent especially for visual media (audio can be converted into text via ASR)

Major Part of Multimedia is Inaccessible

But major part of multimedia data has no text data

We checked a number of photos in Flickr and found around 85% of photos have no tags or description

as far as we use the text search-based technologies, such large amounts of multimedia are inaccessible at all!


Moreover, text-based multimedia search is NOT perfect

searching images of "people playing drums" some results are good

but some results are very strange


Johndog

Multimedia semantic content analysis is required

However it’s difficult◦ Multimedia is difficult to handle by computers◦ Inherently difficult due to “Semantic Gap”

Multimedia Content Analysis and Search

Query: Lion

Lion

Multimedia data is huge◦ text: 1kb/s (10 words), audio: 100kb/s (MP3), video

10Mb/s (MPEG2) computers since 1940s (ENIAC 1946) text processing by computer since 1950s!

(Turing test 1950, ELIZA and SHRDLU 1960s) project Gutenberg since 1971 CD-ROM (1985), DVD (1993), larger memory,

external storage (hard disk drives) multimedia data (audio/image/video) are

getting manageable only after 1990s

Handling Multimedia

Please guess what this is.

Semantic Gap

Water Lilies, Monet

Please guess what this is.

Semantic Gap

Semantic Gap

Computers are so good at handling text, but not so at handling multimedia

text: artificial media, symbolized by nature multimedia: ambiguous, depend on cognition, natural

media, not symbolized, etc... human can easily “see” or perceive but we cannot explain how we “see”

Semantic Gap

The quick brown fox jumps over the lazy dog

1980s Landsat images, medial images, stock

photos Search using relational DB only via statistics and text issue was how to handle “huge” data of

images less attention was paid

to content analysis

Early Media Search System

CBIR: Image retrieval based on “content” T. Kato, TRADEMARK & ART MUSEUM (1989) IBM QBIC (1990s) Take an image as a query, and return “similar”

images Use “features,” e.g., color histogram, edge,

shape, etc. It worked for images without metadata Assume that similar images in the feature

space are semantically similar as well But this is not always true

Content-Based Image Retrieval (CBIR)

Content-Based Image Retrieval (CBIR)

Feature space

Semantic Gap

Let’s take a look at face detection as an example...

Face detection is very stable technology

Before 1990 face detection was very unstable◦ Shape of facial features and their geometric

relations were hard coded

After late 1990s face detector using machine learning succeeded in very stable performance◦ Simply provide a lot of face image

examples (a few thousands) to the system and let it learn

Multimedia Semantic Content Analysis via Machine Learning

Early face detection method

Machine learning

• Following the success of machine-learning-based approaches in face detection, OCR, ASR, etc., researchers decided to “train” computers for media semantic content analysis

• build corpus: tens, hundreds, or thousands images/video shots per concept with manual annotation

• extract features (low-level, but recently “local” features are known to be more effective)

• train computers to automatically map between low-level features and semantic categories using machine learning

• Several corpora available

Media Semantic Content Analysis

Caltech 101 (2003), Caltech 256 (2007) 101/256 concepts define the set of concepts first, then collect

images (via image search engine) manual selection, so clean annotation up to a few hundreds images per concept standard benchmark datasets “small world effect” anticipated questionable selection of concepts

Caltech 101/256

airplane, chair, elephant, faces, leopards, rhino

bonsai, brain, scorpion, trilobite, yin_yang...

Large number of concepts, large number of images

#concepts: 10,000+ #images: 10,000,000+ concepts are systematically selected from

WordNet (computer-readable thesaurus)

Manual annotation by Amazon Mechanical Turk Hard to control quality Scalability issue

• Currently researchers are focusing on the issue: how to effectively learn semantic concepts from GIVEN training media corpus

• Corpus: the larger, the better• But how to obtain large corpus?• CGM (Flickr, Web): noisy• Manual annotation (AMT):

costly, less scalable• Other approaches such as

ESP game could be interesting

Issues

1970 1980 1990 2000 2010

Text

Audio/Speech

Image

Video

Project Gutenberg

bag-of-words

TF/IDF

WSJ

TREC PageRank

MFCC

Viterbi, HMM

CMU-MIT

Face DBPascal VOC

ImageNet

TRECVID

Caltech101

V-J Face

Det.

USPSOCR

single digit

1000 wordsLVCSR

IBM ViaVoice

Multimedia content analysis research: “just started”

More advanced results to come Business value? Killer applications?

Conclusion

Shin’ichi Satoh National Institute of Informatics.

Documents

Transcript of Shin’ichi Satoh National Institute of Informatics.