Image-Language Association: are we looking at the right features? Katerina Pastra Language...

16
Image-Language Association: are we looking at the right features? Katerina Pastra Language Technology Applications, Institute for Language and Speech Processing, Athens, Greece

Transcript of Image-Language Association: are we looking at the right features? Katerina Pastra Language...

Image-Language Association:are we looking at the right features?

Katerina Pastra

Language Technology Applications,

Institute for Language and Speech Processing, Athens, Greece

The pervasive digital video context

File-swapping networks (P2P), (video files & video blogs)

IPTV, iTV

Video search engines

Conversational robots, MM presentation systems...

acce

ss to

MM

cont

ent

generation of MM content

Auto-analysis of image-language

relations

complementaritycomplementarityindependenceindependenceequivalenceequivalence

Overview

Focus on semantic equivalence relation = Multimedia Integration = image-language association Brief review of state of the art association mechanisms – feature sets used The OntoVis feature set suggestion Using OntoVis in the VLEMA prototype Prospects for going from 3D to 2D Future plans and conclusions

Association Mechanisms in prototypes

Intelligent MM systems from SHRDLU to conversational robots of new millennium (Pastra and Wilks 2004):

Simulated or manually abstracted visual input is used to avoid difficulties in image analysis Integration resources used with a priori known associations (e.g. image X on screen is a “ball”), or allowing simple inferences e.g. matching an input image to an object-model in the resource, which is in its turn linked to a “concept/word” )

to avoid difficulties in associating V-L Applications are restricted to blocksworlds/miniworlds

scaling issues

Association algorithms

To be embedded in prototypes: Probabilistic approaches for learning (e.g. Barnard et al. 2003)

use word/phrase + image/image region (f-v vectors) require properly annotated corpora (IBM, Pascal etc.)

Logic-based approaches (e.g. Dasiopoulou et al. 2004)

use feature-augmented ontologies match low-level image features + leaf nodes Use of both approaches reported too (Srikanth et al. 2005)

Feature set used:Feature set used:shape, colour, texture, position, sizeshape, colour, texture, position, size

Scaling?Scaling?

The quest for the appropriate f-set

Constraints in defining a f-set: Features must be distinctive of object classes (at the basic-level)

Feature values must be detectable by image

analysis modules

Cognitive thesis:No feature set is fully representative of the characteristics of an object, but one may be more or less successful in fixing the reference of the corresponding concept (word)

The OntoVis suggestion

Feature-set suggested• physical structure: the number of parts into which an object is expected to be decomposed in different dimensions • visually verifiable functionality: visual characteristics an object may have which are related to its function, &• interrelations: relative location of objects, relative size

A domain model Ontology + KBasefor static indoor scenes (sitting rooms in 3D – XI KR language)

The OntoVis suggestion

x

y

z

OntoVis – KB examplesprops(sofa(X),[has_xclusters_moreThan(X,1)]).

props(sofa(X),[has_yclusters_equalMoreThan(X,2)]).

props(sofa(X),[has_ yclusters_equalLessThan(X,4)]).

props(sofa(X),[has_ zclusters_equalMoreThan(X,2)]).

props(sofa(X),[has_zclusters_equalLessThan(X,3)]).

props(sofa(X),[on_floor(X,yes)]).

props(sofa(X),[has_surface(X,yes)]).

props(sofa(X),[size(X,XCLUSTERS)]).

props(chair(X),[has_xclusters (X,1)]).

props(chair(X),[has_ yclusters_equalMoreThan(X,2)]).

props(chair(X),[has_ yclusters_equalLessThan(X,4)]).

props(chair(X),[has_zclusters_equalMoreThan(X,2)]).

props(chair(X),[has_zclusters_equalLessThan(X,3)]).

props(chair(X),[on_floor(X,yes)]).

props(chair(X),[has_surface(X,yes)]).

Props(chair(X),[size(X,XCLUSTER_YValue,TableYDIM_UpperConstraint)]).

armchairs?armchairs?

stools?stools?

OntoVis – KB examples

props(table(X),[has_xclusters(X,1)]).

props(table(X),[has_yclusters(X,2)]).

props(table(X),[has_zclusters(X,1)]).

props(table(X),[on_floor(X,yes)]).

props(table(X),[has_surface(X,yes)]).

props(table(X),[size(X,YDIM,XDIM, Relative_to_Room_YXDIM)]).

props(heater(X),[has_xclusters(X,1)]).

props(heater(X),[has_yclusters(X,1)]).

props(heater(X),[has_zclusters(X,1)]).

props(heater(X),[on_wall(X,yes)]).

props(heater(X),[on_floor(X,no)]).

props(heater(X),[has_surface(X,yes)]).

props(heater(X),[size(X,XDIM,YDIM, Relative_to_Wall_YXDIM)]).

OntoVis F-set advantages

It generalizes over visual appearance differences (e.g. different styles of sofas)

It goes beyond viewpoint (view angle + distance) differences

It can be used to reason on object id by analogy (e.g. to describe “sofa-like” objects if not certain)

Using OntoVis

VLEMA: A Vision-Language intEgration MechAnism

Input: automatically re-constructed static scenes in

3D (VRML format) from RESOLV (robot-surveyor) Integration task: Medium Translation

from images (3D sitting rooms)

to text (what and where in EN) Domain: estates surveillance Horizontal prototype Implemented in shell programming and ProLog

The Input

OntoVis+ KB

“…a heater … and a sofa with 3 seats…”

Description

Data Transformations

Object Segmentation

Object Naming

System Architecture

The Output

Wed Jul 7 13:22:22 GMTDT 2004

VLEMA V1.0

Katerina Pastra@University of Sheffield

Description of the automatically constructed VRML file

“development-scene.wrl”

This is a general view of a room.

We can see the front wall, the left-side wall, the floor,

A heater on the lower part of the front-wall and a sofa with 3 seats.

The heater is shorter in length than the sofa.

It is on the right of the sofa.

Extension of OntoVis and testing in VRML worlds Modular description of clusters/parts (not rely just on their number in each dimension) Exploration of portability of f-set to 2D images

Initial signs of feasibility: cf. research on detecting spatial relations in 2D, structure-identification in 2D, algorithms for 3D reconstruction from photographs)

Future Plans & Conclusions

To what extent scalable even in 3D?To what extent scalable even in 3D?

Complementary or alternative to Complementary or alternative to current approaches?current approaches?

OntoVisOntoVis

Indications of Indications of OntoVis scalability & OntoVis scalability & feasibility that worth feasibility that worth further explorationfurther exploration