September 11th, 2013

    A thesis submitted in fulfillment of the requirements for the degree of Doctor of Philosophy


    Igor García Olaizola

    Supervised by Prof. Basilio Sierra Araujo


    Dr. Julián Flórez Esnal

    El doctorando El director El director

    Donostia San Sebastian, Wednesday 11th September, 2013

    A Framework for Content Based Semantic Information Extraction from Multimedia Contents


    Author: Igor García Olaizola

    Advisor: Basilio Sierra Araujo

    Advisor: Julián Flórez Esnal

    The following web-page address contains up to date information about this dissertation and related topics:

    sertation and related topics:


    Text printed in Donostia San Sebastián

    First edition, September 2013

    Zuretzat aita.

    One of the main characteristics of the new digital era is the media

    big bang, where images (still images or moving pictures) are one of

    the main type of data. Moreover, this is an increasing trend mainly

    pushed by the easy of capturing given by all new mobile devices that

    include one or more cameras.

    From a professional perspective, most content related sectors are

    facing two main problems in order to operate efficient content man-

    agement systems: a) need of new technologies to store, process and

    retrieve huge and continuously increasing datasets and b) lack of

    effective methods for automatic analysis and characterization of

    unannotated media.

    More specifically, the audiovisual and broadcasting sector which is

    experiencing a radical transformation towards a fully Internet conver-

    gent ecosystem, requires content based search and retrieval systemsto browse in huge distributed datasets and include content from

    different and heterogeneous sources.

    On the other hand, earth observation technologies are improving the

    quantity and quality of the sensors installed in new satellites. This

    fact implies a much higher input data flow that must be stored and


    In general terms, the aforementioned sectors and many other me-

    dia related activities are good examples of the Big Data phenomenon

    where one of the main problem relies on the semantic gap; the in-

    ability to transform mathematical descriptors obtained by image

    processing algorithms into concepts that humans can naturally un-


    This dissertation work presents an applied research activity overview

    along different R&D projects related with computer vision and mul-

    timedia content management. One of the main outcomes of this

    Una de las caractersticas principales de la nueva era digital es el la

    gran explosin producida alrededor de los contenidos multimedia

    donde las imgenes (tanto estticas como en movimiento) suponen

    el tipo principal de dato. Adems, esta tendencia sigue siendo cre-

    ciente debido principalmente a la facilidad de captura ofrecida por

    los dispositivos mviles que incluyen una o ms cmaras.

    En lo referente a los diferentes sectores profesionales relacionados

    con los medios digitales, el crecimiento tan exagerado de los datos

    est causando dos problemas principales. Por un lado, se requieren

    nuevas tecnologas que permitan el almacenamiento, proceso y la

    recuperacin de contenidos de una manera efectiva en conjuntos

    enormes y crecientes de datos. Por otro lado, tambin son necesarios

    mtodos automticos de anlisis y caracterizacin de contenidos sin

    anotacin previa.

    De una forma ms especfica, podemos destacar el sector audiovi-

    sual en el que se est produciendo una profunda transformacin

    provocada principalmente por el proceso de convergencia con Inter-

    net. En esta situacin, se hacen cada vez ms necesarios sistemas de

    bsqueda y recuperacin de contenidos que permitan navegar en

    conjuntos masivos de datos que cada vez son ms distribuidos y de

    una procedencia ms heterognea.

    En el caso de la observacin de la Tierra, los sistemas de adquisi-

    cin de datos son cada vez ms numerosos y ms precisos, hecho

    que genera flujos de contenido cada vez mayores que deben ser

    continuamente procesados y almacenados.

    En general, los sectores mencionados previamente y otras activi-

    dades relacionadas con contenidos multimedia son claros ejemplos

    del fenmeno Big Data que est produciendo y donde uno de los

    problemas principales consiste en eliminar la brecha semntica (ms

    conocida comosemantic gap). Podemos definir la brecha semntica

    como la diferencia no salvada por el momento entre los conceptos

    matemticos derivados de las diferentes tcnicas de procesamiento

    de imgenes y los conceptos que los humanos manejamos para de-scribir estos mismos contenidos.

    La presente memoria de tesis presenta una revisin sobre la activi-

    dad de investigacin aplicada que se ha realizado mediante varios

    proyectos relacionados con la visin por computador y la gestin

    de contenido multimedia. Uno de los resultados principales de esta

    actividad investigadora ha sido el modelo Mandrgora, un diseo

    de arquitectura con el objetivo de minimizar la brecha semntica

    y crear anotaciones automticas basadas en una ontologa previa-mente definida.

    Debido a que uno de los problemas principales a los que se en-

    frenta la implementacin de Mandrgora es el hecho de que la falta

    de conocimiento previo sobre el contenido limita el anlisis inicial,

    hemos propuesto un nuevo mtodo (DITEC) para la caracterizacin

    semntica de imgenes. Los buenos resultados obtenidos en las prue-

    bas experimentales realizadas han resultado en una adaptacin del

    mtodo original basado en un descriptor global de manera que unavariante de dicho descriptor global resulte eficaz como descriptor

    local. En este documento tambin se describe la variante DITEC lo-

    cal en la que los resultados de las pruebas experimentales realizadas

    (an con una implementacin en fase de desarrollo) han mostrado

    un comportamiento altamente competitivo al ser comparadas con

    los descriptores locales ms populares en la literatura cientfica.

    Aro digital berriaren berezitasun nagusienetako bat media edukien

    big bang edo ugaritze neurrigabekoa da, irudiak (bai argazki eta

    baita bideoak ere) eduki mota nagusia direlarik. Joera hau gainera

    oraindik ere gehiagora doa batik bat kamera bat edo bi dakartzaten

    gailu mugikorrek edukia jasotzeko eskaintzen duten erraztasunak


    Ikuspegi profesionaletik begiratuz gero, edukiekin lotuta dauden

    sektoreak edukien kudeaketa efiziente bat egiteko garaian bi arazo na-

    gusiren aurrean aurkitzen dira. Alde batetik, datu kopuru itzel hauek

    metatu, prozesatu eta norberaganatzeko teknologia berriak behar

    dira. Bestalde, alde aurretik inongo anotaziorik ez duen edukiaren

    analisi eta bereizte automatikorako metodo eraginkorrak garatzeke

    daude oraindik.

    Gehiago zehaztuz, ikus-entzunezko edukien eta irrati uhin bitarteko

    hedabideen sektorea, une hauetan Internetekin bat egitera doan

    bideari ekinda bizitzen ari den eraldatze prozesu sakonean inoizko

    eduki kopuru handienak kudeatzen ari da. Gainera, edukiak gero

    eta jatorri ezberdin eta izaera heterogeneoagoa azaltzen dute, garai

    bateko eredu zentralizatu eta trinkoekin lan egiten zuten sistemen

    eraginkortasuna erabat urrituz. Horregatik, datu hauetan bilaketa er-

    aginkorrak egiteko edukia bera aztertzeko gai diren egitura malguko

    sistemen garapena behar beharrezkoa da.

    Bestalde, Lurraren behatze jarduerarako teknologiak gero eta kopuru

    eta doitasun handiagoko instrumentazioa erabiltzen dute belaunaldi

    berriko sateliteetan. Hau dela eta gero eta datu jario handiagoa

    igortzen dute behaketa sistemek eta eduki guzti horiek gorde eta

    prozesatzeko beharrak gero eta zailagoak dira betetzen.

    Oro har arestian aipatutako sektoreak eta multimedia edukiekin

    dabiltzan halako beste hainbat Big Data gertakariaren adibide

    nabariak dira, bertan arrail semantikoa (semantic gap bezela

    ezagutzen dena), hots, irudi prozesamenduko algoritmoen bidez er-

    auzitako bereizgarri matematikoak gizakioi ulergarri egiten zaizkigun

    kontzeptuetan bihurtzeko ezina, arazo nagusienetariko bat bihurtudelarik.

    Tesi dokumentu honetan, konputagailu bidezko ikusmenari lotu-

    tako ikerkuntza aplikatuko hainbat proiektutan lortutako emaitza

    orokorrak azaltzen ditugu. Emaitza nagusienetako bat Mandragora

    arkitektura da. Mandragoraren xede nagusia arrail semantikoa txik-

    itzeko ontologi batetan oinarrituta dagoen irudien anotazio sistema

    automatiko bat sortzea da.

    Mandragoraren arazo nagusienetako bat hasierako ezagutzarik izangabe lehen prozesamendua itsuka egin beharra denez, irudien

    domeinu semantikoa karakterizatzeko metodo berri bat aurkezten

    dugu, DITEC izenekoa. Saiakera esperimentaletan lortutako emaitza

    onak ikusirik, DITEC metodoaren muinean dagoen deskribatzaile

    globala era lokalean erabiltzeko egokitzeko ahalegina egin dugu.

    DITEC bereizgarri lokala ere azaltzen da beraz dokumentu honetan.

    Metodoaren inplementazioa oraindik ere garapen egoeran dagoen

    arren, lortutako emaitza esperimentalak oso onak izan dira zientzialiteraturan dauden deskribatzaile lokal ezagunenekin alderatuta.

    This is not the story of a self-made man. Instead, all the achievements

    presented in this work have a long chain behind, a chain composed

    by people that have supported my entire professional career and

    something that cannot be separated from personal experiences. At

    this point, it is worth to acknowledge all these people.

    In this sense, my both supervisors, Basilio Sierra and Julin Flrez

    have been an essential part of this work, with an unconditional

    commitment and a highly valuable scientific guidance. Dudarik

    gabe, esan liteke, Julian, nire bide profesionalaren lehen hastapenak

    zurekin eman nituela. Hasieratik zugandik sentitu nuen konfidantza

    eta babesa ez dira hamarkada oso batetan gutxiagora joan, eta hori

    bada zerbait. Denbora guzti honetan zugandik ikasia nire eguneroko

    lanaren oinarri nagusienetako bat izanik, lan honetan ere halaxe isla-

    datzen da. Bestalde, unibertsitatean irakasle egoki bat aukeratzeko

    bidean, zorte izugarria izan nuen Basi ezagututa. Hasieratik jakin

    izan du nire egoera profesionalak sortzen dizkidan etenaldi eta jar-

    raipen faltara egokitzen. Aholku eta zuzendaritza zientifiko ezin hobe

    bat egin dituela esango nuke eta era atsegin eta gogotsuan gainera,

    gogor eta astuna izan litekeen prozesu bat, gustora egiten den lana


    Dentro de Vicomtech, entorno en el que se ha movido la mayor parte

    de mi actividad profesional y donde se enmarca esta tsis, he contado

    con innumerables apoyos. Seguro que dejar alguno sin mencionar

    (desde aqu mis disculpas) pero no por ello quiero dejar de citar al-

    gunos tales como Jorge Posada, Director Adjunto, que me ha apoyado

    en todo momento con nimos y consejos prcticos que vienen muy

    bien cuando uno se centra demasiado en su problema. Amalia y

    David, compaeros de fatigas que me demostraron que s es posible

    hacer una tesis doctoral compaginada con la actividad profesional

    en un centro tecnolgico. Shabs, this interesting man that always

    shows me that things might have a non obvious point of view that it is

    worth to observe. Edurne, gauza zailak errez eginaz behin eta berriz

    bidea lautzen didan lankide eta laguna. Por supuesto, merece unamencin especial el departamento de TV Digital y Servicios Multime-

    dia del que soy parte y del que he sentido un apoyo enorme durante

    todo el proceso. Espero realmente poder corresponder en la misma


    This work has been carried out in a strong collaboration with some

    colleagues that deserve a special mention. Marco Quartulli, cien-

    tfico renacentista con el que juego y aprendo cada da. Naiara

    Aginako, irudiekin lan egiten hasi ginen egunetik bide lagun, laneanzorrotz bezain atsegin tratuan, zure txanda noiz iritsiko desiatzen

    nago. Gorka, quien con su tesis marc el punto de partida de este

    trabajo. Cuntas discusiones interesantes que derivaron en buenas

    ideas o.. . en ms discusiones :-) Espero poder seguir disfrutando de

    tu contrapunto. Y por supuesto Iigo Barandiaran, un investigator

    meticuloso, creativo y con un gusto por el trabajo bien hecho que

    para m sigue siendo un ejemplo. De las inmumerables horas que

    hemos pasado juntos en este proceso, no ha habido un minuto en el

    que no haya disfrutado. Me ilusiona saber que todava nuestro tra-

    bajo require de mejoras y ms investigacin porque ser la manera

    de que continuemos colaborando. Al final parece que nos vamos a

    tomar esa cerveza!

    Echando un poco la vista atrs, tampoco puedo olvidarme de otros

    viejos amigos que aunque de una manera ms lejana han sido fun-

    damentales para que este trabajo se haya realizado. Haritz,mein

    Brder, karrera hasi genianetik horrenbeste ordu elkarrekin, hain-

    bat proiektu eta diskusio, beti elkarlanean laguntzeko prest. Gero

    urtebeteko abentura elkarrekin Wichernstrasse inguruetan. Nolako

    injinerua nauken, hik baduk bai zeresana. Jaizki, aspaldiko lagun, kar-

    rera garaitik eta gure Alemaniko abenturan triangelu zarauztarraren

    beste erpina. Lehen bezela, orain ere ez didak gutxi lagundu. Esan

    eta egin, artikuluaren zuzenketak eskatu orduko eta doitasun han-

    diz gainera. Hitzez laguntza eskaintzea erreza dek, hik egitez erakutsi


    Profesionalki lana buru belarri egin ahal izateko, pertsonalki oreka

    lortzea behar beharrezkoa dela uste dutenetakoa naiz ni, eta hor-

    retarako bizitza zurekin elkarbanatuz, goizero egunari ekitea gauza

    zoragarri bat izatea zuri zor dizut, Myriam, batzutan garabi bat beharbaduzu ere.

    Izan zirelako gara, eta garelako izango dira, Naroa eta Maddi, zuek

    zarete nire benetako proiektua, txikiak izango zarete baina zuei begira

    beste guztia geratzen zait niri txiki.

    Esan bezala, izan zirelako gara, eta ni naizena baldin banaiz familiari

    eta batik bat gurasoei zor diet (onetik dudana behintzat). Osaba Joxe,

    nire bizitzan dauden oinarrizko printzipioen erakusle, horrenbeste

    urte eta gero ez dira aldatu. Ama, zuk erakutsiak dira ahalegintzearenbalioa, lanean gustoa jartzearena, txukun ibiltzearena. Oraindik

    ere halaxe erakusten didazu egunero. Aita, tesi lan honekin zuri

    egin nahi dizut aipamenik garrantzitsuena. Ingenioa, irudimena eta

    jakintzaren arteko konbinazio bezela maisutasunez erabiliz, zeuk

    zuzendu ninduzun ingeniaritzara. Nik zure eredu hori jarraitzen

    jartzen dut ahalegina. Lanean lanetik kanpo bezela, zuzen eta pula-

    mentuz, ingurukoei laguntzen saiatuz eta zailak badira ere erabaki

    zuzenei koherentzia osoz eutsiz. Egunero saiatzen naiz zuk hitzez etaegitez hain garbi erakutsitako bidea duintasunez betetzen.

    Eskerrik asko guztioi

    Igor Garca Olaizola

    September 2013

    List of Figures

    1.1 Begira scene definition. . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    1.2 Example of the cloud segmentation process . . . . . . . . . . . . . . 9

    1.3 Rushes content analysis workflow . . . . . . . . . . . . . . . . . . . . 13

    1.4 Grafema Assets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    1.5 Grafema System Workflow . . . . . . . . . . . . . . . . . . . . . . . . . 15

    1.6 Grafema System Architecture . . . . . . . . . . . . . . . . . . . . . . . 16

    1.7 IQCBM System Architecture . . . . . . . . . . . . . . . . . . . . . . . . 17

    1.8 Screenshots of the IQCBM user interface . . . . . . . . . . . . . . . . 18

    1.9 Relationship between R&D projects and scientific activity in multi-

    media content analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    2.1 Information Retrieval Reference Model . . . . . . . . . . . . . . . . . 22

    2.2 Mandragora Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 24

    2.3 DIKW Pyramid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    3.1 Watson DeepQA High-Level Architecture . . . . . . . . . . . . . . . . 28

    3.2 Idealized query process decomposition on EO image mining . . . . 31

    3.3 Envisat instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    3.4 General architecture of the meteorological information manage-ment system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    4.1 DITEC System workflow . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    4.2 Trace transform, geometrical representation . . . . . . . . . . . . . . 41

    4.3 Trace transform contribution mask at very high resolution parame-

    ters (Image resolution:100x100px.n = 1000, n = 1000, n = 5000). 444.4 Pixels relevance in trace transform scanning process with different

    parameters (n, n, n). Original image resolution = 384x256. . . . . 45


    4.5 Trace Transform and subsequent Discrete Cosine Transform of

    Lenna. (Y channel of YCbCr color space) . . . . . . . . . . . . . . . . 48

    4.6 Conceptual scheme: DCT matrix transformation into , kpair vector. 49

    4.7 Statistical properties of all Kurtosis measurements made on thedistributions obtained by processing Corel 1000 dataset . . . . . . . 50

    4.8 Examples of probability density distribution and histograms ob-

    tained by the samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    4.9 Samples of Corel 1000 dataset. The dataset includes 256x384 or

    384x256 images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

    4.10Distance among classes in the Corel 1000 dataset according to mis-

    classified instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

    4.11Distance among most inter-related classes in the Corel 1000 dataset

    according to misclassified instances. . . . . . . . . . . . . . . . . . . . 57

    4.12Corel 1000 picture corresponding to classArchitectureand classified

    asMountain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

    4.13 Corel 1000 precision results with different feature extraction algo-

    rithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

    4.14Samples of satellite footage dataset. 256x256px patches at different

    scales. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    4.15Distance among classes in the Geoeye dataset according to misclas-

    sified instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

    4.16 Time performance behavior. . . . . . . . . . . . . . . . . . . . . . . . 62

    4.17 System workflow for DITEC as local feature . . . . . . . . . . . . . . . 65

    4.18 Matching accuracy depending on the number of angular samples . 68

    4.19 Matching accuracy depending on the number of radial samples . . 69

    4.20 Matching accuracy depending on the number of simultaneous in-

    crease of angular and radial sampling . . . . . . . . . . . . . . . . . . 69

    4.21 Computation time depending on the simultaneous increase of an-

    gular and radial sampling . . . . . . . . . . . . . . . . . . . . . . . . . 704.22 In-plane Rotation Transformation matching results. . . . . . . . . . 71

    4.23 Scale Transformation matching results. . . . . . . . . . . . . . . . . . 71

    4.24 Projective Transformation matching results. . . . . . . . . . . . . . . 72

    4.25 Exposure change photometric Transformation matching results. . . 73

    4.26 Trace transform row and column analysis . . . . . . . . . . . . . . . . 73

    A.1 DITEC development platform . . . . . . . . . . . . . . . . . . . . . . . 98

    A.2 Circular patch image . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99


    A.3 Result of (,) space exploration with Bresenham . . . . . . . . . . . 99

    A.4 First half of the source image is sampled (blue regions) while areas

    around vertical and horizontal axes are not considered. . . . . . . . 100

    A.5 Second half of the source image is sampled (red and green). Theseregions are moved to




    4 ,


    ,areas in order to be sampled

    with the Bresenham algorithm. . . . . . . . . . . . . . . . . . . . . . . 100

    A.6 Result of (,) sampling with Bresenham algorithm and a single

    image rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

    A.7 Result of (,) pixelwise sampling with image rotation for each

    angular iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

    A.8 Result of different sampling strategies of (,) space. . . . . . . . . 102

    B.1 Scanline defined in terms of and . . . . . . . . . . . . . . . . . . . 104


    List of Tables

    4.1 List of Trace Transform functionals proposed in [KP01]. . . . . . . . 42

    4.2 Quantization effects of the trace transform . . . . . . . . . . . . . . . 46

    4.3 Corel 1000 dataset confusion matrix. . . . . . . . . . . . . . . . . . . 564.4 Geoeye dataset confusion matrix. . . . . . . . . . . . . . . . . . . . . 59


    Part I

    Work Description


    If our brains were simple enough

    for us to understand them, wed be

    so simple that we couldnt.

    Ian Stewart



    Artificial Intelligence (AI) is probably one of the most exciting knowledge fields

    where even the definition of the term becomes controversial due to the manifold

    understanding of the intelligence that remains as a hard epistemological prob-

    lem. Learning, reasoning, understanding, abstract thought, planning, problem

    solving and other related topics are all different aspects that imply intelligence.

    The emergence of programmable digital computers in the late 1940 offered a

    revolutionary way to experimentally explore new methods for formal reasoning

    and logic. However, the initial great expectation of AI did not come into reality

    and the prediction made by Herbert A. Simon1 machines will be capable, within

    twenty years, of doing any work a man can do still remains as a Science Fiction


    The fashions of AI over the years have moved from automated theorem prov-

    ing to expert systems that later on where substituted by behaviour-based robotics

    and now seem to find the solution on learning from big data[Lev13]. All these

    trends have not been able to meet the expectation that the founders of AI put

    on the field[Wan11]. Patrick Winston( director of the MIT Artificial Intelligence

    Laboratory from 1972 to 1997) cited the problem of mechanistic balkanization,

    with research focusing on ever-narrower specialties such as neural networks or

    genetic algorithms. When you dedicate your conferences to mechanisms, theres a

    1Herbert Alexander Simon (June 15, 1916 February 9, 2001), ACMs Turing Award for making

    basic contributions to artificial intelligence, the psychology of human cognition, and list processing

    in (1975) and considered one of the founders of AI


    tendency to not work on fundamental problems, but rather [just] those problems

    that the mechanisms can deal with[Cas11].

    However, there has been a great scientific and technological advance in many

    AI related domains formal logic, reasoning, statistics and data mining, geneticprogramming, knowledge representation, etc. that without satisfying the founda-

    tions proposed by Winston or Chomsky, has enabled the creation of technological

    solutions for different application fields such as natural language processing,

    computer vision, drug design, medical diagnosis, genetics, finance & economy,

    user recommendation systems and many others.

    1.1 Context of this research activity

    The research activity described in this dissertation has been mainly carried out

    withing the applied researchperspective given by both, basic and applied research

    projects developed in VicomtechIK41. VicomtechIK4 is an applied research

    center located in San-Sebastian (Basque Country, Spain) and combines the excel-

    lence in pure basic research with its application and transfer to the industry. In

    this sense, some of these projects have been transferred to the industry and their

    intellectual property has been protected by applying patents. In other cases, the

    scientific progress carried out within the project has been published in journalsor conferences.

    The knowledge of the market needs, technological state of the art and real

    integration which introduces many constraints coming from the real world are

    combined with the scientific method and basic research activities in collabora-

    tion with Universities. In this case, the collaboration with the University of the

    Basque Country2 and more specifically with the Computer Science and Artificial

    Intelligence Department of the Computer Engineering Faculty has been a key

    element for the balanced applied/scientific progress of the research work.

    1.1.1 VicomtechIK4

    VicomtechIK4 as an applied research centre is focused on all aspects related with

    multimedia and visual communication technologies along the entire content pro-

    duction pipeline, from generation, through processing and transmission until



    1.1 Context of this research activity

    rendering, interaction and reproduction. VicomtechIK4 is structured in 6 de-

    partments that offer different views and specific technological solutions around

    the aforementioned research activity. These departments are the following:

    Digital Television and Multimedia Services

    Speech and Natural Language Technologies

    eTourism and Cultural Heritage

    Intelligent Transport Systems and Engineering

    3D Animation and Interactive Virtual Environments

    eHealth and Biomedical Applications

    The research described in this dissertation has been carried out within the

    Digital TV & Multimedia Servicesdepartment with a strong collaboration withthe department ofeHealth and Biomedical Applications. In fact, the problem of

    computer vision and multimedia content understanding is one of the main re-

    search lines of VicomtechIK4. In this case, one of the main problems addressed

    in this dissertation is aligned with one of the main difficulties of current multime-

    dia management systems in diverse sectors such as broadcasting, remote sensing,

    medical imaging, etc.: The huge amount of unannotated data and extremely

    broad domains that cannot be explicitly defined. Relation with other VicomtechIK4 PhD. processes

    The research and technological activity performed in this work has been carried

    out in a strong collaboration with other two PhD processes made in Vicomtech

    IK4. These two works analyze and develop other aspects related with the analysis

    and understanding of multimedia. More specifically Marcos[Mar11] studied the

    multimedia retrieval problem from a semantic point of view, creating a semantic

    middlewarebased approach as an intermediary layer between users high level

    queries and systems low-level annotations. Some of the final considerations of

    the work carried out by Marcos and the requirements identified as future work to

    feed this semantic middleware from a bottom-up approach are the basis of the

    initial context of this dissertation.

    On the other hand, Barandiaran [Bar13]focused his work in the analysis of

    local descriptors. The collaboration with Barandiaran has resulted into a local

    adaptation of the global descriptor as one of the main contributions proposed

    in this dissertation (see Section4.6). This novel local feature has demonstrated

    highly robust characteristics as a local descriptor .


    1.1.2 Computer Science and Artificial Intelligence Department

    of the Computer Engineering Faculty

    The Robotics and Autonomous Systems Group which belongs to the Computer

    Science and Artificial Intelligence Department of the Computer Engineering

    Faculty is very active in two main areas:

    Mobile Robotics

    Behavior-based control architectures for mobile robots.

    Bio-inspired robot navigation.

    Use of visual information for navigation.

    Machine Learning

    Dynamic learning mechanisms

    Classifier combination

    New paradigms for supervised classification

    Optimization problems

    The deep knowledge of this group on machine learning science and tech-

    nologies has provided the scientific foundations for the more technological work

    developed in VicomtechIK4. This combination provides a high potential context

    for scientific research.

    1.2 R&D Projects

    This dissertation work has been carried out based on R&D projects with common

    underlying scientific needs and customer specific requirements. The knowledge

    and experience acquired during these projects has driven the general framework

    presented as one of the main contributions of this work.

    1.2.1 Begira

    Title:Diseo y Desarrollo de un Sistema Seguimiento Preciso de Objetos

    en Transmisiones Deportivas (Design and development of a high accuracy

    object tracking system for sports broadcasting).

    Project typology:Industrial project partially supported by theGaitekpro-


    Company name:G93.


    1.2 R&D Projects

    Period:2005-2009. Summary

    Augmented reality projects require a deep knowledge of the scene that has to beextracted/updated in real time. In order to ensure the accuracy and real-time

    performance of the system, the knowledge must be explicitly defined.

    The goal of ofBegiraproject was to develop a single-camera system to track

    the ball trajectory and position the bouncing point for Basque Pelota live TV

    transmissions. The main constraints of the system were:

    Single camera.

    Broadcasting camera (720p@50).

    Tracking, positioning and virtual reconstruction under 20 seconds.

    Single standard computer for processing purposes.

    From an Artificial Intelligence perspective, we can consider it as a system

    where the knowledge domain is reduced to a single scene (the Basque Pelota

    court) and thus can be explicitly defined. The main elements that define this

    domain are:

    3D environment: A court composed by 3 plane surfaces (front wall, side wall

    and ground).

    The relative position of the camera to the court is obtained during a

    calibration process by putting a checkerboard on the ground.

    Once the camera is calibrated, its position is fixed during the entire


    Dynamic objects: There are only 2 types of dynamic objects in the scene:

    Players: There can be two or 4 players. Their size is much bigger than the

    ball and most of the time their lowest part is touching the ground.

    Ball: It is white, round and much smaller than the players. Sudden trajec-

    tory changes are due to the hit of the players or a bounce. The ball is

    so rigid that the bounce can be considered elastic.

    According to the domain defined with the aforementioned concepts, a homog-

    raphy matrixHis calculated to obtain cameras extrinsic parameters. Then the


    amera rg n

    (xi, yi)

    (Z=0 Plane Origin)


    amera r g n

    (x y )i i

    (xi, yi)


    (Z=0 Plane Origin)


    Figure 1.1:Scene definition: Ball trajectory samples used to estimate the paramet-

    ric curves and the calculation of the bouncing point on the ground once the center

    position of the ball is obtained (crossing point of the two curves).

    ball is initially detected and the tracking system follows its trajectory. Abrupt tra-

    jectory changes define the limit between the instant before and after the bounce.

    Once the two parametric curves are estimated, their crossing point is calculated

    on the image. This two-dimensional position (in pixels) is then converted to the

    3D space using the inverse of the homography matrix (H1). To solve the uncer-

    tainty of the 3D position obtained by the 2D projection, the conditionZ= 0 isestablished for the bouncing point. More details of the project can be found in

    Section7.3. Conclusions

    TheBegiraproject is a good example of expert systems applied to image process-

    ing and computer vision. The technical goals were successfully achieved and the

    results of the project were exploited by the Basque public broadcaster ETB and

    the TV content producer G93. However, the knowledge acquired by the system

    was sohardcodedthat it is very difficult to extend or integrate it in other more

    general solutions. The good performance and accuracy results rely on its reduceddomain definition and rigid nature.

    1.2.2 Skeye

    Title: Sistema de anlisis meteorolgico basado en imgenes del cielo

    tomadas desde tierra (Meteorological analysis system based on images

    taken from the earth).

    Project typology:Industrial project supported by theGaitekprogramme.


    1.2 R&D Projects

    Company name:Dominion.

    Period:2007-2008. SummaryMeteorological stations provide multiple sensor data as well as some more sub-

    jective information such as the cloudiness. The goal ofSkeyewas to provide

    an automatic system to accurately estimate the cloudiness factor, avoiding any

    human intervention.

    As mentioned in Begira, the semantic domain was small and quite straight

    forward to model. The four classes that compose the domain are:cloud,sun,blue

    skyand earth. The project was carried out by analyzing the features that were

    characteristic for each class and the scene was defined in terms of a dome withnormalized illumination conditions.

    (a) (b)

    Figure 1.2:Example of the cloud segmentation process Conclusions

    Similarly toBegira, in this case, the feature extraction process provided all the

    information need for a further class assignment by applying specific thresholds.

    However, the further integration of the developed system in more domains or

    scenes would be a difficult task since all the development andthe selected featurestotally depend on the domain definition and scene conditions. More information

    about this work can be found in Section7.1.

    1.2.3 SiRA

    Title: Diseo y Desarrollo de un Sistema de Reconocimiento de Marcas

    Comerciales en Emisiones Televisivas (Design and development of a system

    for commercial brand recognition in TV broadcasts).


    Project typology:Industrial project supported by theGaitekprogramme.

    Company name:Vilau.

    Period:2007-2008. Summary

    This project is another example of a system based on a reduced semantic domain,

    but in this case the approach was more general and some higher abstraction level

    elements were introduced. The goal ofSiRAwas to detect logos in TV content in

    order to automatize advertisement monitoring tasks. This project was also sup-

    ported by the Basque Government and its industrial application was envisioned

    by Vilau, a media communication company.

    In this case, the constraints in terms of real time behavior and equipmentwere lighter than inBegira. However, the domain was broader: any type of logos

    embedded in any type of content taken from different perspectives.

    The approach followed in this case was to firstly detect a logo candidate as-

    suming that a logo would be typically surrounded by a regular shape (square,

    circle, triangle, etc.) and composed of very few colors. Once the logo was detected,

    different feature extraction algorithms could be applied in order to compare the

    results with those features corresponding to the target logo dataset. Depending

    on the extracted features, different distance metrics were applied. Conclusions

    The results ofSiRAcan be integrated as a new feature in other content analysis

    systems. In this case,SiRAwould provide information about potential logos exist-

    ing in a specific video or still image. Moreover, even if the process itself is carried

    out by using low-level operators, it can be considered that the result ofSiRAis a a

    set of high level features with valuable semantic content as in general terms the

    presence of a logo means that there is a product or and advertisement related toit.

    1.2.4 SIAM

    Title:Diseo y Desarrollo de un Sistema de Anlisis Multimedia de Con-

    tenido Audiovisual en Plataformas Web Colaborativas (Design and devel-

    opment of a system for multimedia analysis of audiovisual content in

    collaborative web platforms).


    1.2 R&D Projects

    Project Typology:Industrial project supported by theGaitekprogramme.

    Company name:Hispavista (http://hispavista.com/).

    Period:2009-2010. Summary

    First ideas of this work related with a semantic analysis of multimedia content

    were developed inSIAM. The goal of this project was to create content analy-

    sis tools to improve the exploitation of large amounts of user generated content.

    The context of the project was www.tu.tv,aYouTubelike video sharing platform

    owned by Hispavista. According to this approach, the semantic labels can be ob-

    tained from unstructured user comments. Then, by finding similar contents, new

    non tagged content can be assigned to a previous label.

    As the type of content analyzed in SIAMwere any kind of videos, the semantic

    domain was too broad and complex to be defined where one of the main prob-

    lems was the definition of a semantic unit in a video. The assumption of a video

    as a semantic unit is to inconsistent in many cases as the elements on it can be

    changing along the time. Therefore, each video was decomposed in shots and

    each shot was analyzed and labeled. Finally, the entire video would be labeled as

    the composition of each shot label. Conclusions

    The main outcome ofSIAMwas the shot based content analysis model and a shot

    boundary detector that has been later used for semantic analysis purposes. More-

    over, the potential of user generated metadata was addressed in this project. We

    identified the potential of this amount of unstructured data that could be com-

    plementary to the perfectly organized but expensive to populate professional


    1.2.5 Cantata

    Title:Content Aware Networked systems Towards Advanced and Tailored


    Project typology:ITEA



    Consortium: Bosch Security Systems,Philips Electronics Netherlands,Philips

    Medical Systems,Philips Consumer Electronics,TU/e, TU Delft, Multitel,

    ACIC, Barco, Traficon, VTT, Solid, Hantro, Capacity Networks, I&IMS, Tele-

    fonica VicomtechIK4, University Pompeu Fabra, CRP, Henri Tudor, Co-dasystem, Kingston University, University of York, INRIA. Summary

    ThegoalofCantatawas to create a distributed service for content analysis. The ap-

    plication field included medical imaging, entertainment an security. Our activity

    was focused in the entertainment sector where the content analysis modules were

    connected to user profiles in order to create content recommendation systems.

    In this case, the logo detection system was used to provide content informa-

    tion to the main content analysis and recommendation system. Conclusions

    The recommendation system was intended to combine user activity information,

    content metadata and low-level feature based information. However, the broad

    domain definition required an unfordable amount of low-level descriptors and

    even the combination of all these descriptors would be a very complex issue. Dueto this complexity, most recommendation systems rely basically on metadata.

    1.2.6 RUSHES

    Title:Retrieval of mUltimedia Semantic units for enHanced rEuSability.

    Project typology: FP6-2005-IST-6.


    Consortium: Heinrich-Hertz-Institut (DE), University of Surrey (UK),Athens Technology Centre (GR), Vcomtech (ES), Queen Mary University of

    London (UK), Telefonica I+D (ES), FAST Search & Transfer (NO), University

    of Brescia (IT), ETB (ES). Summary

    The overall aim ofRUSHESwas to design, implement, validate, and trial a system

    for both delivery of, access to raw media material (rushes) and the reuse of that


    1.2 R&D Projects

    content in the production of new multimedia assets, following multimodal ap-

    proach that combines knowledge extracted from the original content, metadata,

    visualization and highly interactive browsing functionality.

    The core research issues ofRUSHESwere:

    Knowledge extraction for semantic inference

    Automatic semantic-based content annotation

    Scalable multimedia cataloging

    Interactive navigation over distributed databases

    Non-linear querying and retrieval techniques using hierarchic descriptors.

    Figure 1.3:Rushes content analysis workflow Conclusions

    TheRUSHESconsortium tried to address thesemantic gapby creating a powerful

    architecture composed of low-level operators. The workflow designed in Rushes

    1.3was able to combine multiple low-level features and multiple types of sources


    (video, audio, text). Moreover, the shot was considered as a semantic unit of a

    video. Due to the fact that different shot boundary operators provide different

    shots, an extra complexity was added to the metadata model where each feature

    could define its temporal boundaries.All the low-level operators were applied to every content in the database. This

    fact introduced a strong limitation in the scalability of the domain. In order to

    identify new concepts, more low-level operators might be needed and as the size

    of the feature-space dimensionality increased, the system became both com-

    putationally too demanding and unfordable for the data mining and ontology

    management processes. We presented a potential solution to this problem in

    [OMK+09] by splitting the domain in sub-domains that only apply those low-

    level feature extraction operators suggested by the domain definition (ontology).

    However it requires a prior knowledge of the content that should be obtained by

    applying low-level operators. Thischicken-eggproblem will be one of the key

    topics of this research work.

    1.2.7 Grafema

    Title:Grafema: Multimodal content search platform

    Project typology:Basic research project.

    Period:2012. Summary

    The goal of the Grafema project was to create a base platform to store, annotate

    and retrieve multimedia content of diverse nature. More than focusing on the

    algorithms to obtain content descriptors or methods for automatic content an-

    notation, Grafema was focused on the architectural aspects and the design of a

    generic solution to deal with different types of content. In this sense, an asset

    could be either text, image, audio, video, 3D or even a combination of these pre-

    vious elementary units. According to this generic description of a digital asset,

    similarity metrics must also adapt to each case or combination. As it can be ob-

    served in Figure1.4, assets containing the labeltigercan be considered as similar

    if they include this information in the metadata or if this label is found in any of

    the elementary units that compose the content.

    The workflow designed for Grafema (Figure1.5) is based in low-level oper-

    ators that are independently processed. The information obtained from these


    1.2 R&D Projects

    Figure 1.4:Grafema Assets

    operators is then ported to a higher level of abstraction by using data mining

    techniques. The obtained information is then introduced in a semantic model

    and stored in a database. The similarity of two assets can be then computed ac-

    cording to this semantic model, but it is not limited to this metric. An iterative

    process starts and enables the calculation of similarity metrics between assets

    of the same type that belong to different instances. This iterative process, is the

    basis of the Grafema architecture (Figure1.6) and provides a new paradigm of

    content search a retrieval based more in a browsing process than in a pure text

    based search.

    Figure 1.5:Grafema System Workflow


    1. INTRODUCTION Conclusions

    The results of Grafema have shown the big potential of iterative processes for mul-

    timedia searching. Even if the tests have been carried out with limited datasets in

    terms of size and domain complexity, the results show that text based search canbe dramatically improved datasets include high volumes of multimedia content.

    Figure 1.6:Grafema System Architecture

    Regarding the state of the art, the annotation and individual metrics as well

    as the unsuitability of most common database solutions for multimedia data are

    still the main drawbacks that limit the potential of these kind of systems.

    1.2.8 IQCBM

    Title:Image Query by Compression Based Methods Project typology:Industrial project.


    Consortium:DLR (German Aerospacial Agency). Summary

    The goal of this project was to create low-level operators and define distance met-

    rics for satellite imagery that would be applied during the ingestion process of the


    1.2 R&D Projects

    delivered streams. The main idea behind this operators was to gather prior char-

    acteristic information that could be useful for later retrieval operations. The lack

    of knowledge regarding the queries that may be applied during the retrieval pro-

    cess made difficult the definition of low-level features that might not be focusedin any specific aspect.

    The domain of Remote Sensing is not as broad as those related with the audio-

    visual sector, but are still too big and complex to be explicitly defined. Moreover,

    new definitions and relationships could be dynamically introduced.








    user classes





    Q 64x64 Compute


    MPEG-7 VST



    JPEG 2K WLTs

    MonetDBstorage + execution







    Your post-processor here

    Your feature extractor here












    Query/ranking frameworkIndexing frameworkAnalysis frameworkPre-


    Figure 1.7:IQCBM System Architecture

    In order to address the lack of prior knowledge, global features were consid-

    ered more adequate than the local ones. The first algorithm implemented in this

    project was based on the codewords provided by a Lempel-ziv compressor as sug-

    gested by Watanabe et al. [WSS02]. The L0 distance (Equation1.1) was used as

    a metric for the codewords related to each element (in this case an element is

    represented by each of the patches obtained after a tiling process applied to the

    multi-resolution satellite imagery).

    dL0 =n

    i=1|xiyi|0 where: 00 = 0 (1.1) Conclusions

    The developed system (Figure1.8) was tested against Corel 1000 dataset [Cor]

    and a subset of the Geoeye imagery [Glo]obtained good accuracy characteristics.

    The length of the feature vector for each item was variable and was an attribute by


    itself as it provides a measure of the complexity of the image. However, in terms

    of scalability, the average length (several thousands of codewords) obtained by

    this algorithm might become a limitation.

    Figure 1.8:Screenshots of the IQCBM user interface

    As a result of this project, a deep study of the current trends of the communitywas carried out [QGO13]. Moreover, a new global feature extraction algorithm

    was developed based on the ideas of Kadyrov et al. [KP98,KP01,KP06].

    1.2.9 Relationship of projects and scientific activity

    Figure1.9shows a summary of the main scientific activity around the aforemen-

    tioned projects. It can be observed that different projects share the same research

    activities and scientific background. However, this activity sharing does not imply

    the reusability of previous developments as different semantic domains require

    specific low-level features, metrics, mining techniques, etc.

    In order to minimize these barriers to a higher practical re-usability, a new

    architecture is proposed in this work.


    1.2 R&D Projects

    Figure 1.9:Relationship between R&D projects and scientific activity in multimedia

    content analysis.


    By three methods we may learn

    wisdom: First, by reflection, which

    is noblest; Second, by imitation,

    which is easiest; and third by ex-

    perience, which is the bitterest.

    Confucius CHAPTER

    2Computer Vision from a

    Cognitive Point of View

    The main approach for computer vision tasks has been based on the identifi-

    cation of low-level features that can be used to segment, identify or determine

    higher abstraction levels. Following the three learning methods stated by Confu-

    cious in the previousquotation, we cansay that this approach provides knowledge

    to the system by imitation. By giving this explicit knowledge based on existing

    datasets or contents that have been used to allow researchers to understand the

    relationship between the identification of real world phenomena and specific

    low-level features, the computer vision system just reproduces the experiments

    with different datasets. This process offers high performance results for narrow

    domains but it cannot be reused in other contexts and it is not a scalable ap-

    proach. The feature space created by these low-level operators easily gets too

    complex for manual (and typically linear) thresholding techniques. For those

    cases where the behavior of a set of low-level feature extractors is too complex

    to model, data mining techniques are applied. In those cases, we can move to

    the third level of Confucious statement. The experience of dealing with this data

    (training for supervised classification and other specific metrics or criteria for

    clustering) provides an adaptive behavior within the system to create the regions

    or hyperplanes that fit best for each specific problem.

    The use of ontologies introduces a new way of adding formal explicit knowl-

    edge to the system. This is typically carried out by establishing concepts and


    Figure 2.1:Information Retrieval Reference Model [Mar11]

    relationships among them, defining a domain in this way. One common use of

    ontologies is to establish shared vocabularies and taxonomies between scientistor professionals. However, from a cognitive system perspective, the most pow-

    erful characteristic of ontologies is the capability of inference that creates new

    rules that where not explicitly defined. The main drawback of ontologies comes

    from the fact that broad complex domains such as those related to the common

    vision understanding cannot be specifically defined, mainly because the size, the

    complexity and the fuzziness of this kind of domains.

    Content Based Image Retrieval (CBIR) systems can be considered as one of the

    branches of cognitive vision since they require the four functionalities considered

    as the pillars of a cognitive vision system: detection, localization, recognition and

    understanding[Ver06]. Marcos et al propose a reference model that addresses the

    use of ontologies for multimedia retrieval purposes [MIOF11]. This work presents

    a reference model (Figure2.1) based on asemantic middleware. The main goal

    of this approach is to create a layer to deal with semantic functionalities (e.g.:

    knowledge extraction, semantic query expansion,. . . ).

    Marcos proposes in his PhD work[Mar11] the use of the semantic middleware


  • 8/9/2019 A framework for content based semantic information extraction from multimedia contents


    to automatically generate annotations of the multimedia assets. This approach,

    initiated in the Rushes project by using a set of low-level features and applying

    fuzzy reasoning to the information provided by those modules offered good re-

    sults for narrow domains, but the system was unable to deal with a big numberof different low-level features and broad complex domains did not show a good

    performance. One of the main drawbacks of this architecture was the fact that all

    low-level features were considered at the same level when no prior information

    was given.

    2.1 Mandragora Framework

    In order to overcome this scalability drawback, we presented a novel architec-

    ture calledMandragora[OMK+09]. This architecture enhances the metadata with

    new labels that can be ported to the semantic layer by using a two step itera-

    tive approach. The implicit and explicit knowledge about a certain domain can

    be introduced in the system with a combination of classifiers and the semantic

    middleware. This combination allows the modeling of bigger and more complex

    domains[SASK08] and reduces the semantic gap by connecting low-level features

    with high-level hypothesis and reinforcement factors. The reinforcement factors

    allow to extend the dimensionality of the domain and provide the framework for

    specific analysis methods.

    The main idea behind this two step approach is to break big domains in sub-

    domains that are more homogeneous both semantically and in terms of low-level

    features. Then, specific feature extractors and semantic definitions can be used

    with much higher precision. One of the key aspects of this framework is the initial

    domain estimation, the hypothesis that will be considered by the next layer to

    launch domain specific analyzers that afterwards will feed the semantic middle-

    ware. If the results of this second step confirm the characteristics of the estimated

    domain, the hypothesis will be accepted and the elements identified in the con-

    tent will be considered as descriptors of this specific asset. Otherwise, the process

    will be restarted with a different hypothesis.

    2.2 Image Processing and AI Approach

    From an Artificial Intelligence perspective we can consider the climbing on the

    DIKW Pyramid (Figure )2.3) as the process that our system has to follow from raw


  • 8/9/2019 A framework for content based semantic information extraction from multimedia contents



    Figure 2.2:Mandragora Architecture for automatic video annotation:[OMK+09]

    unannotated images to structured content with semantic information. The main

    issue consists of the semantic gap between the mathematical representation ob-tained by the developed operators and the high abstraction level concepts that

    are intended to be discovered by using such low-level features. Smeulders et al.

    define the Semantic Gap as: . . . the lack of coincidence between the information

    that one can extract from the visual data and the interpretation that the same data

    have for a user in a given situation. According to this definition, we can consider

    the semantic gap as the distance betweendataandwisdomin the DIKW pyramid.

    The typical approaches are both top-down, (ontologically driven approaches


  • 8/9/2019 A framework for content based semantic information extraction from multimedia contents


    that build domain definitions by creating relationships between high level con-

    cepts) and bottom-up, automatic annotation or labeling approaches that try to

    discover correspondences between high level annotations and automatically ex-

    tracted features [HSL+06]. These both approaches can also be combined in thesame process.

    Figure 2.3:DIKW Pyramid

    Most bottom-up approaches relay on data-mining techniques to move from

    low-level mathematical representation to classes that will be at a higher abstrac-

    tion level. There is a enormous diversity of different supervisedunsupervised

    classification or regression techniques, methods for feature space analysis, algo-

    rithms for attribute selection etc. Thus, each specific problem requires the set

    of tools and algorithms that suits best for each characteristics and requirements

    (type of attributes and classes, dimensionality, dataset size, computational cost,



  • 8/9/2019 A framework for content based semantic information extraction from multimedia contents


    The difference between stupidity

    and genius is that genius has its


    Albert Einstein


    3Domain Identification

    As it has been stated in the previous section, the domain identification is one

    of the key issues of cognitive vision as it allows the use of contextual information.

    Current best performing systems are mainly those where the size and the com-

    plexity of the domain are relatively low. Deng et al. [ DBLFF10] perform a study

    the effects of dealing with more than 10,000 categories. The results show that:

    Computational issues become crucial in algorithm design.

    Conventional wisdom from a couple of hundred image categories on rela-

    tive performance of different classifiers does not necessarily hold when the

    number of categories increases.

    There is a surprisingly strong relationship between the structure of the

    WordNet and the difficulty of visual categorization.

    Classification can be improved by exploiting the semantic hierarchy.

    The process carried out by Deng et al. is based on state of the art descriptors

    such as GIST[OT01] and SIFT[Low99]. The classification process uses Support

    Vector Machines and the dataset includes more than 9 million assets.

    Popular AI development results such as Deep Blue against Kasparov [Dee]

    commonly considered as a great step in AI where machines are able to beat hu-

    man minds are clear cases where the domain and the rules that define it are rather

    simple, while combinatorial space derived from it becomes huge. For those cases,


    Figure 3.1:Watson DeepQA High-Level Architecture [FBCC+10]

    brute force algorithms can defeat human experience and heuristics capabilities.

    In the case of Deep Blue its domain dependence was so high that even some

    hardware components where specifically designed for chess playing purposes.

    A step forward was done by Watson [Wat] in 2011 that won theJeopardy!prize

    against former winners. In these cases, Watson was able to process natural lan-

    guage by identifying keywords and accessing 200 million pages of structured and

    unstructured content. As it is stated in the IBM DeepQA Research Team (devel-

    opers of Watson) when they refer to WatsonThis is no easy task for a computer,

    given the need to perform over an enormously broad domain, with consistently

    high precision and amazingly accurate confidence estimations in the correctness of

    its answers. However, even if the constraints to perform this task are much harder

    than for chess playing, apart from the natural language processing module, the

    task of playing theJeopardy!can be considered as an advanced text search en-

    gine that does not require prior contextual knowledge as it can be observed in itsarchitectural design (Figure3.1).

    The current state of the art is plenty of AI approaches that face the same limi-

    tation observed in these two examples. They obtain a very good performance in a

    specific narrow domain but fail when it scales up or when the same system is ap-

    plied for a different problem. Current multimedia information retrieval systems

    are exactly in this situation where contents belonging to specific contexts can be

    successfully managed but have strong limitations of flexibility and scalability.


  • 8/9/2019 A framework for content based semantic information extraction from multimedia contents


    3.1 Domain characterization for CBIR

    The importance of semantic context is very well known in Content Based Image

    Retrieval (CBIR)[SF91,TS01]. This is especially relevant for broad-domain data

    intensive multimedia retrieval activities such as TV production and marketing or

    large-scale earth observation archive navigation and exploitation. Most modeling

    approaches rely on local low-level features, based on shape, texture, color etc. The

    drawback of these methods is that the characterization of the context requires

    prior contextual information, introducing a chicken-and-egg problem[TMF10]. A

    possible approach to reduce this dependency involves the exploitation of global

    image context characterization for semantic domain inference. This prior in-

    formation on scene context could represent a valuable asset in computer vision

    for purposes ranging from regularization to the pre-selection of local primitive

    feature extractors [SWS+00].

    3.1.1 Broadcasting

    The broadcasting sector has experienced a deep transformation with the intro-

    duction of digital technologies. All internal work-flows have been affected by the

    fact of representing content digitally. Regarding the Multimedia Asset Manage-

    ment (MAM) systems, before the content was digital, all assets were centralized

    and managed by documentalists/librarians, professionals that following a rigid

    taxonomy were responsible of annotating, storing and retrieving the content.

    Therefore, the work-flow was organized in a manner that documentalists offer

    the content management service to editors. Since the digitalization of ingesting

    and delivery processes, editors can directly and concurrently access to the con-

    tent they are looking for. It offers great advantages in terms of efficiency allowing

    non-linear editing and minimizing access times. However, this new work style in-

    troduces much more inconsistencies since contents are concurrently annotated

    by users that do not strictly follow a given taxonomy. in the metadata and in order

    to create direct search and retrieval services, content annotations must be richer

    and better since editors do not have the knowledge of documentalists to browse

    among millions of assets. In order to get this improved metadata, manual annota-

    tions result too expensive for most cases and automatic annotation systems are

    not able to characterize high abstraction level categories, specially due to the size

    and complexity of the broadcasting context.


    From a technical point of view, there are many industrial solutions and stan-

    dards for metadata (SMEF, BMF, Dublin Core, TV Anytime, MPEG-7, SMPTE

    Descriptive Metadata, PB Core, MXF-DMS1, XMP etc.) that offer good retrieval

    characteristics. However, all these technologies and specifications rely on a previ-ously annotated dataset that in most practical cases cannot be populated at an

    affordable cost. Alternative methods for massive content annotation

    The explosion of prosumers and web video portals offer a new way of enriching

    content with metadata. Most of these platforms offer the possibility of leaving

    comments that can be used as annotations afterwards. However, these annota-

    tions are always unstructured and their confidence is much lower. Therefore, theycannot be used directly as a source of metadata.

    On the other hand, speech processing tools that nowadays are being used

    to create subtitles, offer another source of textual information that is very repre-

    sentative of the content. The use of the audio channel to create metadata faces

    the same problem of unstructured text as users comments. However, it offers a

    very rich and highly related text that fits very well with current text based search


    3.1.2 Earth Observation, Meteorology

    An extensive review of the state of the art of content-based retrieval in Earth

    Observation (EO) image archives is presented in Section 7.2. Compared with

    the broadcasting application field, EO archive volumes deal with even bigger

    data volumes (approaching the zettabyte)1. The assets they contain are largely

    under-exploited: the majority of records have never been accessed. Up to 95% of

    records have never been accessed according to figures reported in conferences.

    The situation is exacerbated by the growing interest in and availability of met-

    ric and sub-metric resolution sensors, due to the ever-expanding data volumes

    and the extreme diversity of content in the imaged scenes at these scales. As it

    happens in the broadcasting sector, interpreters to manually annotate archived

    content are expensive and tend to operate in applicative domains with stable,

    1 The data volume for the EOC DIMS Archive in Oberpfaffenhofen is projected to about 2

    petabytes in 2013 (Christoph Reck, DLR-DFD, presentation during ESA EOLib User Requirements

    workshop, ESRIN November 17, 2011)


    3.1 Domain characterization for CBIR

    Figure 3.2: Idealized query process decomposition into processing modules and

    basic operations based on an adaptation of Smeulders et al.[SWS+00].

    well-formalized requirements rather than on the open-ended needs of the remote

    sensing community at large or of broad efforts like GEOSS [KYDN11].

    Regarding the domain, the EO semantic space is much more focused than the

    one required for broadcasting content. In fact, Domain-specific ontologies help

    to define concepts in a finer granularity. For specific uses such as the context of

    disaster management in coastal areas: ontologies for Landsat1 and MODIS2 im-

    agery based on the Anderson classification system[And76] have been developed.

    However, the semantic gap between the huge amount of data remains still as an

    issue to automatically fulfill these specific ontologies. A general decomposition

    of a theoretical query process is depicted in Figure3.2.



    A special particularity of the EO domain is the diversity of type of data pro-

    vided by the instruments installed in a satellite, where most of them are affected

    by noise and distortions produced by the distance, atmosphere, etc.

    Envisat (Environmental Satellite) launched on 2002 and operated by ESA(European Space Agency) includes the following instruments1 (Figure3.3):

    ASAR: Advanced Synthetic Aperture Radar, operating at C-band, ASAR ensures

    continuity with the image mode (SAR) and the wave mode of the ERS-1/2


    MERIS a programmable, medium-spectral resolution, imaging spectrometer op-

    erating in the solar reflective spectral range. Fifteen spectral bands can be

    selected by ground command, each of which has a programmable width

    and a programmable location in the 390 nm to 1040 nm spectral range.

    AATSR: Advanced Along Track Scanning Radiometer, continuity of the ATSR-1

    and ATSR-2 data sets of precise sea surface temperature (SST) levels of

    accuracy (0.3 K or better).

    RA-2 Radar Altimeter 2 (RA-2) is an instrument for determining the two-way

    delay of the radar echo from the Earths surface to a very high precision:less than a nanosecond. It also measures the power and the shape of the

    reflected radar pulses.

    MWR: microwave radiometer (MWR) for the measurement of the integrated

    atmospheric water vapour column and cloud liquid water content, as cor-

    rection terms for the radar altimeter signal. In addition, MWRmeasurement

    data are useful for the determination of surface emissivity and soil moisture

    over land, for surface energy budget investigations to support atmospheric

    studies, and for ice characterization.

    GOMOS: measures atmospheric constituents by spectral analysis of the spec-

    tral bands between 250 nm to 675 nm, 756 nm to 773 nm, a