GRID ENABLED SYSTEM FOR MEDICAL IMAGE GATHERING, ANALYZING, RETRIEVAL AND PROCESSING
Content-Based Image Retrieval for Medical Applications
Transcript of Content-Based Image Retrieval for Medical Applications
Igor Francisco Areias Amaral
Content-Based Image Retrieval for
Medical Applications
Faculdade de Ciências da Universidade do Porto
Outubro de 2010
Igor Francisco Areias Amaral
Content-Based Image Retrieval for
Medical Applications
Tese submetida à Faculdade de Ciências
da Universidade do Porto para a obtenção do grau
de Mestre em Engenharia Matemática
Dissertação realizada sob a supervisão
de
Prof. Doutor Jaime dos Santos Cardoso (INESC-Porto)
e de
Prof. Doutor Joaquim Fernando Pinto da Costa (DMA-FCUP)
Porto, Outubro de 2010
To my parents, José and Maria
Acknowledgments
It is finally done. These first words you read were the last to be written in this
document. Looking back, this work reflects a hard learning process. Now, in the end,
what I feel that I know little. I learned that there is much more to learn. However I was
never alone during this task.
With the help of my thesis supervisors, Professors Jaime Santos Cardoso and
Joaquim Pinto da Costa, I was able to acquire the necessary motivation and knowledge
to achieve my goals. They provided the freedom to pursue my own ideas and, at the
same time, were rigorous in reviewing my work. For that, and for granted me the
opportunity in such a remarkable research field, I am thankful. They had a fundamental
role in bringing this document to life.
While at INESC, where this work was developed, I also had the opportunity to meet
amazing people and make new friends. They not only made many small contributions to
this work, during our informal talks at lunch or coffee breaks, but are also responsible
for the amazing atmosphere inside the institution.
My family was also very important to me these last months, specially my parents
who were supportive in every occasion, giving me all the chances I had.
For last Cristina, who taught me, during these last years, the importance of having
someone waiting for you.
Igor Francisco Areias Amaral
October, 2010
Abstract
Advances in digital imaging technologies and the increasing prevalence of picture
archival systems have led to an exponential growth in the number of images generated
and stored in hospitals during recent years. Thus, automatic medical image annotation
and categorization can be very useful for the purposes of image database management.
Conventional image retrieval systems are based on textual annotation where key
information about the image is stored. In medical images it forms an essential
component on a patient’s record. However, in many occasions this information is very
often lost as consequences of image compression or human error. Also, given the
number of different standards adopted for medical image annotation, building a
comprehensive ontology regarding medical terms is not always consensual. Recently,
advances in Content Based Image Retrieval prompted researchers towards new
approaches in information retrieval for image databases. In medical applications it
already met some degree of success in constrained problems.
This document addresses the problem of medical image annotation relying only in
pictorial information, where images are classified by means of a hierarchical standard.
We present a comprehensive survey of related works and a description of the
mathematical tools used to achieve our goals. Our methodology consists in the use of
commonly approaches to this problem as well as an implementation our own ideas,
aiming to explore the hierarchical nature of the standard used for annotation.
Afterwards, we improve our initial results by means of two merging strategies and
provide an interpretation for our results.
Keywords: medical images, image descriptors, classification, support vector machines.
Resumo
Avanços na tecnologia compreendendo imagens digitais bem como um aumento na
utilização de sistemas de armazenamento de imagens levaram nos anos recentes a um
crescimento do número de imagens geradas e arquivadas no meio hospitalar. Como
consequência, a anotação e categorização automática de imagens médicas pode ser
bastante útil para a manutenção de bases de dados.
Abordagens convencionais a sistemas de recuperação de imagem baseiam-se em
anotações textuais onde informação crucial sobre o conteúdo da imagem é guardada. No
entanto, esta informação é frequentemente perdida como consequência da compressão
de imagem ou erro humano. Adicionalmente, dado o número de diferentes normas
adoptadas para anotação de imagens médicas, a construção de uma ontologia
compreendendo termos médicos nem sempre é consensual.
Recentemente, avanços na recuperação por conteúdo de imagens impulsionaram
investigadores para novas abordagens na extracção de imagens em bases de dados. Em
aplicações médicas existe já um relativo sucesso para problemas específicos.
Este documento aborda o problema de anotação de imagens médicas baseado apenas
em informação visual, onde imagens são anotadas segundo uma norma hierárquica. Será
apresentado um resumo compreendendo trabalhos relacionados e uma descrição das
ferramentas matemáticas usadas para alcançar os objectivos propostos. A nossa
abordagem consiste em métodos geralmente usados no mesmo problema bem como a
implementação de novas estratégias desenvolvidas com o objectivo de explorar a
hierarquia da norma usada para anotação. Mais tarde, através métodos para fusão de
anotações, melhoraremos os resultados iniciais, seguido-se uma interpretação dos
mesmos.
Palavras chave: imagens médicas, descritores de imagem, classificação, máquinas de
suporte vectorial.
Contents
Introduction
1.1 Motivation 1
1.1.1 Concept-bases systems 2
1.1.2 Medical Image standards and ontologies 3
1.1.3 Concept-based retrieval limitations: the road to CBIR 4
1.2 Content –based image retrieval 5
1.2.1 CBIR Systems 6
1.2.2 Smeudlers CBIR paradigm formalization 8
1.2.3 CBIR Future work overview 9
1.2.4 CBIR in Medical Applications 11
1.3 Structure of this document 12
1.4 Goals 12
1.5 Main Contributions 13
Related Work
2.1 The IRMA code 15
2.2 Error evaluation for the IRMA code 17
2.3 ImageCLEF Medical Evaluation Tasks 20
2.3.1 2005 Medical Annotation Task 22
2.3.2 2006 Medical Annotation Task 27
2.3.3 2007 Medical Image Annotation Task 32
2.3.4 2008 Medical Image Annotation Task 36
2.3.5 2009 Medical Image Annotation Task 39
2.4 Other IRMA database related work 44
Background information
3.1 The image domain 45
3.1.1 Image properties 46
3.1.1.1 Color 46
3.1.1.2 Shape 47
3.1.1.3 Texture 47
3.1.1.4 Interest Points 48
3.1.2 Image descriptors 48
3.1.2.1 Tamura textures 50
3.1.2.2 Edge Histogram Descriptor (EHD) 52
3.1.2.3 Color Layout Descriptor(CLD) 54
3.1.2.4 Scalable Color Descriptor (SCD) 55
3.1.2.5 Color and Edge Directivity Descriptor (CEDD) 56
3.1.2.6 Fuzzy Color and Texture Histogram (FCTH) 57
3.1.2.7 Spatial Envelope (GIST) 58
3.1.2.8 Speeded Up Robust Features (SURF) 59
3.2 Support Vector Machine (SVM) 60
3.3 2007 Medical Annotation Task database 67
Methodology
4.1 Framework Description 69
4.1.1 Feature Extraction 70
4.1.1.1 Global Descriptors 71
4.1.1.2 Bag-of-words model 72
4.1.2 Model Training and Image Annotation 72
4.1.3 Methods Fusion 73
Results
5.1 Feature Extraction 77
5.2 Annotation 78
5.3 Semantic Meaningless Codes 84
5.4 Fusion 85
Conclusions and Future Work 89
References 91
1
Chapter 1
Introduction
1.1 1.1 1.1 1.1 MotivationMotivationMotivationMotivation
HE image is probably one of the most important tools in medicine since it provides a
method for diagnosis, monitoring drug treatment responses and disease management of
patients with the advantage of being a very fast non-invasive procedure, having very few side
effects and with an excellent cost-effect relationship.
Hard-copy image formats, i.e., analog screen films, were the initial support for medical
images but they are becoming rarer. Maintenance, storage room and the amount of material to
display images in this format contributed for its disuse. Nowadays digital images, the soft-copy
format, lack the previous mentioned problems while offering the possibility of text annotations
in metadata format. Table 1.1 gives an overview of digital types, sizes and number of images
per exam in medical imagiology. Curiously, this transition from hard-copy to soft-copy images
is still the focus of an interesting debate related with human perception and interpretation issues
during exam analysis [1].
Exam Type One Image
(bits)
# of
Images/Exam
One
Examination
Nuclear medicine (NM) 128x128x12 30-60 1-2 MB
Magnetic resonance imaging (MRI) 256x256x12 60-3000 8 MB up
Ultrasound (US)* 512x512x8 20-240 5-60 MB
Digital subtraction angiography (DS) 512x512x8 15-40 4-10 MB
Digital microscopy 512x512x8 1 0.25 MB
Digital color microscopy 512x512x24 1 0.75 MB
Color light images 512x512x24 4-20 3-15 MB
Computed tomography (CT) 512x512x12 40-3000 20 MB up
Computed/digital radiography (CR/DR) 2048x2048x12 2 16 MB
Digitized X-rays 2048x2048x12 2 16 MB
Digital mammography 4000x5000x12 4 160 MB
*Doppler US with 24 bit color images
Table 1.1 – Types and sizes of some commonly used digital medical images (From [2]).
T
2
With the increase of data storage capacity and the development of digital imaging devices, to
increase efficiency and produce more accurate information, a steady growth of the number of
medical images produced can be easily inferred. A good example of this trend is the Radiology
Department of the University Hospital of Geneva where, alone, produced from 12.000 medical
images a day in 2002 [3] to 50.000 medical images a day in 2007 [4]. The main contributions
for these numbers are video frames from cardiac catheterizations and endoscopies. Aside the
obvious usefulness of medical images, patient diagnosis and treatment, this huge amount of data
also provides an excellent resource for researchers in the medical field.
1.1.11.1.11.1.11.1.1 ConceptConceptConceptConcept----based systems based systems based systems based systems
With the exponential increase of medical data in digital libraries, it is becoming more and
more difficult to execute certain analysis on search related tasks. Because textual information
retrieval is already a matured discipline, a way to overcome this problem is to use metadata for
image indexation where key description about its content and context can be stored. For medical
images we could store, for instance, patient identification, type of exam and its technical details
or even a small text comment concerning clinical relevant information. With this information
annotated, text-matching techniques can be applied for retrieving images satisfying a given
search statement mediated by a thesaurus, performed by evaluating the similarity between the
search statement and the metadata. Output evaluation can motivate a later thesaurus expansion,
new rules for validation and matching or a new search statement. This is called text-based or
concept-based image retrieval. A schema for this type of systems is depicted in Figure 1.1.
Figure 1.1 – A basic diagram representing a concept-based image retrieval system (From [5]).
3
Concept-based systems can be traced back, in a much wider domain, to the end of the 1970’s
accordingly to Rui [6], and are still used in photo and video sharing websites like Flickr1,
Google image search2 or YouTube
3.
1.1.21.1.21.1.21.1.2 Medical image standards and ontologiesMedical image standards and ontologiesMedical image standards and ontologiesMedical image standards and ontologies
To foster the concept-based approach, a nomenclature of medical terms together with a
relational or hierarchical model - a standard - is needed to bridge the content of the medical
image and its context. Also, standards regarding image compression formats, database
programming languages and network protocols are essential, as they provide mutual
understanding of users with different backgrounds in user-machine environments as well as
interchangeability of data via machine-machine protocols.
The ARC-NEMA standard for medical images was first developed in the 1980’s by a joint
venture between the American College of Radiology (ACR) and the National Electrical
Manufacturers Association (NEMA). Later, in 1992, after the inclusion of network protocols
and numerous glossary revisions, ARC-NEMA was renamed Digital Image and
Communications System4 (DICOM) and is the most common standard used for specifying
components of a medical imaging system. Other standards like SNOMED5, MeSH
6, HL7
7,
GALEN8, ICD-10
9 and UMLS
10 were also developed alongside with other type of solutions that
define interoperability between them: the IHE11
uses DICOM/HL7 for internal/external
communications without being a standard itself. The “order entry” issues, related with specific
information demanded by law and only an optional part of the DICOM header, also led to the
development of the Japanese JJ1017 standard [7]. In Japan the medical environment works with
more detailed information not fully covered by the DICOM standard. After a failure in trying to
change the DICOM standard to suit these needs, Japan advanced to its own system as an
extension of DICOM.
The degree to which the ontology of any standard can be a transparent representation of the
content underlying medical images is questionable. Understanding ontology as a formal way to
codify semantics that are representative of a reason, we face the difficulty to choose an adequate
1 http://www.flickr.com
2 http://images.google.com
3 http://www.youtube.com
4 http://medical.nema.org
5 http://www.snomed.org
6 http://www.nlm.nih.gov/mesh/meshhome.html
7 http://www.hl7.org
8 http://www.opengalen.org
9 http://www.who.int/classifications/icd
10 http://www.nlm.nih.gov/research/umls
11 http://www.ihe.net
4
terminology that captures the meaning of the image. Very often the problem is reversed when
such terminology is already well defined but the concepts that we are trying to represent become
subject of attention [8]. This is particularly evident in Emotional Information Retrieval (EmIR)
[9]. Furthermore, meaning is not a well defined quantifiable attribute, but, as Heidorn defines
[10] a property ascribed by human analysis of the image bringing to bear a combination of
objective and subjective knowledge in a sociocognitive process. Then, in one hand, words can
be used to denotate the image content if its meaning is straightforward and literal, which is not
very usual. On the other hand, if the image content can be connoted with different layers of
knowledge then words are not enough to describe its meaning [8].
1.1.31.1.31.1.31.1.3 ConceptConceptConceptConcept----based retbased retbased retbased retrieval limitationsrieval limitationsrieval limitationsrieval limitations:::: the road to CBIRthe road to CBIRthe road to CBIRthe road to CBIR
In practice the conceptualization of a general thesaurus of medical terms consume many
resources and demands extensive collaboration efforts where consensus is hard to reach. It is
reasonable to use inductive approaches by starting with more specific standards and attempt
generalization later. In the composite SNOMED-DICOM micro-glossary [11] such a strategy is
used. Nevertheless, all standards presented are not ineffectual since they are used in several
Picture Archive and Communications Systems (PACS). Facing the amount of images in a
database, annotation by human hand can be a time consuming and cumbersome task where
perception subjectivity can lead to unrecoverable errors. A study of medical images using
DICOM headers revealed 15% of annotation errors from both human and machine origin [12].
The amount of different languages that can be used for annotation is extensive and may lead to
translation/interpretation errors during a search statement or when databases are merged. It is
convenient to be aware of the prospect of re-indexing images due to the presence of an event
that changes the importance of a particular aspect, e.g., Forsyth’s previously unknown famous
person photos [13], or the need to link the content of the image to a new search statement
possibility, e.g., Seloff’s engineer search for a misaligned mounting bracket existent only in a
annotated astronaut training image [14]. From the foregoing it is clear the concept-based image
retrieval pose too many problems both from the ontology point of view, as stated in the previous
section, and from a practical point of view. Another major obstacles for concept-based image
retrieval systems are the existence of homographs and the fact that the search statement, or
query, does not allow the user to switch and/or combine interaction paradigms [15] during text
transactions. The ideal system would relieve the human factor from the annotation task, by
doing it automatically, and allowing image retrieval by its content in its purest form, not by text
description. This is Content Based Image Retrieval (CBIR).
5
1.2 Content1.2 Content1.2 Content1.2 Content Based Image RetrievalBased Image RetrievalBased Image RetrievalBased Image Retrieval
During our lives, and since a very early age, we have the ability to easily recognize
thousands of objects in many different conditions. Trying to understand how we do it is a deep
and complex subject. Pre-iconographic, iconographic and iconological formalism proposed by
Panofsky [16], generalized by Shatford [17] and extended by Shatford-Layne [18] provide the
notion that an image is not a single unit but an amalgam of generic, specific and abstract
content [16] where it may be necessary to determinate which attributes will result in useful
groupings of images and which attributes should be left for the user to identify [18]. The pre-
iconographic elements imply a simple identification through familiarity and representative of a
very low level knowledge related to human abstraction, but enough to comprehend some factual
information within the image. Pure forms, like volumes and lines, and their disposition are at
this level. The iconographic elements attempt to describe a motif or groups of motifs associated
with the pre-iconographic level and can imply a statistical procedure to identify those important
or unimportant, depending on their role in the image. Iconological interpretation is the highest
level of knowledge that can be extracted from the image and it results from grouping the pre-
iconographic and iconographic interpretations together with reasoning: it is the symbolical value
of the image [16]. The experience of the individual plays a role at all stages of the formalism,
exerting influence in the ability to group content based on the image attributes [17]. Shatford
generalization of Panofsky work comprehends only the first two levels, pre-iconographic and
iconographic, replacing these by of and about relational sentences. Therefore, an image can be
of a generic/specific person, animal, thing, action, condition, place or time of day, etc, about
abstractions symbolized by objects of beings, actions, events, places or time [18]. While not
rejected, real applications of such theoretical models were experimented by Enser but met little
success due to the dichotomous character of queries made by users in a concept-based retrieval
system [19].
We can attempt some simplification by considering a two-step process: first we retrieve
information from what we see; second we categorize the scene and the objects within using a
previous cognitive process. If we define images as two dimensional representations of our three
dimensional reality then the same process holds. But what is the image content? There is no
precise answer to this question. Nonetheless, relationships between image properties like color,
shape, texture and interest points are certain to be fundamental for its characterization.
The goal of CBIR is to replicate this human ability of object recognition using a similar two-
step process: use of quantified measures from the image that are believed to represent color,
shape, texture and interest points - the image descriptors - as an approach to human perception;
use of machine learning techniques, to create a model for the data, or similarity measures, to
6
interpret the image in order to establish the difference of two elements or groups of elements as
an approach to human cognition.
1.2.11.2.11.2.11.2.1 CBIR systemsCBIR systemsCBIR systemsCBIR systems
In a typical CBIR system (Figure 1.2) the input from the user consists in one or more images,
a test set. Pictorial content is then extracted into image descriptors and stored in the form of
feature vectors. In the system there is a database of images, a training set, where the
information extraction already took place and was used to choose the best models and/or
similarity measures for comparison. With the help of this models and similarity measures the
test set is indexed and/or similar images are retrieved. Relevance feedback take into
consideration the results and act by weighting or ranking feature vectors to discriminate their
importance; decide which image descriptors are relevant or not for the query; change models
and/or similarity measures; provide new model training and/or similarity measures definitions;
perform a new query. Human interaction in CBIR systems can also be used as an integral part of
it at this stage, not only when automatic methods fail. From a user perspective a CBIR system
should meet, accordingly to Chang [15], the general requirement of timely delivery and easy
accessibility of image and associated information for the user, at a resolution appropriated for
the intended task(s).
Figure 1.2 – A scheme of a typical CBIR system. Relevance feedback can be accomplished
using human interaction (From [20]).
The first theoretical CBIR systems designs appeared in 1987 [21]. The first prototype of
CBIR system appeared five years later, in 1992, and was developed by T. Kato [22] for an
7
electronic art gallery containing 205 pictures of paintings. Kato is also credited to be the first to
use the CBIR term [23]. In his system information extraction was performed by an adaptive
filter, based on the Weber-Fechner law for the human vision mechanism, to capture global and
local edge points. With this form of image abstraction, Query by Visual Example (QVE)
algorithms, based in correlation between the image query and the database, were employed for
the retrieval process. The best correlation was set as a similarity measure to match image
candidates to the given query.
The first commercial release of a CBIR system, the IBM Query by Image Content (QBIC)
[24], took place in 1995 and swayed the nature of future frameworks. Surveys of CBIR systems
can be found in Aigrain [25], Eakins [23] and Rui [6].
Historically CBIR is a relatively recent research area but with numerous and diverse
application fields, well resumed in Eakins [23]. Smeudlers [26] points the lack of computational
capacity, digital imaging devices and an underdeveloped Internet as the main causes that
hampered serious research attempts in this area before 1995. Aigrain [25] criticizes precisely the
fact that at this time too much effort was being placed at information systems and not in the
content processing. However, it is consensual that the lack of communication between retrieval
research and databases systems contributed for a slower development of CBIR, perceived as
soon as 1979 in the Conference on Database Applications of Pictorial Applications [26] and
remaining heretofore unrelated. Datta [27] also states that effective means of indexation were
overshadowed by the research of efficient visual representations and similarity measures. This
opinion slightly contradicts Rui in [6], where he states the stimulus given by the introduction of
the wavelet transform in the early 1990’s had an impact in the growth of the number of
available image descriptors and, consequently, motivated the appearance of CBIR systems.
Other forms of image information retrieval, already established at this time, for shape, color and
texture have found in CBIR extensive use, leading insofar to the first Moving Picture Experts
Group (MPEG) standards in 1992. During this decade image descriptors also started to focus
not on general information about the whole or partitioned image but in interest points aiming to
capture higher level information. Such type of image descriptors were mainly influenced by the
works of Harris for corner detection [28] and Lindberg for blob detection [29]. One of the major
achievements in this type of descriptors was most probably achieved in 1999 when Lowe
presented the Scale-Invariant Feature Transform (SIFT) [30], inspiring the research for other
affine transformation detectors [31], invariant to certain image conditions. The extent to which
these interest points could be used suffered a change when computer vision borrowed the word
frequency analysis from text-search operations in the so called bag-of-words or bag-of-features
models [32] around 2003, an approach that up for today is still gradually being unfolded with
the help of machine learning. Undoubtfully, one of the major contributions for CBIR was the
Internet boom and the arrival of the first Internet browsers around 1995, which demanded
8
urgent tools to retrieve information that suddenly could be accessed. Datta [27] verifies an
exponential growth of the number of scientific papers made available by three main publishers
from around 150 in 1995 to 1200 in 2008.
1.2.21.2.21.2.21.2.2 Smeudlers CBIR paradigm formalizationSmeudlers CBIR paradigm formalizationSmeudlers CBIR paradigm formalizationSmeudlers CBIR paradigm formalizationδδδδ
Notwithstanding all the various contributions in the previous section, a proper formalization
of the whole paradigm of CBIR is a necessity as otherwise would be hard to develop mission
critical applications or claim the much needed level of consistency and integrity of a recently
independent research field. Somehow this remained ignored by early implementations that did
not demand any kind of full understanding on the field, thus overlooking a broader overview of
the problems involved and a better refinement of CBIR systems components. Only in 2000
Smeudlers presented a deep review towards formalism for CBIR that influenced researchers
until the present time and will continue to influence in the future.
When CBIR systems are used for image extraction or annotation the output often does not
satisfy the given query. Smeudlers calls this problem the semantic gap or the lack of
coincidence between the information that one can extract from the visual data and the
interpretation that the same data have for a user in a given situation and justifies this behavior
with the difficulty to connect high-level concepts associated to the image to low-level content in
data-driven features. Considering that computers use numerical information only, when an
image is converted into digital format it is important to be aware of how much information is
lost during the process. This is the sensory gap or the gap between the object in the world and
the information in a numerical/verbal/categorical derived from an image recording of that
scene. Missing information can derive from cluttering, illumination conditions, occlusion,
distortion, differences in camera viewpoint or any other extra accidental elements in the image.
The sensorial gap is closely related with the variability of the image content, which
Smeudlers categorizes in two opposite domains: the narrow domain if the image has a limited
and predictable variability in all relevant aspects of its appearance and the broad domain if the
image has an unlimited and unpredictable variability in its appearance even for the same
semantic meaning. The distinctions between image domains play an important role during CBIR
systems design. Professional applications are usually domain-specific, dealing with narrow
domain images for object recognition or a quantitative objective description of its content.
Public applications use larger databases with broader domain images towards generic
applications for qualitative information retrieval.
δ In order to avoid excessive citations of the same source, any definitions in italic presented inside this
section can be found in [26]
9
Formalism for the query was also formulated for the capture of the essence of the user
intention. If the user has no specific aim for the query then a search by association is performed.
Systems to satisfy this requirement use iterative refinement for the given examples, thus being
very interactive. If the search is made for objects belonging to a certain category then we have a
category search. Systems for category searches rely on similarity measures that characterize an
image as part, or not, of a certain category. If the goal of the user is to search for a precise copy
image it is said that the user targets the search or performs an aimed-search, where the system
must search images from the specific example. Depending on the query intention, Datta [27]
defines the user as a browser, surfer and searcher respectively. With image domain and user
intention definitions, Smeudlers reformulates the goal for CBIR systems in the following way:
the challenge for image search engines on a broader domain is to tailor the engine to the
narrow domain the user has in mind via specification, examples and interaction.
1.2.31.2.31.2.31.2.3 CBIR future work overviewCBIR future work overviewCBIR future work overviewCBIR future work overview
There is room for development of CBIR in many directions. CBIR systems for narrow
domain images achieved a good degree of success; still, as the variability of images grows in
larger datasets, the problem grows deeper. From what was so far presented it is possible to point
some possible trends:
• Concept-based systems for image retrieval should not be ignored. Even if somehow
independent from CBIR, both approaches can complement each other in hybrid systems
where integration of natural language and computer vision take place. Aigrain [25]
mentions that this can help to capture rich semantic content of the image like names,
places, actions or prices.
• A higher understanding of what the user pretends from the information available is also
fundamental for any further work. For medical images, a study of what a doctor is looking
for when examining an image for diagnosis can be found in [33].
• From a database perspective there is also a lot of work to be done since the
developments made so far on this field are currently ill related with the developments in
CBIR, being more targeted for an increase in capacity rather than information organization
for future retrieval. Meanwhile a good interdisciplinary relationship for CBIR research is
slowly being established between areas like machine learning, multimedia, computer
10
vision, information retrieval and human-computer interaction accordingly to Datta in 2007
[27], a need previously expressed by Cawkell in 1993 [34] and Rui in 1999 [6].
• Today, digital image representations consist mostly in color models like the additive
Red-Green-Blue (RGB), the Hue-Saturation-Value (HSV) or the grayscale. The
dependence on these color systems also raises the question if they are sufficient to provide
information about the image. Specific color systems, like Tint-Saturation-Luminance
(TSL) for face detection, are to be considered as a potential solution for specific problems
in CBIR.
• A better interpretation of semantic image similarity is also needed for new metrics in
similarity measures since these degrade when databases grow and are usually domain-
specific. Smeudlers defends the search of similarity outside the scope of histogram
similarity [26].
• To counter the semantic gap, the linkage of low-level visual features to high-level
semantic meaning, more efforts should also be placed in the research of additional
descriptors for a better characterization of the image, thus allowing, at the same time, a
decrease in the sensorial gap. Image descriptors invariant to illumination conditions,
distortions, clutter, occlusion, etc, would reduce data from a broader to a narrower domain,
satisfying Smeudlers definition for the purpose of a CBIR system.
• New interface designs for user-machine interaction. Jain makes an original observation
about this subject in his blog1, stating the ‘simplistic’ fear from developers to produce
simple and useful systems rather than complex designs and extending this observation to
the academics, criticizing the excess of jargon to obfuscate their ideas.
• One of Smeudlers concluding remarks [26] points the necessity to classify usage-types,
aims and purposes to clearly evaluate if a proposed system solves a particular problem or
just perform better than a previous system.
• New, and more, general or domain-specific public databases.
Not all the future work possibilities in CBIR are stated, as they are very extensive, but only a
general idea of growth directions. It is worthy to point one last aspect in CBIR that may be an
1 http://ngs.ics.uci.edu , December 18
th 2009 entry.
11
important future work area and is largely ignored in CBIR surveys. As was already understood,
there was lack of proper formalism in the paradigm of CBIR until the valuable contribution of
Smeudlers. Withal, Smeudlers never refers any human cognition theoretical model for vision,
like Panofsky/Shatford work, or any implementation attempt of such models like Enser in his
survey [35]. It seems that non-ambiguous level-type knowledge theoretical models to be liable
for integration in CBIR are above any formalism definitions and may possess a key role to
understand how high-level concepts can be constructed by grouping low-level content.
1.2.41.2.41.2.41.2.4 CBIR in medical applicationsCBIR in medical applicationsCBIR in medical applicationsCBIR in medical applications
CBIR in the medical field also presents a growing trend in publications [36]. Although the
number of experimental algorithms comprehending specific problems and databases face a
growth its reflection on the number of medical applications and frameworks is still very
constrained. Only a few systems exist with relative success. The CervigramFinder system [37]
was developed to study the uterine cervix cancer. It is a computer assisted framework where
local features from a user-defined region in an image are computed and, using similarity
measures, similar images are retrieved from a database. The Spine Pathology & Image Retrieval
System (SPIRS) [38] is a web-based hybrid retrieval system, working with both image visual
features and text-based information. It allows the user to extract spine x-rays images from a
database by providing a sketch/image of the vertebral outline. The retrieval process is based in a
active contours algorithm for shape discrimination. The Image Retrieval for Medical
Applications (IRMA) system [39] is a generic web-based x-ray retrieval system. It allows the
user to extract images from a database given an x-ray image query. Local features and similarity
measures are used to compute the nearest images. The SPIRS and IRMA systems were merged
to form the SPIRS-IRMA system, with the functionalities of both. More recently a CBIR
framework prototype was proposed for retrieval of images from a broader domain, including x-
rays, CT and US [40]. In this system multiple features from the image, based in intensity, shape
and texture, are extracted from a given query and used to retrieve similar images based on
similarity measures. Reviews of CBIR for medical applications can be found in [41] and [42]. A
review of 21 CBIR systems for Radiology can be found in [43].
Medical applications are one of the priority areas where CBIR can meet more success
outside the experimental sphere due to population aging in developed countries.
Notwithstanding the progress already achieved in the few frameworks available here is still a lot
of work to be done in order to develop a commercial system able to fulfill image
retrieval/diagnosis comprehending a broader image domain.
12
1.31.31.31.3 Structure of this documentStructure of this documentStructure of this documentStructure of this document
In this chapter we presented the motivation for the problem, a small survey of the CBIR
paradigm, its current state and future work. The rest of this thesis is structured as follows:
• In Chapter 2 we discuss the related work. We start by presenting the IRMA hierarchical
code for classification of medical images, the adopted error evaluation metric and a
survey comprehending the IRMA database. Next, we survey the work for the
imageCLEF medical annotation tasks from 2005 to 2009 as well as other works on the
IRMA database1.
• Chapter 3 contains the background information for the comprehension of this work.
Images structure, color, shape and texture properties will be addressed, together with a
discussion of the global and local descriptors used herein for image retrieval. Moreover
we refer to the machine learning technique used, the Support Vector Machine (SVM),
and the IRMA database subset used in this work.
• Chapter 4 will contain the methodologies used, like the bag-of-visual words,
classification strategies and decision fusion schemes.
• In chapter 5 we present the results together with a discussion regarding these. In chapter
6 the major conclusions and future work will be drawn.
1.41.41.41.4 GoalsGoalsGoalsGoals
The proposed problem in this work consists in medical image classification/annotation:
given a medical image we want to know what it is, e.g., what is the image class taking into
consideration a database of collected images belonging to specific classes.
Some general goals were defined in the beginning and were adjusted depending on the
intermediate results achieved. An emphasis on learning subjects related with this work was also
an essential part of the initial goals. These were:
• To comprehend the fundamental aspects involved in CBIR, namely in the areas of
image processing, focusing on image descriptors, and machine learning, namely
Support Vector Machines (SVMs).
• To study the previous work done in the last years for CBIR in the medical image
domain, especially the IRMA Database related work.
1 The IRMA Database and all images presented in this work are a courtesy of T.M. Deserno, Department
of Medical Informatics RTWH Aachen, Germany.
13
• Based on the two previous points, to implement one of the best approaches for the
available IRMA database within the institution. This goal was adjusted very early.
Instead of an implementation we decided to design our own system. However some
aspects of previous related works were preserved.
• To use well known image descriptors, as well as other image descriptors not used in
previous related works, focusing in those with code provided by feature extraction
engines or authors.
• To investigate new classification strategies and compare them with the related works.
• To investigate fusion methods to improve our initial results in order to make them
competitive or better than the results found in literature.
• To point new directions and considerations where future work can be developed.
In order to achieve these goals we followed a set of fundamental principles of the computer
vision/machine learning group within Instituto Nacional de Engenharia de Sistemas e
Computadores (INESC-Porto)1 where this work was developed:
• Images were considered in their raw format.
• Results were in a quantitative basis in order to allow comparison with related works.
• Ongoing progress and preliminary results were presented in regular meetings within the
institution in order to gather feedback from researchers that work in similar problems
and whose contributions proved valuable.
1.51.51.51.5 Main contribuMain contribuMain contribuMain contributionstionstionstions
The main contributions of this work are:
• An experimental performance evaluation of image descriptors in the context of medical
image annotation in medical databases.
• A new interpretable method of classification using the SVM based on the hierarchical
standard for medical images adopted and its comparison with other methods used in
related works.
• An experimental evaluation of the fusion between methods.
• A relearn method using SVM’s to identify potential wrong classified images that will be
subjected to the fusion process stated in the previous point.
1 http://www2.inescporto.pt/
14
Results from this work led to the following publication:
• Igor F. Amaral, Filipe Coelho, Joaquim F. Pinto da Costa and Jaime S. Cardoso;
“Hierarchical Medical Image Annotation Using SVM-based Approaches”, in
Proceedings of the 10th IEEE International Conference on Information Technology and
Applications in Biomedicine, 2010.
15
Chapter 2
Related work
2.12.12.12.1 The IRMA codeThe IRMA codeThe IRMA codeThe IRMA code
HE IRMA code for medical image classification [44], is a mono-hierarchical multi-axial
classification scheme for medical images. It consists in four axes, with three to four
positions each, describing different content within the image: the technical (T) axis code
describing image modality; the direction (D) axis code describing body orientation; the
anatomical (A) axis code for the body region examined; and the biological (B) axis code for the
examined body system. All axes have three positions with the exception of the T axis, with four.
Therefore the full IRMA code for one particular image consists in 13 characters (IRMA: TTTT-
DDD-AAA-BBB). In order to emphasize the mono-hierarchical order of the positions we
adopted a slightly different notation for the IRMA code – IRMA: T1T2T3T4-D1D2D3-A1A2A3-
B1B2B3. This means that, for example, that the position T2 in the modality axis is hierarchically
higher than the position T3 within the same axis. This notation will prove to be useful in
subsequent chapters.
The possible values for a particular position can be {0,…,9,a,…,z} where ‘0’ in a particular
position of an axis denotes ‘unspecified’, truncating the code and forcing the assignment of the
same value to any hierarchically inferior position. Each sub-position in an axis, i.e. a position
that is not the hierarchically highest (the root), is connected with one and only one hierarchically
higher position. Therefore any axis consists in a tree whose leafs are reached by one and only
one top-down way, making it, as previously stated, mono-hierarchical. An IRMA code will
consequently be a forest of trees, each representing an axis. Only two relational sentences are
allowed: “is a” for the root and “part-of” for sub-positions. Even if the meanings of two or more
sub-positions at different hierarchical levels are literally identical, like in some T3 and T4 sub-
positions for sonography modality (T1=“2”, T2 � �1, … ,8�), depending of the value of the sub-
position hierarchically higher of which they are “part-of”, different meanings of the axis are
established, guaranteeing non-ambiguity. Such structure allows the development of methods for
semantic queries in databases. Figure 2.1 shows some examples of images with their respective
IRMA codes.
T
x-ray, projection radiography, analog, high
energy – coronal, anteroposterior, supine
chest
x-ray, projection radiography, analog,
overview image – sagittal, mediolateral
lower extremity/leg, knee, left
musculoskeletal system
IRMA: 1121-230-942
Figure 2.1 – Examples of x-ray images annotated with the IRMA code. Notice that some axis
may be completely ‘unspecified’ (F
Task database).
IRMA: 1123-127-500
ray, projection radiography, analog, high
posterior, supine -
x-ray, projection radiography, analog, high
energy – sagittal, right-
inspiration – chest
ray, projection radiography, analog,
sagittal, mediolateral –
ity/leg, knee, left knee –
musculoskeletal system
x-ray, projection radiography, analog,
overview image – sagittal, left
– cranium, neuro cranium
musculoskeletal system
942-700
ray images annotated with the IRMA code. Notice that some axis
may be completely ‘unspecified’ (From: 2007 ImageCLEF Medical Annotation
IRMA: 1121-220-230
500-000 IRMA: 1123-211-500
16
projection radiography, analog, high
-left lateral,
chest
ray, projection radiography, analog,
sagittal, left-right lateral
cranium, neuro cranium –
musculoskeletal system
ray images annotated with the IRMA code. Notice that some axis
Medical Annotation
230-700
500-000
17
The technical (T) axis, with four positions, describes the image modality by assigning the
image source of acquisition to the T1 position whose details are then assigned to the T2 position.
T3 position specifies the technique used, with more details on such techniques specified in the
T4 position. Directional (D) axis starts with the description of the common orientation of the
body in the D1 position whose details are specified in the D2 position. Here there was the
concern to distinguish the posteroanterior and anteroposterior directions due to the differences
in scale between organs or bones. The last position of the directional axis, D3, describes the
functional orientation during the exam. The anatomy of the human body is described in the
anatomy axis (A). The first position of this axis, A1, specifies nine major regions and the
subsequent positions, A2 and A3, define in more detail these regions. The B axis denotes the
organ system under analysis and complements the A axis because different type of organs exist
in the same anatomical region. The B1 position in this axis specifies 10 organ systems and the
rest of the position, B2 and B3, specify a particular system until an organ is identified.
Due to the structure of the IRMA code, modifications or extensions can easily take place by
replacing or adding new positions/position values, or even by adding completely new axis.
Other standards with the same purpose suffer from incompleteness, ambiguity, lack of causality
and without hierarchy. MeSH thesaurus is a polyhierarchical standard, where several codes for
modality possess the same meaning of a unique IRMA T axis code and lack of incompleteness.
DICOM and SNOMED nomenclatures are incomplete and ambiguous for the description of the
body anatomy. JJ1017 standard is closely related to the IRMA code but offers only three axes
for image classification, raising problems for semantic retrieval due to ambiguity and lack of
detail for the human body regions. Specific examples of the limitations of these standards in
comparison with the IRMA code can be found in [44]. A complete reference for the IRMA code
values is available at request at the IRMA project website1.
2.22.22.22.2 Error evaluation for the IRMA codeError evaluation for the IRMA codeError evaluation for the IRMA codeError evaluation for the IRMA code
Consider a particular axis X of an IRMA code. Let � � �, ��, … , � , … , � be the correct code
for X. Here � is a particular position in X and I is the depth of the tree for the considered axis.
Notice that I may change for different axis. Let �� � ��,���, … , �� , … , �� be a classified code for X
where each �� position can be of any specific value precisely for the position or, if a “do not
know” decision is chosen, the coding can be done by using a wildcard “*”. If a position �� in X
was wrongly classified then all positions �� �, … , �� will also be considered wrong due to the
hierarchical structure of the IRMA code. If in the same position a ‘0’ or “*” classifications are
1 http://www.irma-project.org
18
given then, again, all subsequent hierarchically inferior positions will be ‘0’ and “*”
respectively.
The error corresponding to a particular axis is given by:
∑ ������ � ���� ��� , �� ������� �
With
��� , �� � � !0 #$ �% � ��% &' ( #0.5 #$ �% � + ,' ( #1 #$ �% - ��% ,' ( # . (2.2)
where in (2.1)
• (a) is the branching factor, that accounts for the decision difficulty in the specific
position, with bi the number of possible values for the specific position.
• (b) is the position in the axis code string and account for the level in the hierarchy.
• (c) is the weight given to a correct/not known/incorrect decision.
If an error is found in a position i then ���/, ��/� � 1 for every 0 � �# 1 1, … , 2�.
The normalized error is given by:
∑ 34�3�5�6�,6���7�83∑ 34�3�7�83
The normalized error values for an axis range between 0, for a complete correct
classification, and 1 for a complete misclassified axis. The contribution of this error for the total
error of a complete IRMA code is weighted by the number of axis. Therefore, because we are
considering a multi-axial scheme, the error count for each axis is obtained by multiplying (2.3)
by 1 9 : where k is the number of axis in the IRMA code. In our case 1 9 : is 0.25 because four
axes are considered.
(2.1)
(2.3) (2.3)
19
In Table 2.1 an example for the error count in the anatomical (A) IRMA code axis is
presented. For the specific code the bi value1 in (2.1) is 11 for the A1 position, 7 for the A2
position and 8 for the A3 position. The maximum error for a complete misclassified axis is
therefore 0.20400 by setting the (c) parcel in (2.1) to 1. The weight of the error for each of the
four axes in the IRMA code is 0.25. Multiplying this value by the normalized error gives us the
error count. This is the contribution for the total error of a complete IRMA code.
Correct code: 463
Classified Error
(eq. 2.1)
Normalized Error
(eq. 2.3)
Error Count
463 0 0 0.000000
46* 0.020833 0.102122 0.025531
461 0.04166 0.204244 0.051061
4*1 0.05655 0.277188 0.069297
4** 0.05655 0.277188 0.069297
47* 0.11310 0.554377 0.138594
473 0.11310 0.554377 0.138594
477 0.11310 0.554377 0.138594
*** 0.10200 0.5 0.125000
731 0.20400 1 0.250000
Table 2.1– Example for error count in the anatomical (A) IRMA code axis (From: [45])
Two special situations not depicted in table 2.1 should also be considered: if the true code of
the axis is completely unspecified, or ‘000’, assigning wildcards for all the positions, ‘***’, is
not considered an error; if the true code of the axis is again completely unspecified and if a
wildcard is assigned to the first position and other values, even if correct, to the subsequent
positions, like ‘*00’, then the classified result will have an error accordingly to (2.1).
The goal of this error counting scheme is to penalize wrong decisions that are supposed to be
of easy classification, i.e., positions is a high hierarchical position or that have few choices for
that particular node, over wrong decisions made at hierarchically deeper nodes or when there is
a wide range of possibilities for such nodes.
1 Accordingly with the IRMA code for 2007 Medical Annotation Task. Changes in IRMA the code since
this year have an influence in the penalization for a wrong decision, with its consequences for the error
count.
20
2.32.32.32.3 ImageCLEF Medical Image Annotation TasksImageCLEF Medical Image Annotation TasksImageCLEF Medical Image Annotation TasksImageCLEF Medical Image Annotation Tasks
Evaluation campaigns for information retrieval, object classification, detection and
segmentation, machine translation, video tracking and even speech recognition are becoming
more and more adopted as a way to research new methods or to improve existent ones. The
competitive character of such campaigns, where several teams aim for the best overall results, is
useful to benchmark proposed systems. The idea behind the campaign concept is this: a
database is provided for a task on one of the designated areas. If the results are satisfactory then
the amount of data provided in the next evaluation campaign increases, making the complexity
of the problem higher, or a new database is provided for a different task. This data reusability
allows researchers to learn from their previous experiences and refine their future work.
Examples of evolutionary campaigns are, for example, the Text REtrieval Conference (TREC)1,
existent since 2001 for information retrieval, TRECVID2, as a part of TREC for video retrieval,
and PASCAL3 network, in 2005 and 2006 for image segmentation, object detection and
classification.
The evolution of the state-of-the art for medical images automatic annotation methods based
on purely visual features can be tracked in the CLEF4 cross-language image retrieval campaign
(ImageCLEF), taking place since 2003 for digital image libraries information extraction,
medical image annotation tasks, which ran from 2005 to 2009. From 2010 onwards this task
will not take place. Like other tasks part of the ImageCLEF Campaign, the medical image
annotation task also assumed a competition format. The goal of the tasks, however, was to
explore, develop and promote automatic annotation techniques and strategies for semantic
information extraction in medical images databases with little or no annotation.
The IRMA database was gradually used during the ImageCLEF Medical Annotation Task.
This database consists of approximately 17000 radiographs collected from the Department of
Diagnostic Radiology at RWTH Aachen University, Aachen, Germany5 during daily routine
examinations. A consequence of this routine is the uneven distribution of the types of
radiographs gathered. However in the database each class has a minimum of 10 images. A
distinctive characteristic of this database is that many images have strong intra-category
variability, e.g., very distinctive images with similar codes, and inter-category similarity, e.g.,
very similar images possessing different codes (Figures 2.2 and 2.3).
1 http://trec.nist.gov
2 http://www-nlpir.nist.gov/projects/trecvid/
3 http://www.pascal-network.org
4 http://www.clef-campaign.org
5 http://www.rad.rwth-aachen.de
21
Figure 2.2 – Example of x-rays belonging to the IRMA database with high intra-category
variability. All images share the same IRMA code 1121-120-800-700 (From [46]).
Figure 2.3 – Example of x-rays belonging to the IRMA database with high inter-category
similarity. Top row images are from the elbow, with an IRMA code xxxx-xxx-
44x-xxx, and bottom row images are from the knee, with an IRMA code xxxx-
xxx-94x-xxx (From [46]).
All images were stored using gray level values in Portable Network Graphics (PNG) format.
Furthermore images were scaled proportionally to their original size, keeping aspect ratio, in
order to fit a 512x512 maximum pixel window. All images were annotated by radiologists using
the IRMA but this was disregarded during the 2005 and 2006 tasks, where all images were
annotated simply by a code number. Only in 2007 the IRMA code was introduced for
classification purposes.
2.3.12.3.12.3.12.3.1 2005 Medical Annotation Task2005 Medical Annotation Task2005 Medical Annotation Task2005 Medical Annotation Task
The 2005 Medical Annotation Task [47] provided a subset of 10000 images from the IRMA
database. Of these 10000 images a random set of 9000 was selected as training data, published
22
with the annotation, and the remaining 1000 images were given for evaluation, without any
annotation. Images belonged to 57 distinguished classes and no IRMA code was used for
annotation. Thus images classes were classified as integers ranging from 1 to 57. Error
evaluation was performed by considering the total error rate, i. e., the percentage of wrongly
classified images, and not the error evaluation scheme presented in section 2.2. A total of 12
teams participated submitting 41 runs1. Table 2.2 gives an overview of the results for the 2005
Medical Annotation Task.
Rank Team Error Rate (%) Difference (%)
1 RWTH-i6 12.6
2 RWTH-mi 13.3 0.7
3 Ulg.ac.be 14.1 1.5
4 Geneva-gift 20.6 8.0
5 Infocomm 20.6 8.0
6 MIRACLE 21.4 8.8
7 NTU 21.7 9.1
8 NCTU-DBLAB 24.7 12.1
9 CEA 36.9 24.3
10 Mtholyoke 37.8 25.2
11 CINDI 45.3 30.7
12 Montreal 55.7 43.1
Table 2.2 – Results overview from 2005 ImageCLEF Medical Annotation Task. (From: [47]).
RWTH-i6 - Computer Science Department, RWTH University, Aachen, Germany:
The Computer Science Department from RWTH University was the winner team with a total
of 12.6% misclassified images. The method consisted in the Image Distortion Model (IDM)
using image thumbnails of sizes ; < 32. From the thumbnails vertical and horizontal gradients
by applying a Sobel filter, Tamura texture features and ��3 < 3�, �5 < 5�� subimages were
extracted. All features were extracted in the Flexible Image Retrieval Engine2 (FIRE) CBIR
framework. The IDM is related to the field of image registration by inherent optimization or
matching process and aim to compensate only deformations that leave the image class
unchanged, discarding emphasis on discrimination between classes. This is not the objective of
other methods in the same area, which focus on best matches to distinguish images of two
classes.
1 Sometimes the number of submissions of one team can be very high. The results and discussion
presented for each task and each team will focus only on the method that performed best. 2 http://thomas.deselaers.de/fire/
23
Classification was made using a Nearest Neighbor (1-NN) classifier. Every image was
mapped into a reference image (the training set) by summing the local pixel distances which
were squared Euclidian distances. The distance between images was calculated by minimizing
the cost over all possible deformation mappings. For this, a subset of 1000 images from the
training data was randomly selected and the best parameters for the model were chosen. From
experiences on this set it was found that ; < 32 image thumbnails features had a better
performance. This performance held for the test set after evaluation. Other slightly different
runs from Rwth-i6 team also performed very well for this task.
RWTH-mi – Department of Medical Informatics, RWTH University, Aachen, Germany:
The Department of Medical Informatics from RWTH University had an error rate of 13.3%,
reaching the second place in the task. An IDM using the Cross Correlation Function (CCF)
for 32 < 32 image thumbnails and a 1-NN Euclidian distance classifier combined with and an
IDM using Tamura texture features with the Jensen-Shannon divergence as distance metric for
histogram comparison. Because the IDM is time consuming, performance was boosted by
passing only the 500 nearest neighbors to the CCF-IDM. Best parameters for the weight of each
IDM in model combination were empirically evaluated using a subset of 1000 images from the
training set.
Ulg.ac.be – Institut Monteflore, University of Liège, Liège, Belgium:
The team from University of Liège scored an error rate of 14.1% using random 16 < 16
patches randomly extracted from the images. Patch extraction was performed inversely
proportional to the number of examples of each class in the training ser. More
precisely @A �B < 0C �⁄ , where @A is the number of patches; m is the number of classes and 0C the number of images of class c. A fixed total of @A � 800000 patches were extracted for
the training set, giving approximately 14000 patches per class. For the test set this number was
fixed in 500 patches for image. Image contrast enhancement, with ImageMagick1, was applied
to every patch. For the learning phase 25 decision trees with boosting (Tree Boosting) were
used together with a stop-splitting criterion using a G2 statistic to determinate the significance of
the test. A G2 statistic can be seen as an alternative to the chi-squared goodness-of-fit test and is
mostly used in hierarchic models of data. All patches for a test image were then aggregated
accordingly to their classified class and the final classification was reached through majority
voting over the patches, image by image.
1 http://www.imagemagick.org/
24
Geneva-gift – University and University Hospitals of Geneva, Service of Medical
Informatics, Geneva, Switzerland:
The team from Geneva had an error rate of 20.6% using an adaptation of the GNU Image
Finding Tool1 (GIFT) for medical images, the medGIFT
2, for image feature retrieval. medGIFT
allows a number of image descriptors, like color histograms and Gabor Texture Filters (GTF),
considering multiple scales, gray levels and directions. Best results relied in 8 gray level images
and 8 directions for the GTF. For this features were extracted from the training set and a term
frequency-inverse document frequency (tf/idf) for weighting was performed. The same
procedure took place for test set and a query was made for every image. Query results consisted
in the 5-NN using histogram intersection for similarity scoring and comparison. The class with
the highest final score became the class selected for the image.
Infocomm – Institute for Infocomm (I2R), Singapore:
Infocomm team achieved a similar score than Geneva-gift with 20.6% of error rate. For this
task various image features were extracted. Among these were polarity, anisotropy and contrast
for texture and Low Resolution Pixel Maps (LRPM) from 16 < 16 thumbnails from the images
for spatial layout. Several subsets of the training data were used to train a SVM with a Radial
Basis Function (RBF) kernel. Once the best parameters for cost and gamma were chosen the test
set was classified. This group also combined the previous descriptors with Blob features with
better results but never submitted this run. Other techniques like Principal Component Analysis
(PCA) were also attempted but with less success.
MIRACLE – Universidad Politecnica de Madrid, Universidad Carlos III de Madrid,
DEADALUS S.A., Madrid, Spain:
MIRACLE team scored 21.3% of error rate during this task. Images were reduced to 32 < 32 thumbnails and GIFT was used for image retrieval given a test image as query.
Classification was performed via a decision table: for the 20-NN a weighting function was
applied to the relevance of each result. Weights corresponding to the same class were summed,
as a measure of confidence, and the class correspondent to the highest sum value was assigned
to the query image. The best number of closest retrievals used was optimized using 10cross
validations for training. Other strategies from this team using a NN classifier achieved a worst
performance rate.
1 http://www.gnu.org/software/gift/
2 http://www.sim.hcuge.ch/medgift/
25
NTU – National Taiwan University, Taipei, Taiwan:
NTU team scored the same error rate of 21.7% using two methods. Images were resized to 256 < 256 pixels and divided in 32 < 32 pixel blocks, each with 8 < 8 pixels. The average
gray value for each block was computed giving a 1024 elements feature vector. Similarity
between test images and classes was measured using the cosine metric in a 1-NN classifier and a
2-NN classifier. No learning phase seemed took place for this method.
NCTU-DBLAB – Department of Computer & Information Science, National Chiao
Tung University, Hsinchu, Taiwan:
NCTU-DBLAB scored an error rate of 24.7%.. The method presented involved image
scaling to a 8 < 8 pixel size and the corresponding 64 dimension pixel gray level was used as a
spatial feature to feed a SVM classifier with RBF kernel. Other image feature combinations and
SVM kernels were used without better success. No indication of a training phase for kernel
parameterization was found.
CEA – CEA-LIST/LIC2M, Fontenay-aux-Roses, France:
CEA team submitted three runs with the best scoring an error rate of 36.9%. All images
were resized to 100 < 100 and divided in four blocks of 50 < 50 pixels. A Sobel filter was
applied to each block and the pixels within were projected in the vertical and horizontal,
originating an aggregation histogram. The test image class is attributed by majority voting of the
3-NN using a Euclidian metric for histogram comparison.
Mtholyoke – Mt. Holyoke College, South Hadley, Massachusetts, USA:
The method used by Mount Holyoke College started by scaling images from both sets to 256 < 256 pixels. Each image was then divided in a 5 < 5 square grid. A total of 250000
blocks for all images were built. Because Gabor energy measures and Tamura texture features
are not correlated these features were extracted from all blocks with the help of the FIRE
framework. Coarseness was separated from the Tamura textures and used as a separate feature.
Two clustering methods, Cluster Query Likelihood (CQL) and Cluster Based Document Model
(CBDM), were used. Ranking measures using the error rate and a K-NN clustering showed that
Gabor energy performed best as image descriptor, even better than a combination with Tamura
texture features, with the CQL model. Best model parameters were optimized with 10 cross
validations and different values of K for each set of features. These models were constructed
with the 9000 images training set. Several clustering methods for the models proposed, K-
means and K-NN, were experimented with the last producing better results. K=25 for the CQL
26
model and K=50 for the CBDM model were empirically established and the models were
submitted for evaluation of the test set. It is not clear in the literature which of these models
performed best but both had a similar error rate of 37.8% and 40.3%.
CINDI – CINDI group, Concordia University, Montreal, Canada:
CINDI team submitted only one run with an error score of 43.3%. Several image features
were extracted to build an approximately 200 elements vector: invariant moments and Canny
Edge detector for shape; gray level co-occurrence (GLCM) matrices for texture and, from these,
higher level features like entropy, energy, contrast and homogeneity. Training set feature
vectors were used as input to a SVM with a RBF kernel. Best parameters using 10 cross
validation folds for the SVM kernel were �E, F� � �200, 0.01� giving a 54.65% rate of correct
classifications. The expected performance for the test is only 2.05% better than the expected.
Montreal – Montreal University, Montreal, Canada:
Montreal team achieved the worst performance for the task with 55.7% of misclassified
images. Aside the fact that a combination of Fourier shape and contour descriptors together with
texture coarseness and directionality were used for image feature extraction, no method for
annotation is known because no work notes was ever published by this team.
Discussion for the 2005 Medical Annotation Task:
The first Medical Annotation Task gathered a good number of participants with submissions.
However 14 of the teams registered for the task did not participate. This can be related with the
interest in the data provided rather than present any methodology for the task. Such a behavior
will be a constant for every edition of this task.
If we look at table 2.2 it is visible that the best results of each team can be grouped in three
categories: below 15% with 3 teams; from 20% to 25% with 5 teams; and from 35% to 56%
with 4 teams. RWTH University teams, partially because were more familiarized with the data,
achieved the best performances with the IDM. Pixel intensity values from scaled images
outperform general image features but only slightly against object recognition methods using
patches as visual words representation. Classifiers performance against IDM and k-NN is mixed
as it is spread all over the rankings, with good and less good results. Some teams used existent
frameworks for feature retrieval in what could be a sign of necessity of such systems for rapid
experimentation. Fusion of the best results was attempted but no score improvement could be
achieved.
27
2.3.22.3.22.3.22.3.2 2006 Medical Annotation Task2006 Medical Annotation Task2006 Medical Annotation Task2006 Medical Annotation Task
The subset for the 2006 Medical Annotation Task [48] increased to 11000 images, 10000
annotated for training and 1000 without any annotation for evaluation, from 116 different
classes. No IRMA code was used for annotation and the error evaluation was made by taking
the error rate in consideration, as in 2005. In this task 28 runs were submitted by 12 teams.
Another 14 teams did not submit any runs. Table 2.3 shows the score rankings accordingly to
the best run of each team.
Rank Team Error Rate (%) Difference (%)
1 RWTH-i6 16.2
2 UFR 16.7 0.5
3 MedIC-CismEF 17.2 1.0
4 MSRA 17.6 1.4
5 RWTH-mi 21.5 5.3
6 CINDI 24.1 7.9
7 OSHU 26.3 10.1
8 NCTU 26.7 10.5
9 MU 28.0 11.8
10 ULG 29.0 12.8
11 DEU 29.5 13.3
12 UTD 31.7 15.5
Table 2.3 – Best runs / team for the 2006 ImageCLEF Medical Annotation Task (From: [48]).
RWTH-i6 - Computer Science Department, RWTH University, Aachen, Germany:
The Computer Science Department from RWTH University managed, once again, to achieve
the overall best performance for this task with an error rate of 16.2%. The best run used sparse
histograms of image patches and a discriminative log-linear maximum entropy model was used
for classification. Square patches of edge lengths of 7, 11, 21 and 31 pixels were extracted at
every position of the images, allowing coverage of objects of different sizes and providing some
degree of invariance to scale changes. Patches dimensionality was reduced to values between 6
and 8 pixels features with the use of Principal Component Analysis (PCA).
Histogram grid was built with uniformly distributed bins using the mean and variance of the
dimensionally reduced patches. For every image histograms with 65536 bins was built. The
position of each patch was added to the histogram as a way to add spatial information and,
consequently, achieve invariance to translations. Log-linear maximum entropy models optimize
the class posterior probability by discriminative training. Model parameters were optimized
28
with a modified generalization of the iterative scale algorithm and classification followed
Bayes’ decision rule. This team also submitted the run used in the 2005 Medical Annotation
Task but this method underperformed. Another run using SVM’s with histograms from image
patches was also presented with results very similar to the best run.
UFR – LMB Group, Albert-Ludwigs-University, Freiburg, Germany:
Albert-Ludwigs-University team best error rate score was 16.7%, not very far from the best
overall score. Using a wavelet-based point detector, relational feature vectors are extracted from
by three parameterizations from a relational function for texture analysis, considering a
surrounding region, in order to capture high level concepts from the image. These vectors are
concatenated and, for all images, clustered by a k-means algorithm. The number of clusters is
set empirically. All feature vectors from each image are accumulated in a global feature vector
using the 1-NN cluster center in three steps: first an all invariant accumulator with 20 bins by
simply counting the number of clusters occurrence for each image was created; second a
rotation invariant accumulator with 10 bins was made for all pairs of salient points lying within
a specific distance range and sharing identical cluster indices; third an orientation invariance
accumulator with 4 bins, using an co-occurrence matrix to capture the statistical properties of
the joint distribution of cluster membership indices, was built for the all pairs of salient points.
The total feature vector for one image totalized 16000 elements. Best run, from the two
submitted, used 1000 salient points per image and for classification a one-vs-rest multiclass
SVM with a histogram intersection kernel and empirical parameterization.
MedIC-CismEF – LITIS Laboratory - INSA de Rouen, CISMeF Team, Rouen University
Hospital & GCSIS Medical School of Rouen, Rouen, France:
MedIC-CismEF team method started to resize all images to 256 < 256 pixels. After splitting
each image in 16 equal blocks (64 < 64 pixels) global and local features were extracted. These
features include: 4 co-occurrence matrices, each for 4 different directions, after 64 gray-level
quantification, producing a 16 feature vector; the fractal dimension, a single number between 2
and 3 denoting the texture smoothness; Gabor features for 3 scales and 4 orientations,
generating a 24 feature vector; a 3 < 3 discrete cosine transform with the exclusion of the low
frequency component, resulting in a 8 feature vector; gray-level-run-lengths in different
directions, yielding a 14 features vector; Laws features mask for textural energy for a 28 feature
vector; Multispectral Simultaneous Autoregressive Model (MSAR), where a grey level of a
pixel is expressed as a function of the gray levels in its neighborhood, for a 24 feature vector;
gray level statistical measures, first to fourth moments, as local features resulting in a 7 feature
vector. The total feature size is 122 for each block. Considering also the extraction of all
29
features for the scaled image as a whole this results in a 2074 feature vector per image. Best run
used an SVM with RBF kernel, whose parameterization was tuned with cross validation using
the training set, after a PCA over the feature space using 95% of the variance as reference,
yielding to a reduced feature vector of 335 elements. Error rate score was 17.2% and a third
place in the task achieved.
MSRA - Microsoft Research Asia, Beijing, China:
MSRA team scored an error rate of 17.6% with the best method relying only in global image
descriptors information together with an SVM for classification. Three main global features
were extracted: images were divided in 8 < 8 blocks and normalized gray average levels from
each block, for illumination invariance, were extracted; images were again divided in 4 < 4
blocks and wavelet coefficients, for texture information, from each block at different scales
were computed; images were converted to binary using Otsu’s method for the threshold. The
area and the central point of the regions were calculated. Then, morphological operations were
performed to extract the contour and edges of the image to describe its shape. These features
were duplicated 6 times to increase their importance in the SVM. An RBF kernel was chosen
for this classifier and parameters were tuned by cross-validation using 5 folds. A similar method
using local descriptors was also attempted with slightly worse results.
RWTH-mi – Department of Medical Informatics, RWTH University, Aachen, Germany:
The Department of Medical Informatics from RWTH University returned this year with
exactly the same method from the 2005 Medical Annotation Task, only with the parameters
adjusted for the new expanded database. Classification yielded an error rate of 21.5%, a worse
performance than the one in 2005.
CINDI – CINDI group, Concordia University, Montreal, Canada:
CINDI team presented a different method during this competition. Instead of an SVM whose
input was the fusion between different image features, a pairwised combination of SVM’s, each
using one different image description, was used. MPEG-7 image descriptors, the Edge
Histogram Descriptor (EHD) and Color Layout Display (CLD), were extracted as global
features. From 5 overlapping grid image divisions invariant shape moments (first and second)
and the GLCM were also extracted and combined in a semi-global global descriptor. Scaled
images, of size 64 < 64 pixels were also used and, from these, the mean gray level from 4 < 4
30
pixels blocks was calculated and concatenated. The summation rule for the combination of
classifiers delivered the best results with an error rate of 24.1%.
OSHU - Department of Medical Informatics & Clinical Epidemiology; Oregon Health
& Science University, Portland, OR, USA:
OSHU team achieved a best error rate of 26.3%. Four runs were submitted using different
image descriptors. Best run used gray values from 16 < 16 image thumbnails in an 32 bins
histogram. Classification was performed by using a neural network, with multi-layer perceptron
architecture, optimized with the training set of images.
NCTU – Department of Computer & Information Science; DBLAB, National Chiao
Tung University, Hsinchu, Taiwan:
No work notes from the NCTU team could be found but, amidst some image descriptors
used in the 2005 Medical Annotation Task (not the most successful), Gabor texture features and
coherence moment, the corrected vector layout was also added. These three image descriptors
together with a NN classifier yielded an error rate of 26.7%.
MU –Media Understanding Group, Institute for Infocomm (I2R), Singapore:
MU team used a two-stage SVM classification for the test set annotation using different
features from those used in the 2005 Medical Annotation Task. A 16 < 16 map of salient
regions denoting the conspicuity of the image was computed, forming a 256 feature vector.
From multi-resolution wavelets a set of salient-points were detected in the image. Then, 13 < 13 image patches around the top 50 salient points were extracted and the respective gray
level values were turned into feature vectors. SIFT was also used to extract multi-directional
features in 4 < 4 patches around keypoints detected with a Difference of Gaussians (DoG).
Finally histograms of pixel gray values from left-tilt strips were also constructed.
In the first stage of the classification an SVM with RBF kernel is trained using cross-
validation for parameter tuning and feature selection. From the training set, the classes that
include 95% of wrongly classified samples were marked for a more refined second evaluation
stage. This refinement process using the training ser defines a threshold used in the test set. To
evaluate a “bad” image in the training set the three maximal elements from the distance vector
from a reference basic system are considered. If a criterion involving the first and second
maximum values is met then the image will be reclassified. Otherwise, if a criterion involving
the second and third maximum values is verified then that image will be declared as “good”.
The error rate scored 28.0 for this methodology.
31
ULG – Institut Monteflore, University of Liège, Liège, Belgium:
ULG team submitted runs based in the same methods explored in the 2005 Medical
Annotation Task. Although these methods achieved a very good performance in 2005, for 2006,
and using 20 extremely randomized trees, classification results were less good, with an error
rate of 29.0%.
DEU – Dokuz Eylul University, Tinaztepe, Turkey:
DEU team scored an error rate of 29.5 using the EHD as image descriptor and a 3-NN
classifier. Only one run was submitted.
UTD - The University of Texas, Dallas TX, USA:
UTD team achieved the last place in the task with an error rate of 31.7%. The only run
submitted scaled all images to 16 < 16 thumbnails and then performed PCA over the pixel gray
level values. A weighted k-NN was used for classification.
Discussion for the 2006 Medical Annotation Task:
The second year of this task saw, with exception of the best results, an overall improvement
of the error rates even with the complexity increase of the problem. Top results are only
separated by a maximum of 1.5% in error difference. In a second group the error rate varies
between 21.5% and 31.7%.
The increase in the number of classes made Rwth-i6 and Rwth-mi runs based in the IDM to
underperform. ULG run based in tree boosting was also not as successful as in the 2005 Medical
Annotation Task. SVM’s undoubtfully dominated the classifier preferences with 9 teams relying
on these for the task proposed. One of the most interesting approaches is probably the MU team
SVM classifier because it is the only one that tries to implement active learning by
automatically identify and reclassify suspicious classes. A fusion of the top three results using
majority voting led to an improved error rate of 14.4%. In general image recognition and
detection techniques seem suited for automatic annotation as well.
2.3.32.3.32.3.32.3.3 2007 2007 2007 2007 Medical Annotation TaskMedical Annotation TaskMedical Annotation TaskMedical Annotation Task
In the 2007 Medical Annotation Task [49] a database of 11000 annotated training images
was provided. For the first time the complete IRMA code was used for annotation. Thus, best
32
runs were evaluated accordingly to the error evaluation scheme (see section 2.2) and not the
error rate. 1000 test images were made available for this task. Regarding a full IRMA code as a
class of objects a total of 116 image classes existed. A total of 29 teams registered but only 10
submitted their works in a total of 68 runs. Table 2.4 shows the overall results for this year task.
Rank Team Error Score Error Rate (%)
1 BLOOM 26.8 10.3
2 RWTH-i6 30.9 13.2
3 UFR 31.4 12.1
4 RWTH-mi 51.3 20.0
5 UNIBAS 58.1 22.4
6 OSHU 67.8 22.7
7 BIOMOD 73.8 22.9
8 CYU 79.3 25.3
9 MIRACLE 158.8 50.3
10 GENEVA 375.7 99.7
Table 2.4 – Rankings for the 2007 ImageCLEF Medical Annotation Task (From: [49]).
BLOOM – IDIAP Research Institute, Martigny, Switzerland:
BLOOM team scored the lowest error count for the 2007 competition, 26.8. A BoW was
built using SIFT as image descriptor (at one octave only1) but performed in a dense point
sampling covering both training and test sets rather than a keypoints detector. The visual
dictionary consisted in 500 words/concepts for the full image and 2 < 2 partitions. Each
dictionary was built with the aid of K-means algorithm using a Euclidian metric. For the 5
dictionaries considered 1500 sampling points were extracted. Furthermore images were resized
to 32 < 32 thumbnails and the pixel gray values were also used.
For classification two methods, the Discriminative Accumulation Scheme (DAS) and the
Multi Cue Kernel (MCK) were used. While the DAS is a high level integration approach,
classifying the same set using different image descriptors and fusing the results in the end, the
MCK is a mid-level integration approach using a multiclass SVM with a linear combination of
two exponential chi-square kernels, each for a single image descriptor. Cross validation on 5
disjoint subsets from the training set provided the best weights for the kernel combination. Best
run was achieved using the MCK in a one-vs-all SVM.
1 The context of octave in Lowe work, the SIFT descriptor, is similar to image scaling. The reason for the
difference between terms is that Lowe uses scaling to denote different levels of smoothness resulting
from the convolution between the image and a Gaussian function.
33
RWTH-i6 - Computer Science Department, RWTH University, Aachen, Germany:
RWTH-i6 team presented 4 runs based in the previous work on sparse histograms of image
patches used in the 2006 Medical Annotation Task. Histograms of 65536 or 4096 bins were
created and classification performed with SVM’s by taking into consideration the complete
IRMA code or each axis separately. In the end the results of the 4 runs are combined with
majority voting. Error count was 30.9.
UFR – LMB Group, Albert-Ludwigs-University, Freiburg, Germany:
UFR team scored an error count of 31.4 and also repeats their methods based upon relational
features used in the 2006 Medical Annotation Task. After the extraction of these features
classification is performed with SVM’s using the full IRMA code and axis-wise. A Binary
Classification Tree (BCT), using the dot product between SVM’s hyperplanes as a similarity
measure, was also used but did not perform very well.
RWTH-mi – Department of Medical Informatics, RWTH University, Aachen, Germany:
RWTH team once again submitted a similar run used in the 2005 and 2006 Medical
Annotation Tasks optimized for the current image set and using the full IRMA code as a class.
Three runs were submitted and differ only in the way that the 1-NN predicted codes are
assembled to predict the final code. Error count achieved was 51.3 but later, using 5-NN
predicted codes, the error count worsened while the error rate improved.
UNIBAS – Databases and Information Systems group, University of Basel, Basel,
Switzerland:
The first participation of the UNIBAS team resulted in an error count of 58.1. The main
focus of UNIBAS runs, 7 in total, is to speed up image annotation using a more generic IDM for 3 < 3 and 5 < 5 without degrading the quality of the results. Inspired mainly by the previous
run using the IDM during the 2005 Medical Annotation Task, two layers were used in the
model: in one images were resized to ; < 32 where X is the smallest edge of the full image; in
the other information retrieved from the image consisted in the application of a Sobel filter on
the full image and perform downscaling later. Different weights were applied to both layers,
being the gray level values more relevant than the filtered thumbnails. A series of model
configurations and algorithmic considerations are made to speed up the IDM. A weighted k-NN
classifier, using the inverse Euclidian distance, is then applied. The closest 3-NNs are compared
in order to classify suspicious position along the axis with the “do not know” option.
34
OSHU - Department of Medical Informatics & Clinical Epidemiology; Oregon Health
& Science University, Portland, OR, USA:
OSHU team ranked 6th
scoring an error count of 67.8. Images were scaled in a first phase to
a size of 256 < 256 pixels. Then information from every image was gathered using: the
GLCM; the gray level co-occurrence matrix of five 128 < 128 overlapping blocks from the
image (GLCM2); a global discrete cosine transform; a color histogram with 32 bins; 16 < 16
thumbnails from the scaled images. All the previous features were then again extracted from the
thumbnails. Separately, the spatial envelope (GIST) was also used. A neural network with a
multi-layer perceptron was the classifier for the annotation. The number of hidden nodes was
optimized with the training data. The best run used the histogram for the thumbnails with and
300 hidden nodes for the neural network. The score achieved is very similar to the previous year
task. A neural network classifier also with 300 hidden nodes for the GIST descriptor resulted in
similar, but not better, results.
BIOMOD – Bioinformatics and Modeling group, University of Liège, Liège, Belgium:
BIOMOD returns for the 2007 task using again the randomized trees with boosting, scoring a
73.8 error count. The method does not differ much from the previous year’s considering, like
other approaches for the same task, a unique full IRMA code as a single class and axis-wise
classifications. Later both results are combined. However the combination of methods and the
axis-wise approach do not outperform the full code classification method.
CYU - Information Management AI lab, Ching Yun University, Jhongli City, Taiwan:
CYU team submitted only one run for an error count of 79.3. For image features this team
proposed an illumination invariant relative local measure for neighbor pixels. Accordingly to
this feature any pixel is classified in three categories. Dividing the image in 4 identical blocks
and evaluating the occurrence frequency for pixels within the same category the spatial
information therein is gathered. A signature of the image represented by a vector of 324
elements is then created. A NN rule using an author’s defined metric tuned with the training set
does the annotation.
MIRACLE – Universidad Politecnica de Madrid, Universidad Carlos III de Madrid,
DEADALUS S.A., Madrid, Spain:
MIRACLE team uses the FIRE framework to extract several global features from the
images: for histogram-like information gray pixel values histograms and Tamura texture
features; for vector-like information the global aspect ratio, a global texture descriptor and
35
Gabor features. A total of 30 approaches using a 10-NN classifier comprehending one specific
type of features or all together classified the test using the full IRMA code, axis wise and
pairwised axis for normalized and non-normalized features. Best run for all features
(normalized) with an axis-wise strategy provided the best results with an error count of 158.8.
GENEVA – University and University Hospitals of Geneva, Service of Medical
Informatics, Geneva, Switzerland:
GENEVA team ranked last for this task with an error of 375.7. The error rate was
particularly bad, with around only 30 completely well classified images. Like in the 2005
Medical Annotation Task the GIFT framework was used to extract image descriptors and
annotate the test set. The amount of features extracted was extensive: local color features at
different scales and considering image-block partitions; color histograms; quantization of GTF
in 10 strengths; Gabor filters and aspect ratio. All features were weighted and several k-NN
classifiers for k � �1, . . ,20� were tested. Best final results took into consideration k=5. This
work was largely improved and explored after the task.
Discussion for the 2007 Medical Annotation Task:
From Table 2.4 it is clearly seen that the results can be clustered in three main groups: the
first three results; the 4-8 ranks and the last two ranks. One of the major conclusions that can be
drawn after this task is that SVM’s outperform NN for image classification. All three top teams
use these for annotation. Furthermore these were combined using majority voting for an error
count of 24 and an error rate of 10.3%. Only 54 wildcards on 31 images were placed during the
combination. For image features another important conclusion can be also reached: local image
features provide best results than global ones. Teams ranking from 1st to 5
th place use local
features alone or combinations of these with global features. Only OSHU (6th place) uses purely
global descriptors. Most part of the strategies for classification adopted ignored the hierarchical
structure of the IRMA code and made little or no use of wildcards. The usage of the full IRMA
code as one class of objects or each IRMA code axis as a class considering its specific meaning
dominated the preferences, with the first performing better than the second. Deselaers et al. [49]
also reports that no image was completely misclassified by all submitted runs and only one
image was completely well classified by them all. However such a fact does not bring new
insight for the evaluation of the problem complexity since GENEVA team had a 99.7% error
rate. A conclusion of the review for this task states that “the task is now at the point where it can
be applied directly to images being inserted into a medical picture archiving system” [49].
However this is not present in the runs provided by the BLOOM team since the BoW dictionary
is assembled taking into consideration features extracted from the test set, thus improving
36
identification of visual concepts during the clustering process. Even if no previous annotation is
needed for this technique, which is an acceptable strategy for the task, it founds no applicability
in PACS.
2.3.42.3.42.3.42.3.4 2008 2008 2008 2008 Medical AMedical AMedical AMedical Annotation Tasknnotation Tasknnotation Tasknnotation Task
Training set for this year’s task [50] the database was increased to 12076 images for a total
of 196 unique IRMA codes. Test set was again set to 1000 images for annotation. The extra 80
classes when comparing with the 2007 task placed a very high challenge to the participants.
This is probably the reason why, from 37 registrations, only 6 groups submitted a total of 23
runs.
Rank Team Error Count
1 IDIAP 74.92
2 TAU 105.75
3 RWTH-mi 182.77
4 MIRACLE 187.90
5 GENEVA 209.70
6 FEIT 286.48
Table 2.5 – Results for the 2008 ImageCLEF Medical Annotation Task considering the best
runs for each group (From: [50]).
IDIAP – IDIAP Research Institute, Martigny, Switzerland:
IDIAP team achieved, for the second consecutive year, the first place in the task with an
error score of 74.92. The best of the 10 runs submitted by this team consisted in a low level
feature integration of the previous 2007 Medical Annotation Task modified SIFT descriptor, at
one scale and ignoring rotation, considering the full image and 2 < 2 partitions for a 2500
elements feature vector together with the Locally Binary Pattern (LBP) rotationally invariant
operator. The idea behind the LBP is to extract textural information from the region around a
pixel after performing a binarization accordingly to a threshold. As an image feature the
concatenation of two two-dimensional histograms of LBP was considered leading to a vector of
648 elements.
Automatic annotation was performed with an SVM using an exponential chi-squared kernel
after a simple concatenation of features. Virtual image examples of low represented classes
were added to the training set by performing SIFT invariant transformations. Among these were
small rotations up to 40 degrees, shifts of 50 pixels in several directions, increments of 50-100
pixels in scaling and illumination variances. Best parameters for the SVM kernel were
37
optimized using sub-sets from the training set taking into consideration the class abundance
therein. Once the annotation was performed a confidence based opinion fusion. Using the
maximum distance of a class to the hyperplane the attained distance is subtracted and if it is less
than a defined threshold a wildcard is placed. Other methods were also used considering the
MCK but did not perform better. The method used in 2007 Medical Annotation test was also
submitted with poor results.
TAU – Medical Image Processing Lab, Tel Aviv University, Tel Aviv, Israel:
TAU team achieved the second place in their first participation with a score of 105.75. A
visual vocabulary of 700 words built from a collection of 9 < 9 rectangular patches, separated
by 6 pixels, from 400 randomly selected images. Then the covariance matrix from
approximately 2 million patches (around 2500 patches per image) and PCA was applied to find
its eigenvectors. The 6 highest energy vectors were used as a base for the rest of the patches. All
patches extracted from one image were normalized to have 0 mean and 1 variance thus
providing some invariance to illumination. The patch mean gray level is lost during the PCA
and is later added as a one more feature. Moreover the spatial location of the patches was also
used extending the number of elements in the feature vector to 7. The Euclidian distance was
used for the clustering of all features extracted from the 400 random images to build the
dictionary.
For image annotation a multiclass one-vs-one SVM with a RBF kernel was trained directly
on the histograms. The hierarchy was ignored and every IRMA code was considered as a class
of objects. However other method by this team using SVM’s probability output uses wildcards
in the annotation process.
RWTH-mi – Department of Medical Informatics, RWTH University, Aachen, Germany:
RWTH-mi team submitted the same run of 2005 and achieved an error count of 182.77. This
run is a baseline for evaluation of the evolution of the other methods during the several years
that the Medical Annotation Task took place. The hierarchal nature of the IRMA code was again
disregarded.
MIRACLE – Universidad Politecnica de Madrid, Universidad Carlos III de Madrid,
DEADALUS S.A., Madrid, Spain:
MIRACLE team, with a score of 187.90, achieved the fourth place for this task. The FIRE
engine was abandoned this year and the feature extracted comprised a gray histogram, statistics
from several orders, Gabor features (4 scales, 8 orientations), coocurrency matrix statistics,
38
DCT coefficients, Tamura features and Discrete Wavelet Transform (DWT) coefficients from 256 < 256 pixels resized images. The same features were also extracted from 64 < 64 blocks
from the resized images. For annotation the classifier used contained two blocks: the first to
select the images from the training set whose distance to the feature vector correspondent to a
test image is under a defined threshold; the second generated the IRMA code depending on the
codes and similarity of the chosen images, assigning a “*” when the addition of strings disagree.
This is, indeed, a variation of a k-NN algorithm. Best results were achieved for k=3. Relevance
feedback methods were also applied but led to slightly worse results.
GENEVA – University and University Hospitals of Geneva, Service of Medical
Informatics, Geneva, Switzerland:
GENEVA team achieved the 5th place in the task with an error count of 209.70. The best
method used the GIFT, as in previous tasks, to perform image annotation. The only changes
were in the parameterization settings and classification strategies adopted. Image features were
similar to those used in the 2007 Medical Annotation Task. The annotation, however, now took
into consideration the bias during the k-NN voting provided by the unbalanced amount of
images from each class. This strategy did not produced better results than a simple axis-wise
descending voting strategy for k=5. Later, different thresholds are tested and it is verified that a
letter by letter voting performs best. However the meaning of such a threshold is not very clear.
FEIT – Faculty of Electrical Engineering and Information Technologies, University of
Skopje, Skopje, Macedonia:
FEIT team from University of Skopje was placed in last during this task with a score of
286.48. Only one descriptor, the EHD, was used to extract information from the images. These
were divided in 16 < 16 blocks and, for each, edge histogram corresponding to 5 different
orientations: vertical; horizontal; 45º degrees; 135º degrees and non-directional edges, were
concatenated in a vector of 80 elements. For classification an axis-wise strategy was adopted
using top-down induction decision trees, random forests with bagging, taking into consideration
the maximization of the reduction of variance for a better cluster homogeneity. An ensemble of
4 training and 4 test sets, for each axis, of 100 un-pruned trees was created and the feature
subset size for random forests was set to 7. As examples travel thorough the tree a threshold is
defined for the Euclidian between the variances of the two nodes.
39
Discussion for the 2008 Medical Annotation Task
As expected the 2008 Medical Annotation Task led to a better use of the hierarchy, with the
best runs, except the ones from TAU and RWTH-mi, using wildcards during the process. The
motivation for this was the increase of image classes from 116 to 196, which posed a challenge
for classifiers because these rely deeply in examples to perform annotation. Image classes where
wildcards were most often used (sometimes with 8 or more wildcards) were the less represented
in the training set. The amount of wildcards used for the totality of runs ranges from nearly
1000 to 7000 [50]. The winner run, from IDIAP team, uses 4148 wildcards.
Results for this task vary much more than the previous year’s tasks. Only IDIAP and TAU
teams achieve error counts below the baseline (RWTH-mi run). Nevertheless their difference is
quite large. Other methods achieved scores closed to the baseline error. From these perhaps
MIRACLE saw a major improvement when comparing with previous years. Like in the 2007
Medical Annotation Task , SVM’s and local descriptors based in dense sampling (IDIAP team)
or image patches (TAU team) together with a BoW outperformed all other classifiers and
descriptors.
2.3.52.3.52.3.52.3.5 2009 Medical Annotation Task2009 Medical Annotation Task2009 Medical Annotation Task2009 Medical Annotation Task
This was the last time that the Medical Annotation Task took place. No further editions are
planned for a near future. An overall task, comprehending a survey to all the tasks from
previous years, was proposed [45]. A total of 12677 images were made available for training
and 1733 images for testing. Images correspondent to three IRMA codes existent in previous
years, 1121-120-450-700, 1121-120-700-400 and 1121-490-913-700, were discarded. Teams
were allowed to submit only one annotation method for all years. Nevertheless small variations
inside the same method, like, for example, different parameterizations, were allowed.
Error evaluation suffered some changes due to the mixture of data for different years in the
same test set. Moreover, the “*” classification option was also introduced for the 2005 and 2006
annotations. For the two first years a corrected classified code yields no error while a
misclassified one holds an error of 1.0. For the “*” half of the maximum error, 0.5, is given.
Table 2.6 gives a small example for this error counting scheme.
Classified Error Count
18 0.0
26 1.0
* 0.5
Table 2.6 – Error count scheme considering the “*” for 2005 and 2006 data (From: [45])
40
The usage of a test data for multiple tasks introduced the concept of clutter class. Because
not all classes are considered in the evaluation for a specific year a cluster class is assigned for
all classes that do not have any expression therein. Hence, any annotation for a clutter class is
not subject to error. Table 2.7 exemplifies the error score for a cluttered class.
Classified
(2005-2006) Error Count
18 0.0
26 0.0
* 0.0
C 0.0
Classified
(2007-2008) Error Count
111 0.000000
11* 0.000000
1** 0.000000
*** 0.000000
*C* 0.000000
Table 2.7 – Error count scheme considering the clutter class for 2005-2008 tasks (From: [45])
The details about the number of classes per task, for training and annotation, are depicted in
Table 2.8. It is convenient to remember that despite the distribution of data is similar for 2006
and 2007 the annotation was performed without/using the IRMA code respectively. When not
using numerical classes for annotation (2005 and 2006) each of these can contain several types
of images corresponding to distinct IRMA codes.
Data Distribution for 2009 Medical Annotation Task
Training 2005 2006 2007 2008
Classes 57 116 116 193
Images 12631 12334 12334 12677
Clutter 46 343 343 -
Test 2005 2006 2007 2008
Classes 55 109 109 169
Images 1639 1353 1353 1733
Clutter 94 380 380 -
Table 2.8 – Data distribution for the 2009 Medical Annotation Task tasks (From: [45])
This task was particularly extensive and all the teams that have participated in previous years
were invited to do so again. Only 7 teams submitted a total of 19 runs. Table 2.9 shows the final
standings for the participants of this task.
41
Rank Team Error Count
Sum 2005 2006 2007 2008
1 TAUBiomed 356 263 64.3 169.5 852.8
2 IDIAP 393 260 67.23 178.93 899.16
3 FEITIJS 549 433 128.1 242.46 1352.56
4 VPA 578 462 155.05 261.16 1456.21
5 MedGIFT 618 507 190.73 317.53 1633.26
6 IRMA 790 638 207.55 359.29 1994.84
7 DEU 1368 1183 487.5 642.5 3681
Table 2.9 – Final standings for the 2009 Medical Annotation Task (From: [45])
TAUBiomed – Medical Image Processing Lab, Tel Aviv University, Tel Aviv, Israel:
TAUBiomed won the task with an error sum of 852.8. The information extraction from the
images was performed identically as in the 2008 Medical Annotation Task and for classification
a multiclass SVM extensive grid search was conducted to optimize the parameters of a χ2 kernel
using 5 cross validations but taking into consideration the error count and not the error rate. The
IRMA code hierarchy was not used and in the end all ‘0’ positions were replacing by wildcards.
Therefore if the ‘0’ code was correct this strategy does not imply an additional error. Moreover
if the ‘0’ was wrongly placed in a last position then the error count is reduced to half. Model
training and annotation took approximately 90 minutes.
IDIAP – IDIAP Research Institute, Martigny, Switzerland:
IDIAP team achieved the 2nd
place with a sum score of 899.16 with the lowest error count
for the 2006 IRMA database. The method used is exactly the same of the previous 2008
Medical Annotation Task.
FEITIJS – Faculty of Electrical Engineering and Information Technologies, University
of Skopje, Skopje, Macedonia:
FEITIJS team ranked 3rd
with a sum score of 1352.56. The approach used was very similar to
the one used during the 2008 Medical Annotation Task. Besides the EHD descriptor the SIFT
was also used in a three stage process: first the keypoints and the correspondent descriptor were
extracted; second all keypoints were clustered in 2000 clusters; third a histogram of 2000 bins
was created by distributing the keypoints accordingly to the closest cluster. This is, indeed, a
42
BoW with 2000 words. This histogram was concatenated with the EHD descriptor and decision
trees were once again used to perform annotation using exactly the same approach of the 2008
Medical Annotation Task but with a feature subset size of 11.
VPA – Computer Vision and Pattern Analysis Laboratory, Sabanci University, Istanbul,
Turkey:
VPA team ranked 4th with achieving an error sum of 1456.21. Images were divided into 4 < 4 non overlapping blocks and for each a derived LBP histogram was extracted and
concatenated into a single, spatially enhanced histogram. This derivation of the LBP aimed to
capture edges, spots and flat areas, over the image. The feature vector totalized 944 elements
and a normalization between JK1, 11L was performed before the submission to the classifier.
For the annotation task a multiclass one-vs-all SVM with an RBF kernel was chosen.
Parameter configuration was done empirically (trial and error) using 5 disjoint subsets of 2000
images considering the minimum average error rate (maximum accuracy). From 3 strategies
used the axis-wise resulted best. One SVM was trained for each code axis and the final code
was predicted from the composition of all results from each model. The other 2 strategies
evaluated, ignoring the hierarchy and training SVM’s accordingly to the abundance of classes in
the training set, performed slightly worse.
MedGIFT – University and University Hospitals of Geneva, Service of Medical
Informatics, Geneva, Switzerland:
MedGIFT scored an error sum of 1633.26 using the GIFT retrieval system. The method used
was similar to those of 2007 and 2008 Medical Annotation Tasks, with a 5-NN using GIFT with
8 gray levels performing consistently better for all years. An SVM approach using SIFT was
also attempted but performed poorly due to a poor configuration of the RBF kernel parameters.
IRMA – Department of Medical Informatics, RWTH University, Aachen, Germany:
IRMA team with a sum score of 1994.84 was placed 6th in the rankings with the usual
baseline run. As in previous years, the hierarchy was disregarded.
DEU – Dokuz Eylul University, Tinaztepe, Turkey:
DEU team ranked last during this task with a sum score of 3681. Local and global
descriptors were used with a k-NN algorithm for classification.
43
Discussion for the 2009 Medical Annotation Task and final remarks:
The methods used in the 2009 Medical Annotation Task were based in the best methods
from previous years. Therefore, the consistency of error counts ranking in each individual year
is of no surprise. Local image descriptors and support vector machines were still the best
options for image annotation. TAUBiomed team surpassed IDIAP team in all dataset years with
exception of 2006. This is a curious result since the difference in 2008 was too large. The
replacement of ‘0’ ending positions for a wildcard may have contributed to a better error count
however the interpretation made for this strategy has a dual point of view: the meaning of ‘0’,
unspecified, can be seen as possessing the same semantic meaning of a wildcard, ‘do not know’.
By assigning ‘0’ or a wildcard the information about a specific position is null. In the other hand
a slightly different interpretation can be made by understanding ‘0’ as unspecified because it is
surely not one of the other possible choices, e.g., as a rejection option, and a wildcard as ‘do not
know’ because there is not enough confidence to assign a specific classification. The extensive
grid search should not be excluded from the reasons for such a boost in performance. Local
descriptors also helped FEITIJS team to improve in the rankings in order to achieve the best
error sum for a non-SVM classifier approach, similarly to 2005. Curiously the VPA team
method also uses one of the image descriptors that performed best in 2007 and 2008 but, with a
different SVM Kernel, it performs poorly.
From 2005 until 2009 it is amazing to see the amount of strategies comprehending image
descriptors and classifiers. However there are still many aspects to consider in this problem. For
instance, the IRMA database consists only in x-rays, with the T-axis locked in some positions,
but the IRMA code covers a wider range of modalities. Databases regarding these other
modalities and annotated with the IRMA code are still not available. Taking into consideration
this larger spectrum of medical image types and their correspondent classes would make the
whole problem to be readdressed. Also, many images from the 2007-2009 databases have
several unspecified axis/position in their IRMA code. Therefore, the success from methods
presented during the 2008-2009 task years cannot be evaluated at their full extent, where such
incompleteness is absent. Is convenient to remind that the IRMA code is still under
development and its number of positions within an axis expected to grow. This will have an
impact in the error evaluation schema.
As a last remark we state that generally there were no particular criteria in the image
descriptors selected for the works reviewed, with many relying on feature extraction engines.
Possibly this choice is a function of image descriptors used by the authors regarding others
works on their research. The only exception is the work from the RWTH University, home of
the IRMA database.
44
2.42.42.42.4 Other IRMA database related workOther IRMA database related workOther IRMA database related workOther IRMA database related work
In [51] a mammographic database annotated with the IRMA code is presented. This database
contains more image direction and anatomical information than the mammographic images
existent in the IRMA database. It is a good example of the IRMA code evolution. Another
IRMA database related work can be found in [52]. Here a subset of the IRMA database
consisting in 9100 images were annotated as belonging to 40 distinctive classes. The IRMA
code is not used in this database, but only two single annotations regarding the anatomy and
direction. The paper presents experimental results on a new merging schema for medical image
classification where the 40 classes are merged into 25 hierarchically superior classes. For this a
large number of image descriptors are extracted from the images: shape features, like invariant
moments, Fourier descriptors and axis orientation; texture features, namely from statistic origin
like energy, homogeneity, contrast and correlation; finally a tessellation-based spectral feature
as well as a directional histogram, both in multi-scale space. There is a feature selection using a
backwards algorithm and classification, in a first stage, is performed using a multi-layer
perceptron. Thereafter, the merging scheme, an iterative procedure based in a distribution
function estimation using a mixture of Gaussian functions and the expectation-maximization
algorithm, is applied and results as encompassed in the 25 hierarchically classes. The 9100
images dataset was divided in a training set consisting in 7861 images and a test set consisting
in 1239 images. Ultimately the merging class system had an accuracy of 90.83%.
45
Chapter 3
Background Information
3.13.13.13.1 The image domainThe image domainThe image domainThe image domain
HE use of digital images dates from 1920, when the Bartlane Cable picture transmission
service was used to transfer images between London and New York. These were codified
in 5 gray levels (later 15) and reconstructed using a telegraph printer. The use of digital images
as we know today appears in the 1960’s when improvements on computing technology and the
onset of the space race led to a surge in digital image processing, especially in the enhancement
of pictures of the moon taken by Ranger and Apollo missions [53]. In the medical field the
digital image appears in the 1970’s and its importance is recognized in 1979, when Sir Godfrey
N. Hounsfield and Prof. Allan M. Cormack shared the Nobel Prize in medicine for the invention
of tomography, the invention behind the Computerized Axial Tomography. But what is an
image? The image, in a literal definition, is a two-dimensional pictorial representation. The
digital image is an approximation of a two-dimensional image by set of values called pixels or
textals. Each pixel is described in terms of its color, intensity/luminance or value. Each digital
image1 has a limited extent, window or size, an outer scale, and a limited resolution, the inner
scale.
Mathematically the image is a real function 2: N2 O N mapping two real variables into a
third real variable. Thus 2�P� � 2�;, Q� � R where �;, Q� is the full spatial domain of the
image, i.e., a set of points in the Cartesian plane, and R is the luminance/color/value of the
image point, with P � N� and R � N. The value R can be interpreted in many ways and it is not
necessarily a positive value. Also, depending on the color system used, R may have a higher
dimensionality. In the digital image P � �;, Q� are pixel coordinates, whose values are bounded
by the image size 0 < B where 0 � �1,2, … , @S� and B � �1,2, … , TU�. Therefore, the digital
image can be seen as a 0 < B matrix of elements.
In order to retrieve images according to a given query we need to enhance its relevant
elements while reducing the remaining aspects. This is the goal of image processing.
Generically, we act on the image using an operator, V, over the full spatial domain of the
1 For easy reference a “digital image” will be onwards addressed simply as “image”.
T
46
image, 2�;, Q�, an image patch, 2�W, X�, or an interest point, 2�W/, XY�, to generate a feature
space containing the information needed to identify the objects in the following way:
$�P� � V Z 2�;, Q� (3.1)
$�P�[,\�� � V Z 2�W, X� (3.2)
$�P�[],\^�� � V Z 2�W/, XY� (3.3)
where 2�;, Q� is the full image; 2�W, X� is an image patch, i.e., a connected subset of Cartesian
points with �W, X� � �;, Q�, & W, X � N; is the R value at a interest point �W/, XY�, with 0 � �1,2, … , @[� and B � �1,2, … , T[�.
3.1.13.1.13.1.13.1.1 Image propertiesImage propertiesImage propertiesImage properties
In Chap. 1 it was stated that there is not a clear definition of what image content is. Instead,
relationships between image properties like color, shape, texture and interest points are certain
to be fundamental for its characterization. But what are exactly these properties and how are
they used for CBIR purposes?
3.1.1.13.1.1.13.1.1.13.1.1.1 Color Color Color Color
Natural Conceptualists say that color is the product of an epistemology conceptualization:
an object is blue because we learn that it is blue and accept it as a “truth”. This is acceptable if
we can explain why it is “truth”. Therefore we are left with a difficult philosophical problem
where there is no consensus on the origin of color. Two very distinctive points of view exist:
some theorists agree that color is a perceiver-relative property of the objects, e.g., dispositions
or powers to induce experiences of a certain kind, or to appear in certain ways to observers of a
certain kind while others state that they are objective physical properties of the objects, e.g.,
color rely on the physical microscopic properties of the bodies and are, therefore, irreducible
[54]. There are many theories about color: Color Fictionalists state that there are no colors at all
(!) when exploiting the gaps of other theories, thus supporting a perceiver-relative point of view;
Simple Objectivists stand for the concept of color is either related with physical properties of
the objects or in the nature of light, hoping that science will provide an answer; Ecologists (!)
diverge slightly from Color Fictionalists arguing that it is an relational property between the
environment and the individual. There are others theories attempting a “unified” definition of
47
color but so far an agreement seems impossible. For the interested reader details of color theory
can be found in [54].
In CBIR, color is a widely visual feature to categorize objects. For this the variable R is
expressed in terms of a color space to represent image colors. The RGB (Red, Green, Blue)
system is commonly used to represent color images where the gray level intensities are
expressed as a sum of red, green and blue grey level intensities. There are many other color
systems, like the HSV (Hue, Saturation, Value) or the CMYK (Cyan, Magenta, Yellow, Key).
The image can also be represented as an 8-bit grayscale image, where pixel intensity is
registered in terms of 256 shades of gray, or as a 2-bit binary image, in black and white. A
quick reference to color systems and conversions between systems can be found in [55].
Depending on the color system, one or more histograms are employed to quantify the color
distribution, defined by the number of bins used. Differences in color distribution are,
sometimes, essential to determine differences between images. However such distribution can
lead to errors when different images present similar histograms. Aiming to capture spatial
relationships between colors, the image is partitioned in smaller subimages and a color
histogram extracted from each of these. This results in the color layout of the image. Exploring
correlations between pairs of similar colors based on their mutual distance within the image can
also be explored in what is called a color auto-correlogram.
3.1.1.23.1.1.23.1.1.23.1.1.2 ShapeShapeShapeShape
The shape of an object is its apparent form. In the image domain extracting shape consists in
the identification of lines and curves. Shape extraction is already a well developed field in
image processing where two main streams exist: gradient-like methods using directional
maxima lookup to quantify the edge strength, like the Canny edge detector, Sobel, Prewitt and
Robert’s operators, and second derivative zero crossing search methods, like the Laplacian-like
approach. Other methods based in the Hough transform, curve propagation and wavelets also
exist. Quantification of edges is made using histograms considering the full image spatial
domain or, like in color, subimages.
3.1.1.33.1.1.33.1.1.33.1.1.3 TextureTextureTextureTexture
The concept of texture is somehow intuitive, being closely related with visual patterns
perceived in the surface of objects that present homogeneity. However its definition is not exact.
Coggins summarizes several definitions for texture in literature [56] just to find out that each is
48
adjusted to the context of the works therein presented. Perhaps a good definition for texture for
image processing is that it is a function of the spatial variation in pixel intensities [57].
In image processing texture representation relies mainly in two methods: structural and
statistical. Structural methods aim to identify texture structural primitives and placement of its
elements, looking for regularity. Examples of this method are adjacency graphs or
morphological operations. Statistical methods use first or higher order statistics to analyze the
distribution of luminance on the image. These include the popular co-occurrence matrix, Fourier
power spectra or shift-invariant principal component analysis (SPCA). In CBIR similarity
metrics are the most used method to compare the textures of images during the retrieval process.
In [57] a review of texture extraction methods can be found.
3.1.1.43.1.1.43.1.1.43.1.1.4 Interest pointsInterest pointsInterest pointsInterest points
Interest points are themselves the result of an operation comprehending the full spatial
domain �;, Q�. An image descriptor comprehending an image patch around these points, instead
of its putative use, plays a role in image retrieval. Therefore, equations (3.2) and (3.3) are
related if the patch is centered on an interest point. Examples of interest point detector are the
Difference of Gaussians (DoG), the Harris corner detector of the Hessian matrix. In CBIR the
use of image descriptors around interest points is grounded in two methods: direct image
matching, to find the same image in a collection, or together with a bag-of-words approach to
capture image concepts. A review of image descriptors using information around interest points
can be found in [31].
3.1.23.1.23.1.23.1.2 Image descriptorsImage descriptorsImage descriptorsImage descriptors
It is very difficult to ascertain which image properties are fundamental to characterize a
specific image. It depends on the context of the problem we want to solve and the knowledge
within the image itself. If we want to recognize specific objects in a scene probably the shape
property is more relevant than the others. However if such objects have a distinct color then the
relevance of this property is higher than the rest. If we want to detect light bulbs in a night
scenario we rely on a interest point detector. Sometimes the image can be quite complex and all
properties are essential for its characterization.
Equations (3.1-3) enable the quantification of the image properties originating image
features. Such quantification is an attempt to bridge the perception of image properties by the
human vision mechanism to mathematical measure(s) taken from the image. This is the aim of
an image descriptor. Image descriptors can be seen as global, like (3.1), or local, like (3.2) and
(3.3) if either the whole image spatial domain is used or only a restricted spatial domain is used.
49
An important aspect of information retrieval from an image is that such information remains
unchanged under different conditions. Thus, descriptors invariant to illumination conditions,
distortion, clutter, occlusion or rotation will provide valuable information for the retrieval
process in CBIR.
In addition to image descriptors focusing in color, shape and texture other descriptors
include composite descriptors. Table 3.1 resumes the image descriptors used in this work, their
spatial domain and their covered image properties.
Image descriptors can also be perception-based if the computed image features are believed
to be related with human perception, like the Tamura textures, or machine-centered, by simply
computing a series of statistical measures from the image that through experimentation prove
their value in specific problems. Good examples of these are the Haralick features [58],
including the co-occurrence matrix. Machine-centered image descriptors have the advantage of
performing analysis in aspects not captured by the human vision.
Descriptor
Spatial Domain Image Properties
Global Local Color Shape Texture Interest
Points
Tamura Textures x x
Edge Histogram (EHD) x x
Color Layout (CLD) x x
Scalable Color (SCD) x x
Color and Edge Directivity (CEDD) x x x x
Fuzzy Color and Texture (FCTH) x x x x
Spatial Envelope (GIST) x x
Speeded Up Robust Features (SURF) x x
Table 3.1 – Image descriptors used in this work classified accordingly to their spatial domain
and image properties covered. Low-level descriptors, like Tamura, EHD, CLD and
SCD, involve only one image property. CEDD and FCTH are mid-level
descriptors. SURF uses gradient information around interest points and can be seen
as a mid-level descriptor. However its use to construct a bag-of-words makes it a
high-level descriptor.
Some of the descriptors used are compliant with the Movie Picture Expert Group (MPEG-7)
standards. The MPEG-7 defines a Descriptor Definition Language (DDL) for descriptor
schemas in multimedia content. For images an MPEG-7 descriptor should be compact in order
to be added as metadata and with proven results for image retrieval. Incorporating a set of image
50
features as part of a standard in the image file metadata can improve image retrieval systems
because the computation of such features is bypassed. With exception of GIST and SURF all
descriptors presented here are MPEG-7 descriptors. More information about image descriptors
that are compliant with MPEG-7 standards can be found in [59].
In this work we selected image descriptors based in several criteria: first we intend to capture
image information with respect to the image properties described in the previous section; second
is an availability issue as we enforced the use of image descriptors with code provided by its
authors or part of a image information extraction engine; third, we wanted to use recently
proposed descriptors, CEDD and FCTH, with machine learning methods rather than similarity
measures. Like in the related works, we did not choose our image descriptors based on the
nature of the images. Most are general image descriptors used in a large variety of problems.
Therefore this work is also a test to their robustness.
3.1.2.13.1.2.13.1.2.13.1.2.1 Tamura Texture FeaturesTamura Texture FeaturesTamura Texture FeaturesTamura Texture Features
Based in psychological experiments Tamura [60] developed a series of computable features
that correlated with the human perception of images: coarseness, contrast, directionality, line-
likeliness, regularity and roughness.
Coarseness
Coarseness is computed from considerable spatial variations of gray levels. It is related,
although implicitly, with the size of primitive textural structures - textals – presented in the
image. In a first step n averages of 2_ < 2_ image sub-windows (Figure 3.1), where 9 ��0,1, … , 0�, are computed around the central pixel
`_�W, X� � ∑ ∑ 2�#, '� 29⁄\��ab3c%�[c�ab3[��ab3c �[c�ab3 (3.4)
Then the absolute differences between pairs of non-overlapping averages in opposing sides,
both in the horizontal and vertical directions, are calculated
d_e�W, X� � f`_�W K 2_c, X� K `_�W 1 2_c, X�f (3.5)
d_g�W, X� � f`_�W, X K 2_c� K `_�W, X 1 2_c�f (3.6)
51
Figure 3.1– 2_ < 2_ image sub-windows for coarseness extraction (From [60]).
Now considering the maximum difference found
hPVBhWid_j�W, X�k (3.7)
where 9 � �0,1, … , 0� and l � �m, n�, its corresponding scale oY�[�W, X� � 2_ is used to
compute the coarseness over the entire image
T pq � /<Y ∑ ∑ oY�[�#, '�Y%/ (3.8)
with 0 < B the image size.
Contrast
Contrast measures how gray levels vary in the image and the extent to which their
distribution is biased either to white or black. If an image has a low contrast it is expect that its
histogram is a Gaussian function. If it is a Gaussian function then it is unimodal. Polarization
gives a measure of the number of peaks that a distribution has and can be estimated by the
kurtosis
rs � tuvu (3.9)
where
ws � dJ2s�W, X�L (3.10)
xs � dJ�2�W, X� K w�sL (3.11)
are the 4th moment about the mean and the squared variance respectively.
52
If the kurtosis is platykurtic (negative) then there is a peaks distribution in the histogram of
the image, i.e., it is not Gaussian. Otherwise if the histogram is unimodal then a leptokurtic
(positive) kurtosis is expected. The contrast is then defined as
T y/zp�qz � v�{u�3u (3.12)
Directionality
To compute the directionality of an image a histogram of local edges at different orientations
is constructed using a Sobel filter. This histogram is expected to be uniform for images without
a strong orientation and to exhibit peaks for images with high directionality. The estimation of
the sharpness of the peaks in the histogram, by summing the second moments around each one
of them, gives the measure of directionality
Typ |/z|j � 1 K P · 0~ · ∑ ∑ �� K �~�� · �������A�/�~ (3.13)
where P is a normalization factor related to the quantization of the angles, 0~ is the number of
peaks, �~ are the points at peak p and �~ is the position of peak p in HD. All summations are
stored in a 16 bin histogram.
Line-likeliness, Regularity and Roughness
The last Tamura features are related with the previous described features, thus adding little in
terms of textural discriminative power. Line-likeliness is the average coincidence of edge
directions that co-occurs in pair of pixels at a distance d along the edge direction. This is
measured using the cosine difference between the angles.
The regularity is defined as
�� � 1 K P · ���y�pq|/|qq 1 ��y/zp�qz 1 �� p| z y/�6 z\ 1 �� /|c� _|6 /|qq� (3.14)
where P is the normalization factor stated before and � is the standard deviation of the measures
stated in the subscript indexes. Roughness is the sum of Coarseness and Contrast.
3.1.2.23.1.2.23.1.2.23.1.2.2 Edge Histogram Descriptor (EHD)Edge Histogram Descriptor (EHD)Edge Histogram Descriptor (EHD)Edge Histogram Descriptor (EHD)
The EHD captures the spatial distribution of edges around the image in five orientations:
horizontal, vertical, 45° degrees, 135° degrees and non-directional (Figure 3.2).
53
Figure 3.2– Types of edges in the Edge Histogram Descriptor (From [61]).
Images are divided in blocks of size 2/ < 2/ pixels, usually 16 < 16 pixels, and, for each
sub-image, one or more digital filters (Figure 3.3) are applied to detect the edges towards the
mentioned orientations
Figure 3.3– Example of a commonly used edge detector operator (From [61]).
Quantification of the number of edges for each sub-image takes into consideration its
maximum magnitude. Consider the matrix operators in Figure 3.3. Let Bg�#, '�, Be�#, '�, Bjcs��#, '�, Bjc���#, '�, B/j�#, '� be the magnitudes of the vertical, horizontal, 45 degrees,
135 degrees and non directional edges of the �#, '� image block. Let $g�9�, $e�9�, $jcs��9�, $jc���9�, $/j�9� be their respective labels. Then, the magnitudes of each block can be
computed as:
Bg�#, '� � f∑ h_�#, '� < $g�9��_�� f (3.15)
54
Be�#, '� � f∑ h_�#, '� < $e�9��_�� f (3.16)
Bjcs��#, '� � f∑ h_�#, '� < $jcs��9��_�� f (3.17)
Bjc���#, '� � f∑ h_�#, '� < $jc���9��_�� f (3.18)
B/j�#, '� � f∑ h_�#, '� < $/j�9��_�� f (3.19)
where h_�#, '� are the pixel intensity values of the �#, '� image block. Given a magnitude
threshold we construct an histogram of edges for each block. Therefore the descriptor is a vector
of 80 elements. Furthermore, dividing the number of occurrences in each bin by the total
number of blocks normalizes the feature vector. A general flowchart for the extraction of the
descriptor can be found below in Figure 3.4.
Figure 3.4– Flowchart for the Edge Histogram Descriptor.
3.1.2.33.1.2.33.1.2.33.1.2.3 Color Layout Descriptor (CLD)Color Layout Descriptor (CLD)Color Layout Descriptor (CLD)Color Layout Descriptor (CLD)
The CLD [59] specifies the spatial distribution of colors. In a first stage images are divided
in 8 < 8 � 64 blocks and then the dominant color in each block is extracted to build an image
of size 8 < 8. The color method commonly used is the color average but any other method can
be an option during this second phase (Figure 3.5).
55
Figure 3.5 – Flowchart for the Color Layout Descriptor
In a third stage each of the three �Q, EC, EP� color space components of the 8 < 8 image is
transformed by a Discrete Cosine Transform (DCT) where three sets of coefficients are
obtained. In the end a non-linear quantization from a zigzag scanning of the image is used to
weight these coefficients producing the feature vector.
3.1.2.43.1.2.43.1.2.43.1.2.4 Scalable Color Descriptor (SCDScalable Color Descriptor (SCDScalable Color Descriptor (SCDScalable Color Descriptor (SCD))))
SCD [59] is a histogram of colors in the HSV color space. In a first step the Hue (H)
component is quantified in a 16 bins histogram while Saturation (S) and Value (V) and
quantified in 4 bins histograms (Figure 3.6). Afterwards a series of 1-D Haar wavelets are
applied to these histograms generating 16 low-pass and 240 high-pass coefficients. Some high-
pass coefficients can be discarded as they consist in low positive and negative values arising
from redundant information from the original histograms. By doing this, the total length of the
descriptor vector can be reduced to 128, 64, 32 or 16 bins if we discard the high-pass
coefficients completely.
Figure 3.6– Flowchart of the Scalable Color Descriptor
56
3.1.2.53.1.2.53.1.2.53.1.2.5 Color and Edge Directivity Descriptor (CEDD)Color and Edge Directivity Descriptor (CEDD)Color and Edge Directivity Descriptor (CEDD)Color and Edge Directivity Descriptor (CEDD)
The Color Edge Directivity Descriptor (CEDD) is a recently proposed composite image
descriptor [62] that captures and relates shape, texture and color from an image (Figure 3.7).
Figure 3.7– Flowchart of the Color and Edge Directivity Descriptor (From [62]).
In this descriptor we can consider the full image or an image block. The texture block
receives the input block in the YIQ color space and applies the EHD descriptor to construct a
histogram of 6 bins, five corresponding to the types of edges found in the image (Figure 3.2)
plus one for no edges of any type found. However the EHD is computed in a different way
because the magnitudes of the edges are normalized:
Bg � Y���S , Be � Y���S , Bjcs� � Y�bu���S , Bjc�� � Y��b3����S , B/j � Y]���S (3.20)
where T`; � BhW�Bg, Be , Bjcs�, Bjc��, B/j� is maximum edge magnitude.
Then, given a threshold, an edge may fall in more than one of the five directional bins bin.
This determinates its texture. If the edge does not fall in any edge category then it belongs to the
last bin, corresponding to no edges.
For color each input block is processed in the HSV color space according to the types of
edges found previously. The first step is to map each edge block in a preset 10 color bins
histogram: Black, Gray, White, Red, Orange, Yellow, Green, Cyan, Blue and Magenta using a
Binary Haar Wavelet descriptor and a 20 fuzzy-linking rules method [62]. In a second stage this
histogram is expanded into a 24 bin color histogram by using Coordinate Logic Filters (CLF)
for vertical edge detection in all three HSV channels: Hue is divided into 8 areas: Red to
Orange, Orange, Yellow, Green, Cyan, Blue, Magenta and Blue to Red; Saturation is divided
into two fuzzy regions defining the shade of a color based in white; Value channel is divided
into three areas: one defines when the pixel (block) will be black and the other two, in
57
combination with Saturation, when it will be gray. Based in this area division a set of 4 fuzzy-
like rules are applied transforming the previous 10 color into a 24 color bin histogram
comprehending Black, Gray, White, Dark Red, Red, Light Red, Dark Orange, Orange, Light
Orange, Dark Yellow, Yellow, Light Yellow, Dark Green, Green, Light Green, Dark Cyan,
Cyan, Light Cyan, Dark Blue, Blue, Light Blue, Dark Magenta, Magenta and Light Magenta.
Color information processed to every edge type yields a 6 < 24 � 144 bins histogram.
Ignoring the last component of the color unit leads to a 6 < 10 � 60 bins histogram name
Compact Color and Edge Directivity Descriptor (CCEDD). To meet MPEG-7 definitions this
feature vector undergoes a Gustafson Kessel classifier to map the final histogram bin values
from decimal to integer.
3.1.2.63.1.2.63.1.2.63.1.2.6 Fuzzy Color and Texture Histogram (FCTH)Fuzzy Color and Texture Histogram (FCTH)Fuzzy Color and Texture Histogram (FCTH)Fuzzy Color and Texture Histogram (FCTH)
The Fuzzy Color and Texture Histogram (FCTH) is also a recent composite descriptor [63]
that resembles the CEDD aiming to capture the image texture, shape and color (Figure 3.8).
Figure 3.8– Flowchart of the Fuzzy Color and Texture Histogram (From [63]).
In the texture unit a Haar transform, at one level, is applied to the luminosity from the YIQ
color space given an input block resulting in four frequency bands each containing 2 < 2
coefficients. For example, considering an 4 < 4 image block, the HL band coefficients
are iE_6 , E_,6�, E_�,6 , E_�,6�k. From here one feature is computed as:
$ � �s ∑ ∑ E_�,6�%�6%��6 �� �3� (3.21)
The features for the LH and HH bands are computed similarly. These features are moments
from wavelet coefficients are effective to discern the image texture because coefficients in
58
different directions bands signal variations in different directions. For instance, HL band
discriminates activities in the horizontal directions while vertical directions show high energy in
this band and low energy in the LH band. Feature computed by 3.21 undergo a fuzzy system
which shape an 8-bin histogram representing several areas: Low Energy Linear Area; Low
Energy Horizontal activation; Low energy Vertical activation; Low energy Horizontal and
Vertical activation; High Energy Linear Area; High Energy Horizontal activation; High Energy
Vertical activation; High Energy Horizontal and Vertical activation.
The remaining procedure for the color unit is similar to the CEDD descriptor and takes into
consideration each of the areas computed in the texture unit. Therefore, the FCTH descriptor
results in a 8 < 24 � 192 bin histogram. To meet MPEG-7 definitions this feature vector also
undergoes a Gustafson Kessel classifier to convert the bin values from decimal to integer. The
compact version of the FCTH descriptor disregards the 24-bin fuzzy linking block in the color
unit yielding a histogram with of 8 < 10 � 80 bins.
3.1.2.73.1.2.73.1.2.73.1.2.7 Spatial Envelope (GIST)Spatial Envelope (GIST)Spatial Envelope (GIST)Spatial Envelope (GIST)
The Spatial Envelope (GIST) is an image descriptor developed for natural scene
categorization. Influenced by seminal approaches in computational vision that have depicted
visual processing as a hierarchical organization of modules of increasing complexity (edges,
surfaces, objects), one prominent view of scene recognition is based on the idea that a scene is
built as a collection of objects [64]. GIST processes the scene as a single entity, aiming for its
shape representation. This means that scenes belonging to the same category have a similar
shape or spatial structure. Since medical images are not likely to possess any particular objects
we found that this image descriptor could provide us some valuable information. Indeed, in [49]
the GIST is one of the image descriptors used in the 2007 Medical Annotation Task database.
In [64] 5 spatial envelope properties are considered: naturalness, openness, roughness,
expansion and ruggedness. However this approach is made considering the natural scene
images database used, containing landscapes with man-made objects (buildings, roads) or
natural landscapes (trees, rivers). For this reason the GIST is seen as an intermediate level
knowledge descriptor. Computation of the GIST descriptor is based in the spatial distribution of
spectral information by means of a Windowed Discrete Fourier Transform (WDFT):
2�W, X, $[, $\� � ∑ #�W�, X��mp�W� K W, X� K X��c%�����[����\���c[,\�� (3.22)
59
where #�W, X� is the intensity distribution of the image window along the spatial variables �W, X�, $[ and $\ are the spatial frequency variables and mp�W�, X�� is a hamming window with a circular
support radius r.
The localized energy spectrum (spectrogram) along a number of pre-determined directions is
then computed as:
`�W, X, $[, $\�� � �2 �W, X, $W, $X��� (3.23)
and gives the distribution of the signal’s energy among the different spatial frequencies,
providing localized structure information. The size of the feature vector generated depends on
the window size and directions intended.
3.1.2.83.1.2.83.1.2.83.1.2.8 Speeded Up Robust Features (SURF)Speeded Up Robust Features (SURF)Speeded Up Robust Features (SURF)Speeded Up Robust Features (SURF)
The SURF [65] is an invariant interest point descriptor for finding correspondences between
two images of the same scene or objects. The motivation for the development of the SURF is to
speed up this generic correspondence process as it is slow in another well known previously
mentioned interest point descriptor, the SIFT. The methodology involved consists in
considering the integral image, e.g, the sum of its gray space pixel values, and a second order
Haar wavelet as an intermediate image representation. Interest points are then located using a
Hessian matrix:
� � ¡[[ ¡[\¡[\ ¡\\¢ (3.24)
where ¡[[�W, X, x� is the Laplacian of the Gaussian (LoG) of the image. It is the
convolution of the Gaussian second order derivative with the image. In SURF the
second order derivatives are approximated with box filters (mean/average filter),
depicted in Figure 3.9. By changing the weights of the filter we increase or decrease the
sensitivity of the detector.
Figure 3.9– Box filters used as an approximation of the Gaussian second order derivative (From
[65]).
60
The space scale analysis is performed with a constant image size during the feature
extraction. Given the scale space where an interest point is detected, s, a circular neighborhood
of 6s around this point. Then, to represent the descriptor directionality, �W, X� Haar wavelet
responses are computed and represented as a vector. Afterwards all responses within an angle of
60º centered in the vector direction are summed.
This circular region is split into 4 < 4 square sub-regions with 5 < 5 regularly spaced sample
points inside. For each region the �W, X� Haar wavelet response is computed and weighted with
a Gaussian Kernel centered in the interest point. Summing the responses for each region
separately yields a feature vector of size 32. Information on the polarity of the intensity changes
is then computed by extracting the sum of absolute value of the responses, originating a 64
elements feature vector. Both vector are then concatenated and normalized.
Computation of the SURF with a feature vector of 128 elements takes into consideration a
separate computation of the x, |W|, y and |X| responses for X ¤ 0 and X ¥ 0, thus doubling the
length of the 32 feature vector.
3.23.23.23.2 The Support Vector Machine (SVM)The Support Vector Machine (SVM)The Support Vector Machine (SVM)The Support Vector Machine (SVM)χχχχ
Support Vector Machines (SVMs) are applied to the problem of making predictions based on
previously seen examples in what is called inductive inference. In order to understand what an
SVM is we need first to consider what we are aiming for when we talk about classifier
performance. Because we hope that the algorithm predicts correct labels of previously
unlabelled data it is natural to measure the performance of a classifier by the probability of
misclassification of an unseen example. The problem is that to establish such probability we
need to know the true underlying probability distributions of the data we are dealing with. If we
actually knew this then there would be no need for inductive inference. Indeed, the knowledge
of the true probability distributions allows us to calculate the best theoretically decision rule
corresponding to a Bayesian Classifier. Perhaps a good way to learn the probability of
misclassification is to use real data for which class labels are known by comparing these with
the ones predicted by the learning algorithm. This misclassification probability can be estimated
by using the learning algorithm in disjoint subsets of our real data in what is called cross-
validation or, more specifically, n-fold cross validation, where n is the number of subsets used.
This is the key idea of a learning algorithm.
The SVM is a supervised learning algorithm that receives labeled examples as input and
outputs a mathematical function that used to predict labels of new examples. Given the space
from where the examples are taken there are infinite hyperplanes, or linear functions, that can
χ Most contents in this section can be found in [66].
61
separate two distinct classes (Figure 3.10). The main idea behind the SVM is to know which
separating hyperplane is the optimal.
Figure 3.10 – Possible separating hyperplanes separating labeled examples in their space
representation.
In the case of the linear separable problem the SVM generates a mathematical function, V�W�, that receives as input another function, representing the known examples, or training set,
and outputs a label:
V�W� � �#V0�$�W�� (3.25)
where $�W� � ¦§, ¨© 1 C, with w a weight vector and b a scalar. The inner product ¦§, ¨© is
defined as:
¦§, ¨© � ∑ � W j � (3.26)
where d is the dimensionality and � is the i-th element of w, § � ��ª, ��, … , �j�.
Then, we can formalize the problem addressed by the linear SVM as given a training
set of vectors W, W�, … , W/ with corresponding class membership labels X, X�, … , X/ that take
on the values +1 or -1, choose parameters w and b of the linear decision function that
generalizes well to unseen examples.
The decision rule for the choice of the best hyperplane is that it not only correctly separates
two classes in the training set, but lies as far from the training examples as possible. Therefore,
the search for such hyperplane is an optimization problem. To solve the optimization problem
we need an objective function as well as a set of restriction regarding the intended hyperplane
(Figure 3.11).
62
Figure 3.11 – Optimal hyperplane (solid line) in a linear separable classification problem
(From [66]).
In order to our hyperplane correctly separate the two classes we need two sets of constrains:
¦§, ¨«© 1 C ¥ 0, for all X � 1 (3.27)
¦§, ¨«© 1 C ¤ 0, for all X � K1 (3.28)
that can be combined as:
�¦§, ¨«© 1 C�X ¥ 0, # � 1, … , 0 (3.29)
The set of constrains (3.27) and (3.28) means that the data must be classified in the correct
side of the hyperplane. However these are not sufficient to separate the two classes optimally.
We need to do so with a maximum margin. The hyperplane satisfying ¦§, ¨© 1 C � 0 in Figure
3.11 is the optimal hyperplane. The function ¦§, ¨© 1 C is +1, in the upper right, or -1, in the
lower left is represented by the dashed lines. In order to maximize the margin distance these two
dashed hyperplanes must be equidistant from the optimal hyperplane and at the same time
parallel to each other. This constrain can be written as:
X �¦§, ¨«© 1 C� ¥ 1, # � 1, … , 0 (3.30)
63
Now the margin can be maximized subjected to this constrain. This distance is equal to 2 ±¦§, §©⁄ . Since maximizing 2 ±¦§, §©⁄ is the same as minimizing ¦§, §© we end up with the
following optimization problem:
²«³§,´ � §. § µ¶·¸ ¹¸º¹ X �§ · ¨« 1 C� ¥ 1 (3.31) »¼½ º¾¾ # � 1, … , 0
However situations may arise when the data is not linearly separable (Figure 3.12). For these
cases we need to soften the constrains to allow these data to lie on the incorrect side of the +1
and -1 hyperplanes by means of a penalization.
Figure 3.11 – Linear inseparable problem (From [66]).
Now we need to introduce a parameter C to balance the goals of margin maximization
separation and correctness of the training set classification. Various tradeoffs between these
goals are achieved by choosing C using cross validation in the training set. Our optimization
problem becomes:
²«³§,´,¿ � §. § 1 E ∑ À Y/ � µ¶·¸ ¹¸º¹ X �§ · ¨« 1 C� 1 À ¥ 1 (3.32) »¼½ º¾¾ # � 1, … , 0
64
While in (3.31) the restrictions could not be violated at all, in (3.32) we look for solution that
keep À values small. Therefore, we allow the point ¨ to violate the margin by an amount À . Then, the boundary points, or support vectors, play a significant role in the performance of the
learning algorithm. The value C trades between how large a margin we would prefer, as
opposed to how many of the training set examples violate this margin. This idea for linearly
inseparable data extends to more complex situations (Figure 3.12).
Figure 3.12 – A more complex linear inseparable case and its mapping into a linear
separable feature space (From [66]).
A linear classifier for the example in Figure 3.12 would never perform well. To overcome
this we assume that there is a mapping Φ that transform the initial data into a linear separable
feature space, probably with higher dimensionality, and perform normal SVM classification in
this space. If a reasonable margin can be achieved in the feature space then a good
generalization of the problem can be expected. With the increase of the dimensionality there
was some fear regarding the curse of dimensionality because it could be difficult to find a
classifier that generalizes well if the number of examples is inferior to the dimension of the
feature space. However Vapnik [67] proved otherwise, opening the way for researchers to
further explore methods that map data into high-dimensional spaces where maximum margin
classifiers could perform.
Linear classifiers in high-dimensional spaces can take a considerable amount of time to solve
the maximum margin optimization problem. A way to approach this difficulty is to convert the
soft margin SVM problem into an equivalent Lagrangian dual problem. If the problems are
equivalent then the solutions must be the same. The new optimization problem becomes:
65
²«³Â � ∑ X X%r r%ÃÄ,Å� �¨« · ¨Æ� K ∑ r Y � µ¶·¸ ¹¸º¹ ∑ X r � 0Y � (3.33) 0 ( r ( E, # � 1, … , B
where the r ’s are the dual variables of the problem, corresponding to the primal variables w
and b by:
§ � Ç r X ¨ Y
�
(3.34) r �X �¦§, ¨«© 1 C� K 1� � 0
Using the inner product rule ¦º 1 ´, ·© � ¦º, ·© 1 ¦´, ·© we can write the decision function
as:
$�¨� � ¦§, ¨© 1 C � ∑ r X ¦¨«, ¨©Y � 1 C (3.35)
where the sign of f(x) gives us the predicted label. In order to determine the optimal values r and b and to calculate f(x) we do not need to know the training or testing vectors but only
their inner product with one another. Then there is no need to explicitly map the data in the
initial space in a new feature space. What we need to know is a kernel function equal to the
inner product of the mapped data:
È�¨, É� � ¦Ê�¨�, Ê�É�© (3.36)
The kernel function should be a good measure of similarity between the vectors x and y and
has to satisfy a series of conditions known as Mercer’s conditions. The kernel function È�¨, É�
satisfies the Mercer’s conditions if for any square integrable functions h(x) and h(y) it is
definite positive:
Ë È�¨, É�¸�¨�¸�É� lWlXÌ Í 0 (3.37)
Some of the most popular kernel functions are:
• Linear: È�¨, É� � ¨ÎÉ.
66
• Polynomial: È�¨, É� � �γ¨ÎÉ 1 P�l, γ ¥ 0. • Radial Basis Function (RBF): È�¨, É� � Ð¨Ñ �KγÒ¨ K ÉÒ2�, γ ¥ 0. • Sigmoid: È�¨, É� � ¹º³¸�γ¨ÎÉ 1 P�.
The exact way to solve the optimization problem requires a quadratic programming. The
SVM using quadratic programming can take advantage when performing classification on
sparse data because at many cases the r ’s are equal to zero. The support vectors have r different from zero and are the hard cases to decide.
With the basics of the SVM understood we now will quickly refer to two aspects used in this
work. Notice that we only considered two possible classifications for the examples, -1 and +1.
However, in problems where more than two labels exist (Figure 3.13), a multiclass problem,
two approaches, one-vs-all or one-vs-one, are used.
Figure 3.13 – A multiclass classification problem case.
Given a problem with data from n classes the one-vs-all strategy we train n classifiers, one
for each class, by assigning the label +1 to the specific class examples and -1 to its complement.
Then, given an unlabelled example, we apply each of the classifiers separately. The decision of
a particular classifier does not influence the decision of the other classifiers. To chose a
particular class we rely on the maximum margin attained by means of a maximum score,
confidence value or probability. In the one-vs-one strategy we train, for a specific example, �0 K 1� classifiers separately, each against examples belonging to different classes, which
totalize 0�0 K 1� 2⁄ classifiers. For an unlabelled example we choose the class that is selected
by the most classifiers.
67
3.33.33.33.3 2007 Medical Annotation Task database2007 Medical Annotation Task database2007 Medical Annotation Task database2007 Medical Annotation Task database
In Chapter 2 we presented the IRMA database. It was clear that during the ImageCLEF
Medical Annotation tasks several subsets of this database were used. In this work we will use
the 2007 Medical Image Annotation database. This database comprehends 12000 images
belonging to 116 different classes. Initially it was separated into three different subsets: a
training set (10000 images), a development set (1000 images) and a test set (1000 images). The
training set was the first to be made available to the task participants. Following this release, the
development set was made available for validation purposes. These two sets, for which the true
classes of objects are known, were merged to perform the analysis of the unlabelled test set.
Some images from this database are depicted in Figure 3.14.
Figure 3.14 – Some examples from the ImageCLEF 2007 Medical Annotation Task
database.
The database characteristics are similar to the main IRMA database. All images have gray
level values stored in Portable Network Graphics (PNG) format. They are scaled proportionally
to their original size and fitted in a 512x512 maximum pixel window. The minimum amount of
each image class in the training set is 10 images. The amount of images per class in the training
set is uneven. The amounts of images in the training and test sets are proportional (Figure 3.15).
The number of images per class in the test set never exceeds the number of images in the
training set. All images share equal T1, T2 and T3 positions in the technique axis.
68
Figure 3.15 – Frequency of classes in the ImageCLEF 2007 Medical Annotation Task
database training and test sets. Two classes represented in the training set
are absent from the test set.
We also considered the training set as the joint training and development sets made
available. During our work we detected some images with several layers of repeated pixel
intensities. We corrected these images before feature extraction.
Our goal now is clearer as it is identical to the proposed for the 2007 Medical Annotation
task. We will used the examples in the training set to train a model using the SVM in order to
predict the correct labels of the test set. These labels consist in the IRMA code.
0 20 40 60 80 100 1200
500
1000
1500
2000
2500
Class
Fre
qu
en
cy
Database Class Frequency
Training Set
Test Set
69
Chapter 4
Methodology
4.14.14.14.1 Framework descriptionFramework descriptionFramework descriptionFramework description
NNOTATION of medical images in this work comprehends two systems, each one
undergoing a number of stages. We named these Normal Fusion System (NFS) and
Smart Fusion System (SFS). Both systems are similar during the initial stages of the annotation
process and differ only regarding the fusion methods involved.
Figure 4.1– Generic framework flowchart for NFS and SFS systems. The differences between
these consist in the Methods Fusion block.
The framework (Figure 4.1) consists in several sequential processing blocks. In the initial
stage we apply the image descriptors described in Chapter 3 to the database in order to extract
information from the images. Feature vectors from the training set will be used to train SVM
models based in three different annotation approaches in a second stage. These models are then
used for the annotation. Afterwards we will fuse these initial annotations in order to further
improve our results in a final annotation. From this fusion between methods we attain the final
IRMA code annotation for all images.
A
70
4.1.14.1.14.1.14.1.1 Feature extractionFeature extractionFeature extractionFeature extraction
MPEG-7 image features - Tamura Textures, EHD, CLD, SCD, CCEDD and CFCTH - were
extracted using a framework developed in C# from the Img(Rummager)1 feature extraction
engine Dynamic-Link Library (DLL) files. The GIST2 and SURF
3 were extracted with code
provided by its respective authors in MATLAB4. Details of the feature extraction block can be
seen on Figure 4.2.
Figure 4.2– The feature extraction block. MPEG-7 global image features were extracted using
the Img((Rummager) engine while GIST was extracted using code provided by its
authors. The only local image descriptor, SURF, was used to construct a dictionary
of visual words.
The resulting image features from each image descriptor were concatenated in the following
order: {CLD, SCD, CCEDD, CFCTH, EHD, Tamura Textures, GIST, SURF}, resulting in a
single 954 elements vector per image.
4.1.1.14.1.1.14.1.1.14.1.1.1 Global descriptorsGlobal descriptorsGlobal descriptorsGlobal descriptors
Some MPEG-7 descriptors were not used exactly as described in Chapter 3. For the Tamura
textures the line-likeliness, regularity and roughness were disregarded as they are functions of
coarseness, contrast and directionality. This means that from these features no new information
about the image is provided. Therefore, Tamura textures resulted in an 18 element feature
1 http://savvash.blogspot.com/2008/06/imgrummager-in-now-available-for.html
2 http://people.csail.mit.edu/torralba/code/spatialenvelope/
3 http://www.vision.ee.ethz.ch/~surf/download.html
4 http://www.mathworks.com
71
vector. CEDD and FCTH were used in their compact form, CCEDD and CFCTH,
comprehending the full image instead of image blocks. The remaining MPEG-7 image
descriptors, EHD, SCD and CLD, were used accordingly to their definitions.
For the GIST descriptor all images were resized to 256 < 256 pixels. The feature vector was
originated from 64 < 64 non-overlapping sub-windows in 8 different directions, yielding a
feature vector of 256 values.
4.1.1.24.1.1.24.1.1.24.1.1.2 BagBagBagBag----ofofofof----words modelwords modelwords modelwords model
We used the SURF together with a bag-of-words (BoW) model [32]. This model aims to
create a dictionary of visual term, representing image concepts, based in local image features.
Based on the dictionary of visual terms, local image features are quantized in a histogram of
visual terms. Unlike the previous image descriptors mentioned, with the exception of GIST, the
BoW tries to capture high-level features from the image instead of low-level content. The idea
behind the bag-of-words model is similar to the creation of dictionaries for text retrieval.
The creation of a BoW model starts with the extraction of local features from an image
dataset. These undergo a clustering algorithm where they are grouped accordingly to a metric.
The number of clusters/centers is user-dependent. Few clusters originate a small dictionary,
where different visual concepts may be represented by the same visual word, while too many
clusters may create visual words with do not represent a visual concept. The exact dimension of
the visual vocabulary to setup a dictionary is somehow database-dependent, requiring
exhaustive testing to reach a value to be used in order to achieve a desired result/performance.
Once the dictionary of visual terms is created, for every image we assign each local image
descriptor to a particular visual word if this is its nearest neighbor. Let Ó � Nj be a local image
descriptor with dimension d. We define the nearest neighbor as:
W�� � �W� � Ô|&W � Ô, W - W�: l#�Õ�Ó, W�� ( l#�Õ�Ó, W�� (4.1)
where D is our visual dictionary. This yields a frequency histogram of visual terms for the
image with a number of bins equal to the visual dictionary size.
In this work we built a BoW model with 512 visual words from SURF local image features.
At the same time it was our goal to build a sparse frequency histogram of visual terms with a
density of no more than 1 word per bin. The reason for this choice is based in the fact that not it
is intuitively preferable to compare frequency histograms that possess different words rather
than possessing the same word in different quantities, but also to take advantage of the SVM
quadratic solver. However, we noticed that some images had a large number of interest points,
more than 1000 sometimes, while others possessed very few, as low as 3 or 5. To overcome this
72
issue we modified dynamically the sensitivity of the interest point detector in the SURF code to
detect between 256 and 512 interest points.
To build the visual dictionary we selected uniformly 30 local descriptors per image in the
training set. The reason for this uniform selection is to avoid a region based sampling since in
the output descriptor file the SURF features are ordered accordingly to their �W, X� coordinates.
A total of 11000 < 30 � 330000 points were gathered and clustered in 512 centers with a k-
means1 algorithm using the Euclidean distance. Afterwards the frequency histograms were
assembled using (4.1) for both training and test sets.
4.1.24.1.24.1.24.1.2 Model training and image annotationModel training and image annotationModel training and image annotationModel training and image annotation
For the annotation task we relied on SVMs with a Radial Basis Function (RBF) kernel. We
set up a framework in MATLAB using the LIBSVM [68] multi-class implementation with
probability estimates considering three approaches: Flat, Axis-wise and Position-wise
annotations. The flat annotation disregards the image IRMA code completely by considering it a
whole class of objects. Here, each IRMA code was replaced by an integer number. The axis-
wise approach consists in the annotation of each IRMA code axis separately. The final IRMA
code is assembled regarding each axis independent result. These two approaches were the most
commonly used strategies in the IRMA database related work (see Chapter 2). However, both
disregard the hierarchical nature of the IRMA code. For this reason we decided to further
explore such hierarchy by introducing the position-wise approach.
The position-wise method operates in each axis codes separately. The algorithm involved is
as follows:
1. Isolate the highest hierarchical position (X1) of the axis, its root, and use the whole
training set to perform the initial annotation.
2. Group all previously unlabelled examples sharing the same annotated code.
3. For each group reduce the training examples to those images that match the
annotation given, in a semantic reduction of the training set, and train new SVM
models to classify the hierarchically subsequent inferior position.
4. Repeat this top-down process through the axis tree until it is completely classified.
Go to point 2 considering the current hierarchical level.
We undertake the same methodology for all axes and assemble the final IRMA code. As we
move along the tree, more groups will be created based on the ongoing annotation and the data
is systematically reduced.
1 http://www.cs.cmu.edu/~dpelleg/kmeans.html
73
It is clear that wrong decisions in early stages of the annotation process will result in
completely misclassified axis. However, correctly annotated images in early stages are expected
to be correctly annotated in the subsequent stages since data that could induce an error is
discarded during the semantic reduction.
For SVM model training we performed an extensive grid-search on the flat and axis-wise
strategies to optimize the RBF kernel parameters (γ, C) using 10-fold cross validation. Because
LIBSVM normalizes all input data, this search was conducted around 1 9: ,where k is the
dimension of the feature vector. A total of 108 models were trained for each axis in the axis-
wise method as well as the flat strategy. For γ we took into consideration the interval between
one magnitude order immediately greater and smaller than 1 9: , for a total of 9 values, and for
the cost we considered E � 2/ where 0 � �K4, K3, … ,7�. Parameter optimization for the RBF
kernel parameters in the position-wise approach raised a problem at this point. For the
Technique (T) and Biology (B) axis the axis-wise parameterization is suitable because in the
first there is only one position to annotate and in the second the annotation of the first position
determinates the position value of the subsequent positions. Some preliminary experiences were
done on a grid search for the first position in the Direction (D) and Anatomy (A) axes. Withal,
in these axes, the grid search returned a large set of optimal parameters. We found cumbersome
to pursue a more accurate result here. However, we noticed that among these results the best
RBF kernel parameterization for the axis-wise approach was also present. Therefore, even if
non optimal, we used the axis-wise approach best RBF kernel parameterization for the position-
wise approach.
Cross validation output during SVM training consists in the overall accuracy for each pair of
parameters. Aside the accuracy we also wanted to see the error count (2.3) that subsets of the
training data yield for the best RBF kernel parameters considering all approaches. For this we
divided the training set in 11 randomly selected disjoint subsets, each with 1000 images, and
performed the annotation of each one separately based on SVM models created from the
remaining 10000 images using the optimal parameters. The weights used were equal for all
feature vector elements.
4.1.34.1.34.1.34.1.3 Methods fusionMethods fusionMethods fusionMethods fusion
So far the methodology for the NFS and SFS systems are similar. They diverge only in the
fusion methods applied thereafter. We expected that a fusion of annotations derived from the
three approaches could improve our initial results. Such method led also to an improvement of
final results in related works (see Chapter 2). This process also offers the possibility of a
wildcard assignment for a particular position given the IRMA code error count evaluation
schema in (2.1-3).
74
In this work we perform only pairwised fusions. In the NFS system the fusion strategy
consists in majority voting for each axis independently. We called this a normal fusion. If the
position values are coincident then the final codes retains such value. Otherwise a wildcard is
placed (Example 4.1). Once a wildcard is placed the subsequent position will be assigned a
wildcard as well.
.1121 K 110 K 500 K 0001121 K 120 K 421 K 000× 1121 K 1 ++ K +++ K000
Example 4.1 – IRMA code fusion by majority voting. Code 1121-120-421-000 is an example
of a semantic meaningless code.
However, at this point we noticed that some codes assembled by the axis-wise and position-
wise approaches have no representation in the 116 classes’ existent in the database. This
happens when these approaches misclassify one or more axes, producing a semantic
meaningless code. These codes can be easily identified and we know that they possess an error.
The problem is that we do not know which of the axis/axes is/are wrong. In normal fusion we
disregarded these codes but we have also experimented to bypass them in the flat/axis wise and
flat/position-wise fusions. When such code is detected we assign the flat IRMA code annotation
as the final annotation. An example of such code can be seen in Example 4.1. Bypassing
meaningless codes in the axis-wise/position-wise fusion was not taken into consideration
because in many occasions the same image is annotated with a meaningless code with both
approaches.
In a second fusion method we attempt to identify potential wrong classified codes in the flat
annotation in order to undergo a normal fusion with the two other approaches. This method is
based in the fact that flat classification outperformed the two other approaches. In Chapter 5 this
result will be presented with detail. Nevertheless, it is herein intended in advance to provide a
rationale for the smart fusion method in the SFS system (Figure 4.3). Thus, if a wrong classified
codes in the flat method can be detected we expect a gain from it fusion with the other methods
if these provide a correct or partially correct annotation. If they provide also an incorrect code
different from the flat annotation we can still reduce the error count. If they are similar the error
does not change. The only issue is if false positives are detected. Here we can consider two
situations: if a correct flat code is classified as incorrect, the axis-wise and position-wise
annotations may also be correct. In this case normal fusion does not increase the error;
otherwise, the error increases if we merge a false positive detected in the flat annotation with an
incorrect code from other approach.
75
Figure 4.3– Detailed flowchart for the SFS system. The methods fusion block comprehends
normal fusion (majority voting), where the flat approach was considered our
baseline.
To identify misclassified codes in the flat annotation we proceed, considering one code, as
follows:
1. Store the probability estimates of the annotated code as the first element of the
feature vector.
2. Compute the average error distance between the annotated code and the subsequent 9 closest codes. For this we understand as the k-codes yielding the closest
probability values to the annotated code. Use this value as the second element of the
feature vector.
3. Group training examples that share the same classification of our test code.
4. With the training examples, labeled as ‘0’, if correct, and ‘1’, if incorrect, train the
classifier and predict the correctness of the code.
. We used again the SVM with an RBF kernel and performed an grid search to optimize its
parameters using a 11 fold cross validation, each comprehending one of the 11 disjoint sets
previously created, and tested directly the normal fusion between the flat method and each one
of the other two approaches. Therefore, cross validation is made using the minimum error count
achieved. For γ we considered the values �0.1, 0.2, … , 0.9�, for a total of 9 values, and for the
76
cost we considered E � 2/ where 0 � �0,1, … ,7�. Best (γ, C) parameters were chosen for the
minimum average error count attained considering all disjoint training subsets and 9 � �3�
nearest neighbors. A total of 72 models were trained during the grid search. Weights were kept
equal for both features.
77
Chapter 5
Results and Discussion
5.15.15.15.1 Feature extractionFeature extractionFeature extractionFeature extraction
HE 954 elements feature vector, originated by the concatenation of image descriptors
output, yielded zero values for 121 bins. This result was expected since some image
descriptors act in different color spaces while the database consists in grayscale images. Tamura
textures descriptor resulted in zero values for contrast and coarseness in all images1.
Changing the sensitivity of the box filters resulted in a more balanced amount of interest
points detected among all images. Even at a very low sensitivity, in some cases, the minimum
amount of points for an image was roughly 100. We also tried to equalize the image histograms,
to improve contrast in order to extract more points in the difficult cases but it proved to be
ineffectual. Our strategy to use between 256 and 512 points for the visual words frequency
histograms proved to be successful since it originated sparse histograms as desired. However
some visual words are very common, resulting in high frequency bins (Figure 5.1).
Figure 5.1 – Visual words frequency histogram using the SURF local descriptor.
1 Probably originated from a software bug.
0 50 100 150 200 250 300 350 400 450 5000
10
20
30
40
50
60
70
80
90
100
Visual word
Fre
qu
en
cy
Histogram of visual words
T
78
We also performed some experiences with the SIFT descriptor based in a Difference of
Gaussians (DoG) interest point detector without satisfactory results.
5.25.25.25.2 AnnotationAnnotationAnnotationAnnotation
Evaluation of the annotation results is based in the error count, accordingly to (2.1-3), and
the error rate, i.e., the percentage of codes that have at least one error in one position within one
code axis. The best RBF kernel parameterization for the flat and axis-wise approaches resulted
from the extensive grid search conducted. Overall results for this grid search are depicted in
Figures 5.2-6, with the values of γ and C kernel parameters transformed by a natural logarithm
for better visualization. Highlighted values in all figures correspond to the parameters that
yielded the highest accuracy rate. These are summarized in Table 5.1.
Figure 5.2 – Grid search results for the flat approach.
The grid search results for the flat approach returned a value of γ in the limit of all values
tested. Hoping for a slightly better accuracy we further experimented decreasing this value.
-10-9 -8
-7-6
-5-4
-4
-2
0
2
4
6
0
50
100
Log(γγγγ)
Flat approach
X: -9.21
Y: 2.773
Z: 88.84
Log(C)
Accu
racy(%
)
79
Unfortunately the accuracy drops immediately. Perhaps such behavior can be explained by the
decreasing slope in the inferior left corner of the grid search plane, which reveals an accuracy
drop trend for smaller values of γ.
Figure 5.3 – Grid search results for the axis-wise approach (Technique axis).
Figure 5.4 – Grid search results for the axis-wise approach (Direction axis).
-10 -9-8
-7-6
-5-4
-4
-2
0
2
4
6
40
60
80
100
Log(γγγγ)
Axis-wise approach - Technique axis
X: -7.958
Y: 1.386
Z: 99.67
Log(C)
Accu
racy (
%)
-10-9 -8
-7-6
-5-4
-4
-2
0
2
4
6
20
40
60
80
100
Log(γγγγ)
Axis-wise approach - Direction axis
X: -7.958
Y: 1.386
Z: 90.04
Log(C)
Accu
racy (
%)
80
Figure 5.5 – Grid search results for the axis-wise approach (Anatomy axis).
Figure 5.6 – Grid search results for the axis-wise approach (Biology axis).
-10-9 -8
-7-6
-5-4
-4
-2
0
2
4
6
20
40
60
80
100
Log(γγγγ)
Axis-wise approach - Anatomy axis
X: -7.958
Y: 1.386
Z: 92.97
Log(C)
Accu
racy (
%)
-10 -9-8
-7-6
-5-4
-4
-2
0
2
4
6
40
60
80
100
Log(γγγγ)
Axis-wise approach - Biology axis
X: -7.958
Y: 1.386
Z: 99.04
Log(C)
Accu
racy (
%)
81
From the grid search performed for the axis-wise approach, we realized that classification in
the Direction (D) and Anatomy (A) axes is more troublesome than in the Technique (T) and
Biology (B) axes.
Flat
Axis-wise
Technique Direction Anatomy Biology
Gamma (γ) 0.0001 0.00035 0.00035 0.00035 0.00035
Cost (C) 16 4 4 4 4
Accuracy (%) 88.8 99.7 90.4 93,0 99,0
Table 5.1 – Best parameters for the RBF kernel according to the flat and axis-wise
methods.
Before this grid search we wanted to verify if the BoW model could provide some additional
accuracy when concatenated to the remaining global image descriptors. Nowak [69] states that
BoW models based on local image descriptors around interest points perform worse than those
based on a dense point sampling. Indeed, some tests using only our BoW model provided poor
accuracy during cross validation with an RBF kernel empirical configuration. However,
experiments using only global image descriptors in a grid search did not achieve higher results
for accuracy than those presented in Table 5.1. The decision to use the RBF kernel is grounded
in worse results achieved by other type of kernels, namely the linear, polynomial and sigmoid
during preliminary experiences with the data.
The accuracy values in Table 5.1 do not tell us anything about the error count. There is no
straightforward relationship between accuracy and the error count. Even if we misclassify few
codes they can be severely penalized according to the error evaluation schema. Therefore, with
the best parameters, we proceeded as described in section 4.1.2 and evaluated the average error
count in the training set considering 11 fold cross validation. Remember that for the position-
wise approach the RBF kernel parameterization used is identical to the axis-wise strategy.
Considering the accuracy returned by the grid search a better performance was expected for the
flat method. Even if the axis-wise strategy can outperform the flat annotation in terms of
accuracy for each axis, this method suffers from error propagation. By multiplying all
probabilities the overall accuracy for a completely correct predicted code is 82.3%, less than the
88.8% attained for the flat annotation. We did not know how the position-wise annotation
would perform at this point. Afterwards we annotated our test set too. Results are shown on
table 5.2.
82
Training Set Test Set
Error count Error rate (%) Error count Error rate (%)
Flat 30.8 11.9 31.4 13.3
Axis -wise 32.7 14.5 37.2 16.6
Position-wise 36.3 16.2 39.9 17.4
Table 5.2 – Error evaluation for all strategies considered.
The flat method eventually outperformed its counterparts both in error count and error rate.
This was also verified in several related works in Chapter 2. To analyze the differences between
all classifiers we computed the percentage of images correctly classified by all three methods,
by two methods but not a third, by only one of the methods and by none. Table 5.3 summarizes
our findings for the second and third cases. Examples of images for the results in Table 5.3 are
presented in Figure 5.7.
Flat Axis -wise Position-wise
Flat 4.2% 3.1% 0.8%
Axis -wise 3.1% 0.3% 1.4%
Position-wise 0.8% 1.4% 1,8%
Table 5.3 – Percentage of test images correctly classified by only one method (diagonal cells),
and by two methods but not a third.
The percentage of images correctly classified by all three methods is 78.8% while the
percentage of images misclassified by all methods is 9.8%. From table 5.3 it is clear that a flat
annotation is more accurate where the other methods fail to classify correctly. The 9.8% of
images misclassified by all methods defines a maximum theoretical limit of 90.2% for accuracy
regarding any possible methodology to further improve the initial results. We noticed some
confusion between classes ‘1123-127-500-000’ and ‘1123-120-500-000’, responsible for
roughly 5% of misclassifications in all methods. These classes have a high inter-category
similarity. However the D3
identical to the first code (Figure 5.
.
Figure 5.7 – Examples from images
by all approaches,
correctly only by the axis
position-wise method,
only by the axis
a)
d)
position is unspecified in the second case and may be actually
(Figure 5.8)
Examples from images a) classified correctly by all approaches,
by all approaches, c) classified correctly only by the flat method,
correctly only by the axis-wise method, e) classified correctly only by the
wise method, f) misclassified only by the flat method,
only by the axis-wise method, h) misclassified only by the position
b)
e)
g) h)
83
position is unspecified in the second case and may be actually
classified correctly by all approaches, b) misclassified
classified correctly only by the flat method, d) classified
classified correctly only by the
misclassified only by the flat method, g) misclassified
misclassified only by the position-wise method.
c)
f)
Figure 5.8 – Examples from classes ‘1123
Confusion between these two classes comprehends roughly 5% of
misclassification
5.35.35.35.3 Semantic meaningless codesSemantic meaningless codesSemantic meaningless codesSemantic meaningless codes
Another result is that when the IRMA code is assembled by the axis
methods, the misclassification of a particular axis can produce a code that does not have any
representation in the 116 possible classes. We named it a “semantic mean
Figure 5.7 a), the classification returned meaningless IRMA codes for both axis
approaches. The importance of these codes is that we are sure that they possess some error.
Meanwhile, experiments to find which axis or axes were miscl
results. We attempted to use the error minimum distance between meaningless codes and all 116
classes and weight by class frequency in the training set with no success. We figured that there
is some correlation between the meani
method. The flat method does not suffer from this kind of misclassification, but it seems that
images producing meaningless codes are “hard” to classify with this strategy. For instance,
45.2% of meaningless codes returned from the axis
misclassifications by the flat annotation. By identifying this correspondence the flat error rate
and error count would decrease to 11.9% and 23.82 respectively. In the case of the positi
method these values would decrease to 12.7% and 26.9, as only 25.6% of meaningless codes
correspond to a wrong classification in the flat annotation. Also, correcting these codes in both
axis-based approaches would
of all annotations (Table 5.4).
misclassifications in the flat approach is clear by the difference of the number of wildcards
assigned considering the two decision rules for maj
Examples from classes ‘1123-127-500-000’ (left) and ‘1123-120
Confusion between these two classes comprehends roughly 5% of
misclassifications in all approaches.
Semantic meaningless codesSemantic meaningless codesSemantic meaningless codesSemantic meaningless codes
Another result is that when the IRMA code is assembled by the axis-wise and position
methods, the misclassification of a particular axis can produce a code that does not have any
representation in the 116 possible classes. We named it a “semantic mean
Figure 5.7 a), the classification returned meaningless IRMA codes for both axis
approaches. The importance of these codes is that we are sure that they possess some error.
Meanwhile, experiments to find which axis or axes were misclassified produced no useful
results. We attempted to use the error minimum distance between meaningless codes and all 116
classes and weight by class frequency in the training set with no success. We figured that there
is some correlation between the meaningless codes detected and misclassified codes in the flat
method. The flat method does not suffer from this kind of misclassification, but it seems that
images producing meaningless codes are “hard” to classify with this strategy. For instance,
aningless codes returned from the axis-wise annotation correspond to
misclassifications by the flat annotation. By identifying this correspondence the flat error rate
and error count would decrease to 11.9% and 23.82 respectively. In the case of the positi
method these values would decrease to 12.7% and 26.9, as only 25.6% of meaningless codes
correspond to a wrong classification in the flat annotation. Also, correcting these codes in both
based approaches would decrease the two error measures as it corresponds to
(Table 5.4). This relationship between meaningless codes and
misclassifications in the flat approach is clear by the difference of the number of wildcards
assigned considering the two decision rules for majority voting.
84
120-500-000’ (right).
Confusion between these two classes comprehends roughly 5% of
wise and position-wise
methods, the misclassification of a particular axis can produce a code that does not have any
representation in the 116 possible classes. We named it a “semantic meaningless” code. In
Figure 5.7 a), the classification returned meaningless IRMA codes for both axis-based
approaches. The importance of these codes is that we are sure that they possess some error.
assified produced no useful
results. We attempted to use the error minimum distance between meaningless codes and all 116
classes and weight by class frequency in the training set with no success. We figured that there
ngless codes detected and misclassified codes in the flat
method. The flat method does not suffer from this kind of misclassification, but it seems that
images producing meaningless codes are “hard” to classify with this strategy. For instance,
wise annotation correspond to
misclassifications by the flat annotation. By identifying this correspondence the flat error rate
and error count would decrease to 11.9% and 23.82 respectively. In the case of the position-wise
method these values would decrease to 12.7% and 26.9, as only 25.6% of meaningless codes
correspond to a wrong classification in the flat annotation. Also, correcting these codes in both
s it corresponds to roughly 4%
This relationship between meaningless codes and
misclassifications in the flat approach is clear by the difference of the number of wildcards
85
5.45.45.45.4 FusionFusionFusionFusion
The NFS fuses IRMA codes by majority voting, between pairs of methods. Here we tested
the decision rule of ignoring, or not, meaningless codes. We also evaluated the number of
wildcards generated. The results are in Table 5.4.
Normal Fusion Replace Meaningless
Error
count
Error
rate
(%)
Wildcards
(“*”)
Error
Count
Error
rate
(%)
Nm Wildcards
(“*”)
Tra
in Flat/Axis-wise 26.9 16.1
-
28.3 14.6 36
- Flat/Position-wise 27.7 18.6 27.8 15.7 47
Axis-wise/Position-wise 30.6 18.4 - - -
Te
st Flat/Axis-wise 29.1 18.3 352 30.1 16.0 42 219
Flat/Position-wise 29.4 20.6 481 28.6 17.4 43 310
Axis-wise/Position-wise 34.9 20.0 296 - - - -
Table 5.4 – Results for the NFS system. Nm is the number of meaningless codes detected for the
axes-based method involved.
In fact, we also tested fusion between all three methods, resulting in a error count of 33.7 and
an error rate of 15.7%. This result is not unexpected since all correct annotations by one method
that were misclassified for the other two are lost (summing the diagonal cells in Table 5.3).
From the Table 5.4 it is seen that we did not performed normal fusion between the axis-
based methods using meaningless codes replacement. This decision is grounded in the fact that
both methods share 22 images with meaningless codes. Then, with a meaningless code detected,
we have a high probability of replacing a meaningless code for another meaningless code. Also,
normal fusion between them did not provided lower error measure than the flat annotation.
Results in Table 5.4 show that there is a decrease of the error count but an increase of error rate.
From here we can conclude that we lose accuracy at lower hierarchical positions but gain error
count by assigning wildcards to wrong hierarchical superior positions. Results were consistent
with the predictions.
The SVMs trained in this work make use of probability estimates. In the previous NFS
system we disregarded this information. Some tests training a probability threshold for wildcard
assignment comprehending all IRMA codes did not result in any significant gain. Some images
with high probability output for a class, axis or position, are incorrectly classified. Others, with
a low probability output, were annotated correctly. Therefore, inferring a decision rule from the
probability distribution did not work. First let us look at the two-dimensional feature space
generated by our pair of variables: probability and the average 3-NN distance (Figures. 5.9-10).
86
Figure 5.9 – Feature spaces for two distinct IRMA codes. For low representation (left) the
separation between classes is better that highly represented codes (right) in
training.
Figure 5.10 – Feature spaces for two distinct IRMA codes, one completely labeled as correct
(left) and one labeled only as incorrect. There are classes of codes completely
misclassified during the training phase.
In Figure 5.9 (left), for low representation, the two elements of the feature vector possess
some information able to separate both classes, where the distribution of wrong classes lay in a
low probability boundary. However, some correct codes also lay in this region. In Figure 5.9
(right) this is not very clear, with both classes mixed at high probability and different average
distances. We trained a second SVM with an RBF Kernel in a grid search to find the optimal
parameters that return the lowest error count when performing fusion. Results are depicted in
Figures 5.11-12.
0.2 0.25 0.3 0.35 0.4 0.45 0.50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9Smart Fusion Feature Space - IRMA: 1121-230-961-700
Average 3-NN distance
SV
M P
rob
ab
ilit
y O
utp
ut
Correct
Incorrect
0 0.1 0.2 0.3 0.4 0.5 0.6 0.70.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Smart Fusion Feature Space - IRMA: 1123-110-500-000
Average 3-NN distance
SV
M P
rob
ab
ilit
y O
utp
ut
Correct
Incorrect
0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.80.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Smart Fusion Feature Space - IRMA: 1123-110-500-000
Average 3-NN distance
SV
M P
rob
ab
ilit
y O
utp
ut
Correct
0.2 0.25 0.3 0.35 0.4 0.450.09
0.1
0.11
0.12
0.13
0.14
0.15
0.16
0.17
Average 3-NN distance
SV
M P
rob
ab
ilit
y O
utp
ut
Smart Fusion Feature Space - IRMA: 1121-240-438-700
Incorrect
87
Figure 5.11 – Grid search for the flat/axis-wise fusion. Best parameters are γ=0.9 and C=128.
Figure 5.12 – Grid search for the flat/position-wise fusion. Best parameters are γ=0.7 and
C=128.
01
23
45
-2.5
-2
-1.5
-1
-0.5
0
27.5
28
28.5
29
29.5
30
X: 4.852
Y: -0.1054
Z: 27.82
Log(C)
Axis-wise smart fusion grid search
Log(γγγγ)
Err
or
Co
un
t
01
23
45
-2.5
-2
-1.5
-1
-0.5
0
27
28
29
30
31
X: 4.852
Y: -0.3567
Z: 27.34
Log(C)
Position-wise smart fusion grid search
Log(γγγγ)
Err
or
Co
un
t
88
The best parameterization was found at the extreme values tested. This would normally
require a more wide selection of parameters to be tested. Even so, we have used both
configurations and tested their performance (Table 5.5).
Table 5.5 – Results for the SFS system.
The strategy used in the SFS system resulted in a marginal improvement of the error count
but a more significant improvement in the error rate when comparing with the NFS system. A
total of 70 codes are stated as incorrect for flat/axis-wise fusion. From these only 43 correspond
to true positives. In the case of the flat/position-wise fusion the number of incorrect codes
detected is also 70 but with 50 true positives. While in both methods some false positives are
merged, which may result in more error, fusion of true positives yields a higher gain. The
number of wildcards is less than the NFS system. While fusion comprehends fewer codes, only
7% of the data, the discrepancy between codes involved is higher.
Smart Fusion
Error
count
Error rate
(%) Wildcards (“*”)
Flat/Axis-wise 29.0 14.4 135
Flat/Position-wise 28.3 14.9 180
89
Chapter 6
Conclusions and Future Work
N this work we addressed the medical image annotation problem and explored fusion
strategies between all methods involved. Standalone results show that annotation
considering the conceptualization of the image in a single class, disregarding the IRMA code,
works better than to explore the nature of the code by means of separation in its constituent axis
or even positions. This result is in line with related works. Methods involving separate axis
classification are prone to error propagation.
Benchmarking our results with related works considering the same database places our SFS
system close to the state of the art, with an error count of 26.8 (see Table 2.4 for more
comparisons). However our work involves different assumptions: the test set is not used to build
the bag-of-words model and it is based in interest points instead of a dense point sampling.
The image descriptors used during the several classification stages are the same and we only
divide the feature space accordingly to the concepts we want to annotate. The choice of the best
image descriptors comprehending each stage will be explored future work. Here it is possible to
add new image descriptors. Also, the weight of the elements in the image descriptor for SVM
classification is the same. This is particularly important in the bag-of-words model, where all
words have equal weight. Common words, with high frequency, are not good to discriminate
classes. Therefore, the application of term frequency – inverse document frequency to the
histogram of visual words is also a possibility to test in future work.
Our position-wise method underperformed when compared with the other two methods.
Nevertheless, it shows interesting results during the fusion methods with the flat method. The
error propagation is large in this method but we are aware that the RBF kernel parameterization
used is not optimal. Searching for an optimal parameterization for each stage seems
cumbersome but can also be addressed in the future. It would also be interesting to develop a
decision rule that allowed the classifier to step back the top-down tree classification due to the
possibility of an error.
Different classification strategies can also be applied. Aside the flat and axis-wise methods
we could also group the database classes in larger concepts comprehending 2 or 3 axis. To avoid
meaningless codes a semantic reduction for the classification of the remaining grouped axis can
take place. For the position-wise method we could also consider all different top axis position
I
90
configurations as a class of objects. A high performance in this stage would lead to less error
committed during the annotation of subsequent positions.
The fusion schemas implemented led to better results but at a higher error rate. This means
that the even if more errors are being considered due to the wildcard assignment these are of
lesser importance. While the NFS method considers a simple majority voting, the SFS method
can be further explored. Detecting possible misclassifications is a complex problem since the
SVM output consists only in a label. Although results show that a careful fusion is possible
there is still too much things involved that we would like to address carefully in the future since
the classification in this fusion method involves ordinal data, due to the SVM probability
output, and at the same time categorical data. We could also try to use as input features in this
stage similarity measures between the classified class and the nearest classes instead of the
average error distance.
One of the most important finding were the meaningless codes. They are easy to find and do
not require knowledge of the true image annotations. However we could not implement any
methodology to detect which axis were wrongly classified. The relationship between the
meaningless codes and the flat method misclassifications can be extremely useful to develop a
new fusion process. Moreover, meaningless codes reveal that the IRMA code is not axis-
independent as stated. There are relationships between IRMA code axis that can be target of
future work.
In the future we also would like to evaluate our methodologies in the more complex 2008
Medical Image Annotation task database in order to test if the conceptualization of the image
content smaller concepts pays off in the case of unbalanced training/test examples distribution.
Future databases involving more image modalities would be interesting to work with as well.
As a final remark the methodologies presented here are not exclusive to medical images
databases. They can be used with any database with a hierarchical annotation standard.
91
References
[1] E.A. Krupinski, “The importance of perception research in medical imaging”, Radiation
Medicine 18, (6), 2000.
[2] H.K. Huang, “PACS and imaging informatics: basic principles and applications”, John Wiley
& Sons Inc.: 25, 2004.
[3] H. Müller, N. Michoux, D. Bandon and A. Geissbuhler, “A review of content-based image
retrieval systems in medicine - clinical benefits and future directions,” International Journal
of Medical Informatics,vol. 73, no. 1, pp. 1–23, 2004.
[4] X. Zhou, A. Depeursinge and H. Müller, “Hierarchical classification using a frequency-based
weighting and simple visual features,” Pattern Recogn. Lett., vol. 29, no. 15, pp. 2011–
2017, 2008.
[5] http://archive.nlm.nih.gov/pubs/antani/icvgip02/icvgip02.php
[6] Y. Rui, T. Huang and S.F. Chang, “Image retrieval: Current techniques, promising directions
and open issues”, J. Visual Commun. Image Represent. 10, 1, 39–62, 1999.
[7] M. Kimura, M. Kuranishi, Y. Sukenobu, H. Watanabe, S. Tani, T. Sakusabe, T. Nakajima, S.
Morimura and S. Kabata, “JJ1017 committee report: image examination order codes –
standardized codes for image modality, region, and direction with local expansion: an
extension of DICOM”, Journal of Digital Imaging,15(2), 106-13,2002.
[8] B. Smith, “Beyond Concepts: Ontology as Reality Representation”, in Proceedings of the
International Conference on Formal Ontology and Information Systems, 2004.
[9] W.G. Stock and S. Schmidt, “Collective Indexing of Emotions in Images. A Study in
Emotional Information Retrieval”, Journal of the American Society for Information Science
and Technology 60(5), S. 863-876, 2009.
[10] P.B. Heidorn, “Image retrieval as linguistic and non-linguistic visual model matching”,
Library Trends, S. 309. 54, Chen & Rasmussen, 1999.
[11] W.D. Bidgood, “The SNOMED DICOM microglossary: controlled terminology resource for
data interchange in biomedical imaging”, Methods Inf. Med. 37, (4-5), 404-14, 1998.
[12] M.O. Güld, M. Kohnen, D. Keysers, H. Schubert, B. Wein, J. Brednoand and T. M. Lehmann,
“Quality of dicom header information for image categorization,” in Intl. Symposium on
Medical Imaging, ser. Proc. SPIE, vol. 4685, San Diego, CA, pp. 280–287, 2002.
[13] D.A. Forsyth, “Computer vision tools for finding images and video sequences”, Library
Trends, Vol. 48, No. 2, pp. 326-355, 1999.
[14] G. A. Seloff,”Automated access to the NASA-JSC image archives”, Library Trends, 38(4),
682-696, 1990.
92
[15] S.K. Chang and A. Hsu, “Image Information Systems: Where do we go from here?”, IEEE
Transactions on Knowledge and Data Engineering, 4, 431-442, 1992.
[16] E. Panofsky, “Meaning in the Visual Arts”, Doubleday Anchor Books, Garden City, NY,
1955.
[17] Shatford, S., “Analyzing the subject of a picture: a theoretical approach”, Cataloguing and
Classification Quarterly, 6(3), 39–62, 1986.
[18] S. Shatford-Layne, “Some issues in the indexing of images”, Journal of the American
Society of Information Science, 45(8), 583-588, 1994.
[19] P.G.B. Enser, “Pictorial information retrieval”, Journal of Documentation, 51(2), 126-170,
1995.
[20] http://lmb.informatik.uni-freiburg.de/research/completed_projects/isearch/ research.en.html
[21] P. Aigrain, “Organizing Image Banks for Visual Access: Model and Techniques”, OPTICA’87
Conf. Proc., pp.257-270, Amsterdam, 1987.
[22] T. Kato, T. Kurita, N. Otsu and K. Hirata “A Sketch Retrieval Method for Full Color Image
Database – Query by visual example”, Proc. ICPR, Computer Vision and Applications,
pp.530-533, 1992.
[23] J. Eakins and M. Graham, “Content based image retrieval”, JISC Technology Applications
Program, Report 39, 1999.
[24] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D.
Lee, D. Petkovic and D. Steele, “Query by image and video content: The QBIC system”,
Computer, 28(9), 23-32, 1996.
[25] P. Aigrain, H. Zhang and D. Petkovic, “Content-Based Representation and Retrieval of
Visual Media: A State of the Art Review”, Multimedia Tools and Applications, Vol. 3, No. 3.
pp. 179-202, 1996.
[26] A. Smeulders, M. Worring, S. Santini, A. Gupta and R. Jain, “Content Based Image Retrieval
at the End of the Early Years”, IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. 22, No. 12, 2000.
[27] R. Datta, D. Joshi, J. Lim, James and Z. Wang, “Image retrieval: Ideas, influences, and
trends of the new age”, ACM Computer Surveys, (39), 2007.
[28] C. Harris and M.J. Stephens, ”A combined corner and edge detector”, In Alvey Vision
Conference, pp 147–152, 1988.
[29] T. Lindeberg, “Detecting salient blob-like image structures and their scales with a scale-
space primal sketch: a method for focus of attention”, International Journal of Computer
Vision 11 (3): pp 283–318, 1993.
[30] D. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints”, International
Journal of Computer Vision, 60, 2, pp. 91-110, 2004.
93
[31] K. Mikolajczyk and C. Schmid, "A performance evaluation of local descriptors", IEEE
Transactions on Pattern Analysis and Machine Intelligence, 10, 27, pp 1615--1630, 2005.
[32] L. Fei-Fei, R. Fergus, and P. Perona, “A Bayesian Approach to Unsupervised One-Shot
Learning of Object Categories,” Proc. IEEE Int. Conf. Computer Vision, 2003.
[33] S. Sedghi, M. Sanderson and P. Clough, “A study on the relevance criteria for medical
images”, Pattern Recognition Letters, 29, pp. 2046-2057, 2008.
[34] A. Cawkell, “Indexing collections of electronic images: A review”, British Library Research
Review, 15, 1993.
[35] P.G.B. Enser, “Towards a comprehensive review of the semantic gap in visual image
retrieval”, Lecture Notes on Computer Science, vol. 2728/2003, 163-168, 2003.
[36] L. R. Long, S. Antani, T. M. Deserno, and G. R. Thoma, “Content based image retrieval in
medicine: retrospective assessment, state of the art, and future directions”, Int J Healthc
Inf Syst Inform, vol. 4, no. 1, pp. 1–16, 2009.
[37] Z. Xue, L.R. Long, S. Antani, J. Jeronimo and G.R. Thoma, “A Web accessible content-based
cervicographic image retrieval system”, in Proceedings of the SPIE Medical Imaging, 6919,
2008.
[38] W. Hsu, S. Antani and L.R. Long, “SPIRS: A Framework for Content-based Image Retrieval
from Large Biomedical Databases”, in Proceedings of the MEDINFO, 12(1), 188-92.
[39] T.M. Deserno, M.O. Güld, B. Plodowski, K. Spitzer, B.B. Wein, H. Schubert,H. Ney and T.
Seidl, “Extended query refinement for medical image retrieval”, Journal of Digital
Imaging; online-first, DOI 10.1007/s10278-007-9037-4, 2007.
[40] S. Antani, T.M. Deserno, L.R. Long, M.O. Güld, L. Neve and G.R. Thoma, “Interfacing global
and local CBIR systems for medical image retrieval”, In Proceedings of the Workshop on
Medical Imaging Research (Bildverarbeitung fur die Medizin), 166-71, 2007.
[41] T.M. Lehmann, M.O. Güld, C. Thies, B. Fisher, K. Spitzer, D. Keysers et.al., “ Content Based
Image Retrieval In Medical Applications”, Methods of Information in Medicine, 43, 354-61,
2004.
[42] H. Müller, N. Michaux, D. Bandon and A. Geissbuhler, “A review of content-based image
retrieval systems in medical applications: Clinical benefits and future directions”, 2007.
[43] C. Akgül, D. Rubin, S. Napel, C. Beaulieu, H. Greenspan and B. Acar, “Content Based Image
Retrieval in Radiology: Current Status and Future Directions”, Journal of Digital Imaging,
[Epub ahead of print], 2010
[44] T. M. Lehmann, H. Schubert, D. Keysers, M. Kohnen and B. B. Wein,“The irma code for
unique classification of medical images,” in Medical Imaging Volume 5033 of SPIE
Proceedings, pp. 109–117, 2003.
[45] T. Tommasi, B. Caputo, P. Welter, M. O. Güld and T. M. Deserno, “Overview of the clef
2009 medical image annotation track,” in Proceedings of the 9th CLEF workshop 2009, ser.
Lecture Notes in Computer Science (LNCS), Corfu, Greece, September 2009.
94
[46] “Tutorial on Medical Image Retrieval - IRMA”, Medical Informatics Europe, 2005.
[47] P. Clough, H. Müller, T. Deselaers, M. Grubinger, T. M. Lehmann,J. Jensen and W. Hersh,
“The CLEF 2005 cross-language imageretrieval track,” in Working Notes of the 2005 CLEF
Workshop, Vienna, Austria, 2005.
[48] H. Müller, T. Deselaers, T. M. Lehmann, P. Clough, E. Kim and W. Hersh, “Overview of the
ImageCLEFmed 2006 medical retrieval and medical annotation tasks,” in CLEF 2006
Proceedings, ser. Lecture Notes in Computer Science (LNCS), vol. 4730. Alicante, Spain:
Springer, 2007, pp. 595–608.
[49] T. Deselaers, T. M. Deserno and H. Müller, “Automatic medical image annotation in
ImageCLEF 2007: Overview, results, and discussion”, Pattern Recognition Letters, vol. 29,
no. 15, pp. 1988–1995, 2008.
[50] T. Deselaers and T. Deserno, “Medical image annotation in imageclef 2008,” in CLEF
Workshop 2008: Evaluating Systems for Multilingual and Multimodal Information Access,
Aarhus, Denmark, September, 2009.
[51] J.E.E. Oliveira, A.P.B. Lopes, G. Camara-Chavez, A .de Araujo and T.M. Deserno,
“MammoSVD: A content-based image retrieval system using a reference database of
mammographies”, Computer-Based Medical Systems, 2009. CBMS 2009. 22nd IEEE
International Symposium on 2009.
[52] H. Pourghassem and H. Ghassemian, “Content-based medical image classification using a
new hierarchical merging scheme”, Comput Med Imaging Graph 2008; (Draft), 2008.
[53] E. Dougherty, “Electronic Imaging Technology”, Technology & Engineers, 1999.
[54] http://plato.stanford.edu/entries/color/
[55] http://www.colour.org
[56] J. Coggins, “A Framework for Texture Analysis Based on Spatial Filtering,” Ph.D. Thesis,
Computer Science Department, Michigan State University, East Lansing, Michigan,1982.
[57] M. Tuceyrn and A. Jain, “Texture Analysis”, The Handbook of Pattern Recognition and
Computer Vision (2nd Edition), pp. 207-248, World Scientific Publishing Co., 1998.
[58] R.M. Haralick, K. Shanmugam and I. Dinstein, “Textural features for image classification,”
IEEE Transactions on Systems, Man, and Cybernetics, SMC-3, pp. 610-621, 1973.
[59] T. Sikora, “The mpeg-7 visual standard for content description-an overview,” vol. 11, no. 6,
pp. 696–702, June 2001.
[60] Tamura, H., S. Mori, and Y. Yamawaki, “Textural Features Corresponding to Visual
Perception,” IEEE Transactions on Systems, Man, and Cybernetics, SMC-8, pp. 460-473,
1978.
95
[61] D. Park, Y. Jeon and C. Won, “Efficient use of local edge histogram descriptors”,
Proceeding of the 2000 ACM workshops on Multimedia, 51-54, 2000.
[62] S. A. Chatzichristofis and Y. S. Boutalis, “CEDD: Color and edge directivity descriptor: A
compact descriptor for image indexing and retrieval.” in ICVS, ser. Lecture Notes in
Computer Science, A. Gasteratos,M. Vincze, and J. K. Tsotsos, Eds., vol. 5008. Springer, pp.
312–322, 2008.
[63] S. Chatzichristofis and Y. Boutalis, “FCTH: Fuzzy color and texture histogram-a low level
feature for accurate image retrieval,” in Proceedings of the 9th International Workshop on
Image Analysis for Multimedia Interactive Services, WIAMIS, pp. 191–196, 2008.
[64] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of
the spatial envelope,” International Journal of Computer Vision, vol. 42, no. 3, pp. 145–
175, 2001.
[65] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,”
Computer Vision and Image Understanding, vol. 110, no. 3, pp. 346–359, 2008.
[66] B. Lovell and C. Walder, “Support Vector Machines for Bussiness Applications”, Business
Applications and Computational Intelligence, Idea Group Publishers, 2006.
[67] V. Vapnik, “The Nature of Statistical Learning Theory”. New York: Springer, 1995.
[68] C-C. Chang and C-J Lin, “LIBSVM: a library for support vector machines”, Software
available at http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2001
[69] E. Nowak, F. Jurie and B. Triggs, “Sampling strategies for bag-of-features image
classification. In: Proc. Eur. Conf. on Computer Vision, vol. 4, pp. 490–503, 2006.