Tutorial on Medical Image Retrieval - evaluation€¦ · Tutorial on Medical Image Retrieval -...

Tutorial on Medical Image Retrieval- evaluation -

Medical Informatics Europe

28.08.2005

Henning Müller, Thomas Deselaers

Service of Medical InformaticsGeneva University & Hospitals, SwitzerlandAachen Technical University, Germany

2©2005 Hôpitaux Universitaires de Genève

Overview

•• Introduction to (image retrieval) evaluationIntroduction to (image retrieval) evaluation

•• PartsParts of an image retrieval benchmark of an image retrieval benchmark•• Data setsData sets

•• Query tasks and topicsQuery tasks and topics

•• Ground truthGround truth

•• Evaluation measuresEvaluation measures

•• Benchmarking eventsBenchmarking events

•• Current initiativesCurrent initiatives•• TRECVIDTRECVID

•• BenchathlonBenchathlon

•• imageCLEFimageCLEF

•• ConclusionsConclusions


Introduction

•• WhatWhat do we actually want to evaluate? do we actually want to evaluate?

•• Realistic scenariosRealistic scenarios

•• Real user needsReal user needs

•• What can we do if it is not used in practice?What can we do if it is not used in practice?

•• Text retrievalText retrieval has a long experience in evaluation has a long experience in evaluation

•• CranfieldCranfield (early 60s), Smart, TREC, CLEF (early 60s), Smart, TREC, CLEF

•• What can we use and what not?What can we use and what not?

•• More commercial interestMore commercial interest

•• First systems in 1960s were more theoreticalFirst systems in 1960s were more theoretical

•• UsabilityUsability testing as well? testing as well?


Usability testing, human factors

•• Tests how real users operate with the systemTests how real users operate with the system

•• User interfaceUser interface

•• Easy and quick to useEasy and quick to use

•• Adherence to interface standardsAdherence to interface standards

•• Novice vs. Advanced user modeNovice vs. Advanced user mode

•• InteractivityInteractivity tests tests

•• Speed is importantSpeed is important

•• Result needs to be explained to the userResult needs to be explained to the user

•• On screen feedbackOn screen feedback


Evaluating clinical information systems

•• ValidationValidation of algorithms on test data of algorithms on test data

•• Evaluation of the results on real data setsEvaluation of the results on real data sets

•• Clinical impactClinical impact

•• Through user tests, improved diagnosesThrough user tests, improved diagnoses

•• OutcomeOutcome, does the use reduce the patient length, does the use reduce the patient length

of stay or the reduced use of system resourcesof stay or the reduced use of system resources

•• Often hard to proveOften hard to prove

•• We are still in an extremely early stage for imageWe are still in an extremely early stage for image

retrievalretrieval


The need for evaluation

•• Without evaluation there is Without evaluation there is no proof of performanceno proof of performance•• No improvement can be shownNo improvement can be shown

•• Techniques cannot be comparedTechniques cannot be compared

•• Techniques will not have any commercial successTechniques will not have any commercial success•• We need to see how far image retrieval has come with respect toWe need to see how far image retrieval has come with respect to

this, can we answer real user needs?this, can we answer real user needs?

•• Systematic evaluation can bring big Systematic evaluation can bring big improvementsimprovementsand deliver important resultsand deliver important results•• CranfieldCranfield tests tests

•• TRECTREC

•• Other domainsOther domains•• Compression, segmentation, watermarking of images, Compression, segmentation, watermarking of images, ……


History of image retrieval evaluation

•• Example results of Example results of one queryone query, then several queries, then several queries

•• Use of Use of databasesdatabases (Corel, (Corel, VistexVistex) containing very similar) containing very similar

imagesimages

•• Problem: different subsetsProblem: different subsets

•• Use of self-defined Use of self-defined measuresmeasures

•• Show clustering, often only one measureShow clustering, often only one measure

•• Definition of invariant measures (generality, invariant PR graphs)Definition of invariant measures (generality, invariant PR graphs)

•• Use of standardized measuresUse of standardized measures

•• Recall, precision, normalized average precision (MPEG7), meanRecall, precision, normalized average precision (MPEG7), mean

average precision (TREC)average precision (TREC)

•• Why is it so hard to compare any two retrieval systems onWhy is it so hard to compare any two retrieval systems on

the same basis?the same basis?


Parts needed for a benchmark

•• Data setsData sets

•• Corel, Washington, Corel, Washington, BenchathlonBenchathlon, MPEG-7, , MPEG-7, CasimageCasimage

•• Query Query topicstopics and and taskstasks

•• Definition based on real world tasks is needed!Definition based on real world tasks is needed!

•• Ground truthGround truth

•• Implicitly used through Corel categoriesImplicitly used through Corel categories

•• Otherwise expensiveOtherwise expensive

•• Evaluation measuresEvaluation measures

•• Benchmarking eventsBenchmarking events


Image data sets available

•• CorelCorel•• Not sold anymore, but thumbnails possibleNot sold anymore, but thumbnails possible

•• University of WashingtonUniversity of Washington•• Groups of photographs from various regionsGroups of photographs from various regions

•• MPEG-7 (copyrighted)MPEG-7 (copyrighted)

•• BenchathlonBenchathlon•• Images of peopleImages of people

•• CasimageCasimage (medical images, and multilingual text), MIRC (medical images, and multilingual text), MIRC

•• Corbis test set (text and images)Corbis test set (text and images)•• Which conditions?Which conditions?

•• NIH publishes all the created databases but non for retrieval,NIH publishes all the created databases but non for retrieval,so farso far

•• Size mattersSize matters!!


Query tasks and topics

•• Very few Very few analyses of user behavioranalyses of user behavior are available are available

•• Journalists queries (Finland)Journalists queries (Finland)

•• Image archive use (England)Image archive use (England)

•• Trademark retrieval is fairly well definedTrademark retrieval is fairly well defined

•• Study on medical images at OHSU, HUG in 2005Study on medical images at OHSU, HUG in 2005

•• How can we define How can we define real-world tasksreal-world tasks??

•• They will have to be based on the databases availableThey will have to be based on the databases available

•• Survey of medical teaching file usersSurvey of medical teaching file users

•• Problem: Almost no retrieval systems in routine useProblem: Almost no retrieval systems in routine use

•• How can we find out real behavior without a standardHow can we find out real behavior without a standard

use of the systems?use of the systems?


Ground truth, Gold standard

•• ExpensiveExpensive to define to define

•• Will need to involve real usersWill need to involve real users

•• More than one set is good to model subjectivityMore than one set is good to model subjectivity

•• Pooling reduces complexity slightly (TREC methodology)Pooling reduces complexity slightly (TREC methodology)

•• Classification of images is practical but change ofClassification of images is practical but change of

databases might be hard (complete annotation)databases might be hard (complete annotation)

•• Databases and ground truth will need to beDatabases and ground truth will need to be

changed from time to time (regularly)changed from time to time (regularly)

•• Community effortCommunity effort would be great would be great

•• Common project (EU, NSF, ?), financing neededCommon project (EU, NSF, ?), financing needed

•• Annotation?Annotation?


Performance measures

•• StandardsStandards that are easy to interpret exist that are easy to interpret exist•• Precision, recall, norm. average rank (MPEG-7), Precision, recall, norm. average rank (MPEG-7), ……

•• Mean average precision to create a ranking at TRECMean average precision to create a ranking at TREC

•• One measure is not enoughOne measure is not enough•• Although measures are strongly correlatedAlthough measures are strongly correlated

•• Normalization of collection size (generality) is notNormalization of collection size (generality) is notneededneeded•• Difficulty of query task can be described in other waysDifficulty of query task can be described in other ways

•• Comparison with different databases is not usefulComparison with different databases is not useful

•• Measures do Measures do notnot pose a pose a criticalcritical problem for problem forevaluationevaluation


Performance measures (2)

•• PrecisionPrecision

•• RecallRecall

retrieved images all ofnumber

retrieved imagesrelevant ofnumber =P

DB in the imagesrelevant all ofnumber

retrieved imagesrelevant ofnumber =R


Benchmarking event

•• NeededNeeded for content-based visual information for content-based visual informationretrieval!!!retrieval!!!

•• A friendly event that should help everyoneA friendly event that should help everyone•• Such as Such as trectrec, clef, clef

•• Co-located with conferencesCo-located with conferences where people go where people goanyways to reduce costsanyways to reduce costs•• BenchathlonBenchathlon at SPIE electronic imaging at SPIE electronic imaging

•• CLEF at ECDLCLEF at ECDL

•• Feedback and acceptance from the community isFeedback and acceptance from the community isimportantimportant•• But how to motivate research groups?But how to motivate research groups?

•• Databases, other Databases, other benefitsbenefits


A technical infrastructure for evaluation

•• Results send in Results send in offlineoffline

•• TREC, CLEFTREC, CLEF

•• InteractiveInteractive user evaluations user evaluations

•• Automatic solution based on a standardAutomatic solution based on a standard

communication protocolcommunication protocol

•• MRML, solutions existMRML, solutions exist

•• Web-based evaluation procedure allows quickWeb-based evaluation procedure allows quick

evaluations after an eventevaluations after an event

•• Harder to get acceptanceHarder to get acceptance


TRECVID

•• Video retrieval at Video retrieval at TRECTREC, now a separate workshop, now a separate workshop

•• Started in 2001Started in 2001

•• 12 participants in 2001, 24 in 2003, 33 in 200412 participants in 2001, 24 in 2003, 33 in 2004

•• 130 hours of video in 2001130 hours of video in 2001

•• Accepted in the communityAccepted in the community, proceedings have an, proceedings have an

impact, new tasks added every yearimpact, new tasks added every year

•• Financing through TREC, domain seems importantFinancing through TREC, domain seems important

and databases are available (news)and databases are available (news)

•• Speech and captions provide important semanticSpeech and captions provide important semantic

informationinformation


TRECVID tasks

•• Shot boundaryShot boundary detection detection

•• Cut or gradualCut or gradual

•• StoryStory segmentation segmentation

•• One news story, contains several shotsOne news story, contains several shots

•• FeatureFeature extraction extraction

•• Concept extraction: indoor, outdoor, speech, people,Concept extraction: indoor, outdoor, speech, people,

train, boat, road, Bill Clinton, train, boat, road, Bill Clinton, ……

•• SearchSearch

•• Human information need is expressed in text+Human information need is expressed in text+

multimediamultimedia

•• Results are a ranked shot listResults are a ranked shot list


Benchathlon

•• Goal was to create a Goal was to create a forumforum for the discussion on for the discussion on

evaluation of image retrieval systems and theevaluation of image retrieval systems and the

creation of an creation of an evaluation infrastructureevaluation infrastructure

•• Situated at Situated at SPIESPIE electronic imaging electronic imaging

•• Started in 2001, after discussions in 2000 and anStarted in 2001, after discussions in 2000 and an

outline document on such a benchmarkoutline document on such a benchmark

•• 2002: 5 papers2002: 5 papers

•• 2003: 8 papers2003: 8 papers

•• 2004: only discussions among participants2004: only discussions among participants

•• 2006: special session on evaluation planned2006: special session on evaluation planned


imageCLEF

•• Located at Cross Language Evaluation Forum(CLEF)Located at Cross Language Evaluation Forum(CLEF)

•• Goal is to evaluate the retrieval of images throughGoal is to evaluate the retrieval of images throughmulti-lingualmulti-lingual information retrieval information retrieval

•• 2003: first image retrieval task, 4 participants2003: first image retrieval task, 4 participants•• Queries in different languages than the English collectionQueries in different languages than the English collection

annotation, image is part of the queryannotation, image is part of the query

•• 2004: 17 participants for three tasks (~200 runs)2004: 17 participants for three tasks (~200 runs)•• Medical task for Medical task for visual image retrievalvisual image retrieval added where the added where the

query topic is an image, onlyquery topic is an image, only

•• 2005: 24 participants for fours tasks (~300 runs)2005: 24 participants for fours tasks (~300 runs)•• Two medical tasks, one retrieval and one on classificationTwo medical tasks, one retrieval and one on classification

•• 36 groups inscribed: much interest in the data36 groups inscribed: much interest in the data


imageCLEF methodology

•• Based on the Based on the TREC/CLEF methodologiesTREC/CLEF methodologies

•• Schedule for participation (January to September)Schedule for participation (January to September)

•• Release of data to participants, then query tasksRelease of data to participants, then query tasks

•• After result submission, pooling and ground After result submission, pooling and ground truthingtruthing

•• Event to compare resultsEvent to compare results

•• Proceedings with an impactProceedings with an impact

•• Still in a learning phase as only in the third yearStill in a learning phase as only in the third year

•• New tasksNew tasks have been added have been added

•• Interactive query/retrieval in 2004Interactive query/retrieval in 2004

•• Medical classification, visual only in 2005Medical classification, visual only in 2005

•• Tasks need to vary every year to cover new groundsTasks need to vary every year to cover new grounds


imageCLEF St. Andrews

•• 28.000 images, 25 queries28.000 images, 25 queries

•• Submissions include visual and textual runs andSubmissions include visual and textual runs and

a large variety of techniquesa large variety of techniques

•• 2005: three example images per query, tasks2005: three example images per query, tasks

taking into account visual contenttaking into account visual content

!"#$"%&'()*+,-

___________ __________ ______

Fotos de faros ingleses

Pictures of English lighthouses

Kuvia englantilaisista majakoista

Bilder von englischen Leuchttürmen


imageCLEFmed

•• More than 50.000 images, 25 query tasksMore than 50.000 images, 25 query tasks

•• Goal was to model search for information byGoal was to model search for information by

medical specialistsmedical specialists

•• Teaching fileTeaching file databases, tasks chosen based on databases, tasks chosen based on

a user survey, three classes of topicsa user survey, three classes of topics

•• Ground Ground truthingtruthing by medical doctors by medical doctors

•• SubmissionsSubmissions include automatic and manual include automatic and manual

submissions and several techniquessubmissions and several techniques

•• Text only is better than visual onlyText only is better than visual only

•• Best systems combine visual and textual (MAP 0.28)Best systems combine visual and textual (MAP 0.28)


Image classification

•• IRMAIRMA group as organisers: 9000 training images group as organisers: 9000 training images

from 57 classes, 1000 evaluation imagesfrom 57 classes, 1000 evaluation images

•• All images are annotated in IRMA codeAll images are annotated in IRMA code

•• VisualVisual properties only, but annotation is available properties only, but annotation is available

in German and Englishin German and English

•• Goal: keep a challenging task for the visual communityGoal: keep a challenging task for the visual community

•• 12 participants, variety of techniques for features12 participants, variety of techniques for features

and classificationand classification

•• Best results: 87.4% correctBest results: 87.4% correct


imageCLEF results

•• Visual classification proves popularVisual classification proves popular

•• More time needed to develop optimized algorithmsMore time needed to develop optimized algorithms

•• Textual is on average better than visual retrievalTextual is on average better than visual retrieval

•• Important to have various semantic levels of queriesImportant to have various semantic levels of queries

•• Best results were obtained by Best results were obtained by combining textualcombining textual

and visualand visual features features

•• Dependent on the features, thoughDependent on the features, though

•• Most groups wanted test data, which was notMost groups wanted test data, which was not

really available this yearreally available this year

•• 2004 tasks were very different2004 tasks were very different


Conclusions

•• Evaluation is Evaluation is essentialessential for any research domain for any research domain

to prove the system performanceto prove the system performance

•• Benchmarking eventsBenchmarking events advance science and advance science and

everybody profitseverybody profits

•• Data sets and feedback from real users is crucialData sets and feedback from real users is crucial

for future tasksfor future tasks

•• More studies on this are neededMore studies on this are needed

•• Data sets need to be made available for all articlesData sets need to be made available for all articles

published if possiblepublished if possible

Tutorial on Medical Image Retrieval - evaluation€¦ · Tutorial on Medical Image Retrieval -...

Documents

Transcript of Tutorial on Medical Image Retrieval - evaluation€¦ · Tutorial on Medical Image Retrieval -...