Highly Scalable Algorithm for Distributed Real-time Text Indexing
Issues and perspectives for medical text indexing
Transcript of Issues and perspectives for medical text indexing
Olivier BodenreiderOlivier Bodenreider
Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland Bethesda, Maryland -- USAUSA
Issues and perspectivesfor medical text indexing
Medical Informatics Europe 2003St-Malo, France
May 6, 2003 - Workshop W20
The Indexing Initiative
3
Motivation at NLMMotivation at NLM
◆◆ Increasing volume of biomedical literatureIncreasing volume of biomedical literature●● MEDLINE has grown from about 7 million citations in MEDLINE has grown from about 7 million citations in
1996 to over 12 million now1996 to over 12 million now
●● The number of journals indexed has grown from about The number of journals indexed has grown from about 3,750 in 1996 to 4,600 now3,750 in 1996 to 4,600 now
◆◆ Increasing availability of full textIncreasing availability of full text
◆◆ Limited resourcesLimited resources●● Especially qualified indexersEspecially qualified indexers
4
The IND ProjectThe IND Project
◆◆ ObjectivesObjectives●● Investigate automatic and semiautomatic indexing Investigate automatic and semiautomatic indexing
methodsmethods
●● Producing equal or better retrievalProducing equal or better retrieval
◆◆ Initially, an independent collection of projects Initially, an independent collection of projects addressingaddressing●● Indexing methodsIndexing methods
●● EvaluationEvaluation
●● PolicyPolicy
http://ii.nlm.nih.gov
[Aronson & al., AMIA, 2000]
5
Current statusCurrent status
◆◆ SemiSemi--automatic indexingautomatic indexing●● New citations are indexed every nightNew citations are indexed every night
●● Suggested descriptors integrated in the environment Suggested descriptors integrated in the environment used by the indexersused by the indexers
●● Ongoing evaluationOngoing evaluation
◆◆ Automatic indexingAutomatic indexing●● Collections not otherwise indexedCollections not otherwise indexed
●● Descriptors not displayedDescriptors not displayed
6
OverviewOverview Title + Abstract
Ordered list of MeSH Main Headings
MeSH Main Headings
UMLS concepts
Clustering
Restrict to MeSH
TrigramPhrase
Matching
Rel. Citations
PubMedRelated
Citations
ExtractMeSHdescr.
Phrasex
MetaMap
Noun Phrases
Three issues
8
Three issuesThree issues Title + Abstract
Ordered list of MeSH Main Headings
MeSH Main Headings
UMLS concepts
Clustering
Restrict to MeSH
TrigramPhrase
Matching
Rel. Citations
PubMedRelated
Citations
ExtractMeSHdescr.
Phrasex
MetaMap
Noun Phrases
Word-senseambiguity
Terminologyvs. ontology
Evaluation
9
Word sense ambiguityWord sense ambiguity
◆◆ Inherent to natural language processing (NLP)Inherent to natural language processing (NLP)
◆◆ Active research fieldActive research field
◆◆ Compounded in the biomedical domainCompounded in the biomedical domain●● Acronyms / abbreviationsAcronyms / abbreviations
●● Gene / gene product namesGene / gene product names
●● Terms not fully specifiedTerms not fully specified
10
Terminology vs. ontologyTerminology vs. ontology
◆◆ Hierarchies often taskHierarchies often task--drivendrivenrather than based on principlesrather than based on principles
◆◆ Usually suitable for information retrievalUsually suitable for information retrieval●● Better recallBetter recall
●● Precision may not be crucialPrecision may not be crucial
◆◆ Not necessarily suitable for reasoningNot necessarily suitable for reasoning
11
EvaluationEvaluation
◆◆ IndexIndex--basedbased●● Gold standardGold standard
■■ But no ground truthBut no ground truth
●● Similarity measuresSimilarity measures■■ But multiple perspectives possibleBut multiple perspectives possible
◆◆ RetrievalRetrieval--basedbased●● Requires test collectionsRequires test collections
◆◆ SystemSystem--vs. uservs. user--centeredcentered
Perspectives
13
PerspectivesPerspectives
◆◆ RequirementsRequirements●● Better ontologiesBetter ontologies
●● Better identification of specialized entitiesBetter identification of specialized entities(e.g., gene names)(e.g., gene names)
●● Better wordBetter word--sense disambiguation techniquessense disambiguation techniques
◆◆ Tremendous interestTremendous interest(through data mining and knowledge discovery)(through data mining and knowledge discovery)●● In the medical informatics communityIn the medical informatics community
●● And beyond (KDD cup 02, genomic track at TREC 03)And beyond (KDD cup 02, genomic track at TREC 03)
Contact:Contact: olivierolivier@@nlmnlm..nihnih..govgovWeb:Web: etbsun2.etbsun2.nlmnlm..nihnih..govgov:8000:8000
Olivier BodenreiderOlivier Bodenreider
Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland Bethesda, Maryland -- USAUSA
MedicalOntologyResearch