Semantic indexing in PubMed
description
Transcript of Semantic indexing in PubMed
Semantic indexing in PubMedSemantic indexing in PubMed
CERN Workshop on InnovationsCERN Workshop on Innovationsin Scholarly Communication (OAI8)in Scholarly Communication (OAI8)Geneva, Switzerland Geneva, Switzerland June 20, 2013June 20, 2013
Olivier BodenreiderOlivier Bodenreider
Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland - USABethesda, Maryland - USA
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
OrientationOrientation
NLM is the world's largest biomedical libraryNLM is the world's largest biomedical library Located in Bethesda, Maryland, near Washington, DCLocated in Bethesda, Maryland, near Washington, DC
PubMed provides access to MEDLINE, NLM’s PubMed provides access to MEDLINE, NLM’s bibliographic database of over 20M citationsbibliographic database of over 20M citations MEDLINE covers 5600 journals and adds almost 1M MEDLINE covers 5600 journals and adds almost 1M
new citations each yearnew citations each year PubMed is part of the Entrez system of the National PubMed is part of the Entrez system of the National
Center for Biotechnology Information (NCBI)Center for Biotechnology Information (NCBI)
2
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 3
OutlineOutline
Anatomy of a MEDLINE citationAnatomy of a MEDLINE citation Types of PubMed searchesTypes of PubMed searches
Simple text searchSimple text search Search based on MeSH indexingSearch based on MeSH indexing
Automatic indexingAutomatic indexing Beyond topicsBeyond topics
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Anatomy of a MEDLINE citationAnatomy of a MEDLINE citation
4
Title
Abstract
Indexing
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 5
MeSH main heading[/subheading(s)][+ * for major topic]
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Types of PubMed searchesTypes of PubMed searches
http://www.ncbi.nlm.nih.gov/pubmed
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Non-semantic searchNon-semantic search
PubMed does not require the use of MeSH for PubMed does not require the use of MeSH for queryingquerying Supports “Google-like” text searchesSupports “Google-like” text searches
““no librarian required”no librarian required”
But can identify MeSH terms even if they are not But can identify MeSH terms even if they are not labeled as suchlabeled as such
7
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Non-semantic search Non-semantic search ExampleExample
Find articles about the cheese GruyèreFind articles about the cheese Gruyère GruyèreGruyère
8
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
MeSH (semantic) searchMeSH (semantic) search
Medical Subject Headings (MeSH)Medical Subject Headings (MeSH) Controlled vocabulary developed at NLM for indexing Controlled vocabulary developed at NLM for indexing
and retrieval of MEDLINE citationsand retrieval of MEDLINE citations ~26,000 descriptors (main headings)~26,000 descriptors (main headings) <100 qualifiers (subheadings)<100 qualifiers (subheadings) 214,000 supplementary concept records214,000 supplementary concept records
Hierarchical structure (“tree numbers”)Hierarchical structure (“tree numbers”) Supports query expansion (“explosion”)Supports query expansion (“explosion”)
Search for a descriptor or any of its descendantsSearch for a descriptor or any of its descendants
9
http://www.nlm.nih.gov/mesh/2013/mesh_browser/MBrowser.html
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Simple MeSH search Simple MeSH search ExampleExample
Find articles about drug-induced psychosesFind articles about drug-induced psychoses "Psychoses, Substance-Induced"[Mesh]"Psychoses, Substance-Induced"[Mesh]
10
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Search with “Explosion”Search with “Explosion”
By default, PubMed retrieves articles indexed with By default, PubMed retrieves articles indexed with a descriptor or any of its descendantsa descriptor or any of its descendants
Use Use mesh:noexpmesh:noexp to prevent “explosion” from to prevent “explosion” from happeninghappening
11
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
““Explosion” Explosion” ExampleExample
Find articles about fluoroquinolones (or desc.)Find articles about fluoroquinolones (or desc.) "fluoroquinolones"[Mesh]"fluoroquinolones"[Mesh]
12
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Search leveraging synonymy in MeSHSearch leveraging synonymy in MeSH
MeSH descriptors include related concepts (Entry MeSH descriptors include related concepts (Entry terms)terms) SynonymsSynonyms Closely related (and clustered or indexing and retrieval Closely related (and clustered or indexing and retrieval
purposes)purposes)
All terms from a descriptor and its entry terms are All terms from a descriptor and its entry terms are used for retrieval in PubMedused for retrieval in PubMed
13
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 14
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Entry terms for “Addison Disease”Entry terms for “Addison Disease”
15
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Search leveraging UMLS SynonymySearch leveraging UMLS Synonymy
Unified Medical Language System (UMLS)Unified Medical Language System (UMLS) Terminology integration systemTerminology integration system ~130 biomedical terminologies~130 biomedical terminologies Synonymous terms clustered into conceptsSynonymous terms clustered into concepts
UMLS synonymy used in PubMedUMLS synonymy used in PubMed Query translation happens “behind the scenes”Query translation happens “behind the scenes” E.g., search on “primary adrenocortical insufficiency”E.g., search on “primary adrenocortical insufficiency”
Retrieves articles about “Addison’s disease”Retrieves articles about “Addison’s disease”
16
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 17
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
No entry term for “Heart attack”No entry term for “Heart attack”
18
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Query translationQuery translation
19
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Subheading restrictionsSubheading restrictions
Subheadings represent the context of use of a Subheadings represent the context of use of a particular descriptorparticular descriptor Ciprofloxacin/Adverse effectsCiprofloxacin/Adverse effects Mood Disorders/Chemically inducedMood Disorders/Chemically induced
Assigned during indexingAssigned during indexing Can be queried in PubMedCan be queried in PubMed
20
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Subheading restrictions Subheading restrictions ExampleExample
Find articles about drugs involved in adverse Find articles about drugs involved in adverse eventsevents "Chemicals and Drugs Category“/adverse effects[MeSH]"Chemicals and Drugs Category“/adverse effects[MeSH]
21
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Recapitulative exampleRecapitulative example
Find articles about drugs involved in adverse events Find articles about drugs involved in adverse events and drug-induced manifestationsand drug-induced manifestations (("Chemicals and Drugs Category"[Mesh]) AND (adverse (("Chemicals and Drugs Category"[Mesh]) AND (adverse
effects[sh] OR contraindications[sh] OR mortality[sh])) AND effects[sh] OR contraindications[sh] OR mortality[sh])) AND (chemically induced[sh] OR (("Drug-Induced Liver (chemically induced[sh] OR (("Drug-Induced Liver Injury"[Mesh:noexp]) OR ("Drug Eruptions"[Mesh:noexp]) OR Injury"[Mesh:noexp]) OR ("Drug Eruptions"[Mesh:noexp]) OR ("Epidermal Necrolysis, Toxic"[Mesh]) OR ("Drug-Induced Liver ("Epidermal Necrolysis, Toxic"[Mesh]) OR ("Drug-Induced Liver Injury, Chronic"[Mesh]) OR ("Erythema Nodosum"[Mesh]) OR Injury, Chronic"[Mesh]) OR ("Erythema Nodosum"[Mesh]) OR ("Serotonin Syndrome"[Mesh]) OR ("Hand-Foot Syndrome"[Mesh]) ("Serotonin Syndrome"[Mesh]) OR ("Hand-Foot Syndrome"[Mesh]) OR ("Neuroleptic Malignant Syndrome"[Mesh]) OR ("MPTP OR ("Neuroleptic Malignant Syndrome"[Mesh]) OR ("MPTP Poisoning"[Mesh]) OR ("Dyskinesia, Drug-Induced"[Mesh]) OR Poisoning"[Mesh]) OR ("Dyskinesia, Drug-Induced"[Mesh]) OR ("Neurotoxicity Syndromes"[Mesh:noexp]) OR ("Psychoses, ("Neurotoxicity Syndromes"[Mesh:noexp]) OR ("Psychoses, Substance-Induced"[Mesh]) OR ("Akathisia, Drug-Substance-Induced"[Mesh]) OR ("Akathisia, Drug-Induced"[Mesh]))) AND (medline[sb])Induced"[Mesh]))) AND (medline[sb])
22
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Automatic indexingAutomatic indexing
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Automatic indexing Automatic indexing MotivationMotivation
Indexing by humans is costly and has limited Indexing by humans is costly and has limited reproducibilityreproducibility
Natural language processing can effectively Natural language processing can effectively support named entity recognitionsupport named entity recognition
Automatic indexing can produceAutomatic indexing can produce Suggestions for human indexersSuggestions for human indexers Final indexing for some journalsFinal indexing for some journals
24
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Automatic indexing Automatic indexing PrinciplesPrinciples
Hybrid approachHybrid approach Concepts extracted from title and abstractConcepts extracted from title and abstract
Mapped from UMLS to MeSHMapped from UMLS to MeSH
MeSH descriptors extracted from related citationsMeSH descriptors extracted from related citations
Post-processingPost-processing Clustering and rankingClustering and ranking Integrate indexing rulesIntegrate indexing rules
E.g., “rule of 3”E.g., “rule of 3”
– Index with a higher-level descriptor rather than with 3 or Index with a higher-level descriptor rather than with 3 or more lower-level descriptorsmore lower-level descriptors
25
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Medical Medical Text Text IndexerIndexer
26
http://ii.nlm.nih.gov/mti.shtml
Automatic indexing Automatic indexing WorkflowWorkflow
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Automatic indexing Automatic indexing ApplicationsApplications
MEDLINE indexingMEDLINE indexing Support MEDLINE indexing at NLMSupport MEDLINE indexing at NLM
3600 new citations processed every weeknight3600 new citations processed every weeknight Suggestions displayed in the indexing environmentSuggestions displayed in the indexing environment
““First-line” indexingFirst-line” indexing For 75 journalsFor 75 journals MTI recommendations are used as an indexerMTI recommendations are used as an indexer Simply reviewed by a senior indexerSimply reviewed by a senior indexer
Cataloging and History of MedicineCataloging and History of Medicine Assisted indexingAssisted indexing
27
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Beyond topicsBeyond topics
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Beyond concepts… relationsBeyond concepts… relations
Also known asAlso known as FactsFacts PredicationsPredications Nano-publicationsNano-publications ……
Relation extractionRelation extraction Usually based on natural language processing (NLP)Usually based on natural language processing (NLP)
E.g., SemRepE.g., SemRep
Relations stored in Relations stored in (subject, predicate, object)(subject, predicate, object) form form With provenance informationWith provenance information
29
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Experimental application Experimental application Semantic MEDLINESemantic MEDLINE
Multi-document summarizationMulti-document summarization Based on a database of 60M predications extracted Based on a database of 60M predications extracted
from MEDLINEfrom MEDLINE Entities normalized to the UMLS MetathesaurusEntities normalized to the UMLS Metathesaurus Relations aligned with the UMLS Semantic Relations aligned with the UMLS Semantic
NetworkNetwork Interfaced with PubMed (for retrieving PMIDs) on Interfaced with PubMed (for retrieving PMIDs) on
a given topica given topic Forms the basis for summarizationForms the basis for summarization
30
http://skr3.nlm.nih.gov/SemMedDemo/
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 31
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 32
Relation extraction Relation extraction ApplicationsApplications
Enhanced information retrievalEnhanced information retrieval Indexing on relations in addition to concepts or Indexing on relations in addition to concepts or
association main heading/subheadingassociation main heading/subheading
Multi-document summarizationMulti-document summarization Extract and visualize the facts extracted from 250 Extract and visualize the facts extracted from 250
recent abstracts on the treatment of Parkinson’s diseaserecent abstracts on the treatment of Parkinson’s disease
Question answeringQuestion answering Clinical and biological questionsClinical and biological questions
Knowledge discoveryKnowledge discovery Connect facts from heterogeneous resourcesConnect facts from heterogeneous resources
MedicalOntologyResearch
Olivier BodenreiderOlivier Bodenreider
Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland - USABethesda, Maryland - USA
Contact:Contact:Web:Web:
[email protected]@nlm.nih.govmor.nlm.nih.govmor.nlm.nih.gov