Semantic indexing in PubMed

33
Semantic indexing in PubMed Semantic indexing in PubMed CERN Workshop on Innovations CERN Workshop on Innovations in Scholarly Communication (OAI8) in Scholarly Communication (OAI8) Geneva, Switzerland Geneva, Switzerland June 20, 2013 June 20, 2013 Olivier Bodenreider Olivier Bodenreider Lister Hill National Lister Hill National Center Center for Biomedical for Biomedical Communications Communications Bethesda, Maryland - Bethesda, Maryland - USA USA

description

CERN Workshop on Innovations in Scholarly Communication (OAI8) Geneva, Switzerland June 20, 2013. Semantic indexing in PubMed. Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA. Orientation. NLM is the world's largest biomedical library - PowerPoint PPT Presentation

Transcript of Semantic indexing in PubMed

Page 1: Semantic indexing in PubMed

Semantic indexing in PubMedSemantic indexing in PubMed

CERN Workshop on InnovationsCERN Workshop on Innovationsin Scholarly Communication (OAI8)in Scholarly Communication (OAI8)Geneva, Switzerland Geneva, Switzerland June 20, 2013June 20, 2013

Olivier BodenreiderOlivier Bodenreider

Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland - USABethesda, Maryland - USA

Page 2: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

OrientationOrientation

NLM is the world's largest biomedical libraryNLM is the world's largest biomedical library Located in Bethesda, Maryland, near Washington, DCLocated in Bethesda, Maryland, near Washington, DC

PubMed provides access to MEDLINE, NLM’s PubMed provides access to MEDLINE, NLM’s bibliographic database of over 20M citationsbibliographic database of over 20M citations MEDLINE covers 5600 journals and adds almost 1M MEDLINE covers 5600 journals and adds almost 1M

new citations each yearnew citations each year PubMed is part of the Entrez system of the National PubMed is part of the Entrez system of the National

Center for Biotechnology Information (NCBI)Center for Biotechnology Information (NCBI)

2

Page 3: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 3

OutlineOutline

Anatomy of a MEDLINE citationAnatomy of a MEDLINE citation Types of PubMed searchesTypes of PubMed searches

Simple text searchSimple text search Search based on MeSH indexingSearch based on MeSH indexing

Automatic indexingAutomatic indexing Beyond topicsBeyond topics

Page 4: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Anatomy of a MEDLINE citationAnatomy of a MEDLINE citation

4

Title

Abstract

Indexing

Page 5: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 5

MeSH main heading[/subheading(s)][+ * for major topic]

Page 6: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Types of PubMed searchesTypes of PubMed searches

http://www.ncbi.nlm.nih.gov/pubmed

Page 7: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Non-semantic searchNon-semantic search

PubMed does not require the use of MeSH for PubMed does not require the use of MeSH for queryingquerying Supports “Google-like” text searchesSupports “Google-like” text searches

““no librarian required”no librarian required”

But can identify MeSH terms even if they are not But can identify MeSH terms even if they are not labeled as suchlabeled as such

7

Page 8: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Non-semantic search Non-semantic search ExampleExample

Find articles about the cheese GruyèreFind articles about the cheese Gruyère GruyèreGruyère

8

Page 9: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

MeSH (semantic) searchMeSH (semantic) search

Medical Subject Headings (MeSH)Medical Subject Headings (MeSH) Controlled vocabulary developed at NLM for indexing Controlled vocabulary developed at NLM for indexing

and retrieval of MEDLINE citationsand retrieval of MEDLINE citations ~26,000 descriptors (main headings)~26,000 descriptors (main headings) <100 qualifiers (subheadings)<100 qualifiers (subheadings) 214,000 supplementary concept records214,000 supplementary concept records

Hierarchical structure (“tree numbers”)Hierarchical structure (“tree numbers”) Supports query expansion (“explosion”)Supports query expansion (“explosion”)

Search for a descriptor or any of its descendantsSearch for a descriptor or any of its descendants

9

http://www.nlm.nih.gov/mesh/2013/mesh_browser/MBrowser.html

Page 10: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Simple MeSH search Simple MeSH search ExampleExample

Find articles about drug-induced psychosesFind articles about drug-induced psychoses "Psychoses, Substance-Induced"[Mesh]"Psychoses, Substance-Induced"[Mesh]

10

Page 11: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Search with “Explosion”Search with “Explosion”

By default, PubMed retrieves articles indexed with By default, PubMed retrieves articles indexed with a descriptor or any of its descendantsa descriptor or any of its descendants

Use Use mesh:noexpmesh:noexp to prevent “explosion” from to prevent “explosion” from happeninghappening

11

Page 12: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

““Explosion” Explosion” ExampleExample

Find articles about fluoroquinolones (or desc.)Find articles about fluoroquinolones (or desc.) "fluoroquinolones"[Mesh]"fluoroquinolones"[Mesh]

12

Page 13: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Search leveraging synonymy in MeSHSearch leveraging synonymy in MeSH

MeSH descriptors include related concepts (Entry MeSH descriptors include related concepts (Entry terms)terms) SynonymsSynonyms Closely related (and clustered or indexing and retrieval Closely related (and clustered or indexing and retrieval

purposes)purposes)

All terms from a descriptor and its entry terms are All terms from a descriptor and its entry terms are used for retrieval in PubMedused for retrieval in PubMed

13

Page 14: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 14

Page 15: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Entry terms for “Addison Disease”Entry terms for “Addison Disease”

15

Page 16: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Search leveraging UMLS SynonymySearch leveraging UMLS Synonymy

Unified Medical Language System (UMLS)Unified Medical Language System (UMLS) Terminology integration systemTerminology integration system ~130 biomedical terminologies~130 biomedical terminologies Synonymous terms clustered into conceptsSynonymous terms clustered into concepts

UMLS synonymy used in PubMedUMLS synonymy used in PubMed Query translation happens “behind the scenes”Query translation happens “behind the scenes” E.g., search on “primary adrenocortical insufficiency”E.g., search on “primary adrenocortical insufficiency”

Retrieves articles about “Addison’s disease”Retrieves articles about “Addison’s disease”

16

Page 17: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 17

Page 18: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

No entry term for “Heart attack”No entry term for “Heart attack”

18

Page 19: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Query translationQuery translation

19

Page 20: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Subheading restrictionsSubheading restrictions

Subheadings represent the context of use of a Subheadings represent the context of use of a particular descriptorparticular descriptor Ciprofloxacin/Adverse effectsCiprofloxacin/Adverse effects Mood Disorders/Chemically inducedMood Disorders/Chemically induced

Assigned during indexingAssigned during indexing Can be queried in PubMedCan be queried in PubMed

20

Page 21: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Subheading restrictions Subheading restrictions ExampleExample

Find articles about drugs involved in adverse Find articles about drugs involved in adverse eventsevents "Chemicals and Drugs Category“/adverse effects[MeSH]"Chemicals and Drugs Category“/adverse effects[MeSH]

21

Page 22: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Recapitulative exampleRecapitulative example

Find articles about drugs involved in adverse events Find articles about drugs involved in adverse events and drug-induced manifestationsand drug-induced manifestations (("Chemicals and Drugs Category"[Mesh]) AND (adverse (("Chemicals and Drugs Category"[Mesh]) AND (adverse

effects[sh] OR contraindications[sh] OR mortality[sh])) AND effects[sh] OR contraindications[sh] OR mortality[sh])) AND (chemically induced[sh] OR (("Drug-Induced Liver (chemically induced[sh] OR (("Drug-Induced Liver Injury"[Mesh:noexp]) OR ("Drug Eruptions"[Mesh:noexp]) OR Injury"[Mesh:noexp]) OR ("Drug Eruptions"[Mesh:noexp]) OR ("Epidermal Necrolysis, Toxic"[Mesh]) OR ("Drug-Induced Liver ("Epidermal Necrolysis, Toxic"[Mesh]) OR ("Drug-Induced Liver Injury, Chronic"[Mesh]) OR ("Erythema Nodosum"[Mesh]) OR Injury, Chronic"[Mesh]) OR ("Erythema Nodosum"[Mesh]) OR ("Serotonin Syndrome"[Mesh]) OR ("Hand-Foot Syndrome"[Mesh]) ("Serotonin Syndrome"[Mesh]) OR ("Hand-Foot Syndrome"[Mesh]) OR ("Neuroleptic Malignant Syndrome"[Mesh]) OR ("MPTP OR ("Neuroleptic Malignant Syndrome"[Mesh]) OR ("MPTP Poisoning"[Mesh]) OR ("Dyskinesia, Drug-Induced"[Mesh]) OR Poisoning"[Mesh]) OR ("Dyskinesia, Drug-Induced"[Mesh]) OR ("Neurotoxicity Syndromes"[Mesh:noexp]) OR ("Psychoses, ("Neurotoxicity Syndromes"[Mesh:noexp]) OR ("Psychoses, Substance-Induced"[Mesh]) OR ("Akathisia, Drug-Substance-Induced"[Mesh]) OR ("Akathisia, Drug-Induced"[Mesh]))) AND (medline[sb])Induced"[Mesh]))) AND (medline[sb])

22

Page 23: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Automatic indexingAutomatic indexing

Page 24: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Automatic indexing Automatic indexing MotivationMotivation

Indexing by humans is costly and has limited Indexing by humans is costly and has limited reproducibilityreproducibility

Natural language processing can effectively Natural language processing can effectively support named entity recognitionsupport named entity recognition

Automatic indexing can produceAutomatic indexing can produce Suggestions for human indexersSuggestions for human indexers Final indexing for some journalsFinal indexing for some journals

24

Page 25: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Automatic indexing Automatic indexing PrinciplesPrinciples

Hybrid approachHybrid approach Concepts extracted from title and abstractConcepts extracted from title and abstract

Mapped from UMLS to MeSHMapped from UMLS to MeSH

MeSH descriptors extracted from related citationsMeSH descriptors extracted from related citations

Post-processingPost-processing Clustering and rankingClustering and ranking Integrate indexing rulesIntegrate indexing rules

E.g., “rule of 3”E.g., “rule of 3”

– Index with a higher-level descriptor rather than with 3 or Index with a higher-level descriptor rather than with 3 or more lower-level descriptorsmore lower-level descriptors

25

Page 26: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Medical Medical Text Text IndexerIndexer

26

http://ii.nlm.nih.gov/mti.shtml

Automatic indexing Automatic indexing WorkflowWorkflow

Page 27: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Automatic indexing Automatic indexing ApplicationsApplications

MEDLINE indexingMEDLINE indexing Support MEDLINE indexing at NLMSupport MEDLINE indexing at NLM

3600 new citations processed every weeknight3600 new citations processed every weeknight Suggestions displayed in the indexing environmentSuggestions displayed in the indexing environment

““First-line” indexingFirst-line” indexing For 75 journalsFor 75 journals MTI recommendations are used as an indexerMTI recommendations are used as an indexer Simply reviewed by a senior indexerSimply reviewed by a senior indexer

Cataloging and History of MedicineCataloging and History of Medicine Assisted indexingAssisted indexing

27

Page 28: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Beyond topicsBeyond topics

Page 29: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Beyond concepts… relationsBeyond concepts… relations

Also known asAlso known as FactsFacts PredicationsPredications Nano-publicationsNano-publications ……

Relation extractionRelation extraction Usually based on natural language processing (NLP)Usually based on natural language processing (NLP)

E.g., SemRepE.g., SemRep

Relations stored in Relations stored in (subject, predicate, object)(subject, predicate, object) form form With provenance informationWith provenance information

29

Page 30: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Experimental application Experimental application Semantic MEDLINESemantic MEDLINE

Multi-document summarizationMulti-document summarization Based on a database of 60M predications extracted Based on a database of 60M predications extracted

from MEDLINEfrom MEDLINE Entities normalized to the UMLS MetathesaurusEntities normalized to the UMLS Metathesaurus Relations aligned with the UMLS Semantic Relations aligned with the UMLS Semantic

NetworkNetwork Interfaced with PubMed (for retrieving PMIDs) on Interfaced with PubMed (for retrieving PMIDs) on

a given topica given topic Forms the basis for summarizationForms the basis for summarization

30

http://skr3.nlm.nih.gov/SemMedDemo/

Page 31: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 31

Page 32: Semantic indexing in PubMed

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 32

Relation extraction Relation extraction ApplicationsApplications

Enhanced information retrievalEnhanced information retrieval Indexing on relations in addition to concepts or Indexing on relations in addition to concepts or

association main heading/subheadingassociation main heading/subheading

Multi-document summarizationMulti-document summarization Extract and visualize the facts extracted from 250 Extract and visualize the facts extracted from 250

recent abstracts on the treatment of Parkinson’s diseaserecent abstracts on the treatment of Parkinson’s disease

Question answeringQuestion answering Clinical and biological questionsClinical and biological questions

Knowledge discoveryKnowledge discovery Connect facts from heterogeneous resourcesConnect facts from heterogeneous resources

Page 33: Semantic indexing in PubMed

MedicalOntologyResearch

Olivier BodenreiderOlivier Bodenreider

Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland - USABethesda, Maryland - USA

Contact:Contact:Web:Web:

[email protected]@nlm.nih.govmor.nlm.nih.govmor.nlm.nih.gov