April 19 th,2002 MuchMore Project Review Multilingual Concept Hierarchies for Medical Information...
-
date post
19-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of April 19 th,2002 MuchMore Project Review Multilingual Concept Hierarchies for Medical Information...
April 19th,2002 MuchMore Project Review
Multilingual Concept Hierarchies for Medical Information Organization and Retrieval
MUCHMORE
April 19th,2002 MuchMore Project Review
Project Overview
Application Addressing a Real-Life Medical Scenario for
Cross-Lingual Information Retrieval
Research & Development Developing Novel, Hybrid (Corpus-/Concept-
Based) Methods for Handling this Scenario
Evaluation Evaluating the Technical Performance of
(Combinations of) Existing and Novel Methods
April 19th,2002 MuchMore Project Review
User Perspective (ZInfo)
MuchMore Provide Relevant Medical Information … for a Specific Patient Problem … Automatically, from the Web … Independent of Language
Vision: BAIK Model
April 19th,2002 MuchMore Project Review
Automatic Query Generation (and Expansion), Identifying the Exact Problem of the Patient
Retrieval and Relevance Ranking of Evidence Based Medical Literature, Language Independent
Summarization and Filtering of Results According to a User Profile
User Requirements
User Perspective (ZInfo)
April 19th,2002 MuchMore Project Review
User Evaluation
Use for Medical Cases Part of Postgraduate Course in Medical Informatics
Evaluate Usefulness Query Generation Relevance for Decisions in Diagnostics and Treatment
Problematic Issues
Different medical profiles, schools, experience, speciality Relevant for one user may mean less or nothing to another Evidence based medicine criteria exist only for a small
fraction of medicine
User Perspective (ZInfo)
April 19th,2002 MuchMore Project Review
MuchMore Prototype
Overview of Prototype Functionality
Relation between Functionality and User Requirements
Issues Addressed by Research and Development within MuchMore
April 19th,2002 MuchMore Project Review
R&D in MuchMore
Corpus Annotation (DFKI, ZInfo) PoS, Morphology, Phrases, Grammatical Functions Term and Relation Tagging
Term Extraction (XRCE, EIT, CMU, CSLI) Bilingual Lexicon Extraction, Extension of Semantic
Resources
Relation Extraction (DFKI, CSLI) Grammatical Function Tagging Extracting Semantic Relation Indicators Extracting Novel Semantic Relations
Sense Disambiguation (CSLI, DFKI) Tuning and Extension of Semantic Resources Combining Sense Disambiguation Methods
Semantic Annotation Based CLIR
Semantic Indexing/Retrieval (EIT,DFKI)
April 19th,2002 MuchMore Project Review
Corpus Based CLIR Bilingual Lexicon Extraction (XRCE, EIT, CMU, CSLI) Pseudo Relevance Feedback: PRF (CMU) Generalized Vector Space Model: GVSM (CMU)
Summarization (CMU) Query, Genre Specific
Text Classification Based CLIR (CMU) Hierarchical/Flat kNN with MeSH
R&D in MuchMore Additional Approaches in CLIR
April 19th,2002 MuchMore Project Review
Corpus Annotation
PoS Lexicon Update, Remaining Error Rate ~ 1.5% (EN)
Histologically, we found a subepidermal blister formation and a predominantly neutrophilic infiltrate. pos=VB > pos_correct=NN
Term and Relation Tagging Evaluation of 8 DE/EN Parallel Abstracts, Relevant for a
Query
Morphology German Nouns MMorph Recall
Incorre
ct
Error-
Rate
test-dvlp
889 617
69.40%42 6.81%
test-
final989 683 69.06% 79 11.57%
Incorrect, e.g.: Chorionzottenbiopsie > Chor + Ion + Zotte + Biopsie
Annotation EvaluationCorpus
~ 9000 English and German Medical Abstracts from 41 Journals, Springer LINK WebSite, ~ 1 M Tokens for each Language
April 19th,2002 MuchMore Project Review
Term Extraction
Aim Bilingual Lexicon Extraction
From Comparable Corpora at Word Level; From Parallel Corpora at Word, and Term (Multi-Word) Level
Bilingual Extension of Semantic Resource (MeSH)
verbesserter transabdomineller Techniken
improved transabdominal techniques
Prognose des Frühcarcinoms prognosis of early gastric cancer
Verletzungen des Gehirns intracranial injuries
Lebensqualitaet quality of live
XRCE (Aims and Resources)
Resources Optimal Combination of Existing Resources (Corpus,
General Dictionary, Thesaurus: MeSH) Corpus Specific German Decompounding (Improves Recall
by 25% at Equal Precision)
April 19th,2002 MuchMore Project Review
Optimal Combination of Resources
Retaining only 10 best Translations for each Candidate
1. word-to-word, comparable corpora: F1 = 0.84
2.a word-to-word, parallel corpora: F1 = 0.98
2.b term-to-term, parallel corpora: F1 = 0.85Evaluating Separately with Individual Resources (F1)
Corpus: 0.62; MeSH: 0.51; General Dictionary: 0.56
3. MeSH Extension: 1453 new multi-word terms added (synonyms or new term entries) extracted from the Springer corpus
Term ExtractionXRCE (Results of Best Method)
April 19th,2002 MuchMore Project Review
Method
Extract Most Frequent Terms (Single Word) by Comparison of Term Frequencies in a General Corpus (German: SDA, English: LA Times) vs. Medical Corpus
Term ExtractionEIT (Similarity Thesauri)
Results
Single Word Terms (Springer Abstracts)
German-English:104,904 / English-German: 49,454
Multiword Terms (Phrase Lexicon Generated from ICD10)
German Phrases: 354 / English Phrases: 665
Bilingual Phrasal Entries Generated:
German - English: 225 / English - German: 246
April 19th,2002 MuchMore Project Review
Method For each word in one language, accumulate counts of the
number of times the translations of the sentences containing that word include each word of the other language. These co-occurrence counts may be restricted using word-alignment techniques.
Apply a variable threshold to filter out uncommon co-occurrences which are unlikely to be translations. The result is a lexicon listing candidate translations and their relative frequencies.Results
~99.000 Bilingual Term Pairs (PubMed Parallel Abstracts)
(Estimated Error Rate: < 10%)
Term ExtractionCMU (EBT Bilingual Lexicon)
April 19th,2002 MuchMore Project Review
Represent English and German Words as Vectors that are Produced by Recording the Number of Co-Occurrences of the Word in Question with each of a Set of Content-Bearing Words. Use (Cosine) Similarity Measure on these Rows to Find “Nearest Neighbours”.
1, 000 (English) content-bearing words
ligament
English words
Kreuzband
Kniegelenk
German words
ligament knee joint
. . .
. . .
. . .
English
German
Term ExtractionCSLI (Infomap System)
Term (EN) SIM Term (DE) SIM
bone 1.00 knochen 0.82
cancellous 0.70 knochens 0.71
osteoinductive
0.67 knochenneubildung 0.67
demineralized 0.65 spongiosa 0.64
trabeculae 0.64 knochenresorption 0.60
formation 0.60 allogenen 0.60
periosteum 0.56 knöcherne 0.59
……… ………
April 19th,2002 MuchMore Project Review
Tuning (CSLI, DFKI)
Aligning Clusters with Senses
C0043210|GER|P|L1254343|PF|S1496289|Frauen|3|
C0043210|ENG|P|L1189496|PF|S1423265|Human adult females|0|
WSD: Terms, Senses
Extension (DFKI)
Morphological Analysis (Decomposition)
Entzündungsgewebe (infection tissue) HYPONYM Gewebe,Körpergewebe (body tissue)
Gewebe, Stoff,Textilstoff
(textile)
Semantic Similarity (Co-Occurrence Patterns)
Karzinom (carcinoma), Metastase (metastasis) SYNONYM Geschwulst, Tumor, ....
Semantic Resource Extension and Tuning
April 19th,2002 MuchMore Project Review
WSD: Algorithm
Bilingual Sense Selection (CSLI) 1 Sense in L1 vs. >1 Sense in L2
English blood vessel (C0005847) vs. vessel (polysaccharide) (C0148346)German Blutgefaesse = blood vessel (C0005847)
Combination of Methods (Task, Domain, General)
Collocations and Senses (CSLI) For an ambiguous single word term that is part of several
unambiguous multiword terms, choose the sense of the most frequent multiword term.
single word term abortion 1) a natural process C0000786
(T047)
2) a medical procedure C0000811
(T061)
multiword term recurrent abortion C0000809 (T047)
=> sense 1
induced abortion C0000811 (T061) => sense
2
April 19th,2002 MuchMore Project Review
WSD: Algorithm
Domain Specific Senses (DFKI) Concept Relevance in Domain Corpus
Mineral0.030774033: Mineralstoff, Eisen, Ferrum, Fluor, Kalzium,
Magnesium4.9409806E-5: Allanit, Alumogel, ..., Axionit, Beryll, ... Wurtzit,
Zirkon
Combination of Methods (Task, Domain, General)
Instance-Based Learning (DFKI) Unsupervised Context Models (n-grams)
Training (Learn Class Models) He drank <milk LIQUID>He drank <coffee LIQUID>He drank <tea LIQUID>He drank <chocolate FOOD,
LIQUID>
Application (Apply Class Models) He drank <chocolate FOOD, LIQUID>He drank <Java
GEOGAPHICAL, LIQUID>
April 19th,2002 MuchMore Project Review
Ambiguous: MeSH EN: 847 (2.5), DE: 780 (2.1); EWN EN: 6300 (2.8) DE: 4059 (1.5)
Evaluation (Nouns): GermaNet (40), English MeSH (59), German MeSH (28)
WSD: Evaluation Lexical Sample Evaluation Corpora
(Medical)
Band (tape, strap. ligament)
Fall (drop, case, instance)
Gefäss (jar, vessel)
Operation (operation, surgery)
Prüfung (survey, tryout, checkup)
Verletzung (injury, trauma)
Wahl (ballot, choice, option)
Lage (site, status, position, layer)
Gewicht (weight, importance)
……
April 19th,2002 MuchMore Project Review
Robust, Shallow Grammatical Function Tagger EM Model (Trained on Frankfurter Rundschau: 35M Tokens, Adaptation on Medical Corpora Under Development)
1.5M ‘Types’: Verb, Voice, Function, Nom-Head-Argument
abarbeiten ACT SUBJ Politiker
Use of PoS Information, Use of Chunk Information Planned
Tags for SUBJ, OBJ, IOBJ, ACT/PAS
German Available, English under DevelopmentUntersucht <PRED1:PAS> wurden 30 Patienten <PRED1:SUBJ> <PRED2:SUBJ>, die sich <PRED2:SUBJ> einer elektiven aortokoronaren Bypassoperation <PRED2:IOBJ> unterziehen <PRED2:ACT> mussten.
Relation Extraction Grammatical Function Tagging (DFKI)
April 19th,2002 MuchMore Project Review
Cluster 1
T047/T060 (Diagnoses) T060/T101 (Affects) T060/T169...
Cluster 3
T047/T121 (Treats, Causes)T061/T121 (Uses)T121/T184 (Treats)...
Cluster 2
T101/T169T101/T184T101/T048...
differentiateconcludediscriminatediagnoseillustrate
sufferdemonstrateprogressdevelopdie
reducetreatfollowdiagnosecure
T047: Disease
T048: Mental Dysfunction
T060: Diagnostic Procedure
T101: Patient
T121: Pharm. Substance
T169: Funct. Concept (Syndrom)
T184: Sign or Symptom
Relation Extraction Semantic Relation Indicators (DFKI, CSLI)Novel Semantic Relations (DFKI, CSLI)
April 19th,2002 MuchMore Project Review
Maximal Marginal Relevance (MMR) Find passages most relevant to query Maximize information novelty (minimize passage redundancy) Assemble extracted passages for summary
Argmaxkdi in C[λS(Q, di) - (1-λ)maxdj
in R (S(di, dj))]
Q = query, d = document, S = similarity functionλ = tradeoff factor between relevance & noveltyk = number of passages to include in summary
Summarization (CMU) Extractive Summarization
Applications
Re-ranking retrieved documents from IR Engine Ranking passages from a document for inclusion in summaries Ranking passages from topically-related document cluster for
cluster summary
April 19th,2002 MuchMore Project Review
MMR applies to English and German– Genre-based specialization (e.g. include conclusions
for scientific articles)– Linguistic specialization possible
Summarization should apply when retrieving FULL articles query-driven summaries instead of generic abstracts
MuchMore Application
Task Query-Relevant (focused) Query-Free (generic)
INDICATIVE, for Filtering (Do I read further?)
To filter search engine results Short abstracts
CONTENTFUL, for reading in lieu of full doc.
To solve problems for busy professionals
Executive summaries
INDICATIVE and QUERY-RELEVANT
Summarization (CMU)
April 19th,2002 MuchMore Project Review
Test Collection: Springer Abstracts (German and English)
Query Set: 25 of 126 Selected by ZInfo
Relevance Assessments
Assumption: Documents Retrieved by all Runs for one Query (Intersection) are Relevant
Pool Size: 500 Documents Based on 18 Runs Done by CMU, CSLI and EIT
German (ZInfo): 959 Relevant Documents
English (CMU): 500 Relevant Documents (1 judge)964 Relevant Documents (3 judges)
Technical Evaluation Test Data
April 19th,2002 MuchMore Project Review
Corpus BasedSimilarity Thesaurus (EIT) Example-based
Translation (CMU)
Pseudo Relevance Feedback (CMU)
Generalized Vector Space Model (CMU)
Hybrid Classification (CMU)
Hierarchical: kNN, Rocchio
Flat: kNN, Rocchio-style
Classifier
Semantic Annotation + Extraction (DFKI,
XRCE)UMLS / XRCE Terms & Semantic
Relations EuroWordNet
Terms
Semantic Annotation + Similarity Thesaurus
Technical Evaluation Methods Evaluated
April 19th,2002 MuchMore Project Review
Overall Performance 11point-Average Precision (Interpolated)
Performance in the High-Precision Area
Assumption: User Wants to Get Most Relevant Documents Topranked within the Result List
Average Interpolated Precision at Recall of 0.1
Exact Precision after 10 Retrieved Documents
Applied to Experiments Evaluating Semantic Annotations
Technical Evaluation TREC-Style Performance Measurements
April 19th,2002 MuchMore Project Review
Data Sets
EIT: The Springer Parallel Corpus, i.e. 9640 Documents for English, and 9640 documents for German CMU: Half of the Corpus, i.e. a Test Set with 4820 Documents in each.
System Eng-Eng Ger-Ger Ger-Eng Eng-Ger
Monolingual EIT: lnu.ltn 0.1914 0.1848 N/A N/A
Crosslingual EIT: SimThes & lnu.ltn
N/A N/A 0.1258 0.1109
Monolingual PRF 0.6782 0.5078 N/A N/A
Crosslingual PRF N/A N/A 0.5487 0.5758
EBT: chi-squared N/A N/A 0.5232 0.5396
Crosslingual GVSM (first evaluation to be completed in July, 2002)
Technical Evaluation Results: Corpus Based Methods
April 19th,2002 MuchMore Project Review
Categorization (Preliminary Results)Reuters-21578: 10,000+ documents, 90 categoriesReuters Corpus Volume 1, TREC-10 version (RCV1): 783,484 documents, 84
categoriesReuters Koller & Sahami subsets (ICML’98): 138 to 939 documents, 6-11
categories in a setOHSUMED: 233,445 documents, 14,321 categoriesSystem Data Set Macro-avg F1 Micro-avg F1
kNN Reuters 21578 .60 .86
Rocchio Reuters 21578 .59 .85
kNN RCV1.TREC-10 (F0.5 = .44) (F0.5 = .55)
Rocchio RCV1.TREC-10 (F0.5 = .39) (F0.5 = .49)
kNN R-KS Subsets (3) .85, .81, .97 .89, .80, .94
HkNN R-KS Subsets (3) .85, .80, .98 .86, .82, .99
Rocchio R-KS Subsets (3) .80, .75, .96 .82, .83, .96
HRocchio R-KS Subsets (3) .83, .81, .98 .78, .84, .99
kNN OHSUMED .26 .48
Technical Evaluation Results: Hybrid Methods
April 19th,2002 MuchMore Project Review
Semantic Annotation + Extraction
Data Set Full Springer CorpusWeighting Scheme Coordination Level Matching (CLM):
1. Pass: Documents Preferred Containing Matching Terms or Semantic Relations
2. Pass: All Features Using lnu.ltnRel. Assessments German
System
11pt AvPrec Prec at Recall of 0.1 Prec at 10 Docs Retr
SemA-v3SemA-v4
Sem-Av3 SemA-v4 SemA-v3 SemAv4
EN2DE: Morph & EWN - 0.0005 - 0.0017 - 0.0040
EN2DE: Morph & UMLS - 0.0933 - 0.2898 - 0.1840
EN2DE: Morph& UMLS & XRCE - 0.1486 - 0.4258 - 0.3360
DE2EN: Morph & EWN - 0.0479 - 0.1240 - 0.0960
DE2EN: Morph & UMLS 0.1507 0.1392 0.3895 0.3963 0.2520 0.2920
Technical Evaluation Results: Hybrid Methods
April 19th,2002 MuchMore Project Review
Semantic Annotation + Similarity Thesaurus Data Set Full Springer Corpus
Weighting Scheme Coordination Level Matching (CLM)Rel. Assessments German
System 11pt AvPrec
Prec at Recall of
0.1
Prec at 10 Docs
Retr
EN2DE: transl. Morphology & EWN 0.0276 0.1353 0.1000
EN2DE: transl. Morphology & UMLS 0.1487 0.4126 0.3320
EN2DE: transl. Morphology & UMLS & XRCE
0.1706 0.4495 0.3600
DE2EN: transl. Morphology & EWN 0.1101 0.3165 0.2000
DE2EN: transl. Morphology & UMLS 0.1413 0.4038 0.2680
Technical Evaluation Results: Hybrid Methods
April 19th,2002 MuchMore Project Review
Assumption: CLIR achieves up to 75 % of Monolingual Baseline
(11pt Average Precision)
Corpus-based Methods (Compared to Monolingual PRF)
German – English PRF: 81 %, EBT: 77 %, EIT: 66%
English – German PRF: 113 %, EBT: 106 %, EIT: 60%
Hybrid Methods (Compared to Monolingual EIT)
German – English: 73 % (UMLS Terms & SemRels)
English – German: 50 % (UMLS Terms & SemRels)
English – German: 80 % (UMLS Terms & SemRels & XRCE Terms)
German – English: 74 % (SimThes & UMLS Terms & SemRels)
English – German: 80 % (SimThes & UMLS Terms & SemRels)
English – German: 92 % (SimThes & UMLS Terms & SemRels & XRCE
Terms)
Technical Evaluation Summary of the Results
April 19th,2002 MuchMore Project Review
Corpus Collection Comparable Medical Document Corpora are Very Difficult to
Obtain, Anonymization Must be Validated by Hospital CIO Work with „Shuffled“ Parallel Corpus Radiology Reports (~600.000) Available in German, to be
Obtained for English
Management Deviations from the Work Plan
Corpus Annotation More Efforts on Improving PoS Tagging and Morphological
Analysis (English and German Medical Specialist Lexicon)
Relation Extraction More Efforts on Grammatical Function Tagging as
Preprocessing for Semantic Relation Tagging and Extraction
April 19th,2002 MuchMore Project Review
R&D Topics Ontology Development Combining Axes in AGK-Thesaurus
(ZInfo) with Cluster Methods (CSLI, DFKI) Semantic Web Semantic Annotation of Medical Documents
with Metadata (UMLS in Protégé)
Management Future Prospects and Activities
Related Projects and Workshops Project Proposal IKAR/OS on KM & Visualization in Life Sciences
OntoWeb SIG on LT in Ontology Development and Use MuchMore Workshop with Invited Experts in Medical Information
Access, CLIR and Semantic Annotation (September 2002) ZInfo/MuchMore Workshop on Electronic Patient Records (Spring
2003)