Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

29
Symbolic and Machine Symbolic and Machine Learning Methods for Learning Methods for Patient Discharge Patient Discharge Summaries Encoding Summaries Encoding Julia Medori Julia Medori CENTAL (Centre for Natural Language Processing) Université catholique de Louvain (Belgium) Séminaire du Cental - 17/12/2010

description

Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding. Julia Medori CENTAL (Centre for Natural Language Processing) Université catholique de Louvain (Belgium) Séminaire du Cental - 17/12/2010. Overview. Problem outline System structure Extraction Encoding - PowerPoint PPT Presentation

Transcript of Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

Page 1: Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

Symbolic and Machine Learning Symbolic and Machine Learning Methods for Patient Discharge Methods for Patient Discharge

Summaries EncodingSummaries EncodingJulia MedoriJulia Medori

CENTAL (Centre for Natural Language Processing)

Université catholique de Louvain (Belgium)

Séminaire du Cental - 17/12/2010

Page 2: Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

OverviewOverview• Problem outline• System structure

– Extraction– Encoding

• Extraction module• Encoding module

– Machine learning methods• Experiments for features selection• Results

– Symbolic methods description• Method 1: Morphological Analysis (MA)• Method 2: Extended lexical patterns (ELP)• Methods combination• Results

• Conclusions

Page 3: Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

IntroductionIntroduction

• AimBuild a (semi-)automated system for ICD-9-CM encoding

• Collaboration CENTAL/Saint-Luc– Université catholique de Louvain (Belgium)

• CENTAL : Centre for Natural Language Processing• Saint-Luc hospital :

– team of 10 coders processes medical records :

Extraction of medical acts and diagnoses ICD-9-CM codes– 85,000 patient’s stays encoded each year.

Page 4: Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

Manual encoding

Page 5: Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

DataData

• International Classification of Diseases -9th Revision-Clinical Modification (ICD-9-CM)– Hierarchy :

• first 3 digits -> general category : 1,135 categories• Digits 4 and 5 -> specific diagnosis : 15,688 codes

• Example :

Code Label001 Cholera

0010 Cholera due to Vibrio cholerae

0011 Cholera due to Vibrio cholerae el tor

0019 Cholera, unspecified

Page 6: Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

ObjectivesObjectives

• Design a coding help: – a tool that will suggest the most likely codes to be

assigned to a patient’s medical record.

• Why not a fully automated system?– Main source of information : Patient discharge

summary (PDS)• PDS : letter, addressed to patient’s GP with no standard

structure

– 15-20% of the codes inferred from other sources from patient’s medical record (often scanned documents).

Page 7: Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

System structureSystem structure

Pre

pro

cess

ing

Dic

tion

arie

s an

d lin

guis

tic s

truc

ture

s

Morphological processing

Context analysis

+ tagging

Mat

chin

g lis

ts

Code modification according to context and

stats

ICD9CM + Inclusions

PDSPDS +

ordered list of codes

ExtractionCoding

Man

ual c

heck

ing

Machine learning module

Page 8: Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

Structure outlineStructure outline

• 2 steps :– Extraction

• Develop an extraction system able to extract information necessary to the encoding task :

– Diagnoses, procedures, locations, dates, allergies, aggravating factors, etc.

=> Reading help tool.

– Encoding• Extracted information => codesthrough a combination of statistical and symbolic

methods.

Page 9: Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

ExtractionExtraction

• Develop specialized linguistic resources– Specialized dictionaries

• Diagnoses and procedures <= ICD-9-CM + UMLS• Medications• Anatomy

– Linguistic structure description• Diagnoses context (present, absent, probable, etc.)• Allergies and smoking• Dates• Weight and height

Page 10: Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

Example of linguistic structure graphExample of linguistic structure graph

Fracture de l’épaule => <MALINDET> Fracture de l’<ANAT>épaule</ANAT></MALINDET>

Page 11: Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

Extraction resultExtraction result

Page 12: Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

Structure outlineStructure outline

• 2 steps:– Extraction

• Develop an extraction system able to extract information necessary to the encoding task :

– Diagnoses, procedures, localisations, dates, allergies, aggravating factors, etc.

=> Reading help tool.

– Encoding• Extracted information => codesthrough a combination of statistical and symbolic

methods.

Page 13: Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

Machine LearningMachine Learning• Encoding = categorization problem

– Features = extracted phrases?– Classes = codes

• Baseline method : Naive Bayes– Tool: Weka

• Corpus : – 13,635 PDS from Digestive Surgery

• 90% training set / 10% test set (1364 PDS)• Average number of codes per PDS: 6.2

• Trained 1 classifier per code occurring>5 times in the corpus :– 775 codes -> 775 classifiers– Limitation: 5% rare codes– attributes: kept only those co-occurring at least twice with the codes.

• Measures: Precision and recall according to the probability returned by the Naive Bayes test.

Page 14: Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

ExperimentsExperiments• A series of experiments were conducted where attributes were variants of

the extracted diagnoses and procedures after stemming.• Variants implied:

– Kept original word order or not.• Ex: excisional biopsy bile duct• Or bile biopsy duct excisional

– Included details like location, date, context.• Excisional biopsy

– Each word of the extracted phrases is a feature• Excisional• Biopsy• Bile• Duct

– Words and morphemes (together) composing the extracted phrase• Bile biopsy excision excisional duct

– Words and morphemes (separately) composing the extracted phrase• Excisional biopsy bile duct• Excision biopsy bile duct

– Values were 0 or 1 whether the attribute was in the text or not.– Values were the frequency of the attribute in the text.

Page 15: Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

ResultsResults

Features Recall Precision Average number of suggested codes

Extracted phrases + details + same word order + 0/1 as values 68,7 73,2 7,87Extract phrases + details + alphabetical order + 0/1 as values

59,1 75,7 6,49

Words and morphemes (together) + details + alphabetical order + 0/1 as values

68,5 74,2 7,54

3 best results when thresholding the list of results where the probability returned by Naive Bayes = 1

Page 16: Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

DiscussionDiscussion

• Limitations of the machine learning method:– 5% rare codes – not enough data to build a

classifier for these codes– Need for annotated data means that these

methods are unable to face changes in classifications

In these cases, we need to use symbolic methodsKevers Laurent et Medori Julia, Symbolic classification methods for patient

discharge summaries encoding into ICD, In: Advances in Natural Language Processing, 7th International Conference on NLP, IceTAL 2010, Reykjavik, August 16-18, 2010, Lecture Notes in Artificial Intelligence, 2010, p. 197-208

Page 17: Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

ObjectiveObjective

• Automatic encoding of PDS according to categories (first 3 digits)

• Use of symbolic methods– No need for annotated data– Can assign rare codes (27% used 5 times or less)

• Principle :– Make use of the nomenclature– Enrich it with other resources in French from

UMLS (Unified Medical Language System)

Page 18: Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

CorpusCorpus

• 19,692 patient discharge summaries (PDS) in French

• General Internal Medicine• 150,116 codes (137,336 categories)• 6,029 distinct codes (895 categories)• Average = 7.6 codes/document (7 categories)

Page 19: Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

Method 1 (MA) – General PrincipleMethod 1 (MA) – General Principle

• Based on the rich morphology of medical language– Ex. Bronchoscopy:

Fibroscopie bronchique = bronchoscopie par fibre optique

• 2 steps process :– Extract phrases or terms describing diagnoses or

procedures to be encoded– Encoding : match these terms to the right code.

Page 20: Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

fibroscopiebronchique

fibro-fibre

-scopiebronch-bronche

-ique

bronchoscopiepar

fibreoptiquebronch-bronche-scopie

• Bags-of-words :Words – stop words + morphemes + meaning

Fibroscopie bronchique Bronchoscopie par fibre optiquePDS ICD-9-

CM

Similarity score

Method 1 (MA) – EncodingMethod 1 (MA) – Encoding

Page 21: Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

Method 1 (MA) – ResultsMethod 1 (MA) – Results

Recall Precision F-measure Nb. classes

Best Recall

46.13 14.70 21.10 20

Best F-measure

34.52 27.34 28.00 8.6

Page 22: Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

Method 2 (ELP) – GeneralMethod 2 (ELP) – General principleprinciple

Developed by L. Kevers as designed for the Stratego project on parliamentary documents.

• Symbolic method with less manual work Use existing « terminological » resources

– ICD-9-CM + UMLS Two steps process

1. Automatic transformation of existing terminological resources into an extraction resource (only once)

2. Use extraction resource on documents for terms extraction and classification (for each document)

Page 23: Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

Method 2 (ELP) – build extraction resource (1)Method 2 (ELP) – build extraction resource (1) For each ICD-9-CM term (= a class), the automatic processing

implies : Gather synonyms (UMLS)

« dengue »

→ « dengues », « dengue fever », « infection by the dengue virus » Parse complex compound expressions

« Infectious and parasitic diseases » → « Infectious disease »

→ « Parasitic disease » Transform initial term into Extended Lexical Pattern (ELP)

- Stopwords : → « infection <TOKEN> dengue virus »- Stemming : → « infect <TOKEN> dengue virus »- Allow insertions : → « infect <I> <TOKEN> <I> dengue <I> virus »

Add negative contexts patterns Build the main transducer for text annotation

Page 24: Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

Method 2 (ELP) – Transducer & outputMethod 2 (ELP) – Transducer & output

Zona [[053]]extremement douloureux [[729]]

gastroscopie [[Z44]]acide [[E96]]

anemie normochrome normocytaire [[285]]sequellaires apicales droite (tuberculose [[137]]

intestin grele [[Z45]]tuberculose [[V12]]

oesophagite moderee aspecifique [[947]]

infection a mycobacterie [[031]]fond de oeil [[Z16]]

pas de [[-]] atteinte du nerf [[957]]zona [[053]]

hyperthyroidie [[242]]goitre [[706]]goitre [[240]]

• Transducer for class '061'

• Output of main transducer for a document

Page 25: Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

Method 2 (ELP) – Class assignment (2)Method 2 (ELP) – Class assignment (2) For a text to classify, analyse the main transducer output When negative contexts, the phrase is skipped Each recognized phrase has one (or more) related code Compute a weight for each phrase based on

– Frequency– Is a multi word expression (frequency*2), or not

Compute a weight for each code by summing up the weights obtained for the phrases

Result : ordered list of codes (possibly threshold it)

Page 26: Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

Method 2 (ELP) – ResultsMethod 2 (ELP) – Results

Recall Precision F-measure Nb of classes

Best Recall

52.74 20.69 27.37 19.6

Best F-measure

37.97 30.30 29.43 9.8

Page 27: Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

Combination of methods 1 & 2Combination of methods 1 & 2

• Merge the lists from method 1 & 21. Threshold(M.1 union M.2)2. Threshold(M.1 inter M.2)3. Threshold(M.1) union Threshold(M.2)4. Threshold(M.1) inter Threshold(M.2)

• The weight for each method can be balanced– Example: 0.4*M.1 union 0.6* M.2

Page 28: Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

Evaluation of symbolic methods combinationEvaluation of symbolic methods combination

Recall (R)

Precision (P)

F-measure (F1)

Nb. classes

Threshold α/1-α

Mix1 : Threshold(Method1 union Method2)

Best R 60.21 13.20 20.86 30.5 No AnyBest F1 37.13 33.12 31.64 8.1 Yes 0.3/0.7

Mix2 : Threshold(Method1 inter Method2)

Best R 38.66 29.28 30.52 9.1 No AnyBest F1 34.73 34.55 31.50 7 Yes 0.3/0.7

Mix3 : Threshold(Method1) union Threshold(Method2)

Best F1 43.28 20.59 27.90 14.7 Yes N/A

Mix4 : Threshold(Method1) inter Threshold(Method2)

Best F1 24.07 37.95 29.46 4.4 Yes N/A

Page 29: Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

ConclusionsConclusions

• Results have to be put into perspective:– Inter-annotator agreement ~70%– 15 to 20% cannot be inferred from PDS– Machine learning methods performed well.– Symbolic methods:

• MA method based on extraction module : 66% of useful information is extracted.

• ELP method performs better when built from short unambiguous phrases. ICD-9-CM code descriptions are more complex.

• Future work : – Give more weight to information contained in important

parts of the PDS (introduction, conclusion…)– Evaluate the actual help given to human coders– Combine with learning algorithms