EEG REPORT TO RECORDINGS MATCHING AND TEXT MINING · in the world, with over 2,500 EEG recordings...

EEG REPORT TO RECORDINGS MATCHING AND TEXT MINING

Yu-Ping ShaoData ScientistMassachusetts General [email protected]

Manohar GhantaData EngineerMassachusetts General [email protected]

Tony Ku, Ph.D.Advisory ConsultantDell EMC Consulting [email protected]

M. Brandon Westover, M.D., Ph.D.Director, Clinical Data Animation Center (CDAC)Harvard Medical School, Massachusetts General Hospital [email protected]

Valdery Moura JuniorData EngineeringMassachusetts General [email protected] Sharing Article © 2018 Dell Inc. or its subsidiaries.

2018 Dell EMC Proven Professional Knowledge Sharing 2

Table of Contents

1 INTRODUCTION ......................................................................................................... 3

2 EEG REPORTS TO REORDINGS MATCHING ........................................................... 4

2.1 Data Sources .......................................................................................................... 4

2.2 EEG Examination Date Range .............................................................................. 6

2.3 Feature Extraction ................................................................................................. 6

2.4 Model ...................................................................................................................... 8

2.5 Results ................................................................................................................... 8

3 EEG REPORT INFORMATION EXTRACTION ......................................................... 10

3.1 Structure of an EEG Report ................................................................................ 10

3.2 Date and Time Extraction .................................................................................... 11

3.3 Medications .......................................................................................................... 13

3.4 Inter-Epoch and Inter-Section Reference Resolution ....................................... 14

3.5 Auto-Tagging Based on ACNS Guidelines ........................................................ 15

3.6 Seizure Sentiment Analysis ................................................................................ 16

4 CONCLUSIONS ........................................................................................................ 20

5 REFERENCES .......................................................................................................... 21

Disclaimer: The views, processes or methodologies published in this article are those of

the authors. They do not necessarily reflect Dell EMC’s views, processes or

methodologies.


1 INTRODUCTION

Brain function is deranged in an alarming number of patients who become ill enough to

hospitalized. It is well known that hospitalized patients who are severely ill often appear

“confused” or “sleepy”, a sign of the effects of illness on the brain. Less well known is that

when such patients undergo brain monitoring with electroencephalography (EEG), up to

50% of these patients are discovered to be experiencing seizures1-10. Treatments are

available, however, most (>90%) ICU seizures occur without any obvious signs, and are

detectable only by cEEG1,4,6. Delayed diagnosis of such “subclinical seizures” leads to

brain damage, increases ICU and hospitalization lengths, and heightens the risk of in-

hospital death or long-term disability in survivors. The Massachusetts General Hospital

(MGH) Neurology Department runs one of the largest in-hospital EEG monitoring services

in the world, with over 2,500 EEG recordings performed each year.

As part of an effort to migrate to a modern electronic medical record (EMR) system in early

2016, Massachusetts General Hospital’s (MGH) Neurology Department has adopted the

latest ACNS guidelines11 for neurophysiologists to report on patients’ EEG data they have

reviewed. To this end, all clinical EEG reports are written in a semi-structured format with

clearly defined section labels along with their respective free-text descriptive narrative.

There is a tremendous need to curate and mine these reports for clinical and research

purposes. However, two main challenges stand in the way.

The first challenge is that patients’ EEG reports and the files containing the waveform data

are stored separately and not clearly linked. MGH’s current portable EEG machines were

deployed in early 2000 and are not integrated with the EMR system. Thus, patient

identification information is manually entered and stored separately before a technician

initiates EEG recording. This manual data entry is error-prone and causes significant

patient mismatches (only about 50% have complete match) between physician’s reports

and their corresponding EEG recording files. Also, various timestamps captured in MGH’s

EMR system for EEG examination are either not available or are not accurate enough to

directly determine the correct linkage.

The second challenge is how to extract useful information from semi-structured free-text

EEG reports to create a curated, feature rich and easy-to-query structured data store that

describes clinically important neurophysiological events. Critical events are those which

either directly harm the brain, i.e. seizures detections12, and the presences of events which

signal a high risk of seizures or service as indicators of brain injury, including epileptiform

discharges12,13, presence of periodic or rhythmic patterns14,15, as well as other supporting

information about epoch’s start/end time and pertinent medications administrations.

With these two challenges, it typically takes researchers weeks to months to manually

download raw data and files, review them, and retrieve the pertinent information in a one-

off fashion. Later researchers have to repeat this process without any possibility to benefit

from previous efforts.


The main contributions of this study are:

(1) a highly accurate fuzzy matching classifier, which matches EEG to their corresponding

EEG recordings, by using not just a patient’s identification attributes but also the context of

patient’s EEG visit date/time and EEG file creation date/time;

(2) text mining techniques that extract and curate useful neurophysiological information

from EEG reports with high fidelity, specifically, epoch date/time range, pertinent

medications used or not used, presence or absence of sporadic discharges11, presence of

periodic or rhythmic patterns (that is, LPD, GPD, LPDA, GPDA, BIPD)11, and seizure

sentiment detection;

(3) a comprehensive data pipeline to extract and store information from (1) & (2) as new

reports and recordings are created daily, to eliminate the laborious one-off weeks to month

efforts required previously for each individual researcher or clinician.

2 EEG REPORTS TO REORDINGS MATCHING

As stated previously, patients’ EEG reports and corresponding EEG waveform files are not

clearly linked, because patient identification information is entered into the electronic

medical record (EMR) separately from the database of EEG recording files, and data in

the EEG recording database is entered manually. In this section, we present a machine

learning model that performs entity matching for existing data as well as ongoing new data

to be created in the future. Unlike other entity matching16 approaches, we use date/time

attributes from the EMR and the EEG recording file system to identify all potential matches

thus narrowing the search space, and then apply entity identification information from both

sides to complete the match. The reasons for this approach are:

- We are certain of the start date/time and duration of recordings from vendor’s EEG

recoding file system, and the date range of a patient’s EEG examination period

from MGH EMR system

- We are certain of the patient’s identification information in EEG reports from the

EMR system, whereas, the same information in the vendor EEG recording file

system is manually entered and therefore not as reliable (hence only 50% fully

match between these two systems)

2.1 Data Sources

The patient identification information and EEG report attributes from EMR system,

pertinent to our matching model, are as follows:

Attribute Group Attribute Name Description

Identification MRN Patient’s medical record number, a 7-digit string

Identification Last Name Patient’s last name

Identification First Name Patient’s first name (mostly formal not nickname, e.g. William instead of Bill)


Identification Middle Name Patient’s middle name (could be full middle name or just initial)

Identification Date of Birth

Identification Gender

Report Report ID EEG report ID

Date/Time Admit DTS Patient hospital admission date/time

Date/Time Start Exam DTS EEG exam start date/time, not always available or reliable

Date/Time End Exam DTS EEG exam end date/time, not always available or reliable

Date/Time Discharge DTS Patient hospital discharge date/time

Date/Time Result DTS EEG result date/time when report was completed, not always available or reliable

The patient identification information and EEG recoding file attributes from the vendor’s

EEG recording file system, pertinent to our matching model, are as follows:

Attribute Group Attribute Name Description

Identification MRN Patient’s medical record number, a 7-digit string

Identification Last Name Patient’s last name

Identification First Name Patient’s first name (mostly formal not nickname, e.g. William instead of Bill)

Identification Middle Name Patient’s middle name (could be full middle name or just initial)

Identification Date of Birth

Identification Gender

File File ID A uniquely hashed value

Date/Time File Created DTS EEG recording created start date/time, always available and reliable. This is the start date/time of an EEG recording.

Date/Time Duration EEG recording duration, always available and reliable. Can be used to derive the end date/time of a recording.

The ground truth consists of (1) the 6 patient identification attributes from the EMR system

and (2) the two date/time stamps from the vendor’s EEG file system.


As sample data we used 3,114 long-term monitoring EEG files to build and train our

classifier model.

2.2 EEG Examination Date Range

To locate a subset of potential EEG recordings being matched to a report, we need to

create a time box based on the 5 date/time attributes of a report. In general, their

chronological order is Admit DTS, Exam Start DTS, Exam End DTS, Discharge DTS, and

Report DTS. The order of Discharge DTS and Report DTS could be reversed, depending

on patient’s length of stay and how quickly the neurophysiologist can complete a report.

Another challenge in creating this time box is that Exam Start DTS, Exam End DTS, and

Report DTS are not always available and even if they are, they are not always accurate.

However, one important practical observation can be of use. That is, an EEG examination

cycle rarely extends over two weeks. Therefore, we employed a reasonably large report

time box to ensure all potential EEG recordings’ start date/time fall inside this range:

- Lower Bound: min (Admit DTS, Exam Start DTS) – 1 day

- Upper Bound: max (Exam End DTS, Discharge DTS, Report DTS) + 14 days

The side effect of this extended time box is overmatching, namely, an EEG file often

matches to more than one report and vice versa. In a practical sense, this is not a critical

problem because the desirable outcome from this process is that reports and files in one

treatment/examination cycle are linked together. Furthermore, we will use epochs’

start/end date/time in the EEG reports to fully resolve this issue (see EEG Report

Information Extraction section below).

2.3 Feature Extraction

To extract features by comparing patient’s corresponding identification attributes from both

systems, we use the following common algorithms and weight tuning methods to be

representative of the likelihood of a match:

- For names and MRN, we use Levenshtein Distance17 to calculate their spelling

similarity. Also, Soundex18 is to calculate their phonetical similarity. Because

middle name is sparsely populated, it’s excluded from the feature set.

- For date of birth, year, month, and day of the month are matched separate then

sum up.

- For gender (male, female, and unknown), exact match is used.

- To account for the case of last name and first name being entered in the wrong

field and compound names being used (e.g. full Spanish last names, Rodriguez-

Alvarez), a method of name cross-containing and containing is used.


Below is the summary of initial weighted score calculation by attribute:

Attribute Scoring Algorithm

Raw Score Range

Transformation Method

Default Score if Null

Weight Weighted Score Range

MRN Levenshtein Distance

[0, 7] quadratic 3 3 [0, 3, 12,

…,147]

Date of Birth

Sum of each exact match

[0, 3] quadratic 1 20 [0, 20, 80,

180]

Gender Exact match [0, 1] linear 0.4 50 [0, 20, 50]

Last Name Levenshtein Distance

[0, 10] quadratic N/A 1 [0, 1, 4, ..,

100]

Last Name Soundex [0, 1] binary N/A 100 [0, 100]

First Name Levenshtein Distance

[0, 10] linear N/A 70

[0, 70]

First Name Soundex [0, 1] binary N/A 30 [0, 30]

Last & First Name

Containing and Cross Containing

[0, 1, 10, 11]

ordinal N/A See table below

See table below

Total [0, …, 677]

In addition to score by attribute individually, the likelihood of a match will be affected,

positively or negatively, by a combination of attributes. Here is the summary of initial score

adjustments for name cross-containing and containing, and attribute compounding effects:

Initial Score Adjustment Raw Score Range

Equivalence Effect To Score

Last name and First name Containing and Cross Containing

[0] No match None

[1]

First Name Match

Replace first name initial score with [100]

[10]

Last Name Match

Replace last name initial score with [200]

[11]

Last and First Match

Replace last & first name initial score with [300]

Birth date and names fully match

MRN = [0,7] N/A Add [0, 85]

Birth date and names fully match

[1] N/A Add [100]


Birth date match but last name Soundex doesn’t match

[1] N/A Subtract [100]

Adjusted Total [0, 787]

The adjusted score of pairwise matching of a report to a file is a strong indication of

probability of a match. However, it’s not used as a threshold to determine whether it’s a

match or not because it could change over time. Instead, the pair with the highest adjusted

score for a report is included as the possible match and those initial weighted scores for

the pair are used as its features for model training and prediction.

2.4 Model

We have selected a set of sample data set of 3,114 EEG files from June 2016 to

December 2016 for model training. The score distribution of the pairs with the highest

score among potential EEG files to a report is shown below. For this data set, the total

adjusted score around 370 seems to be a good separation for match or not match.

To avoid the issue of class imbalance and ensure balanced distribution of the data across

all scoring ranges, we use the following data stratification: 100-599: 100%, 600-699: 20%,

>700: 1%. As a result, we have a very class-balanced sample set of 647 positives and 629

negatives for model training.

The classifiers we have trained are logistic regression, decision tree, gradient boosted

tree, and random forest19 with 70-30 split of training vs. hold-out, 3-fold cross validation,

and 10-18 sets of hyper-parameter tuning.

2.5 Results

All models perform rather well and random forest (20 trees, maximum depth of 6,

maximum bins of 16) has a slight edge with area under ROC = 0.9947. We get area under

ROC = 0.997763 for the entire data set without stratification, which is expected because

the remaining data set are all with very high weighted score (> 600) and they are matches.

The distribution of feature importance are: MRN: 0.0136, date of birth: 0.0182, gender:

0.0043, last name Levenshtein distance: 0.4706, first name Levenshtein distance: 0.1627,


last name Soundex: 0.1641, first name Soundex: 0.1528, name containing & cross

containing: 0.0138. As expected the weight scores of last name and first name are the

dominating ones especially spelling similarity, whereas, MRN is not as important as

expected.

The performance of the model to predict pairwise match for the data set from April 2016 to

September 2017 for all three examination types (long-term EEG monitoring, routine EEG,

and epilepsy monitoring) is consistent with that above, except patients from neonatal

intensive care unit (NICU) where some new born babies haven’t been named yet and

generic name like “baby girl” or “baby boy” have been entered. Thus, we need to twig the

name scoring features to accommodate these cases.

Here is the confusion matrix and performance of matching of long-term monitoring EEG

recording files (including NICU cases) from April 2016 to October 2017:

Predict: Positive

Predict: Negative

Sum

Actual: Positive TP = 21,919 FP = 0 21,919

Actual: Negative FN = 471 TN = 516 987

Sum 22,390 516 n = 22,906

Recall = 0.9790, precision = 1.00, and accuracy = 0.9794.

Here is the confusion matrix and performance matching of routine and epilepsy monitoring

EEG recording files from April 2016 to October 2017:

Predict: Positive

Predict: Negative

Sum

Actual: Positive TP = 9,121 FP = 0 9,121

Actual: Negative FN = 34 TN = 440 474

Sum 9,155 440 n = 9,595


In the next phase of the study, we plan to use the same framework to train a classifier and

perform predictions for pairwise matching of EEG reports to files prior to MGH’s cutover to

the new EMR system, that is, from 2002 to March 2016. The reason for a separate model

is that the quality of data collection is significantly worse than the current model covered

period.


3 EEG REPORT INFORMATION EXTRACTION

3.1 Structure of an EEG Report

In early 2016, MGH’s neurology department standardized its format and terminology to

describe patient’s EEG recordings by adopting American Clinical Neurophysiology

Society’s (ACNS) guidelines11 in a semi-structured fashion. Minimal time epochs to be

documented separately are the first 30 minutes (equivalent to a routine EEG) to 24 hour

period, or a significant change in the recording during this period. Thus, a report could

consist of one or more epochs and each epoch consists of the following sections and free

text narrative after the section headings:

- EPOCH: start date/time and end date/time of an epoch

- MEDICATIONS: pertinent medications and sometimes their administration

- BACKGROUND11

- SPORADIC DISCHARGES11

- PERIODIC OR RHYTHMIC PATTERNS11

- SEIZURES: narrative regarding presence or absence of seizures, and type of

seizures if present

- EVENTS: any event happened, like change of medications, change of state from

asleep to awake, etc.

In this study, the type of important information to be extracted from reports are:

- Date and Time extraction from EPOCH section

- Inter-epoch and section reference resolution from all sections except EPOCH, with

pointer to the other epoch or section for pertinent information

- Auto-tagging in the narrative, for example, the presence of LPD in PERIODIC OR

RHYTHMIC PATTERNS section, or the presence of epileptiform discharges in

SPORADIC DISCHARGES section

- Seizure sentiment (positive, negative, or neutral) described in SEIZURES section

The method required for each type of information extraction is different and they are

described below.


Here is a sample report file with two epochs:

Spelling correction has not been applied in this work for the following reasons:

- The text in these reports are concise, domain specific, acronym-filled, and not

necessarily grammatically correct and may not have proper sentence breaks.

- Spelling correction will be useful only if domain-specific & custom dictionary is

added, which will take a lot of time and effort with uncertain benefit.

- We can use word stems in text search to overcome issues of singular vs. plural,

minor spelling difference among regions (American English vs. British English),

and minor typos.

- Most of our text mining work is keyword-driven and they tend to be spelled

correctly with some common shorthand among the professionals or limited typos.

3.2 Date and Time Extraction

The marker of an epoch starts with an EPOCH section which contains the start and end

date/time. They are critical information to locate the corresponding EEG recordings. The

free text entered in this section has the following characteristics:

- Date format: as MGH is in the U.S., in a majority of the cases, the date format is

month-first with three variations: MM/DD/YYYY, MM/DD/YY (no century), MM/DD

(no year). In the case of missing year or date entirely, it can be derived from

patient’s visit date with some accuracy

- Time format: time format is much less consistent as that of date format with a

couple of variations: h24:mm (military time format), hh:mm AM/PM, hhmm (no

delimiter between hour and minute), hmm (single digit hour)

- Delimiter between start date/time and end date/time: it could be clearly stated (for

example, to, until, “-“, etc.) or implicitly inferred (spaces). The implicit ones could

cause significant problem when date is not provided.


- Ill-formatted or/and incomplete date/time text string: there is a significant number of

bad data entries. Most can be inferred through several passes of regular

expressions algorithms but a few are just too ambiguous. Rather than guessing

incorrectly, it’d be better to leave them unresolved for users of the data to decide.

Here are some samples of EPOCH section with various formats of specifying start & end

date/time.

Here are the processing steps:

- Step 1: Break EPOCH text string into two parts, start date/time and end date/time:

o if a common delimiter (to, until, etc.) is present

o otherwise, identify potential date strings and use the second date string as

the delimiter. This will only work with date/time string start with date, like

12/1/2016 12:20 am (most cases). It won’t work for string start with time,

like 12:20 am 12/1/2016.

- Step 2: Normalize time text string to hh:mm(:ss) using regular expression

- Step 3: Deride default start date and end date as the default date for Python

DateUtil package:

o Default start date: max(admit date, EEG exam begin date)

o Default end date: min(EEG exam end date, discharge date)

- Step 4: Use Python DateUtil parser with the default dates derived above on start

and end date/time text string

- Step5: Use Python DateFinder parse on cases of failure from DateUtil parser.

DateFinder is very good at finding date time embedded in long text strings.

Here are some ill-formed and/or ambiguous start & end date/time string along with default

start date (LowerDTS):


With the processing stated above, out of 10,612 EPOCHs, there are 36 ill-formatted and/or

ambiguous date/time strings which can’t be resolved definitively. Thus, we have achieved

a 99.66% success rate. With epochs’ start/end date/time mined here, we are able to

narrow down from a group of highly potential matching files obtained from the previous

section, to exactly the list of EEG files narrated in a report.

3.3 Medications

MEDICATIONS section text is rather well formed mostly with clear delimiters to separate

each medication (sometime dosage and its administration are included). It allows us to

parse them out easily and store them in separate rows in a table for query. Note that

negation is not weeded out because sometimes a confirmed omission is as important as

confirmed usage.

Here are the simple processing steps for medications:

- Step1: Use “,”, “.”, “and” as token separator to parse medication text string

- Step 2: Dosage and administration information are kept in each token

- Step 3: Do not weed out “none”, “no”, “nil”, etc. as they won’t match anything for

medication match

- Step 4: Explode the token list into one record per token

Here are some examples of the processing described from Step 1 to Step 3:

Here are some examples of the processing described in Step 4:


3.4 Inter-Epoch and Inter-Section Reference Resolution

In the case of long term EEG monitoring which tends to run for multiple days, there might

not be significant changes observed from one epoch (< 24 hours) to the next and

neurophysiologists would write the narrative once and then refer to the same epoch in the

subsequent epoch(s), rather than copy and paste the same narrative. This is so-called

inter-epoch reference. There is no new information to be mined, other than to propagate

the same information from the referred epoch section. There are also cases of chain

referencing.

Instead of documenting their narrative into corresponding sections, some

neurophysiologists tend to write a complete narrative in one section and then refer to it in

other sections (e.g. see above, see below, see background section, etc.). This is so-called

inter-section referencing. Contrary to inter-epoch referencing, there are new information to

be extracted in the right context. For example, LPD is mentioned in SEIZURES section.

Because we are not looking for LPD in the context of seizures, we miss the proper tagging

of LPD in the context of PERIODIC OR RHYTHMIC PATTERNS.

Here is a report having 5 epochs, with inter-epoch referencing for background and seizure,

and inter-section referencing in sporadic discharge to periodic or rhythmic patterns. Note

that epoch sequence as stored and processed from report is in reverse chronological

order, that is, the epoch with the high sequence is the earliest epoch in terms of date/time.

The identification of a referencing text follows three rules:

1. short in length, under 50 characters

2. with a referencing term like, “see”, “as”, “same”, “similar”

3. with a directional term or specific location:

o Forward direction: “above”, “prior”, “before”, “previous”

o Backward direction: “below”, following”, “after”, “subsequent”

o Specific location: section name


With these simple rules, we are able to achieve 100% accuracy to find inter-epoch or inter-

section referencing. Thus, we are able to propagate extracted information across epochs

or pull the right text to be extracted in the right context between two sections, for auto-

tagging or seizure sentiment analysis (see below).

3.5 Auto-Tagging Based on ACNS Guidelines

ACNS guidelines [1] are quite extensive. Our initial focus (the most useful for researchers)

is on auto-tagging main terms in PERIODIC OR RHYTHMIC PATTERNS:

- Lateralized periodic discharges (LPDs) a.k.a. Periodic lateralized epileptiform

discharges (PLEDs)

- Generalized periodic discharges (GPDs) a.k.a. Generalized periodic epileptiform

discharges (GPEDs)

- Lateralized rhythmic delta activity (LRDA)

- Generalized rhythmic delta activity (GRDA)

- Bilateral periodic independent discharges (BIPDs) a.k.a. Bilateral independent

periodic lateralized epileptiform (BIPLED)

and the presence or absence of sporadic discharges.

Negation has not been applied to auto-tagging because neurophysiologists tend not to

mention them if they are not present. But there are a few but insignificant number of

exceptions. Besides, false positives are more tolerable than false negatives in our case.

With like matching without considering the sequence of a compound term (e.g. lateralized

periodic discharges), we are able to achieve reasonably good auto-tagging results.

Out of 4,889 unique PERIODIC OR RHYTHMIC PATTERNS sections (including inter-

section referencing), we are able to tag:

Main Term Number of Text Tagged Percent of Text Tagged

LPD 1176 24%

GPD 1389 28%

LRDA 382 8%

GRDA 975 20%

BIPD 213 4%


Out of 5,004 unique SPORADIC DISCHARGES sections (including inter-section

referencing), we are able to tag:

Main Term Number of Text Tagged Percent of Text Tagged

epileptiform 629 13%

triphasic 307 6%

sharp/spike wave 1736 35%

polyspike 100 2%

These results are in line with researcher’s expectation. To improve the accuracy with less

false positives, in the next phase, we plan to use nGram approach to implement matching

with sequence of words in a compound term and extend the auto-tagging to cover more

terminologies in ACNS guidelines [1].

3.6 Seizure Sentiment Analysis

One of the most important pieces of information that can be extracted from EEG reports is

the sentiment of seizures. Taking advantage of a more standardized semi-structured

format for reports, we have employed a less deterministic but more straightforward

language model, with very impressive results.

Our definition of seizure sentiment is:

- Positive if indicate seizure(s)

- Negative if indicate no seizure

- Neutral if there is no clear indication one way and the other. Eventually, this is

considered negative.

Seizure types that we want to detect in this study are electrographic seizures,

electroclinical seizure, status epilepticus, epilepsia partialis continua, and absence

seizures. The essence of our sentiment model is to find seizure keywords, and then locate

enumerating keywords and negation keywords, to determine their applicability to seizure

keywords based on their proximity.

After reviewing several hundreds of sample reports, we have come up with the following

keyword set to be used in our sentiment model. We do not attempt to find the exhaustive

list of them in the English dictionary. However, in the domain of seizure sentiment

analysis, we find them quite adequate as shown in the result later.

- Seizure keywords (including singular/plural, misspelling, and shorthand):

o seizure, seizures, siezures , sz, szs

o epilepticus, epileptics

o epilepsia

- Negation keywords: no, not, none, nothing, lack, unlikely

- Enumeration keywords:


o Literal enumeration: one, two, three, …, ten (people tend to use numbers if

it’s more than ten in a sentence)

o Numeric or numeric range enumeration: for example, 5 or 5-6 seizures

o Implicit enumeration: several, multiple, numerous, handful, few, lot, many,

any, considerable, single, number, these

Out of 10,612 epochs in our sample report set, there are 10,440 epochs with seizures text.

In addition to and similar to inter-epoch and inter-section referencing, we can infer

negative seizure sentiment easily when the entire text is very short (that is, <= 50

characters) with 100% accuracy. Below is the breakdown of our sample seizure text and

their treatment:

Type of Seizure Text Count Treatment

Inter-epoch referencing to previous epoch 20 Propagate sentiment from the referred epoch

With short text (<=50 characters) and clearly indicating no seizure (e.g, none, no/not/without seizure, seizure free)

8,776 Simple sentiment determination

Inter-section referencing to the other section (above/below or specific) in the same epoch

408 Copy text from the referred section and perform full sentiment analysis

Full self-contained text 1,236 Full sentiment analysis

The seizure text from the last two rows (1,644 = 408 + 1,236) is the focus of our more

comprehensive sentiment analysis.

With keyword set defined and sample text set identified, our model works on sentences

individually first. Once a seizure keyword is found in a sentence, it’s further broken down

into nGram based on proximity parameters (number of preceding and following words),

using seizure keywords as anchors. Within those proximity boundaries, we look for

enumerating keywords and negation keywords to determine their applicability to seizure

keyword(s). Based on the sentiment of each sentence in the text, we derive the overall

sentiment of seizure for an epoch.

Here is the detailed processing steps:

• Step 1: Break description into sentences using NLTK, preserve their sequence,

and remove most punctuations except dashes for numeric range enumeration

• Step 2: Tokenize each sentence (no stop word removal)

• Step 3: Determine sentiment of each sentence as follows:

• Positive if it contains at least one seizure term and no negation term

• Negative if it contains at least one seizure term and at least one negation

term

• Neutral if it contains no seizure term


• Step 4: Use proximity of enumeration terms (4 preceding and 1 following words) to

any seizure term, to determine whether sentence has enumeration of seizures:

• If true, change its sentiment to positive if it’s negative from Step 3

• Otherwise, leave the sentiment as is from Step 3

• Step 5: Use proximity of negation terms (4 preceding and 2 following) to any

seizure term, to determine whether they are applied to seizures:

• If false, change its sentiment to positive if it’s negative from Step 3 & 4

• Otherwise, leave the sentiment as is from Step 3 & 4

• Step 6: Determine the overall seizure sentiment for an epoch as follows:

• Positive if any sentence has at least one instance of enumeration of

seizures (from Step 4)

• Otherwise, further processing as follows:

• Positive if the first non-neutral sentence has positive sentiment

• Negative if the first non-neutral sentence has negative sentiment

• Neutral, otherwise

Sample text with positive sentiment are shown below. The notation in the column of

“Sentence & Paragraph Sentiment” is [S1, S2(E), …, Sn] => P, where S1 to Sn are sentence

sentiment, (E) if a sentence has applicable enumerating key word, and P is the sentiment

for the entire paragraph.


Below are some sample text with negative sentiment. Note that the last two misclassified

cases because the negation keywords fall out the proximity zone of seizure keywords.

However, if we increased the proximity zone, we would see a lot more false negatives

which are less desirable.

Here is the confusion matrix and performance of seizure sentiment analysis on those

1,644 longer text set:

Predict: Positive

Predict: Negative/Neutral

Sum

Actual: Positive TP = 911 FP = 2 913

Actual: Negative/Neutral FN = 0 TN = 731 731

Sum 911 733 n = 1,644


Here is the confusion matrix and performance of combining simple and full sentiment

analysis on all 10,419 seizure text:

Predict: Positive

Predict: Negative/Neutral

Sum

Actual: Positive TP = 911 FP = 2 913

Actual: Negative/Neutral FN = 0 TN = 9507 9507

Sum 911 9509 n = 10,420

Recall = 1.00, precision = 0.9978, and accuracy = 0.9998


Given its simplicity, the performance of this sentiment model is extremely impressive. Also,

we are able to retrieve specified seizure type for epochs with positive sentiment. Their

breakdown is listed below:

Seizure Type Count Percentage

absence seizure 5 0.5%

electroclinical seizure 118 12.9%

electrographic seizure 339 37.1%

epilepsia partialis continua 4 0.4%

status epilepticus 69 7.6%

not specified 378 41.4%

In the follow-up work, we plan to run this sentiment analysis on new seizure text with the

same report in semi-structured format as well as older seizure text since 2002 which is

much more unstructured, to determine the model’s real performance and potential

improvements.

We also tried a variation of the model by removing non-negation stop words and

shortening the number of preceding words from 4 to 3. Stop word removal is one of the

common techniques of normalization in text analysis. However, the result of this variation

is not as good as the one presented here. The reason seems to be that it’s natural in

human writing to put enumerating or negation keywords not too far away from a seizure

keyword, including counting stop words. Stop word removal breaks this nature and yields

results that are not as good.

4 CONCLUSIONS

We have described a highly accurate classifier to match an EEG report to its

corresponding EEG recording files. Our classifier performs this matching using the context

of visit date/time and patient identification attributes in the EMR system, combined with

information from a separate vendor EEG recording file system. We have also described

several text analysis techniques that very accurately extract various types of clinically

important information from EEG reports. With these two foundations, we have established

a highly effective and efficient data pipeline for clinical operations, quality improvement,

and neurophysiological research.


5 REFERENCES

1. Westover, M. B. et al. The probability of seizures during EEG monitoring in critically

ill adults. Clin. Neurophysiol. Off. J. Int. Fed. Clin. Neurophysiol. 126, 463–471 (2015).

2. Jirsch, J. & Hirsch, L. J. Nonconvulsive seizures: developing a rational approach to

the diagnosis and management in the critically ill population. Clin. Neurophysiol. Off. J. Int.

Fed. Clin. Neurophysiol. 118, 1660–1670 (2007).

3. Alvarez, V. et al. The use and yield of continuous EEG in critically ill patients: A

comparative study of three centers. Clin. Neurophysiol. Off. J. Int. Fed. Clin. Neurophysiol.

128, 570–578 (2017).

4. Pandian, J. D., Cascino, G. D., So, E. L., Manno, E. & Fulgham, J. R. Digital video-

electroencephalographic monitoring in the neurological-neurosurgical intensive care unit:

clinical features and outcome. Arch. Neurol. 61, 1090–1094 (2004).

5. Vespa, P. M. et al. Acute seizures after intracerebral hemorrhage: a factor in

progressive midline shift and outcome. Neurology 60, 1441–1446 (2003).

6. Claassen, J., Mayer, S. A., Kowalski, R. G., Emerson, R. G. & Hirsch, L. J.

Detection of electrographic seizures with continuous EEG monitoring in critically ill

patients. Neurology 62, 1743–1748 (2004).

7. Dennis, L. J. et al. Nonconvulsive status epilepticus after subarachnoid

hemorrhage. Neurosurgery 51, 1136–1143; discussion 1144 (2002).

8. DeLorenzo, R. J. et al. Persistent nonconvulsive status epilepticus after the control

of convulsive status epilepticus. Epilepsia 39, 833–840 (1998).

9. Treiman, D. M. et al. A comparison of four treatments for generalized convulsive

status epilepticus. Veterans Affairs Status Epilepticus Cooperative Study Group. N. Engl.

J. Med. 339, 792–798 (1998).

10. Towne, A. R. et al. Prevalence of nonconvulsive status epilepticus in comatose

patients. Neurology 54, 340–345 (2000).

11. L. J. Hirsch, S. M. LaRoche, et al. American Clinical Neurophysiology Society’s

Standardized Critical Care EEG Terminology: 2012 version

12. Siddharth Biswal, Zarina Nip, Valdery Moura Junior, Matt T. Bianchi, Eric S

Rosenthal, and M Brandon Westover, MD PhD. Automated Information Extraction from

Free-Text EEG Reports. Conf Proc IEEE Eng Med Biol Soc. 2015; 2015: 6804–6807. doi:

10.1109/EMBC.2015.7319956

13. Thomas J, Jing Jin, Dauwels J, Cash SS, Westover MB. Automated epileptiform

spike detection via affinity propagation-based template matching. Conf Proc IEEE Eng

Med Biol Soc. 2017 Jul;2017:3057-3060. doi: 10.1109/EMBC.2017.8037502.


14. Ruiz AR, Vlachy J, Lee JW, et al. Association of periodic and rhythmic

electroencephalographic patterns with seizures in critically ill patients [published online

December 19, 2016]. JAMA Neurol. doi:10.1001/jamaneurol.2016.4990

15. Struck AF, Ustun B, Ruiz AR, Lee JW, et al. Association of an

Electroencephalography-Based Risk Score With Seizure Probability in Hospitalized

Patients. JAMA Neurol. 2017 Oct 9. doi: 10.1001/jamaneurol.2017.2459.

16. H. Köpcke, E. Rahm, Frameworks for entity matching: A comparison, Data Knowl.

Eng. (2009), doi:10.1016/j.datak.2009.10.003

17. Levenshtein Distance: https://en.wikipedia.org/wiki/Levenshtein_distance

18. Soundex: https://en.wikipedia.org/wiki/Soundex

19. Apache Spark Machine Learning APIs: http://spark.apache.org/docs/latest/ml-

guide.html

Dell EMC believes the information in this publication is accurate as of its publication date.

The information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” DELL EMC MAKES

NO RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE

INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED

WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Use, copying and distribution of any Dell EMC software described in this publication

requires an applicable software license.

Dell, EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries.

EEG REPORT TO RECORDINGS MATCHING AND TEXT MINING · in the world, with over 2,500 EEG recordings...

Documents

Transcript of EEG REPORT TO RECORDINGS MATCHING AND TEXT MINING · in the world, with over 2,500 EEG recordings...