EEG REPORT TO RECORDINGS MATCHING AND TEXT MINING · in the world, with over 2,500 EEG recordings...
Transcript of EEG REPORT TO RECORDINGS MATCHING AND TEXT MINING · in the world, with over 2,500 EEG recordings...
EEG REPORT TO RECORDINGS MATCHING AND TEXT MINING
Yu-Ping ShaoData ScientistMassachusetts General [email protected]
Manohar GhantaData EngineerMassachusetts General [email protected]
Tony Ku, Ph.D.Advisory ConsultantDell EMC Consulting [email protected]
M. Brandon Westover, M.D., Ph.D.Director, Clinical Data Animation Center (CDAC)Harvard Medical School, Massachusetts General Hospital [email protected]
Valdery Moura JuniorData EngineeringMassachusetts General [email protected] Sharing Article © 2018 Dell Inc. or its subsidiaries.
2018 Dell EMC Proven Professional Knowledge Sharing 2
Table of Contents
1 INTRODUCTION ......................................................................................................... 3
2 EEG REPORTS TO REORDINGS MATCHING ........................................................... 4
2.1 Data Sources .......................................................................................................... 4
2.2 EEG Examination Date Range .............................................................................. 6
2.3 Feature Extraction ................................................................................................. 6
2.4 Model ...................................................................................................................... 8
2.5 Results ................................................................................................................... 8
3 EEG REPORT INFORMATION EXTRACTION ......................................................... 10
3.1 Structure of an EEG Report ................................................................................ 10
3.2 Date and Time Extraction .................................................................................... 11
3.3 Medications .......................................................................................................... 13
3.4 Inter-Epoch and Inter-Section Reference Resolution ....................................... 14
3.5 Auto-Tagging Based on ACNS Guidelines ........................................................ 15
3.6 Seizure Sentiment Analysis ................................................................................ 16
4 CONCLUSIONS ........................................................................................................ 20
5 REFERENCES .......................................................................................................... 21
Disclaimer: The views, processes or methodologies published in this article are those of
the authors. They do not necessarily reflect Dell EMC’s views, processes or
methodologies.
2018 Dell EMC Proven Professional Knowledge Sharing 3
1 INTRODUCTION
Brain function is deranged in an alarming number of patients who become ill enough to
hospitalized. It is well known that hospitalized patients who are severely ill often appear
“confused” or “sleepy”, a sign of the effects of illness on the brain. Less well known is that
when such patients undergo brain monitoring with electroencephalography (EEG), up to
50% of these patients are discovered to be experiencing seizures1-10. Treatments are
available, however, most (>90%) ICU seizures occur without any obvious signs, and are
detectable only by cEEG1,4,6. Delayed diagnosis of such “subclinical seizures” leads to
brain damage, increases ICU and hospitalization lengths, and heightens the risk of in-
hospital death or long-term disability in survivors. The Massachusetts General Hospital
(MGH) Neurology Department runs one of the largest in-hospital EEG monitoring services
in the world, with over 2,500 EEG recordings performed each year.
As part of an effort to migrate to a modern electronic medical record (EMR) system in early
2016, Massachusetts General Hospital’s (MGH) Neurology Department has adopted the
latest ACNS guidelines11 for neurophysiologists to report on patients’ EEG data they have
reviewed. To this end, all clinical EEG reports are written in a semi-structured format with
clearly defined section labels along with their respective free-text descriptive narrative.
There is a tremendous need to curate and mine these reports for clinical and research
purposes. However, two main challenges stand in the way.
The first challenge is that patients’ EEG reports and the files containing the waveform data
are stored separately and not clearly linked. MGH’s current portable EEG machines were
deployed in early 2000 and are not integrated with the EMR system. Thus, patient
identification information is manually entered and stored separately before a technician
initiates EEG recording. This manual data entry is error-prone and causes significant
patient mismatches (only about 50% have complete match) between physician’s reports
and their corresponding EEG recording files. Also, various timestamps captured in MGH’s
EMR system for EEG examination are either not available or are not accurate enough to
directly determine the correct linkage.
The second challenge is how to extract useful information from semi-structured free-text
EEG reports to create a curated, feature rich and easy-to-query structured data store that
describes clinically important neurophysiological events. Critical events are those which
either directly harm the brain, i.e. seizures detections12, and the presences of events which
signal a high risk of seizures or service as indicators of brain injury, including epileptiform
discharges12,13, presence of periodic or rhythmic patterns14,15, as well as other supporting
information about epoch’s start/end time and pertinent medications administrations.
With these two challenges, it typically takes researchers weeks to months to manually
download raw data and files, review them, and retrieve the pertinent information in a one-
off fashion. Later researchers have to repeat this process without any possibility to benefit
from previous efforts.
2018 Dell EMC Proven Professional Knowledge Sharing 4
The main contributions of this study are:
(1) a highly accurate fuzzy matching classifier, which matches EEG to their corresponding
EEG recordings, by using not just a patient’s identification attributes but also the context of
patient’s EEG visit date/time and EEG file creation date/time;
(2) text mining techniques that extract and curate useful neurophysiological information
from EEG reports with high fidelity, specifically, epoch date/time range, pertinent
medications used or not used, presence or absence of sporadic discharges11, presence of
periodic or rhythmic patterns (that is, LPD, GPD, LPDA, GPDA, BIPD)11, and seizure
sentiment detection;
(3) a comprehensive data pipeline to extract and store information from (1) & (2) as new
reports and recordings are created daily, to eliminate the laborious one-off weeks to month
efforts required previously for each individual researcher or clinician.
2 EEG REPORTS TO REORDINGS MATCHING
As stated previously, patients’ EEG reports and corresponding EEG waveform files are not
clearly linked, because patient identification information is entered into the electronic
medical record (EMR) separately from the database of EEG recording files, and data in
the EEG recording database is entered manually. In this section, we present a machine
learning model that performs entity matching for existing data as well as ongoing new data
to be created in the future. Unlike other entity matching16 approaches, we use date/time
attributes from the EMR and the EEG recording file system to identify all potential matches
thus narrowing the search space, and then apply entity identification information from both
sides to complete the match. The reasons for this approach are:
- We are certain of the start date/time and duration of recordings from vendor’s EEG
recoding file system, and the date range of a patient’s EEG examination period
from MGH EMR system
- We are certain of the patient’s identification information in EEG reports from the
EMR system, whereas, the same information in the vendor EEG recording file
system is manually entered and therefore not as reliable (hence only 50% fully
match between these two systems)
2.1 Data Sources
The patient identification information and EEG report attributes from EMR system,
pertinent to our matching model, are as follows:
Attribute Group Attribute Name Description
Identification MRN Patient’s medical record number, a 7-digit string
Identification Last Name Patient’s last name
Identification First Name Patient’s first name (mostly formal not nickname, e.g. William instead of Bill)
2018 Dell EMC Proven Professional Knowledge Sharing 5
Identification Middle Name Patient’s middle name (could be full middle name or just initial)
Identification Date of Birth
Identification Gender
Report Report ID EEG report ID
Date/Time Admit DTS Patient hospital admission date/time
Date/Time Start Exam DTS EEG exam start date/time, not always available or reliable
Date/Time End Exam DTS EEG exam end date/time, not always available or reliable
Date/Time Discharge DTS Patient hospital discharge date/time
Date/Time Result DTS EEG result date/time when report was completed, not always available or reliable
The patient identification information and EEG recoding file attributes from the vendor’s
EEG recording file system, pertinent to our matching model, are as follows:
Attribute Group Attribute Name Description
Identification MRN Patient’s medical record number, a 7-digit string
Identification Last Name Patient’s last name
Identification First Name Patient’s first name (mostly formal not nickname, e.g. William instead of Bill)
Identification Middle Name Patient’s middle name (could be full middle name or just initial)
Identification Date of Birth
Identification Gender
File File ID A uniquely hashed value
Date/Time File Created DTS EEG recording created start date/time, always available and reliable. This is the start date/time of an EEG recording.
Date/Time Duration EEG recording duration, always available and reliable. Can be used to derive the end date/time of a recording.
The ground truth consists of (1) the 6 patient identification attributes from the EMR system
and (2) the two date/time stamps from the vendor’s EEG file system.
2018 Dell EMC Proven Professional Knowledge Sharing 6
As sample data we used 3,114 long-term monitoring EEG files to build and train our
classifier model.
2.2 EEG Examination Date Range
To locate a subset of potential EEG recordings being matched to a report, we need to
create a time box based on the 5 date/time attributes of a report. In general, their
chronological order is Admit DTS, Exam Start DTS, Exam End DTS, Discharge DTS, and
Report DTS. The order of Discharge DTS and Report DTS could be reversed, depending
on patient’s length of stay and how quickly the neurophysiologist can complete a report.
Another challenge in creating this time box is that Exam Start DTS, Exam End DTS, and
Report DTS are not always available and even if they are, they are not always accurate.
However, one important practical observation can be of use. That is, an EEG examination
cycle rarely extends over two weeks. Therefore, we employed a reasonably large report
time box to ensure all potential EEG recordings’ start date/time fall inside this range:
- Lower Bound: min (Admit DTS, Exam Start DTS) – 1 day
- Upper Bound: max (Exam End DTS, Discharge DTS, Report DTS) + 14 days
The side effect of this extended time box is overmatching, namely, an EEG file often
matches to more than one report and vice versa. In a practical sense, this is not a critical
problem because the desirable outcome from this process is that reports and files in one
treatment/examination cycle are linked together. Furthermore, we will use epochs’
start/end date/time in the EEG reports to fully resolve this issue (see EEG Report
Information Extraction section below).
2.3 Feature Extraction
To extract features by comparing patient’s corresponding identification attributes from both
systems, we use the following common algorithms and weight tuning methods to be
representative of the likelihood of a match:
- For names and MRN, we use Levenshtein Distance17 to calculate their spelling
similarity. Also, Soundex18 is to calculate their phonetical similarity. Because
middle name is sparsely populated, it’s excluded from the feature set.
- For date of birth, year, month, and day of the month are matched separate then
sum up.
- For gender (male, female, and unknown), exact match is used.
- To account for the case of last name and first name being entered in the wrong
field and compound names being used (e.g. full Spanish last names, Rodriguez-
Alvarez), a method of name cross-containing and containing is used.
2018 Dell EMC Proven Professional Knowledge Sharing 7
Below is the summary of initial weighted score calculation by attribute:
Attribute Scoring Algorithm
Raw Score Range
Transformation Method
Default Score if Null
Weight Weighted Score Range
MRN Levenshtein Distance
[0, 7] quadratic 3 3 [0, 3, 12,
…,147]
Date of Birth
Sum of each exact match
[0, 3] quadratic 1 20 [0, 20, 80,
180]
Gender Exact match [0, 1] linear 0.4 50 [0, 20, 50]
Last Name Levenshtein Distance
[0, 10] quadratic N/A 1 [0, 1, 4, ..,
100]
Last Name Soundex [0, 1] binary N/A 100 [0, 100]
First Name Levenshtein Distance
[0, 10] linear N/A 70
[0, 70]
First Name Soundex [0, 1] binary N/A 30 [0, 30]
Last & First Name
Containing and Cross Containing
[0, 1, 10, 11]
ordinal N/A See table below
See table below
Total [0, …, 677]
In addition to score by attribute individually, the likelihood of a match will be affected,
positively or negatively, by a combination of attributes. Here is the summary of initial score
adjustments for name cross-containing and containing, and attribute compounding effects:
Initial Score Adjustment Raw Score Range
Equivalence Effect To Score
Last name and First name Containing and Cross Containing
[0] No match None
[1]
First Name Match
Replace first name initial score with [100]
[10]
Last Name Match
Replace last name initial score with [200]
[11]
Last and First Match
Replace last & first name initial score with [300]
Birth date and names fully match
MRN = [0,7] N/A Add [0, 85]
Birth date and names fully match
[1] N/A Add [100]
2018 Dell EMC Proven Professional Knowledge Sharing 8
Birth date match but last name Soundex doesn’t match
[1] N/A Subtract [100]
Adjusted Total [0, 787]
The adjusted score of pairwise matching of a report to a file is a strong indication of
probability of a match. However, it’s not used as a threshold to determine whether it’s a
match or not because it could change over time. Instead, the pair with the highest adjusted
score for a report is included as the possible match and those initial weighted scores for
the pair are used as its features for model training and prediction.
2.4 Model
We have selected a set of sample data set of 3,114 EEG files from June 2016 to
December 2016 for model training. The score distribution of the pairs with the highest
score among potential EEG files to a report is shown below. For this data set, the total
adjusted score around 370 seems to be a good separation for match or not match.
To avoid the issue of class imbalance and ensure balanced distribution of the data across
all scoring ranges, we use the following data stratification: 100-599: 100%, 600-699: 20%,
>700: 1%. As a result, we have a very class-balanced sample set of 647 positives and 629
negatives for model training.
The classifiers we have trained are logistic regression, decision tree, gradient boosted
tree, and random forest19 with 70-30 split of training vs. hold-out, 3-fold cross validation,
and 10-18 sets of hyper-parameter tuning.
2.5 Results
All models perform rather well and random forest (20 trees, maximum depth of 6,
maximum bins of 16) has a slight edge with area under ROC = 0.9947. We get area under
ROC = 0.997763 for the entire data set without stratification, which is expected because
the remaining data set are all with very high weighted score (> 600) and they are matches.
The distribution of feature importance are: MRN: 0.0136, date of birth: 0.0182, gender:
0.0043, last name Levenshtein distance: 0.4706, first name Levenshtein distance: 0.1627,
2018 Dell EMC Proven Professional Knowledge Sharing 9
last name Soundex: 0.1641, first name Soundex: 0.1528, name containing & cross
containing: 0.0138. As expected the weight scores of last name and first name are the
dominating ones especially spelling similarity, whereas, MRN is not as important as
expected.
The performance of the model to predict pairwise match for the data set from April 2016 to
September 2017 for all three examination types (long-term EEG monitoring, routine EEG,
and epilepsy monitoring) is consistent with that above, except patients from neonatal
intensive care unit (NICU) where some new born babies haven’t been named yet and
generic name like “baby girl” or “baby boy” have been entered. Thus, we need to twig the
name scoring features to accommodate these cases.
Here is the confusion matrix and performance of matching of long-term monitoring EEG
recording files (including NICU cases) from April 2016 to October 2017:
Predict: Positive
Predict: Negative
Sum
Actual: Positive TP = 21,919 FP = 0 21,919
Actual: Negative FN = 471 TN = 516 987
Sum 22,390 516 n = 22,906
Recall = 0.9790, precision = 1.00, and accuracy = 0.9794.
Here is the confusion matrix and performance matching of routine and epilepsy monitoring
EEG recording files from April 2016 to October 2017:
Predict: Positive
Predict: Negative
Sum
Actual: Positive TP = 9,121 FP = 0 9,121
Actual: Negative FN = 34 TN = 440 474
Sum 9,155 440 n = 9,595
Recall = 0.9963, precision = 1.00, and accuracy = 0.9964.
In the next phase of the study, we plan to use the same framework to train a classifier and
perform predictions for pairwise matching of EEG reports to files prior to MGH’s cutover to
the new EMR system, that is, from 2002 to March 2016. The reason for a separate model
is that the quality of data collection is significantly worse than the current model covered
period.
2018 Dell EMC Proven Professional Knowledge Sharing 10
3 EEG REPORT INFORMATION EXTRACTION
3.1 Structure of an EEG Report
In early 2016, MGH’s neurology department standardized its format and terminology to
describe patient’s EEG recordings by adopting American Clinical Neurophysiology
Society’s (ACNS) guidelines11 in a semi-structured fashion. Minimal time epochs to be
documented separately are the first 30 minutes (equivalent to a routine EEG) to 24 hour
period, or a significant change in the recording during this period. Thus, a report could
consist of one or more epochs and each epoch consists of the following sections and free
text narrative after the section headings:
- EPOCH: start date/time and end date/time of an epoch
- MEDICATIONS: pertinent medications and sometimes their administration
- BACKGROUND11
- SPORADIC DISCHARGES11
- PERIODIC OR RHYTHMIC PATTERNS11
- SEIZURES: narrative regarding presence or absence of seizures, and type of
seizures if present
- EVENTS: any event happened, like change of medications, change of state from
asleep to awake, etc.
In this study, the type of important information to be extracted from reports are:
- Date and Time extraction from EPOCH section
- Inter-epoch and section reference resolution from all sections except EPOCH, with
pointer to the other epoch or section for pertinent information
- Auto-tagging in the narrative, for example, the presence of LPD in PERIODIC OR
RHYTHMIC PATTERNS section, or the presence of epileptiform discharges in
SPORADIC DISCHARGES section
- Seizure sentiment (positive, negative, or neutral) described in SEIZURES section
The method required for each type of information extraction is different and they are
described below.
2018 Dell EMC Proven Professional Knowledge Sharing 11
Here is a sample report file with two epochs:
Spelling correction has not been applied in this work for the following reasons:
- The text in these reports are concise, domain specific, acronym-filled, and not
necessarily grammatically correct and may not have proper sentence breaks.
- Spelling correction will be useful only if domain-specific & custom dictionary is
added, which will take a lot of time and effort with uncertain benefit.
- We can use word stems in text search to overcome issues of singular vs. plural,
minor spelling difference among regions (American English vs. British English),
and minor typos.
- Most of our text mining work is keyword-driven and they tend to be spelled
correctly with some common shorthand among the professionals or limited typos.
3.2 Date and Time Extraction
The marker of an epoch starts with an EPOCH section which contains the start and end
date/time. They are critical information to locate the corresponding EEG recordings. The
free text entered in this section has the following characteristics:
- Date format: as MGH is in the U.S., in a majority of the cases, the date format is
month-first with three variations: MM/DD/YYYY, MM/DD/YY (no century), MM/DD
(no year). In the case of missing year or date entirely, it can be derived from
patient’s visit date with some accuracy
- Time format: time format is much less consistent as that of date format with a
couple of variations: h24:mm (military time format), hh:mm AM/PM, hhmm (no
delimiter between hour and minute), hmm (single digit hour)
- Delimiter between start date/time and end date/time: it could be clearly stated (for
example, to, until, “-“, etc.) or implicitly inferred (spaces). The implicit ones could
cause significant problem when date is not provided.
2018 Dell EMC Proven Professional Knowledge Sharing 12
- Ill-formatted or/and incomplete date/time text string: there is a significant number of
bad data entries. Most can be inferred through several passes of regular
expressions algorithms but a few are just too ambiguous. Rather than guessing
incorrectly, it’d be better to leave them unresolved for users of the data to decide.
Here are some samples of EPOCH section with various formats of specifying start & end
date/time.
Here are the processing steps:
- Step 1: Break EPOCH text string into two parts, start date/time and end date/time:
o if a common delimiter (to, until, etc.) is present
o otherwise, identify potential date strings and use the second date string as
the delimiter. This will only work with date/time string start with date, like
12/1/2016 12:20 am (most cases). It won’t work for string start with time,
like 12:20 am 12/1/2016.
- Step 2: Normalize time text string to hh:mm(:ss) using regular expression
- Step 3: Deride default start date and end date as the default date for Python
DateUtil package:
o Default start date: max(admit date, EEG exam begin date)
o Default end date: min(EEG exam end date, discharge date)
- Step 4: Use Python DateUtil parser with the default dates derived above on start
and end date/time text string
- Step5: Use Python DateFinder parse on cases of failure from DateUtil parser.
DateFinder is very good at finding date time embedded in long text strings.
Here are some ill-formed and/or ambiguous start & end date/time string along with default
start date (LowerDTS):
2018 Dell EMC Proven Professional Knowledge Sharing 13
With the processing stated above, out of 10,612 EPOCHs, there are 36 ill-formatted and/or
ambiguous date/time strings which can’t be resolved definitively. Thus, we have achieved
a 99.66% success rate. With epochs’ start/end date/time mined here, we are able to
narrow down from a group of highly potential matching files obtained from the previous
section, to exactly the list of EEG files narrated in a report.
3.3 Medications
MEDICATIONS section text is rather well formed mostly with clear delimiters to separate
each medication (sometime dosage and its administration are included). It allows us to
parse them out easily and store them in separate rows in a table for query. Note that
negation is not weeded out because sometimes a confirmed omission is as important as
confirmed usage.
Here are the simple processing steps for medications:
- Step1: Use “,”, “.”, “and” as token separator to parse medication text string
- Step 2: Dosage and administration information are kept in each token
- Step 3: Do not weed out “none”, “no”, “nil”, etc. as they won’t match anything for
medication match
- Step 4: Explode the token list into one record per token
Here are some examples of the processing described from Step 1 to Step 3:
Here are some examples of the processing described in Step 4:
2018 Dell EMC Proven Professional Knowledge Sharing 14
3.4 Inter-Epoch and Inter-Section Reference Resolution
In the case of long term EEG monitoring which tends to run for multiple days, there might
not be significant changes observed from one epoch (< 24 hours) to the next and
neurophysiologists would write the narrative once and then refer to the same epoch in the
subsequent epoch(s), rather than copy and paste the same narrative. This is so-called
inter-epoch reference. There is no new information to be mined, other than to propagate
the same information from the referred epoch section. There are also cases of chain
referencing.
Instead of documenting their narrative into corresponding sections, some
neurophysiologists tend to write a complete narrative in one section and then refer to it in
other sections (e.g. see above, see below, see background section, etc.). This is so-called
inter-section referencing. Contrary to inter-epoch referencing, there are new information to
be extracted in the right context. For example, LPD is mentioned in SEIZURES section.
Because we are not looking for LPD in the context of seizures, we miss the proper tagging
of LPD in the context of PERIODIC OR RHYTHMIC PATTERNS.
Here is a report having 5 epochs, with inter-epoch referencing for background and seizure,
and inter-section referencing in sporadic discharge to periodic or rhythmic patterns. Note
that epoch sequence as stored and processed from report is in reverse chronological
order, that is, the epoch with the high sequence is the earliest epoch in terms of date/time.
The identification of a referencing text follows three rules:
1. short in length, under 50 characters
2. with a referencing term like, “see”, “as”, “same”, “similar”
3. with a directional term or specific location:
o Forward direction: “above”, “prior”, “before”, “previous”
o Backward direction: “below”, following”, “after”, “subsequent”
o Specific location: section name
2018 Dell EMC Proven Professional Knowledge Sharing 15
With these simple rules, we are able to achieve 100% accuracy to find inter-epoch or inter-
section referencing. Thus, we are able to propagate extracted information across epochs
or pull the right text to be extracted in the right context between two sections, for auto-
tagging or seizure sentiment analysis (see below).
3.5 Auto-Tagging Based on ACNS Guidelines
ACNS guidelines [1] are quite extensive. Our initial focus (the most useful for researchers)
is on auto-tagging main terms in PERIODIC OR RHYTHMIC PATTERNS:
- Lateralized periodic discharges (LPDs) a.k.a. Periodic lateralized epileptiform
discharges (PLEDs)
- Generalized periodic discharges (GPDs) a.k.a. Generalized periodic epileptiform
discharges (GPEDs)
- Lateralized rhythmic delta activity (LRDA)
- Generalized rhythmic delta activity (GRDA)
- Bilateral periodic independent discharges (BIPDs) a.k.a. Bilateral independent
periodic lateralized epileptiform (BIPLED)
and the presence or absence of sporadic discharges.
Negation has not been applied to auto-tagging because neurophysiologists tend not to
mention them if they are not present. But there are a few but insignificant number of
exceptions. Besides, false positives are more tolerable than false negatives in our case.
With like matching without considering the sequence of a compound term (e.g. lateralized
periodic discharges), we are able to achieve reasonably good auto-tagging results.
Out of 4,889 unique PERIODIC OR RHYTHMIC PATTERNS sections (including inter-
section referencing), we are able to tag:
Main Term Number of Text Tagged Percent of Text Tagged
LPD 1176 24%
GPD 1389 28%
LRDA 382 8%
GRDA 975 20%
BIPD 213 4%
2018 Dell EMC Proven Professional Knowledge Sharing 16
Out of 5,004 unique SPORADIC DISCHARGES sections (including inter-section
referencing), we are able to tag:
Main Term Number of Text Tagged Percent of Text Tagged
epileptiform 629 13%
triphasic 307 6%
sharp/spike wave 1736 35%
polyspike 100 2%
These results are in line with researcher’s expectation. To improve the accuracy with less
false positives, in the next phase, we plan to use nGram approach to implement matching
with sequence of words in a compound term and extend the auto-tagging to cover more
terminologies in ACNS guidelines [1].
3.6 Seizure Sentiment Analysis
One of the most important pieces of information that can be extracted from EEG reports is
the sentiment of seizures. Taking advantage of a more standardized semi-structured
format for reports, we have employed a less deterministic but more straightforward
language model, with very impressive results.
Our definition of seizure sentiment is:
- Positive if indicate seizure(s)
- Negative if indicate no seizure
- Neutral if there is no clear indication one way and the other. Eventually, this is
considered negative.
Seizure types that we want to detect in this study are electrographic seizures,
electroclinical seizure, status epilepticus, epilepsia partialis continua, and absence
seizures. The essence of our sentiment model is to find seizure keywords, and then locate
enumerating keywords and negation keywords, to determine their applicability to seizure
keywords based on their proximity.
After reviewing several hundreds of sample reports, we have come up with the following
keyword set to be used in our sentiment model. We do not attempt to find the exhaustive
list of them in the English dictionary. However, in the domain of seizure sentiment
analysis, we find them quite adequate as shown in the result later.
- Seizure keywords (including singular/plural, misspelling, and shorthand):
o seizure, seizures, siezures , sz, szs
o epilepticus, epileptics
o epilepsia
- Negation keywords: no, not, none, nothing, lack, unlikely
- Enumeration keywords:
2018 Dell EMC Proven Professional Knowledge Sharing 17
o Literal enumeration: one, two, three, …, ten (people tend to use numbers if
it’s more than ten in a sentence)
o Numeric or numeric range enumeration: for example, 5 or 5-6 seizures
o Implicit enumeration: several, multiple, numerous, handful, few, lot, many,
any, considerable, single, number, these
Out of 10,612 epochs in our sample report set, there are 10,440 epochs with seizures text.
In addition to and similar to inter-epoch and inter-section referencing, we can infer
negative seizure sentiment easily when the entire text is very short (that is, <= 50
characters) with 100% accuracy. Below is the breakdown of our sample seizure text and
their treatment:
Type of Seizure Text Count Treatment
Inter-epoch referencing to previous epoch 20 Propagate sentiment from the referred epoch
With short text (<=50 characters) and clearly indicating no seizure (e.g, none, no/not/without seizure, seizure free)
8,776 Simple sentiment determination
Inter-section referencing to the other section (above/below or specific) in the same epoch
408 Copy text from the referred section and perform full sentiment analysis
Full self-contained text 1,236 Full sentiment analysis
The seizure text from the last two rows (1,644 = 408 + 1,236) is the focus of our more
comprehensive sentiment analysis.
With keyword set defined and sample text set identified, our model works on sentences
individually first. Once a seizure keyword is found in a sentence, it’s further broken down
into nGram based on proximity parameters (number of preceding and following words),
using seizure keywords as anchors. Within those proximity boundaries, we look for
enumerating keywords and negation keywords to determine their applicability to seizure
keyword(s). Based on the sentiment of each sentence in the text, we derive the overall
sentiment of seizure for an epoch.
Here is the detailed processing steps:
• Step 1: Break description into sentences using NLTK, preserve their sequence,
and remove most punctuations except dashes for numeric range enumeration
• Step 2: Tokenize each sentence (no stop word removal)
• Step 3: Determine sentiment of each sentence as follows:
• Positive if it contains at least one seizure term and no negation term
• Negative if it contains at least one seizure term and at least one negation
term
• Neutral if it contains no seizure term
2018 Dell EMC Proven Professional Knowledge Sharing 18
• Step 4: Use proximity of enumeration terms (4 preceding and 1 following words) to
any seizure term, to determine whether sentence has enumeration of seizures:
• If true, change its sentiment to positive if it’s negative from Step 3
• Otherwise, leave the sentiment as is from Step 3
• Step 5: Use proximity of negation terms (4 preceding and 2 following) to any
seizure term, to determine whether they are applied to seizures:
• If false, change its sentiment to positive if it’s negative from Step 3 & 4
• Otherwise, leave the sentiment as is from Step 3 & 4
• Step 6: Determine the overall seizure sentiment for an epoch as follows:
• Positive if any sentence has at least one instance of enumeration of
seizures (from Step 4)
• Otherwise, further processing as follows:
• Positive if the first non-neutral sentence has positive sentiment
• Negative if the first non-neutral sentence has negative sentiment
• Neutral, otherwise
Sample text with positive sentiment are shown below. The notation in the column of
“Sentence & Paragraph Sentiment” is [S1, S2(E), …, Sn] => P, where S1 to Sn are sentence
sentiment, (E) if a sentence has applicable enumerating key word, and P is the sentiment
for the entire paragraph.
2018 Dell EMC Proven Professional Knowledge Sharing 19
Below are some sample text with negative sentiment. Note that the last two misclassified
cases because the negation keywords fall out the proximity zone of seizure keywords.
However, if we increased the proximity zone, we would see a lot more false negatives
which are less desirable.
Here is the confusion matrix and performance of seizure sentiment analysis on those
1,644 longer text set:
Predict: Positive
Predict: Negative/Neutral
Sum
Actual: Positive TP = 911 FP = 2 913
Actual: Negative/Neutral FN = 0 TN = 731 731
Sum 911 733 n = 1,644
Recall = 1.00, precision = 0.9978, and accuracy = 0.9988.
Here is the confusion matrix and performance of combining simple and full sentiment
analysis on all 10,419 seizure text:
Predict: Positive
Predict: Negative/Neutral
Sum
Actual: Positive TP = 911 FP = 2 913
Actual: Negative/Neutral FN = 0 TN = 9507 9507
Sum 911 9509 n = 10,420
Recall = 1.00, precision = 0.9978, and accuracy = 0.9998
2018 Dell EMC Proven Professional Knowledge Sharing 20
Given its simplicity, the performance of this sentiment model is extremely impressive. Also,
we are able to retrieve specified seizure type for epochs with positive sentiment. Their
breakdown is listed below:
Seizure Type Count Percentage
absence seizure 5 0.5%
electroclinical seizure 118 12.9%
electrographic seizure 339 37.1%
epilepsia partialis continua 4 0.4%
status epilepticus 69 7.6%
not specified 378 41.4%
In the follow-up work, we plan to run this sentiment analysis on new seizure text with the
same report in semi-structured format as well as older seizure text since 2002 which is
much more unstructured, to determine the model’s real performance and potential
improvements.
We also tried a variation of the model by removing non-negation stop words and
shortening the number of preceding words from 4 to 3. Stop word removal is one of the
common techniques of normalization in text analysis. However, the result of this variation
is not as good as the one presented here. The reason seems to be that it’s natural in
human writing to put enumerating or negation keywords not too far away from a seizure
keyword, including counting stop words. Stop word removal breaks this nature and yields
results that are not as good.
4 CONCLUSIONS
We have described a highly accurate classifier to match an EEG report to its
corresponding EEG recording files. Our classifier performs this matching using the context
of visit date/time and patient identification attributes in the EMR system, combined with
information from a separate vendor EEG recording file system. We have also described
several text analysis techniques that very accurately extract various types of clinically
important information from EEG reports. With these two foundations, we have established
a highly effective and efficient data pipeline for clinical operations, quality improvement,
and neurophysiological research.
2018 Dell EMC Proven Professional Knowledge Sharing 21
5 REFERENCES
1. Westover, M. B. et al. The probability of seizures during EEG monitoring in critically
ill adults. Clin. Neurophysiol. Off. J. Int. Fed. Clin. Neurophysiol. 126, 463–471 (2015).
2. Jirsch, J. & Hirsch, L. J. Nonconvulsive seizures: developing a rational approach to
the diagnosis and management in the critically ill population. Clin. Neurophysiol. Off. J. Int.
Fed. Clin. Neurophysiol. 118, 1660–1670 (2007).
3. Alvarez, V. et al. The use and yield of continuous EEG in critically ill patients: A
comparative study of three centers. Clin. Neurophysiol. Off. J. Int. Fed. Clin. Neurophysiol.
128, 570–578 (2017).
4. Pandian, J. D., Cascino, G. D., So, E. L., Manno, E. & Fulgham, J. R. Digital video-
electroencephalographic monitoring in the neurological-neurosurgical intensive care unit:
clinical features and outcome. Arch. Neurol. 61, 1090–1094 (2004).
5. Vespa, P. M. et al. Acute seizures after intracerebral hemorrhage: a factor in
progressive midline shift and outcome. Neurology 60, 1441–1446 (2003).
6. Claassen, J., Mayer, S. A., Kowalski, R. G., Emerson, R. G. & Hirsch, L. J.
Detection of electrographic seizures with continuous EEG monitoring in critically ill
patients. Neurology 62, 1743–1748 (2004).
7. Dennis, L. J. et al. Nonconvulsive status epilepticus after subarachnoid
hemorrhage. Neurosurgery 51, 1136–1143; discussion 1144 (2002).
8. DeLorenzo, R. J. et al. Persistent nonconvulsive status epilepticus after the control
of convulsive status epilepticus. Epilepsia 39, 833–840 (1998).
9. Treiman, D. M. et al. A comparison of four treatments for generalized convulsive
status epilepticus. Veterans Affairs Status Epilepticus Cooperative Study Group. N. Engl.
J. Med. 339, 792–798 (1998).
10. Towne, A. R. et al. Prevalence of nonconvulsive status epilepticus in comatose
patients. Neurology 54, 340–345 (2000).
11. L. J. Hirsch, S. M. LaRoche, et al. American Clinical Neurophysiology Society’s
Standardized Critical Care EEG Terminology: 2012 version
12. Siddharth Biswal, Zarina Nip, Valdery Moura Junior, Matt T. Bianchi, Eric S
Rosenthal, and M Brandon Westover, MD PhD. Automated Information Extraction from
Free-Text EEG Reports. Conf Proc IEEE Eng Med Biol Soc. 2015; 2015: 6804–6807. doi:
10.1109/EMBC.2015.7319956
13. Thomas J, Jing Jin, Dauwels J, Cash SS, Westover MB. Automated epileptiform
spike detection via affinity propagation-based template matching. Conf Proc IEEE Eng
Med Biol Soc. 2017 Jul;2017:3057-3060. doi: 10.1109/EMBC.2017.8037502.
2018 Dell EMC Proven Professional Knowledge Sharing 22
14. Ruiz AR, Vlachy J, Lee JW, et al. Association of periodic and rhythmic
electroencephalographic patterns with seizures in critically ill patients [published online
December 19, 2016]. JAMA Neurol. doi:10.1001/jamaneurol.2016.4990
15. Struck AF, Ustun B, Ruiz AR, Lee JW, et al. Association of an
Electroencephalography-Based Risk Score With Seizure Probability in Hospitalized
Patients. JAMA Neurol. 2017 Oct 9. doi: 10.1001/jamaneurol.2017.2459.
16. H. Köpcke, E. Rahm, Frameworks for entity matching: A comparison, Data Knowl.
Eng. (2009), doi:10.1016/j.datak.2009.10.003
17. Levenshtein Distance: https://en.wikipedia.org/wiki/Levenshtein_distance
18. Soundex: https://en.wikipedia.org/wiki/Soundex
19. Apache Spark Machine Learning APIs: http://spark.apache.org/docs/latest/ml-
guide.html
Dell EMC believes the information in this publication is accurate as of its publication date.
The information is subject to change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” DELL EMC MAKES
NO RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE
INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Use, copying and distribution of any Dell EMC software described in this publication
requires an applicable software license.
Dell, EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries.