Computational prediction of clinical outcome of...

Yosuke Tanigawa (ytanigaw@stanford.edu), Stephen Pfohl (spfohl@stanford.edu)Biomedical Informatics Ph.D. program, School of Medicine, Stanford University

Abstract

References

Future Direction

Models and Results

Computational prediction of clinical outcome of sepsis from critical care database

1. D C Angus, W T Linde-Zwirble, J Lidicker, G Clermont, J Carcillo, and M R Pinsky. Epidemiology of severe sepsis in the United States: analysis of incidence, outcome, and associated costs of care. Critical care medicine, 29(7):1303{1310, 2001.

2. Alistair E W Johnson, Tom J Pollard, Lu Shen, Li-Wei H Lehman, MenglingFeng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. MIMIC-III, a freely accessible critical care database. Scientic data, 3:160035, 2016.

3. Robert Tibshirani. Regression Selection and Shrinkage via the Lasso, 1996.

The classification algorithm would likely be immediately improved by further feature engineering to better represent temporality and also by a grouping of similar features through a mapping onto ontological knowledge graph such as the UMLS metathesaurus. However, it is likely more worthwhile to re-define the model objectives such that risk of a septic-event and mortality may be predicted in real-time.

The data of interest is contained within the MIMIC III database[2], an electronic health record database curated by MIT that houses de-identified demographics, vital signs, lab test results, procedures, medications, notes, imaging reports, and outcomes of 58,000 hospital admissions between 2001 and 2012 for 38,645 adults and 7,875 neonates at the Beth Israel Deaconess Medical Center. For classification, we label hospital admissions as a positive example only if sepsis occurs over the course of the admission based on the clinical criteria by Angus et. al[1]. For the purposes of survival analysis, we consider the set of admission with a positive sepsis label who also experienced a death in the hospital and define the time of death as the number of days since admission.

Discussion

0.00 0.25 0.50 0.75 1.00False Positive Rate

modelNameNoICD

AllICD

NonAngusICD

Lasso Testing Performance

MIMIC-III

hadmid

Itemid

Date,time Value Flag

1 1 10:00 20

1 1 11:00 50 !

1 2 0:20 200

1 3 16:00 3.5

2 2 7:45 25 !

m n 8:30 1

hadmid

Item1mean

Item1slope

Item1mean

Item1# flag

Item1% flag

Item1cnt … Item n

1 25 2 30 1 0.50 2 N/A

2 N/A 0 N/A N/A N/A 0 N/A

m 30 5 4 3 0.60 5 3

Table(1)Original (2)wide (3)sparsity (4)NZV

nrow ncol nrow ncol ncol ncoldiagnoses_icd 651,047 5 58,976 6,985 N/A 33admissions 58,976 19 58,976 78 N/A N/Alabevents 27,854,055 9 58,147 2,880 522 461inputevents_cv 17,527,935 22 21,879 1,112 166 119inputevents_mv 3,618,991 31 21,879 1,112 166 119outputevents 4,349,218 13 51,836 4,556 18 17procedureevents_mv 258,066 25 21,894 464 52 34chartevents_1 38,033,561 15 28,687 268 70 61chartevents_2 13,116,197 15 34,904 36 12 10chartevents_3 38,657,533 15 29,085 356 108 89chartevents_4 9,374,587 15 27,210 44 32 28chartevents_5 18,201,026 15 27,231 168 54 49chartevents_6 28,014,688 15 34,896 1,644 278 267chartevents_7 255,967 15 2,030 1,488 6 5chartevents_8 34,322,082 15 7,990 1,268 184 155chartevents_9 1,274,692 15 7,452 404 162 156chartevents_10 9,584,888 15 18,650 528 28 17chartevents_11 470,141 15 8,672 996 12 10chartevents_12 265,413 15 1,405 804 4 4chartevents_13 39,066,570 15 56,716 500 74 53chartevents_14 100,075,138 15 24,549 3,032 836 535

ModelICD

featuresincluded

Total#offeatures

Training(n=53,079)

Test(n=5,897)

Classification AUCLasso None 797 0.908 0.791Lasso NonSepsis 824 0.910 0.817Lasso All 830 0.947 0.900RF None 797 0.998 0.816RF NonSepsis 824 0.999 0.855RF All 830 1.000 0.921

Model ICDfeaturesincluded

Total#offeatures

Training(n=5,827)

Test(n=330)

Survival c-indexCox None 797 0.92 0.81

The use of Electronic Health Records (EHR) over the past several years has generated a large data source that allows for development of machine learning models for early diagnosis, risk stratification, and clinical decision support. Generating gold-standard labels for the outcome (phenotyping) is critical to the process of developing a training cohort, but is often a labor-intensive process requiring manual chart review. Sepsis affects over a million patients annually and remains one of the largest contributors to mortality in the ICU, costing the healthcare system over 14 million dollars per year. In hopes of facilitating high-throughput development of predictive models, we propose an electronic phenotyping algorithm capable of retrospectively identifying sepsis cases from the EHR that attains high performance without the use of ICD-9 billing codes. Additionally we explore models that predict risk of mortality following sepsis on the basis of the derived EHR features.

Logistic regression with L1 regularization (Lasso)𝐿 𝜃 = ∑ 𝑦 & − 𝜃(𝑥 & *

+ 𝜆 𝜃 -

Random Forest (RF) - Fit with 250 trees

Cox Proportional Hazards with L1 regularization

𝐿 𝜃 = /exp(𝜃(𝑥(&))

∑ exp(𝜃(𝑥(&))�6:8(9):8(;)

subject to 𝜃 - ≤ 𝜆

We were successful at processing a large and diverse clinical database for the retrospective classification of sepsis cases, but the utility of the model is limited in that valid classification may only be made retrospectively and thus cannot be used for clinical decision support or real-time prediction. However, given that we are able to achieve relatively high performance without the use of ICD-9 codes, it may be possible to use this model to develop study cohorts with patients that may have been missed by models using only the ICD-9 codes for the outcome definition. Additionally, this same set of summary features attains modest performance at predicting the time-dependent risk of death in the hospital following sepsis, but the result is less strong than in the classification case.

1. Raw Data (21 SQL Tables)2. Convert to wide format3. Remove variables with greater than 90% missing4. Near-Zero-Variance filtering5. Join Tables and define the labels6. Split train/test (90/10) – All operations now

separate7. Near-Zero-Variance filtering8. Log-transformation and Normalization9. Median Imputation10. Near-Zero-Variance filtering11. Join ICD-codes (optional)

The raw data is sparse and temporal. We performed the following operations to extract aggregate summary features for each admission.

Feature Engineering

−8 −6 −4 −2

log(Lambda)

704 657 602 506 407 303 203 125 77 55 37 19 12 4

0 10 20 30 40 50

−0.5

L1 Norm

0 215 446 573 678 732

0 100 200Variable Importance (MeanDecreaseGini)

0 50 100 150 200Days Since Admission

ival p

−8 −6 −4 −2 0

log(Lambda)

ikelih

740 704 658 572 469 333 188 94 42 22 7 6 6

itemidlabevents_meanValue_51250itemidlabevents_meanValue_50970itemidlabevents_flagRate_50893itemid_chartevents_meanValue_224057itemidlabevents_flagNum_51144itemidlabevents_valSlope_51301itemidlabevents_meanValue_51237itemidlabevents_meanValue_50818itemidlabevents_meanValue_50971itemidlabevents_meanValue_51491itemid_chartevents_meanValue_723itemidlabevents_meanValue_51265itemidlabevents_flagNum_51301itemidlabevents_meanValue_50820itemidlabevents_nMeas_50818itemidlabevents_flagNum_50902itemidlabevents_flagRate_51516itemidlabevents_meanValue_50804itemid_chartevents_meanValue_224059itemidlabevents_meanValue_51274itemidlabevents_flagNum_51221itemidlabevents_flagNum_50804itemidlabevents_valSlope_50825itemidlabevents_meanValue_51301itemidlabevents_meanValue_51493itemidlabevents_meanValue_50983itemidlabevents_valSlope_51006itemidlabevents_valSlope_51498itemidlabevents_meanValue_50902itemidlabevents_nMeas_50819itemidlabevents_meanValue_51277itemidlabevents_meanValue_51222itemidlabevents_meanValue_50802itemidlabevents_meanValue_51146itemidlabevents_flagNum_50862itemidlabevents_meanDiff_50821itemidlabevents_meanValue_50863itemidlabevents_meanDiff_50893itemidlabevents_meanValue_50882itemidlabevents_flagRate_50804itemidlabevents_meanValue_51254itemidlabevents_meanDiff_50970itemid_inputCV_meanValue_220949itemidlabevents_nMeas_51491itemidlabevents_nMeas_51248itemidlabevents_meanValue_51144itemidlabevents_valSlope_51256itemidlabevents_valSlope_51254itemidlabevents_meanValue_51200itemidlabevents_nMeas_51221itemidlabevents_flagRate_51006itemidlabevents_flagNum_51009itemidlabevents_valSlope_50821itemidlabevents_valSlope_50813itemidlabevents_meanValue_51249itemidlabevents_meanValue_50893itemidlabevents_flagRate_51009itemidlabevents_flagNum_51279itemidlabevents_valSlope_50912ageitemidlabevents_nMeas_51009itemidlabevents_meanValue_51516itemidlabevents_nMeas_51265itemidlabevents_meanValue_50821itemidlabevents_flagNum_50931itemidlabevents_valSlope_51244itemidlabevents_nMeas_50813itemidlabevents_nMeas_51301itemidlabevents_meanValue_50912itemidlabevents_flagNum_50970itemidlabevents_nMeas_50983itemidlabevents_flagNum_50882itemidlabevents_nMeas_51279itemidlabevents_nMeas_51200itemidlabevents_nMeas_51146itemidlabevents_nMeas_50825itemidlabevents_meanValue_51006itemidlabevents_nMeas_51277itemidlabevents_nMeas_51249itemidlabevents_flagNum_51256itemidlabevents_nMeas_51222itemidlabevents_nMeas_50971itemidlabevents_nMeas_51250itemidlabevents_flagNum_50893itemidlabevents_flagNum_51006itemidlabevents_nMeas_51254itemidlabevents_nMeas_51244itemidlabevents_nMeas_50931itemidlabevents_flagNum_50912itemidlabevents_nMeas_50882itemidlabevents_nMeas_50902itemidlabevents_flagRate_50912itemidlabevents_nMeas_51006itemidlabevents_nMeas_51256itemidlabevents_flagNum_51244itemidlabevents_nMeas_50868itemidlabevents_flagNum_51222itemidlabevents_nMeas_50960itemidlabevents_nMeas_50912itemidlabevents_nMeas_50970itemidlabevents_nMeas_50893

0 100 200Variable Importance (MeanDecreaseGini)

0.00 0.25 0.50 0.75 1.00False Positive Rate

modelNameNoICD

AllICD

NonAngusICD

Random Forest Testing Performance

• Lasso logistic regression and random forest successfully identify patients with or without ICD code features

• Cox models predict risk of mortality following sepsis with modest performance

Training

2. Long to wide

1.Raw Data

2. Long to wide

3. & 4. Filter

5. & 6. Define Training and Test sets

Labels Features ICD Features

Lasso RandomForest

Computational prediction of clinical outcome of...

Documents

Transcript of Computational prediction of clinical outcome of...

CS229, FALL 2016 1 Diabetic Retinopathy Identiﬁcation and …cs229.stanford.edu/proj2016/report/HonnungarMehraJoseph-DRISC-r… · making timely treatment accessible to more patients.

IMPLEMENTING MACHINE LEARNING IN EARTHQUAKE …cs229.stanford.edu/proj2016/poster/Acevedo-Implementing... · 2017. 9. 23. · thankful to the National Science Foundation through the

Uncertainty Quantification and Sensitivity Analysis of ...cs229.stanford.edu/proj2016/report/ParkJihoon-Uncertainty... · Uncertainty Quantification and Sensitivity Analysis of Reservoir

ColorNN Book: A Recurrent-Inspired Deep Learning Approach ...cs229.stanford.edu/proj2016/report/KannanGupta-ColorNNBook-report.pdflatter on MIT’s Places2. Notably, Zhang et al. computed

American Immigrants Classification and Naturalization Time ...cs229.stanford.edu/proj2016/report/ShengLienWang...sample5 sample6 sample7. time for these groups. As this is the clustering

1 Introduction - CS229: Machine Learningcs229.stanford.edu/proj2016/report/TataruSalernoZivkovic-AutsimAnd... · Autism and The Human Microbiome Christine A. Tataru Michael D. Salerno

Using Gene Expression Data to Predict Clinical Information ...cs229.stanford.edu/proj2016/poster/Abell-UsingGeneExpressionData… · progesterone receptor in breast cancer was very

Predictive analysis on Multivariate, Time Series …cs229.stanford.edu/proj2016/report/Thakkar_Predictive_Analysis_on...1 Predictive analysis on Multivariate, Time Series datasets

[Final report] Computational prediction of clinical outcome of ...cs229.stanford.edu/proj2016/report/TanigawaPfohl...identify those cases accounts for much of US health care cost and

CS 229 | Classification of Channel Bifurcation Points in ...cs229.stanford.edu/proj2016/report/Nesvold-ClassificationDelta... · CS 229 | Classification of Channel Bifurcation Points

A Reinforcement Learning Approach for Motion …cs229.stanford.edu/proj2016/poster/Hockman-A...International Symposium on Experimental Robotics, October 2016 [3] B. Hockman and M.

Deep Learning for Object Classification in Retail …cs229.stanford.edu › proj2016 › poster › WeeChongBustan...Deep Learning for Object Classification in Retail Stores IdawatiBustan,

1 WaveMedic: Convolutional Neural Networks for Speech …cs229.stanford.edu/proj2016/report/FisherScherlis-WaveMedic... · 1 WaveMedic: Convolutional Neural Networks for Speech Audio

Embodied Music Meditation: A Real-time Interactive Audio ...cs229.stanford.edu/proj2016/poster/RauZhangZhou-EmbodiedMusic... · Embodied Music Meditation: A Real-time Interactive

Predicting Sexual Orientation Via Facebook Status …cs229.stanford.edu/proj2016/poster/LohSooXing-PredictingSexual... · Michael Xing Aaron Loh ... Predicting Sexual Orientation

CS229 MACHINE LEARNING, DECEMBER 2016 1 Implementing ...cs229.stanford.edu/proj2016/report/Acevedo-ImplementingMachine... · Implementing Machine Learning in Earthquake Engineering

Blessing, A., & Wen, K. (n.d.). Using Machine Learning for …cs229.stanford.edu › proj2016 › poster › ChaiRameshYeo-Are... · 2017-09-23 · Cartoons?” If a machine could

CS229 MACHINE LEARNING, STANFORD UNIVERSITY, …cs229.stanford.edu/proj2016/report/FegelisHebert...CS229 MACHINE LEARNING, STANFORD UNIVERSITY, DECEMBER 2016 3 t= f(k x t +h y t) (6)

CS229 FINAL PROJECT 1 Reduced order modeling approach for cardiovascular stent …cs229.stanford.edu/proj2016/report/ROMberkin.pdf · 2017-09-23 · CS229 FINAL PROJECT 1 Reduced

Semi-Supervised Keyword Spotting in Arabic Speech Using Self …cs229.stanford.edu/proj2016/poster/Mahmoud-Keyword... · 2017-09-23 · Semi-Supervised Keyword Spotting in Arabic