RESEARCH OpenAccess Inversereinforcementlearningfor ... · nia [6], epilepsy [7], anesthesia [8],...

Yu et al. BMC Medical Informatics and Decision Making 2019, 19(Suppl 2):57https://doi.org/10.1186/s12911-019-0763-6

RESEARCH Open Access

Inverse reinforcement learning forintelligent mechanical ventilation andsedative dosing in intensive care unitsChao Yu1,2*†, Jiming Liu2*† and Hongyi Zhao1

From 4th China Health Information Processing ConferenceShenzhen, China. 1-2 December 2018

Abstract

Background: Reinforcement learning (RL) provides a promising technique to solve complex sequential decisionmaking problems in health care domains. To ensure such applications, an explicit reward function encoding domainknowledge should be specified beforehand to indicate the goal of tasks. However, there is usually no explicitinformation regarding the reward function in medical records. It is then necessary to consider an approach wherebythe reward function can be learned from a set of presumably optimal treatment trajectories using retrospective realmedical data. This paper applies inverse RL in inferring the reward functions that clinicians have in mind during theirdecisions on weaning of mechanical ventilation and sedative dosing in Intensive Care Units (ICUs).

Methods: We model the decision making problem as a Markov Decision Process, and use a batch RL method, FittedQ Iterations with Gradient Boosting Decision Tree, to learn a suitable ventilator weaning policy from real trajectories inretrospective ICU data. A Bayesian inverse RL method is then applied to infer the latent reward functions in terms ofweights in trading off various aspects of evaluation criterion. We then evaluate how the policy learned using theBayesian inverse RL method matches the policy given by clinicians, as compared to other policies learned with fixedreward functions.

Results: Results show that the inverse RL method is capable of extracting meaningful indicators for recommendingextubation readiness and sedative dosage, indicating that clinicians pay more attention to patients’ physiologicalstability (e.g., heart rate and respiration rate), rather than oxygenation criteria (FiO2, PEEP and SpO2) which is supportedby previous RL methods. Moreover, by discovering the optimal weights, new effective treatment protocols can besuggested.

Conclusions: Inverse RL is an effective approach to discovering clinicians’ underlying reward functions for designingbetter treatment protocols in the ventilation weaning and sedative dosing in future ICUs.

Keywords: Reinforcement learning, Inverse learning, Mechanical ventilation, Sedative dosing, Intensive care units

*Correspondence: [email protected]; [email protected]†Chao Yu and Jiming Liu contributed equally to this work.1School of Computer Science and Technology, Dalian University ofTechnology, Dalian, China2Department of Computer Science, Hong Kong Baptist University, Hong Kong,China

© The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, andreproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to theCreative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

http://crossmark.crossref.org/dialog/?doi=10.1186/s12911-019-0763-6&domain=pdf

mailto: [email protected]

mailto: [email protected]

http://creativecommons.org/licenses/by/4.0/

http://creativecommons.org/publicdomain/zero/1.0/

Yu et al. BMC Medical Informatics and Decision Making 2019, 19(Suppl 2):57 Page 112 of 197

BackgroundEmerging in recent years as a powerful trend andparadigm in machine learning, reinforcement learning(RL) [1] has achieved tremendous achievements in solvingcomplex sequential decision making problems in vari-ous health care domains, including treatment in HIV[2], cancer [3], diabetics [4], anaemia [5], schizophre-nia [6], epilepsy [7], anesthesia [8], and sepsis [9], justto name a few. However, all the existing RL applicationsare grounded on an available reward function, either ina numerical or an qualitative form, to indicate the goalof treatments by clinicians. Explicitly specifying such areward function not only heavily requires prior domainknowledge, but also relies on clinicians’ personal expe-rience that varies from one to another. Thus, consistentlearning performance might not be always guaranteed. Infact, even some components of reward information canbe manually defined, it is usually ambiguous to specifyexactly how such components should be traded off in anexplicit and effective manner. As such, in situations whenno explicit information is available regarding the rewardfunction, it is then necessary to consider an approachto RL whereby the reward function can be learned froma set of presumably optimal treatment trajectories usingretrospective real medical data.

The problem of deriving a reward function fromobserved behaviors or data is referred to as inverse rein-forcement learning (IRL) [10], which has received anincreasingly high interest by researchers in the past fewyears. These methods have achieved substantial suc-cess in a wide range of applications, from imitation ofautonomous driving behaviors [11, 12], control of robotnavigation [13] to high dimensional motion analysis [14].

Despite the above tremendous progress, there is sur-prisingly quite limited work on applying IRL approachesin clinical settings. We conjecture that such an absencemight be due to the inherent complexity of clinical dataand its associated uncertainties in the decision makingprocess. In fact, medical domains present special chal-lenges with respect to data acquisition, analysis, inter-pretation and presentation of these data in a clinicallyrelevant and usable format [15]. Medical data are usu-ally noisy, biased and incomplete, posing significant chal-lenges for existing RL methods. For example, many studiesare conducted with patients who fail to complete part ofthe study, or, because of the finite duration of most stud-ies, there is often no information about outcomes aftersome fixed period of time. The missing or censoring datawill tend to increase the variance of estimates of the valuefunction and the policy in RL [16]. This problem becomeseven more severe in the case of IRL, where algorithmsnot only need to learn a policy using RL, but also needto learn a reward function using data characterized by theabove notorious features. The errors brought in during the

policy learning and reward learning intertwine with eachother in IRL, potentially leading to divergent solutions thatare of little use in practical clinical applications [17].

In this paper, we aim to apply IRL methods in solvinga specific clinical decision making problem in ICUs, i.e.,the management of invasive mechanical ventilation, andthe regulation of sedation and analgesia during ventilation[18]. Since prolonged dependence on mechanical ventila-tion can cause higher hospital cost while increased risk ofcomplications may occur if premature extubation is con-ducted, it is pressing to develop an effective protocol forweaning patients off from a ventilator by properly tradingoff these two aspects and making optimal sedative dosingduring this process. By using sets of real treatment trajec-tories, we infer the reward functions that clinicians have inmind during their decisions of mechanical ventilation andsedative dosing in ICUs. Experiments verify the effective-ness of IRL in discovering clinicians’ underlying rewardfunctions, which are then exploited for designing betternew treatment protocols in ICUs.

Related workWith the development in ubiquitous monitoring tech-niques, a plethora of ICU data has been generated in avariety of formats such as free-text clinical notes, images,physiological waveforms, and vital sign time series, enableoptimal diagnose, treat and mortality prediction of apatient in ICUs [15]. Thus far, a great deal of theoreti-cal or experimental studies have employed RL techniquesand models for decision support in critical care. Nematiet al. developed deep RL algorithms that learn an opti-mal heparin dosing policy from real trails in large elec-tronic medical records [19, 20]. Sandu et al. studied theblood pressure regulation problem in post cardiac surgerypatients using RL [21]. Padmanabhan et al. resorted toRL for the control of continuous intravenous infusionof propofol for ICU patients by both considering anes-thetic effect and regulating the mean arterial pressure toa desired range [8]. Raghu et al. proposed an approach todeduce treatment policies for septic patients by using con-tinuous deep RL methods [22], and Weng et al. appliedRL to learn personalized optimal glycemic treatments forseverely ill septic patients [9]. The most related work isthat by Prasad et al., who applied batch RL algorithms,fitted Q iteration with extremely randomized trees, todetermine the best weaning time of invasive mechani-cal ventilation, and the associated personalized sedativedosage [18]. Results demonstrate that the learned poli-cies show promise in recommending weaning protocolswith improved outcomes, in terms of minimizing rates ofreintubation and regulating physiological stability. How-ever, all these studies are built upon a well predefinedreward function that requires heavy domain knowledgeand manual engineering.


Ng and Russell first introduced IRL to describe the prob-lem of recovering a reward function of an MDP fromdemonstrations [10]. Numerous IRL methods have beenproposed afterwards, including Apprenticeship Learning[11], Maximum Entropy IRL [23], Bayesian IRL [24], andnonlinear representations of the reward function usingGaussian processes [25]. Most of these methods needto solve an RL problem in each step of reward learn-ing, requiring an accurate model of the system’s dynamicsthat is either given a priori or can be estimated wellenough from demonstrations. However, such accuratemodels are rarely available in clinical settings. How toguarantee the performance of the RL solutions in an IRLprocess is an unsolved issue in IRL applications, espe-cially in clinical settings where the only available infor-mation is the observations of a clinician’s treatment datathat are subject to unavoidable noise, bias and censoringissues.

MethodsIn this setion, we investigate the possibility of applyingIRL approaches in solving complex clinical decision mak-ing problems, that is, automated weaning of mechanicalventilation and optimal sedative dosage in ICUs. To thisend, the critical care data set and its preprocessing areintroduced first. The decision making framework and itsassociated RL components are then discussed. Finally, anIRL method is applied to refer the reward functions usedby clinicians.

Preprocessing of critical care dataWe use the Medical Information Mart for IntensiveCare (MIMIC III) database [26], which is a large, freelyaccessible database comprising identified health-relatedinformation from nearly forty thousand distinct adultpatients and eight thousand neonates who stayed in crit-ical care units of the Beth Israel Deaconess MedicalCenter between 2001 and 2012. The database is mainly

for academic and industrial research purpose, offering avariety forms of data in critical care including demograph-ics, vital signs, laboratory tests, diagnoses, medications,and more.

To acquire the data for our purpose, we first extract8860 admissions from adult patients undergoing invasiveventilation in MIMIC III database. In order to focus onweaning ventilation and sedative dosing, we further fil-ter these data by excluding those admissions when thepatient was kept under ventilation for less than 24 hours,or unsuccessfully discharged from the hospital by the endof the admission. This allows us to exclude those admis-sions when ventilation was applied during a short periodof time (e.g., after a surgery), or when a patient expiredin the ICUs due to many other factors unrelated merelyto the ventilator support and sedative dosing [18]. In thispaper, we mainly focus on learning policies for wean-ing of ventilation and sedative dosing, thus, prolongedventilation, administration of unsuccessful spontaneousbreathing trials, or reintubation within the same admis-sion are considered to be the main factors during decisionmakings.

Data in critical care are characterized by issues of com-partmentalization, corruption and complexity [15]. Mea-surements of vital signals and lab results may be irregular,sparse, and error-prone. Some physiological parametersare taken several times an hour, such as heart rate orrespiratory rate, while other physiological parameters aremeasured only once in several hours, such as arterial pHor oxygen pressure. Changes in body state occur all thetime, and naive methods for interpolation do not meetthe necessary accuracy at higher temporal resolutions.Therefore, we use support vector machines (SVM) to fitthe physiological parameters according to measurementtime. After preprocessing, we obtain complete data foreach patient, at a temporal resolution of 10 minutes, fromadmission time to discharge time. Figure 1 shows exam-ple trajectories of three vital signs (Heart Rate, SpO2 and

Fig. 1 Example trajectories of three vital signs (Heart Rate, SpO2 and Respiratory Rate) after preprocessing. a Heart Rate b Sp02 c Respiratory Rate


Respiratory Rate) after preprocessing. It shows that thepredicted trajectories can fit the sample trajectories at ahigh accuracy.

MDP formulation and the RL solutionThe decision making process of our problem is modeled asa Markov Decision Process (MDP) by a tuple of 〈S, A, P, R〉,in which st ∈ S is a patient’s state at time t, at ∈ A isthe action made by clinicians at time t, P(st+1|st , at) is theprobability of the next state after given the current stateand action, and r(st , at) ∈ R is the observed reward fol-lowing a transition at time step t. The goal of an RL agentis to learn a policy to maximize the expected accumulatedreward over time horizon T by:

Rπ (st) = limT→∞

Est+1|st ,π(st)

T∑

t+1γ tr(st , at),

where the discount factor γ determines the relative weightof immediate and long-term rewards.

The patient’s response to sedation and extubationdepends on many different factors, from demographiccharacteristics, pre-existing conditions, and comorbiditiesto specific time-varying vital signs, and there is consid-erable variability in clinical opinion on the extent of theinfluence of different factors [18]. We extracted the mostimportant 13-dimensional features, including respirationrate, heart rate, arterial pH, positive end-expiratory pres-sure (PEEP) set, oxygen saturation pulse oxymetry (SpO2),inspired oxygen fraction (FiO2), arterial oxygen partialpressure, plateau pressure, average airway pressure, meannon-invasive blood pressure, body weight (kg), age, andventilation. In designing the set of actions, two actionsare defined to indicate a patient off or on the ventilator,respectively. As for the sedative, we focus on a com-monly used sedative, the propofol, and separate the dosageinto four different levels of sedation. Thus, the action setdefined includes eight combinational actions in total.

The formulation of reward function is the key in suc-cessful applications of RL approaches. Here, we design areward function rt+1, under the guidance of clinicians atthe Hospital of University of Pennsylvania (HUP). Currentextubation guidelines at HUP must meet the followingtwo main conditions: the physiological stability (respira-tory rate ≤ 30, heart rate ≤ 130 , and arterial pH ≥ 7.3),and the oxygenation criteria ( PEEP (cm H2O) ≤ 8), SpO2(%)≥ 88, and FiO2 (%) ≤ 50 ). Following previous work[18], the reward function rt+1 is defined as rt+1 = rvitals

t+1 +rventoff

t+1 + rventont+1 to reward physiological stability and suc-

cessful extubation while penalizing adverse events (i.e.,failed spontaneous breathing trials SBTs or reintubation).

Reward component rvitalst+1 evaluates how the actions

perform regarding the patient’s physiological stability in

terms of staying within a reasonable range and having notchanged drastically:

rvitalst+1 =C1

∑

vstat

[1

1 + e−(vstat −vsta

min)− 1

1 + e−(vstat −vsta

max)+ 1

2

]

− C2

[max

(0,

|vstat+1 − vsta

t |vsta

t− 0.2

)],

where values vstat are the measurements of physiolog-

ical vitals in the state definition (i.e., respiratory rate,heart Rate, and arterial pH) at time t, which are believedto be indicative of physiological stability. When vsta

t ∈[vsta

min, vstamax

], the patient is considered to be in a physiolog-

ically stable state. The second part on the right indicatesthe negative feedback when consecutive measurementshad a sharp change, which is detrimental to the patient.

Reward component rventofft+1 evaluates the performance of

weaning ventilation at time t + 1:

rvent offt+1 = 1[st+1(vent on)=0]

⎡

⎣C3 · 1[st(vent on)=1]

+C4 · 1[st(vent on)=0] − C5∑

vextt

1[vextt > vext

max || vextt < vext

min]

⎤

⎦,

where vextt are parameters related to the conditions for

extubation (i.e., FiO2, SpO2, PEEP), and 1[con.] is an indi-cator function that returns 1 if the condition con. is true,and 0 otherwise. If vext

t /∈ [vext

nim, vextmax

], which means the

condition is not suitable for extubation, the agent will getnegative rewards when extubation has been conducted.Otherwise, a positive reward will be given at the time ofsuccessful extubation (i.e., the C3 component).

The last reward component rventont+1 simply indicates the

cost for each additional hour spent on the ventilator:

rvent ont+1 =1[st+1(vent on)=1]

[C6 · 1[st(vent on)=1]

−C7 · 1[st(vent on)=0]]

.

Constants C1 to C7 are weights of the reward func-tion (C1 + . . . + . . . C7 = 1), which determine therelative importance of each reward component. Givena predefined weight vector, existing RL methods canbe applied to learn an optimal policy for the deci-sion making problem. Due to its data efficiency, weadopt an off-policy batch-mode RL method, the Fit-ted Q-iteration (FQI) [27], to solve the learning prob-lem. FQI uses a set of one-step transition tuples F ={(⟨

snt , an

t , snt+1

⟩, rn

t+1)

, n = 1, . . . , |F |} to learn a sequenceof function approximators of the Q values (i.e., theexpected value of state-action pairs) by iteratively solv-ing supervised learning problems. The Q values areupdated at each iteration according to the Bellman


equation: Q̂k(st , at) ← rt+1 + γ maxa∈A Q̂k−1(st+1, a), where

Q̂1(st , at) = rt+1. The approximation of the optimal policyafter K iterations is then given by:

π̂∗(s) = arg maxa∈A

Q̂K (s, a).

Although various existing supervised learning methodsare available for regression in FQI, including kernel-basedmethods and decision trees, in this paper, we use theGradient Boosting Decision Tree (GBDT) [28] method asthe regression method due to its supreme performance ingeneralization.

A Bayesian IRL solutionThe direct application of RL approaches requires pre-defined weight parameters such that a feasible policycan be learned. Although, generally, the reward functioncan be formulated according to some domain knowl-edge, in many situations, an explicit formulation of rewardfunctions is difficult or even impossible, even with theguidance of experts. In response to this problem, anapprenticeship learning algorithm was proposed [11],which learns the reward function from the trajectory ofan expert’s demonstration, so that the learned policy canmatch the expert’s policy most [11]. Although the basicreward components for the ventilation weaning and seda-tive dosing problem in ICUs have been formulated basedon the guidance of expert doctors, how to derive theweights to trade off these components is still a challengingissue to be resolved.

To this end, we first intend to use the apprenticeshiplearning algorithm to learn the complete reward func-tion (i.e., values for C1, . . . , C7) from the cases treated byexpert clinicians and then optimize the policies learned byusing this reward function. To use apprenticeship learn-ing algorithm, a concept called feature expectation shouldbe defined, which can be simply understood as the expec-tation of the corresponding environmental feature underthe current policy. The algorithm then proceeds as fol-lows. The reward function is initialized first, randomlyor preferentially according to some prior knowledge, andthen any RL algorithm can be used to compute an inter-mediate policy. By assuming an accurate model of thesystem’s dynamics that can be either given a priori or canbe estimated well enough from the data trajectories, thefeature expectation for the intermediate policy can be cal-culated. After that, it calculates the weight of the rewardfunction to ensure that the closest feature expectationbetween the expert policy and the intermediate policy bemaximized. Then the new reward function can be appliedto compute a new policy and a new feature expectation.This process iterates until the resulting policy is closeenough to the expert’s policy.

Algorithm 1 Bayesian IRL with Fitted Q-IterationInput:One-step transitions F = {〈sn

t , ant , sn

t+1〉, rnt+1}n=1:|F |;

Action space A;Subset size N ;Regression parameters θ ;Initialize:Reward function R; Strategy π ; Probability P(O|R);Repeat:Randomly choose an R′ from the neighbors of R in R

|S|δ );

Initialize Q0(st , at) = 0 ∀st ∈ F , at ∈ Afor iteration k = 1 → K do

subsetN ∼ F ;S ←[ ];for i ∈ subsetN do

Qk(si, ai) ← r′i + γ max

a′∈A(predict(〈si+1, a′〉, θ)

);

S ← append(S, 〈(si, ai), Q(si, ai)〉);endθ ← regress(S);

endπ ← classify

(⟨snt , an

t⟩, θ

);

Evaluate Qπ (s, a, R′) and compute P(O|R′);p ← P(O|R′)

P(O|R);

R ← R′ with probability min{1, p};Output: R

However, after applying apprenticeship learning algo-rithm in the ICU problem, the learned policy could notconverge at all. After a deeper analysis, we found thatthe reason was attributed to correlation of state featuresin the reward function with the length of patient’s stayin ICUs and the number of inbutation and exbutation.These factors are affected by many other issues such as thepatient’s personal situation, the degree of shortage in ICUwards, and the personal treatment strategy preference,which cannot effectively distinguish the expert’s policesand non-expert policies. Particularly, naive apprenticeshiplearning algorithms that are built on the comparison offeature expectations are unsuited for problems of bivari-ate features with a varying length of trajectories, sincethis would cause significant bias in computing the expec-tations for such features, leading to divergence of finallearning performance.

To avoid the above problems, we exploited the BayesianIRL algorithm [24] to learn the reward function. Thewhole learning procedure is given by Fig. 2. We assumethat under the reward value function R, the possi-bility of the agent performing the expert trajectoryO = {(s1, a1), . . . (sk , ak)} is given by Pr(O|R) =∏k

i=1 Pr((si, ai)|R), in which the possibility for each (si, ai)is assumed to follow the Boltzmann distribution asPr((si, ai)|R) = 1

CieαQ∗(si,ai,R), where Q∗(s, a, R) is a


Fig. 2 The process of IBL

potential function (the action value function) under theoptimal policy for R, Ci = ∑

a∈A eαQ∗(si,a,R) is the nor-malization constant, and α is a parameter to adjust thepossibility of the expert’s choice of action. Combined withthe prior distribution of the reward function R, the pos-terior probability of R under the observation and actionsequence O can be computed using Bayes’ theorem asPr(R|O) = 1

Z eα∑

i Q∗(si,ai,R)Pr(R), where Z is another nor-malization constant. When no other information is given,we assume that reward value function Pr(R) obeys auniform distribution.

In order to compute the posterior distribution of R, weuse a Markov Chain Monte Carlo (MCMC) algorithm(GridWalk) as the sampling method, which generates aMarkov chain on the intersection points of a grid of lengthδ in the region R

|S| (denoted as R|S|δ ). Algorithm 1 gives

the main procedure of the Bayesian IRL method withFQI as the RL algorithm in each inner iteration of policylearning.

ResultsAs there are six commonly used sedatives in the MIMICIII data set, we extract 707 admissions that the propo-fol was applied as the sedative. These data are then splitinto the training set with 566 admissions and test set withthe remaining 141 admissions. The radial basis functionis used as the kernel function in SVM with regularizationcoefficient C being 25. After data preprocessing, 285.5 and150.1 thousands one-step transitions are generated in thetraining set and test set, respectively. In order to ensurefaster training speed, we take 10000 one-step transitionsfor training in each iteration of FQI. The number of boost-ing stages is 100, and learning rate is 0.1. All the samplesare used for fitting the individual base learners, and theleast squares loss function is to be optimized. For each

base learner, all the features are considered when lookingfor the best split. The maximum depth is 3, the minimumnumber of samples required to split an internal node is 2,and at least one sample is required to be at a leaf node.Other hyper-parameters are set as default values.

First, we would like to evaluate whether the FQI methodcombined with GBDT as the regressor, and its inverseversion are capable of learning any effective solutions.For each weight Ci (i ∈ {1, . . . , 7}), we constrain itsvalue in between [0,1] to indicate different levels ofimportance. We test FQI-GBDT using a weight vectorof [1/7,1/7,1/7,1/7,1/7,1/7,1/7] (i.e., πBL), and the otherthree different weight vectors that are generated randomlyfrom the range of [0,1], corresponding to πBL1 , πBL2 , andπBL3 , respectively. Table 1 presents the parameter set-tings for RL policies. In order to use the Bayesian IRLwith FQI-GBDT, we choose the initial weight vector as[0,0,0,0,0,0,0] to indicate none prior knowledge about thevalue functions. After each exploration of the weights inthe IRL process, the weights are then normalized suchthat their sum is equal to 1. Figure 3 plots the conver-gence of the learning processes in terms of differenceof Q values in two consecutive iterations. Both the RLmethods and the IRL method are capable of achieving aconvergence after around 40 iterations, which verifies theeffectiveness of the application of RL and IRL methods

Table 1 Weight vectors for different RL policies

Policy Weight of reward function

πBL [ 1/7, 1/7, 1/7, 1/7, 1/7, 1/7, 1/7]

πBL1 [ 0.14, 0.24, 0.15, 0.19, 0.07, 0.07, 0.14]

πBL2 [ 0.08, 0.17, 0.16, 0.18, 0.29, 0.10, 0.02]

πBL3 [ 0.07, 0.19, 0.12, 0.21, 0.26, 0.04, 0.11]


Fig. 3 Convergence of Q̂ using FQI-GBDT and the inverse FQI-GBDTmethods

in solving the ventilation and sedative dosing problemsin ICUs. Since IRL method involves a process of esti-mating the reward function during learning, it can bringabout a more efficient and robust learning process thanthe RL methods that are based on predefined fixed rewardfunctions.

Figure 4 plots the convergence of probability P(O|R)using Bayesian IRL, where W1 =[ 0, 0, 0, 0, 0, 0, 0] and W2 :[ 1/7, 1/7, 1/7, 1/7, 1/7, 1/7, 1/7]. As the number of iter-ations increases, the policies learned by using πIBL aregetting closer to the expert’s policy. However, weight W2enables a better initial performance than W1 due to lessexploration in the reward function space. Note that P(O|R)is not a probability converging to 1, since it is a proportionvalue that an action’s potential function (i.e., Q function)accounts for the potential functions of all the actions.Results in Fig. 4 thus indicate that the efficiency of a Baye-sain IRL method closely depends on the initial weightsof the reward functions. If some prior knowledge aboutthe reward functions is available, learning efficiency can

Fig. 4 The convergence of P(O|R) using different initial values of theweights

be greatly improved by initializing weights to those spec-ified by this prior knowledge. Enabling the integration ofdomain knowledge into the learning process for perfor-mance improvement is also a major merit of Bayesain IRLmethods.

In order to assess how well the policies learned matchthe true policies of the doctors, we validate all the policieson the test set of real medical data. As shown in Table 2,the performance of RL methods heavily depends on thechoice of initial reward weights. Policy πBL matches 53.5%of the joint action of doctors, with 99.6% consistency inventilation action and 53.9% in sedative action, while pol-icy πBL2 can only matches averagely 14.1% of the jointaction. The IRL method is consistent with doctors’ actionsin ventilation by 99.7% and in sedative dosing by 54.2%,achieving an overall consistency of 53.9%.

We further divide the test data set into two mainsub-groups: expert data set, in which inbutation wasconducted only once and the SBTs were successful, andnon-expert data set in which inbutation was conductedonly once but the SBTs failed (i.e., Ordinary Single Intu-bation Data) or inbutation was conducted more than once(i.e., Multiple Intubation Data). The latter two data setsare considered to be non-expert data sets because wrongdecisions of weaning the ventilation or sedative dosingcaused the failure of SBTs or inbutation more than once.Table 3 shows the results of πIBL and πBL in these test datasets in terms of match of sedative dosing actions. As πBLis the best policy among all the RL polices, it can achievea comparable correctness against πIBL. However, the cor-rectness of the non-expert sets, particularly the multipleintubation data set, is much higher than the expert test set.This is because it is more difficult to derive the experts’reward functions compared with non-experts, since non-experts’ reward functions (i.e., clinical decisions) usuallydeviate far away from the true ones expressed by experts.The larger bias thus enables IRL methods to explore thewhole reward function space more easily.

DiscussionCurrent extubation guidelines provide precise conditionsfor clinicians to determine when extubation is mostpreferable. However, the priorities of these conditions

Table 2 The correctness of learned polices using RL and IRLmethods in the test data set

Policy Overall Action Ventilation Sedative

πIBL 53.9% 99.7% 54.2%

πBL 53.5% 99.6% 53.9%

πBL1 23.5% 45.7% 51.0%

πBL2 14.1% 35.5% 39.1%

πBL3 17.2% 34.9% 54.1%


Table 3 The correctness of sedative dosing polices using RL andIRL methods in the test data set

Policy Expert Data Ordinary SingleIntubation Data

MultipleIntubation Data

πIBL 44.5% 48.5% 63.4%

πBL 44.4% 48.4% 62.8%

are usually based on clinicians’ personal experience, thushaving not been explicitly specified. Figure 5 comparesthe importance of patients’ physiological indicators andventilator parameters using the policies learned by thefour RL methods and IRL method. It is clear that thefeature importance of the policies learned by differentreward weights and learning methods differ from eachother a lot. For example, the top three important fea-tures are (FiO2, MAP and age), (FiO2, MAP and PEEPset), and (FiO2, MAP and SpO2) for policy πBL3 , πBL2 , andπBL1 , respectively. However, results in Fig. 3 show that thethree RL methods perform poorly in terms of slow con-vergence rate and unstable learning process, indicating thelimitations of such feature priorities.

To provide a deeper insight, we compared the impor-tance of related features using the two more efficientmethods of πBL and πIBL. Figures 6 and 7 show that the

importance of related features shares quite similar pat-terns. The top three important features are age, heartrate and respiratory rate, and these three features togetheraccount for a large proportion of all the features. Partic-ularly, the age of a patient is strongly correlated with thepatient’s ability to recover, and thus is given the highestpriority when considering ventilation and sedative treat-ment policies in ICUs. Besides, heart rate and respirationrate are two main factors in maintaining physiologicalstability. Paying special attention to these factors is con-tradictory to the other three RL methods (i.e., πBL1 , πBL2 ,and πBL3 ) that pay more attention to oxygenation criteriaof FiO2, PEEP and SpO2.

Although πIBL and πBL methods share similar learningperformance, surprisingly, the learned weights differ a lot.The weights of the reward function using πIBL is finallystabilized at [ 0.26, 0.06, 0.18, 0.12, 0.08, 0.28, 0.02]. Com-pared with the weights [ 1/7, 1/7, 1/7, 1/7, 1/7, 1/7, 1/7]using the πBL method, weights C2, C5 and C7 using πIBLare much smaller, while weights C1 and C6 are muchlarger. This indicates that, rather than considering all theseven factors equally, doctors give higher priorities to thepatient’s physiological stability in terms of staying withina reasonable range (i.e., higher C1), and the cost for eachadditional hour spent on the ventilator (i.e., higher C6),

Fig. 5 Comparison of feature importance using different RL and IRL policies


Fig. 6 Feature importance using πBL

but lower priorities to other factors. These results sug-gest helpful insights into the development of new effectivetreatment protocols for intelligent ventilation and sedativedosing in ICUs.

ConclusionsIn this work, a data-driven approach is proposed to theoptimization of weaning mechanical ventilation and seda-tive dosing for patients in ICUs. We model the decisionmaking problem as an MDP, and use a batch RL method,FQI with GBDT, to learn a suitable ventilator weaningpolicy from real trajectories in historical ICU data. ABayesian IRL method is then applied to infer the latentreward functions in terms of weights in trading off vari-ous aspects of evaluation criterion. We demonstrate thatthe approach is capable of extracting meaningful indica-tors for recommending extubation readiness and sedativedosage, on average outperforming direct RL methods interms of regulation of vitals and reintubations. Moreover,

Fig. 7 Feature importance using πIBL

by discovering the optimal weights using IRL methods,new effective treatment protocols can be suggested inthe intelligent decision making of ventilation weaning andsedative dosing in future ICUs.

Although our work has verified the effectiveness of IRLmethods in complex clinical decision making problems,there are a number of issues that need to be carefullyresolved before these methods can be meaningfully imple-mented in a clinical setting. First, in this paper, the twomain processes of data preprocessing and data learningare conducted separately. There is no doubt that the errorsbrought in the preprocessing process will affect the learn-ing accuracy in the data learning period. It is thus neces-sary to enable IRL methods to directly work on the rawnoisy and incomplete data. Moreover, most existing IRLmethods require an accurate model to be given before-hand or estimated from data. This is infeasible when sucha model is lacking or accurate estimation of the model isinfeasible directly from expert demonstrations, particu-larly in clinical settings where the model always involvesa large volume of continuous states and actions. It is thusvaluable to apply IRL methods that are capable of esti-mating the rewards and model dynamics simultaneously.Some theoretical research on IRL [17, 29] has investi-gated these issues recently and can be investigated in theclinical settings here. These issues are left for our futurework.

AcknowledgementsNot applicable.

FundingThis work is supported by the Hongkong Scholar Program under Grant No.XJ2017028, and Dalian High Level Talent Innovation Support Program underGrant 2017RQ008.

Availability of data and materialsThe datasets used and/or analysed during the current study available from thefirst author on reasonable request.

About this supplementThis article has been published as part of BMC Medical Informatics and DecisionMaking Volume 19 Supplement 2, 2019: Proceedings from the 4th China HealthInformation Processing Conference (CHIP 2018). The full contents of thesupplement are available online at URL. https://bmcmedinformdecismak.biomedcentral.com/articles/supplements/volume-19-supplement-2.

Authors’ contributionsYC proposed the idea, implemented the simulation and drafted themanuscript. ZH contributed to the collection, analysis, and interpretation ofexperimental data. LJ supervised the research and proofread the manuscript.All authors contributed to the preparation, review, and approval of the finalmanuscript and the decision to submit the manuscript for publication.

Ethics approval and consent to participateNot applicable.

Consent for publicationNot applicable.

Competing interestsThe authors declare that they have no competing interests.

https://bmcmedinformdecismak.biomedcentral.com/articles/supplements/volume-19-supplement-2

https://bmcmedinformdecismak.biomedcentral.com/articles/supplements/volume-19-supplement-2


Publisher’s NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations.

Published: 9 April 2019

References1. Sutton RS, Barto AG. Reinforcement Learning: An Introduction.

Cambridge: The MIT press; 1998.2. Parbhoo S, Bogojeska J, Zazzi M, Roth V, Doshi-Velez F. Combining

kernel and model based learning for hiv therapy selection. AMIA SummitsTransl Sci Proc. 2017;2017:239.

3. Tseng H-H, Luo Y, Cui S, Chien J-T, Ten Haken RK, Naqa IE. Deepreinforcement learning for automated radiation adaptation in lungcancer. Med Phys. 2017;44(12):6690–705.

4. Daskalaki E, Diem P, Mougiakakou SG. Model-free machine learning inbiomedicine: Feasibility study in type 1 diabetes. PloS ONE. 2016;11(7):0158722.

5. Escandell-Montero P, Chermisi M, Martinez-Martinez JM, et al.Optimization of anemia treatment in hemodialysis patients viareinforcement learning. Artif Intell Med. 2014;62(1):47–60.

6. Shortreed SM, Laber E, Lizotte DJ, Stroup TS, Pineau J, Murphy SA.Informing sequential clinical decision-making through reinforcementlearning: an empirical study. Mach Learn. 2011;84(1-2):109–36.

7. Nagaraj V, Lamperski A, Netoff TI. Seizure control in a computationalmodel using a reinforcement learning stimulation paradigm. Int J NeuralSyst. 2017;27(07):1750012.

8. Padmanabhan R, Meskin N, Haddad WM. Closed-loop control ofanesthesia and mean arterial pressure using reinforcement learning.Biomed Signal Process Control. 2015;22:54–64.

9. Weng W-H, Gao M, He Z, Yan S, Szolovits P. Representation andreinforcement learning for personalized glycemic control in septicpatients. arXiv preprint arXiv:1712.00654. 2017.

10. Ng AY, Russell SJ, et al. Algorithms for inverse reinforcement learning. In:ICML. Omnipress; 2000. p. 663–670.

11. Abbeel P, Ng AY. Apprenticeship learning via inverse reinforcementlearning. In: ICML. Omnipress; 2004. p. 1.

12. Kuderer M, Gulati S, Burgard W. Learning driving styles for autonomousvehicles from demonstration. In: ICRA. New York: IEEE; 2015. p. 2641–2646.

13. Shiarlis K, Messias J, Whiteson S. Inverse reinforcement learning fromfailure. In: AAMAS. New York: ACM; 2016. p. 1060–1068.

14. Li K, Burdick JW. Inverse reinforcement learning in large state spaces viafunction approximation. arXiv preprint arXiv:1707.09394. 2017.

15. Johnson AE, Ghassemi MM, Nemati S, Niehaus KE, Clifton DA, CliffordGD. Machine learning and decision support in critical care. Proc IEEE.2016;104(2):444–66.

16. Vincent R. Reinforcement learning in models of adaptive medicaltreatment strategies. PhD thesis: McGill University Libraries; 2014.

17. Herman M, Gindele T, Wagner J, Schmitt F, Burgard W. Inversereinforcement learning with simultaneous estimation of rewards anddynamics. In: Artificial Intelligence and Statistics. New York: ACM; 2016. p.102–110.

18. Prasad N, Cheng L-F, Chivers C, Draugelis M, Engelhardt BE. Areinforcement learning approach to weaning of mechanical ventilation inintensive care units. arXiv preprint arXiv:1704.06300. 2017.

19. Nemati S, Adams R. Identifying outcome-discriminative dynamics inmultivariate physiological cohort time series. Adv State Space MethodsNeural Clin Data. 2015;283.

20. Nemati S, Ghassemi MM, Clifford GD. Optimal medication dosing fromsuboptimal clinical examples: A deep reinforcement learning approach.In: EMBC. New York: IEEE; 2016. p. 2978–2981.

21. Sandu C, Popescu D, Popescu C. Post cardiac surgery recovery processwith reinforcement learning. In: ICSTCC. New York: IEEE; 2015. p. 658–661.

22. Raghu A, Komorowski M, Ahmed I, Celi L, Szolovits P, Ghassemi M.Deep reinforcement learning for sepsis treatment. arXiv preprintarXiv:1711.09602. 2017.

23. Ziebart BD, Maas AL, Bagnell JA, Dey AK. Maximum entropy inversereinforcement learning. In: AAAI, vol 8. Cambridge: AAAI Press; 2008. p.1433–1438.

24. Ramachandran D, Amir E. Bayesian inverse reinforcement learning.Urbana. 2007;51(61801):1–4.

25. Levine S, Popovic Z, Koltun V. Nonlinear inverse reinforcement learningwith gaussian processes. In: NIPs. Cambridge: MIT Press; 2011. p. 19–27.

26. Johnson AE, Pollard TJ, Shen L, Li-wei HL, Feng M, Ghassemi M, Moody B,Szolovits P, Celi LA, Mark RG. Mimic-iii, a freely accessible critical caredatabase. Sci Data. 2016;3:160035.

27. Ernst D, Geurts P, Wehenkel L. Tree-based batch mode reinforcementlearning. J Mach Learn Res. 2005;6(Apr):503–56.

28. Friedman JH. Greedy function approximation: a gradient boostingmachine. Ann Stat. 2001;29(5):1189–232.

29. Kangasrääsiö A, Kaski S. Inverse reinforcement learning from incompleteobservation data. arXiv preprint arXiv:1703.09700. 2017.

RESEARCH OpenAccess Inversereinforcementlearningfor ... · nia [6], epilepsy [7], anesthesia [8],...

Documents

Transcript of RESEARCH OpenAccess Inversereinforcementlearningfor ... · nia [6], epilepsy [7], anesthesia [8],...