Risk-averse autonomous systems: A brief history and recent ...

37
Risk-averse autonomous systems: A brief history and recent developments from the perspective of optimal control Yuheng Wang and Margaret P. Chapman 1 Edward S. Rogers Sr. Department of Electrical and Computer Engineering, University of Toronto, 10 King’s College Road, Toronto, Ontario M5S 3G8 Canada Abstract We offer a historical overview of methodologies for quantifying the notion of risk and optimizing risk-aware autonomous systems, with emphasis on risk- averse settings in which safety may be critical. We categorize and present state- of-the-art approaches, and we describe connections between such approaches and ideas from the fields of decision theory, operations research, reinforcement learn- ing, and stochastic control. The first part of the review focuses on model-based risk-averse methods. The second part discusses methods that blend model-based and model-free techniques for the purpose of designing policies with improved adaptive capabilities. We conclude by highlighting areas for future research. Keywords: Autonomous systems, Intelligent systems, Risk analysis, Optimal control, Safe learning 1. Introduction Although the precise meaning of risk is domain-dependent, there is an emerg- ing appreciation for the importance of incorporating information about the chance and extent of unfortunate outcomes into risk assessments. For instance, the International Organization for Standardization (ISO) Risk Management Guidelines suggest that risk assessments examine “the likelihood of events and consequences” as well as “the nature and magnitude of consequences” [1, Sec. 6.4.3]. In the context of autonomous systems, two main paradigms have been ? Submitted for the Special Issue on Risk-aware Autonomous Systems: Theory and Practice. Email address: [email protected], [email protected] (Yuheng Wang and Margaret P. Chapman) 1 Y.W. and M.P.C. are with the Edward S. Rogers Sr. Department of Electrical and Computer Engineering, University of Toronto. Preprint submitted to Journal of Artificial Intelligence September 21, 2021 arXiv:2109.08947v1 [cs.AI] 18 Sep 2021

Transcript of Risk-averse autonomous systems: A brief history and recent ...

Risk-averse autonomous systems:A brief history and recent developments from the

perspective of optimal control

Yuheng Wang and Margaret P. Chapman1

Edward S. Rogers Sr. Department of Electrical and Computer Engineering, University ofToronto, 10 King’s College Road, Toronto, Ontario M5S 3G8 Canada

Abstract

We offer a historical overview of methodologies for quantifying the notionof risk and optimizing risk-aware autonomous systems, with emphasis on risk-averse settings in which safety may be critical. We categorize and present state-of-the-art approaches, and we describe connections between such approaches andideas from the fields of decision theory, operations research, reinforcement learn-ing, and stochastic control. The first part of the review focuses on model-basedrisk-averse methods. The second part discusses methods that blend model-basedand model-free techniques for the purpose of designing policies with improvedadaptive capabilities. We conclude by highlighting areas for future research.

Keywords: Autonomous systems, Intelligent systems, Risk analysis, Optimalcontrol, Safe learning

1. Introduction

Although the precise meaning of risk is domain-dependent, there is an emerg-ing appreciation for the importance of incorporating information about thechance and extent of unfortunate outcomes into risk assessments. For instance,the International Organization for Standardization (ISO) Risk ManagementGuidelines suggest that risk assessments examine “the likelihood of events andconsequences” as well as “the nature and magnitude of consequences” [1, Sec.6.4.3]. In the context of autonomous systems, two main paradigms have been

?Submitted for the Special Issue on Risk-aware Autonomous Systems: Theory and Practice.Email address: [email protected], [email protected]

(Yuheng Wang and Margaret P. Chapman)1Y.W. and M.P.C. are with the Edward S. Rogers Sr. Department of Electrical and

Computer Engineering, University of Toronto.

Preprint submitted to Journal of Artificial Intelligence September 21, 2021

arX

iv:2

109.

0894

7v1

[cs

.AI]

18

Sep

2021

used historically for quantifying and managing the risk that arises due to systembehavior: the risk-neutral paradigm and the worst-case (i.e., robust) paradigm.The risk-neutral paradigm quantifies the risk of uncertain outcomes in termsof the average outcome, whereas the worst-case paradigm quantifies the risk ofuncertain outcomes in terms of the most harmful outcome. A standard prob-lem within the risk-neutral paradigm is to compute a policy that minimizes anaverage cumulative cost (i.e., maximizes an average cumulative reward) thatarises as a system evolves over time. Comprehensive presentations of this prob-lem from the reinforcement learning and stochastic control communities can befound in [2] and [3], respectively. A fundamental limitation of the risk-neutralparadigm is its ignorance of the features of a cost distribution other than theaverage cost.

By quantifying the average cost only, the risk-neutral paradigm cannot en-code preferences for distributions with more specific features, including a re-duced average cost above a given quantile or a reduced standard deviation (Fig.1). Such features may be important for systems with safety goals, e.g., systemsthat should avoid operating within particular regions of their respective statespaces. Systems with safety goals are ubiquitous. Stormwater infrastructuresystems must minimize overflows; networks of human and autonomous vehiclesmust avoid collisions; cancer treatments must control cancer while minimizingadverse side effects (Fig. 2).

Figure 1: A probability density function of an absolutely continuous random variable Z rep-resenting a cost. Z can be a random cost that is incurred as an autonomous system operatesover time. Z being absolutely continuous means that it admits a density function. Differentcharacteristics of the distribution of Z are shown, including the magnitude of the standarddeviation

√variance(Z) (a), the magnitude of the expectation of |Z − E(Z)| (b), and the

magnitude of the expectation of maxZ − E(Z), 0 (c). The illustration emphasizes that√variance(Z) and E(|Z − E(Z)|) do not differentiate between realizations above and below

the mean, whereas E(maxZ −E(Z), 0) does make this distinction. We also depict the left-side (1 − α)-quantile VaRα(Z) and the Conditional Value-at-Risk CVaRα(Z) at level α in

the case when α is small enough so that E(Z) +√

variance(Z) ≤ VaRα(Z). The area of theshaded region under the curve is equal to α ∈ (0, 1), and the area under the entire curve isequal to 1. We provide more details about VaRα(Z) and CVaRα(Z) in Sec. 2.

2

Figure 2: Examples of systems from different domains with safety goals. a) Urban water sys-tems are required to minimize overflows while meeting other goals, such as providing sufficientirrigation for green roofs or improving water quality. b) Networks of autonomous and humanvehicles must avoid collisions, but if collisions are unavoidable, then accelerations that hu-mans experience should be minimized to reduce the severity of injuries. c) Cancer treatmentsmust manage the growth of cancer cells without posing too much harm to healthy cells. d)Agriculture systems must generate sufficient crop yields while conserving water and energy.For instance, energy may be required to pump water from reservoirs.

Traditionally, one computes policies for autonomous systems with safetygoals by framing a problem within the worst-case paradigm. A standard prob-lem within this paradigm is to compute a policy that minimizes a maximumcost, which depicts how bounded adversarial disturbances can inhibit satis-factory operation of an autonomous system [4, 5, 6, 7, 8, 9]. The worst-caseparadigm may be useful when 1) it is possible to characterize the bounds ofdisturbances with a sufficient degree of certainty, 2) such bounds do not changein unanticipated ways over time, and 3) a policy can be computed that is nottoo cautious to apply in the real world. In particular, this paradigm has beenused with tremendous success in the aerospace industry. A fundamental concernabout the worst-case paradigm, however, is the typical assumption of boundedadversarial disturbances, which may not be suitable for every application.

The risk-neutral and worst-case paradigms represent two distinct perspec-tives regarding the future, that is, a future of averages versus a future of mis-fortune. An intermediary paradigm, called the risk-averse paradigm, forms abridge between the risk-neutral and worst-case paradigms.2 A typical problem

2We use the term risk averse because we assume that uncertainty is more harmful thanbeneficial. The term risk seeking assumes that uncertainty is more beneficial than harmful. Incontrast, the term risk aware (synonym: risk sensitive) is more general, as it refers to either

3

within the risk-averse paradigm is to compute a policy that minimizes a cri-terion that encodes a preference for cost distributions with particular desiredcharacteristics. For example, one may encode a preference for cost distributionswith a small

1. linear combination of mean and variance,

2. average cost subject to a probabilistic constraint,

3. expected utility, where the utility function is selected based on a human’ssubjective preferences,

4. quantile, or

5. average cost above a quantile.

The five listed items are examples of different approaches for quantifying risk.Such approaches have diverse advantages and disadvantages regarding axiomaticjustification, numerical tractability, and explainability, which we introduce inSec. 2.

Our contribution is to present a unifying overview about what risk can meanand how this notion can be incorporated into the control of autonomous sys-tems. We view risk as a complex notion that is difficult to define and optimizein general. We consider the most useful definition for risk to depend on the ap-plication of interest, whereas it is typical to define risk in terms of expectation,variance, or probability. We discuss the origins of approaches for quantifyingrisk and present core methodologies for the optimization of autonomous systemswith respect to risk-averse criteria. We summarize the literature that we coverin Fig. 3 and Fig. 4.

We wish to bring the reader’s attention to related topics that we do notemphasize in this review, including safe exploration, safe model predictive con-trol, and inverse reinforcement learning. In 2014, Pecka and Svoboda [10] sum-marized safe exploration approaches in reinforcement learning (RL) using theclassification from [11], which includes learning from demonstrations [12]. In2015, Garcıa and Fernandez surveyed RL algorithms that incorporate safetyconsiderations into policy synthesis, with emphasis on model-free methods [13,Table 2]. In 2020, Hewing et al. reviewed approaches that combine learningwith model predictive control and emphasized safety issues [14]. Recent surveyson inverse reinforcement learning can be found in [15, 16]. However, our surveypresents methodologies for quantifying risk aversion (Sec. 2) and optimizingthe behavior of autonomous systems with respect to risk-averse criteria, withemphasis on the model-based viewpoint (Sec. 3). In addition to discussing con-tributions from the RL community, we present contributions from the stochasticcontrol and operations research communities. Moreover, we discuss connectionsbetween the risk-averse autonomous systems literature and the safe learning lit-erature, which includes a blend of model-based and model-free methods (Sec.4).

risk averse or risk seeking. In this paper, we focus on risk-averse methods for autonomoussystems and occasionally discuss risk-seeking methods where appropriate.

4

Relevance to Artificial Intelligence. The broad problem of how to analyze andoptimize the behavior of autonomous systems has significant relevance to thefield of Artificial Intelligence. For example, designing algorithms to ensure thatrobots can safely and efficiently traverse uneven terrain is an instance of thisproblem. Moreover, optimizing treatment protocols for a particular cancer pa-tient using data from their previous treatment cycle, expert knowledge from anoncologist, and machine learning techniques is another instance of this problem.Our survey offers a historical and modern overview of research at the intersectionof risk analysis and autonomous systems. Naturally, this intersection is quiteinterdisciplinary and connects ideas from many domains, including reinforce-ment learning, decision theory, operations research, and stochastic control. Weare hopeful that our survey about risk-averse autonomous systems will inspirefuture theoretical and applied research that enhances the operation of systemsin practice and therefore the quality of life.

Organization. Sec. 2 presents approaches for quantifying risk aversion usingmaps that evaluate random variables, called risk functionals. Sec. 3 describesmethods for optimizing systems with respect to risk-averse criteria, with em-phasis on model-based methods. Sec. 4 presents methods that permit policiesto adapt to changing situations in real time, which involves a combination ofmodel-free and model-based methods. Sec. 5 concludes the paper with a dis-cussion of open questions that are inspired by this review.

Notation. Rn+ := x ∈ Rn : xi ≥ 0, i = 1, . . . , n is the non-negative orthant inRn. R∗ := R ∪ −∞,+∞ is the extended real line. N := 1, 2, . . . , is the setof natural numbers. We introduce some measure-theoretic notation next. If Mis a metrizable space, then BM is the Borel sigma algebra on M .3 The notation(Ω,F , µ) denotes a generic probability space. A random variable on (Ω,F , µ) isa function Z : Ω→ R that is measurable relative to F and BR. If p ∈ [1,+∞),an element of Lp := Lp(Ω,F , µ) is a class of random variables with finite pthmoment that are equal almost everywhere with respect to µ. However, it isconventional to abuse notation and use Z ∈ Lp to specify a random variablewith finite pth moment. Z having finite pth moment means that the Lp-normof Z, ‖Z‖p := (

∫Ω|Z|pdµ)1/p, is finite. The notation E(Z) :=

∫ΩZdµ denotes

the expectation of Z. In what follows, we present terminology that relates toautonomous systems. A random cost is a random variable in which smallerrealizations are better and larger realizations are worse. N ∈ N denotes thelength of a finite-time horizon. Π is a class of history-dependent policies. S is aBorel state space, and A is a Borel action space.4 The abbreviation DP stands

3An example of a metrizable space is Rn with the Euclidean metric. More generally, anymetric space is a metrizable space. Intuitively, BM is a large collection of subsets of M thatare regular enough to be measured. Mathematical details can be found in [3, Prop. 7.11, p.118], for example.

4For instance, Rn with the Euclidean metric is a Borel space. Any B ∈ BRn with theEuclidean metric is a Borel space. For a formal definition of Borel spaces, the reader mayrefer to [3, Def. 7.7, p. 118].

5

for dynamic programming, MDP stands for Markov decision process, and CDFstands for cumulative distribution function.

2. Methods for Quantifying Risk Aversion

Suppose that Z is a random cost that is incurred as an autonomous systemevolves over time. For example, if the system is a vehicle, then Z may be themaximum deviation between a desired path and the true path during a futuretrip. If the system is an urban water network, then Z may be the total volumeof water that is released into a combined sewer during a future storm.5 Ifthe system is a cancer tumor, then Z may depend on the size of the tumor,magnitude of adverse side effects, and doses of drugs during a future course oftreatment.

The risk of a random outcome Z can be quantified by using a risk func-tional (also called a risk measure). The state-of-the-art axiomatic theory of riskfunctionals has been developed primarily in operations research and financialengineering. This theory has been motivated in particular by the challenge ofoptimizing financial portfolios despite uncertain markets. Specifically, a riskfunctional is a map from a set of random variables to the extended real line. Arisk functional that characterizes a distribution under a pessimistic perspectivethat variability leads to harm is risk averse. In contrast, one that characterizesa distribution under an optimistic perspective that variability leads to benefitis risk seeking. The expectation is risk neutral, as it is indifferent to variabilityabout the mean and therefore invokes neither perspective. In this paper, weare interested in quantifying risk aversion, which is relevant for autonomoussystems with safety goals that operate under uncertainty (Fig. 2). In Sec. 2.1,we present an overview of common risk functionals, and in Sec. 2.2, we providean extended discussion regarding their advantages and disadvantages.6

2.1. Common Risk Functionals

In 1994, Heger introduced several risk functionals to the machine learningcommunity, including Expected Utility, Mean-Variance, and a functional thatresembles Value-at-Risk [5], which are still popular in the literature. Morebroadly, common classes of risk functionals include: utility-based functionals,functionals that quantify deviations from the mean, quantile-based functionals,

5A combined sewer can release a combination of stormwater and untreated wastewaterinto a natural waterway due to a sewer network’s restricted capacity. Combined sewers canbe found in older cities, including San Francisco and Toronto. In prior work, we developeda risk-averse trajectory-wise method for safety analysis, and we applied the method to theproblem of quantifying the risk of combined sewer overflows [17, 18].

6Our presentation in this section focuses on risk functionals, which map measurable func-tions to the extended real line. A probability measure maps measurable sets, called events,to [0, 1]. Since a probability measure evaluates measurable sets, it is not a risk functional.However, it is common to use probabilities to assess the risk of autonomous systems. Wedescribe methods that employ probabilistic risk assessments in Sec. 3.2 and Sec. 3.3.5.

6

and nested risk functionals. Most of the risk functionals that we present inSec. 2.1 are also presented by [19, Sec. 6.3.2], from which we take inspiration.However, our presentation is less technical and incorporates connections to theautonomous systems literature.

2.1.1. Expected Utility

The notion of a utility function is based on the idea that two different peoplecan perceive the same outcome, event, or reward, etc. differently [20, Chap. 1].This notion was first studied formally by D. Bernoulli in 1738, whose paper wastranslated from Latin to English in 1954 [21]. von Neumann and Morgensternanalyzed the use of Expected-Utility functionals to model subjective preferencesfrom a decision-theoretic perspective in 1944 [22].

The Expected Utility of a non-negative random cost Z takes the form E(φ(Z)),where φ : R+ → R is a utility function. If φ is convex, then φ(z) represents howa risk-averse decision-maker perceives a realization z ∈ R+ of the random costZ; analogously, if φ is concave, then φ(z) represents how a risk-seeking decision-maker perceives a realization z of Z [23, 20]. It is typical to assess a relatedquantity called a certainty equivalent φ−1(E(φ(Z)), where φ is strictly increas-ing and continuous and the inverse φ−1 exists [23, p. 107]. In particular, theinverse operation ensures that the units of φ−1(E(φ(Z)) are equivalent to theunits of Z, which may be useful for practical applications. For simplicity, weinformally call φ−1(E(φ(Z)) an Expected-Utility functional as well.

2.1.2. Exponential Utility

The Exponential-Utility functional of a non-negative random cost Z at levelθ ∈ R with θ 6= 0,

ρEU,θ(Z) := −2θ logE(exp(−θ2 Z)), (1)

is a special case of φ−1(E(φ(Z)), where φ(z) = exp(−θ2 z). The setting of θ < 0represents a risk-averse perspective, whereas the setting of θ > 0 represents arisk-seeking perspective. In particular, if θ < 0, then φ(z) = exp(−θ2 z) overes-timates a large realization z ∈ R+ of Z using an exponential transformation.Under a restricted set of conditions, ρEU,θ(Z) approximates a linear combina-tion of the mean E(Z) and the variance variance(Z). Specifically, it can beshown (under appropriate conditions) that

limθ→0+

ρEU,θ(Z) = limθ→0−

ρEU,θ(Z) = E(Z), (2)

and if θ is sufficiently close to 0, then a Taylor expansion yields ρEU,θ(Z) ≈E(Z)− θ

4variance(Z) [24, p. 765]. We provide sufficient conditions to derive (2)in a footnote.7 We discuss the optimization of Exponential-Utility performancecriteria for MDPs in Sec. 3.3.1.

7Suppose that Z ∈ L2 := L2(Ω,F , µ), Z is non-negative, and there exist a ∈ R and

b ∈ R such that a < 0 < b and exp(−θ2Z) ∈ L2 for each θ ∈ [a, b]. One can use [25, Thm.

2.27, p. 56] and Holder’s Inequality to show that ddθE(exp(−θ

2Z)) = −1

2E(Z exp(−θ

2Z)),

7

2.1.3. Mean-Variance, Mean-Deviation, Mean-Upper-Semideviation

The following risk functionals provide different ways to quantify the spreadof a distribution around the mean. Let p ∈ [1,+∞), Z ∈ Lp, and λ ≥ 0 begiven. If p = 2, define a linear combination of the mean and variance of Z asfollows:

ρMV,λ(Z) := E(Z) + λvariance(Z). (3)

Assessing mean-variance trade-offs has a well-established history in the financialliterature. The famous works by H. Markowitz from the 1950s focus on portfolioselection under competing mean-variance objectives and justify using varianceas a measurement for financial risk [26, 27]. More recently, this foundationhas motivated the development of methods for autonomous systems that aredesigned to

1. minimize the variance such that the mean must equal a particular value,which is an inexpensive computation if the system has linear dynamicsand quadratic costs [28], and

2. minimize a linear combination of the mean and variance, which is typicallyan expensive optimization problem [29, Example 1].

The Mean-Deviation of Z is a linear combination of the expectation of Zand the Lp-norm of Z − E(Z),

ρMD,λ(Z) := E(Z) + λ‖Z − E(Z)‖p. (4)

The Mean-Upper-Semideviation of Z emphasizes positive deviations of Z withrespect to E(Z) by assessing the Lp-norm of maxZ − E(Z), 0,

ρMU,λ(Z) := E(Z) + λ‖maxZ − E(Z), 0‖p. (5)

2.1.4. Value-at-Risk

The Value-at-Risk (VaR) of Z represents the smallest cost in a given frac-tion of the largest costs (Fig. 1). Formally, the Value-at-Risk of Z ∈ L1 :=L1(Ω,F , µ) is defined by, for α ∈ (0, 1),

VaRα(Z) := infz ∈ R : µ(Z ≤ z) ≥ 1− α. (6)

Note that FZ(z) := µ(Z ≤ z) is the cumulative distribution function (CDF) ofZ, which is right-continuous and increasing [30, p. 209]. For a given α ∈ (0, 1),there may be different points z 6= z′ such that FZ(z) = FZ(z′) = 1 − α, orthere may be no such point. The VaRα(Z) = F−1

Z (1 − α) is the generalizedinverse CDF of Z at level 1− α; this is the smallest value z ∈ R such that thecondition FZ(z) ≥ 1− α holds (6). Synonymously, the VaRα(Z) is the left-side

limθ→0

E(Z exp(−θ2Z)) = E(Z), and lim

θ→0E(exp(−θ

2Z)) = 1. Then, one can complete the proof

of (2) using L’Hopital’s Rule.

8

(1− α)-quantile corresponding to the probability distribution of Z [31, p. 721].Fig. 5 provides a visual explanation for the VaRα(Z).

2.1.5. Conditional Value-at-Risk

The Conditional Value-at-Risk (CVaR) of Z represents the expectation ofZ in a given proportion of the worst cases (Fig. 1). CVaR was developed in itsmodern form in the early 2000s, and the key works underlying this developmentinclude [32, 33, 34]. Additional names for CVaR are Average Value-at-Risk,Expected Shortfall, and Expected Tail Loss [31]. CVaR has various representa-tions, and these representations offer different insights into how the functionalquantifies risk.

First, we introduce the representation that explains the name ConditionalValue-at-Risk. Suppose that Z ∈ L1 := L1(Ω,F , µ), α ∈ (0, 1), and the CDFFZ(z) := µ(Z ≤ z) is continuous at the point z = VaRα(Z). Under theseconditions, CVaRα(Z) is equivalent to E(Z|Z ≥ VaRα(Z)), which is the expec-tation of Z conditioned on the event that Z exceeds the VaRα(Z) [19, Thm.6.2, p. 259]. This representation is depicted in Fig. 1, in the special case whenZ is absolutely continuous.

The next representation that we present explains the name Average Value-at-Risk. If Z ∈ L1 and α ∈ (0, 1), then CVaRα(Z) equals an average of theValue-at-Risk in which the integral is taken with respect to the level of the VaR,

CVaRα(Z) = 1α

∫ 1

1−α VaR1−s(Z) ds. A proof is provided by [19, Thm. 6.2].The last representation is often presented as the definition for CVaR. This

representation expresses CVaR in terms of a convex optimization problem, whichis useful for risk-averse control of autonomous systems in particular (to be de-scribed in Sec. 3.3.3). If Z ∈ L1 and α ∈ (0, 1], then CVaRα(Z) is defined by

CVaRα(Z) := infs∈R

(s+ 1

αE(maxZ − s, 0)). (7)

It can be shown, e.g., see [19, p. 258] or [31, p. 721], that for any α ∈ (0, 1), aminimizer is VaRα(Z), and thus,

CVaRα(Z) = VaRα(Z) + 1αE(maxZ −VaRα(Z), 0). (8)

Eq. (8) means that the CVaR at level α ∈ (0, 1) encodes risk in terms ofa linear combination between the VaRα(Z) and the average amount that Zexceeds the VaRα(Z). Moreover, one can use (7) to prove that CVaRα(Z)equals the expectation of Z if α = 1.

2.1.6. Nested Risk Functionals

Consider a risk functional ρ applied to a cumulative cost Z := Z0 +Z1 +Z2

of the form,ρ(Z) := ρ0(Z0 + ρ1(Z1 + ρ2(Z2))). (9)

For simplicity of presentation, we consider a nested risk functional for a timehorizon of length N = 2 in (9), whereas more generally, we may consider ρ =

9

ρ0 ρ1 · · · ρN and Z = Z0 + Z1 + · · · + ZN with N ∈ N. The problem ofminimizing a random cost that is assessed using a nested risk functional for anautonomous system has been studied by several authors over the past decade.Example works between 2010 and 2021 include [35, 36, 37, 38, 39]. We describemethodology for this problem in Sec. 3.3.4.

2.2. Advantages and Disadvantages

Having presented common risk functionals, now we discuss their respectiveadvantages and disadvantages in terms of decision-theoretic axioms, computa-tional tractability, and explainability.

2.2.1. Axiomatic Considerations

Some risk functionals have a strong axiomatic foundation, including CVaRon the domain L1 with the level α ∈ (0, 1] and Mean-Upper-Semideviation onthe domain Lp with p ∈ [1,+∞) and the parameter λ ∈ [0, 1] [19, Example 6.20].In particular, these two functionals satisfy four desirable properties that definethe class of coherent risk functionals: monotonicity, subadditivity, translationequivariance, and positive homogeneity.8 The class of coherent risk functionalswas originally proposed by Artzner et al. [40], which justified the importanceof the four properties in financial engineering applications. Also, these prop-erties can be justified more broadly. Monotonicity means that a cost that islarger than another almost everywhere incurs a larger risk, which is a ratherlogical condition. Subadditivity ρ(Z1 + Z2) ≤ ρ(Z1) + ρ(Z2) means that “amerger does not create extra risk” in financial engineering [40]. More generally,a “merger” can be viewed as a system, network, or sequential process of non-adversarial interacting components, which may be relevant for applications inrobotics, healthcare systems, renewable energy, natural resources management,etc. The properties of translation equivariance and positive homogeneity indi-cate that the risk of a translated or scaled random cost is related to the risk ofthe untransformed random cost in a natural way, that is, through the partic-ular translation or scaling. Unfortunately, Value-at-Risk, Exponential Utility,and Mean-Variance, for example, are not coherent. Value-at-Risk (6) is notsubadditive [40, 41]. Exponential Utility (1) is not positively homogeneous.9

8Here, we present these properties formally. Let Zi ∈ Lp := Lp(Ω,F , µ), p ∈ [1,+∞],a ∈ R, β ∈ R+, and ρ : Lp → R be given. The risk functional ρ having the domain Lp

means that ρ evaluates random variables whose pth moments are finite and if Z1 equals Z2

µ-a.e., then ρ(Z1) = ρ(Z2). The abbreviation µ-a.e. specifies a property that holds almosteverywhere with respect to the probability measure µ. ρ is monotonic if and only if Z1 ≤ Z2

µ-a.e. =⇒ ρ(Z1) ≤ ρ(Z2). ρ is subadditive if and only if ρ(Z1 + Z2) ≤ ρ(Z1) + ρ(Z2). ρ istranslation equivariant if and only if ρ(Z1 + a) = ρ(Z1) + a. ρ is positively homogeneous ifand only if ρ(βZ1) = βρ(Z1).

9For example, consider ρEU,θ(Z) (1) with θ = −2, where Z equals 1 with probability 0.2and Z equals 2 with probability 0.8. If β = 2, then ρEU,θ(βZ) ≈ 3.81 but βρEU,θ(Z) ≈ 3.73.

10

Mean-Variance (3) with λ > 0 is not positively homogeneous and not mono-tonic.10

2.2.2. Computational Tractability

If Z is a cumulative random cost incurred by an MDP, the Expectation of Zand Exponential Utility of Z can be optimized by formulating a DP recursionon the state space S. A performance criterion that is defined using a nestedrisk functional can be optimized in this way as well, e.g., see [35]. However,optimizing performance criteria that are defined using other risk functionals,including CVaR, Mean-Variance, Mean-CVaR, and Expected Utility (for a non-exponential utility function), require more expensive computations in general[18, 23, 29, 42, 43, 44, 45, 46]. To optimize a multi-stage random cost that isevaluated using one of these risk functionals for an MDP, one solves a series ofsub-problems, where each sub-problem typically depends on the history of theprocess from time zero. This type of problem can require great computationalresources (e.g., a long runtime, high energy consumption, a computing cluster).In contrast, each sub-problem corresponding to the optimization of the Expec-tation or the Exponential Utility of a cumulative cost for an MDP depends onthe history only through the current state, which is more tractable. In addi-tion to the inconvenience of long run times, large-scale computations may havenegative environmental consequences; e.g., see [47] for a recent study about thecarbon emissions of computing.

In prior work, we optimized a cumulative random cost that was evaluatedusing Exponential Utility or CVaR for a thermostatic regulator with S ⊆ Rand a time horizon of length N = 12 and a stormwater system with S ⊆ R2

and a time horizon of length N = 48 [46]. While our code was not designedfor efficiency specifically, a qualitative comparison of resources highlights thedisadvantage of MDP problems that require history-dependent algorithms. Thecomputations for the one-dimensional system required 2 minutes for ExponentialUtility and 25 minutes for CVaR on 4 cores; the computations for the two-dimensional system required 10 minutes for Exponential Utility and 136 hoursfor CVaR on 30 cores [46, Table IV]. We present numerical methods for risk-averse MDP problems in more detail in Sec. 3.

2.2.3. Explainability

CVaR, Value-at-Risk, and Mean-Variance have intuitive interpretations, whereasthe interpretations of other risk functionals, including nested risk functionals,Mean-Deviation, Mean-Upper-Semideviation, and Exponential Utility, are lessstraightforward. In particular, if Z is a random variable with finite first mo-ment and its cumulative distribution function is continuous, then the CVaR of

10Consider ρMV,λ(Z1) and ρMV,λ(Z2) (3) with λ = 4, where Z1 is uniformly distributed

on (−1, 1) and Z2 equals 1.2 with probability 1. If β = 2, then ρMV,λ(βZ1) = 163

, but

βρMV,λ(Z1) = 83

. While Z1 ≤ Z2 everywhere, it holds that ρMV,λ(Z1) = 43> ρMV,λ(Z2) =

1.2.

11

Z at level α ∈ (0, 1] is the expectation of Z in the α · 100% worst cases. Theparameter α in the definition of CVaR (7) has a quantitative interpretationas a fraction of the worst-case realizations of Z, which is illustrated when Zadmits a density in Fig. 1. In contrast, the parameter θ in the definition ofExponential Utility (1) has a qualitative interpretation as a weight on the vari-ance under specific conditions, which typically include θ being sufficiently nearzero. More generally, the class of Expected-Utility functionals has a qualitativeinterpretation of representing subjective preferences (Sec. 2.1.1). Connectionsbetween nested risk functionals and distributionally robust performance cri-teria have been studied by [39, 31], for example; in [31], see Proposition 4.1in particular. Risk functionals, such as VaR and CVaR, encode the notion ofrisk in terms of quantiles, whereas Mean-Deviation, Mean-Upper-Semideviation,Mean-Variance, and Exponential Utility encode the notion of risk in terms ofdeviations from the Expectation.

A preferred approach for encoding risk is application-dependent. For exam-ple, during a typical drive of an autonomous vehicle, penalizing the variance ofthe distance between a desired path and the true path may be preferable, butpenalizing the maximum acceleration experienced by humans during a collision,or a near-collision, may be critical to alleviate the severity of injuries. For an-other example, minimizing the CVaR of the maximum flood level may be usefulfor designing stormwater infrastructure systems [18]. However, minimizing theMean-Variance of the flood level is not appropriate because the variance doesnot differentiate between deviations above and below the mean, and deviationsbelow the mean are not important for flood control.

The intuitive interpretation for CVaR and its strong axiomatic justificationhas contributed to its growing popularity in recent years in the field of risk-averse autonomous systems. In contrast, prior to the early 2000s, ExponentialUtility tended to be the risk functional of choice and is still popular. In the nextsection, we present an overview of numerical methods for optimizing risk-averseautonomous systems, which includes historical and recent developments withemphasis on model-based approaches.

3. Optimizing the Behavior of Autonomous Systems with respect toRisk-Averse Criteria

The problem of optimizing the behavior of autonomous systems with safetyconsiderations despite uncertainties has received considerable research attentionby various academic circles, starting at least in the 1970s. As we introducedpreviously, two main paradigms have emerged from this research activity, whichmay be called the risk-neutral paradigm and the worst-case paradigm. Thesomewhat secondary risk-averse paradigm forms a bridge between the two main-stream paradigms; a risk-averse problem typically reduces to a risk-neutral orworst-case problem as a special case. We organize this section into three parts.The first part concerns the worst-case paradigm, the second part concerns therisk-neutral paradigm (in the setting of systems with safety considerations), andthe third part concerns the risk-averse paradigm. In the first part, we present

12

methods that use worst-case analyses to quantify risk and then optimize sys-tem behavior with respect to this risk assessment. The second part focuses onmethods that use probabilistic analyses, and the last part describes methodsthat use risk functionals to optimize system behavior.

To facilitate our presentation, we briefly describe two different models forautonomous systems. A non-stochastic system is affected by bounded adver-sarial disturbances without probabilistic descriptions. However, a stochasticsystem is affected by disturbances with probabilistic descriptions that need notbe adversarial or bounded. The class of systems that is more useful dependson the application of interest. For example, if the goal is to design computersoftware to protect private consumer data, assuming adversarial disturbances torepresent attackers may be appropriate. On the other hand, if the goal is to de-sign a green roof to reduce ambient air temperature, assuming non-adversarialstochastic disturbances (i.e., random noise) to represent uncertain precipitationpatterns may be suitable. Typically, a worst-case analysis is used to quantifyrisk for a non-stochastic system. In contrast, a probabilistic analysis or an anal-ysis based on risk functionals is used to quantify risk for a stochastic system.An example of a stochastic system is a Markov decision process (MDP), whichis a commonly used model in reinforcement learning, stochastic control, andoperations research.

3.1. Using Worst-Case Analyses to Quantify Risk for Autonomous Systems

A standard worst-case analysis involves computing a policy that minimizesa maximum deterministic cost subject to assumptions about the dynamics of anon-stochastic system. This type of analysis can be used to compute subsets ofthe state space in which a system is guaranteed, at least in principle, to stayinside a desired region or reach a target within a given time horizon. The firstproblem concerns the verification of an invariance property, whereas the secondproblem concerns the verification of a reachability property (Fig. 6). Tradi-tionally, exact knowledge of the dynamics equations and disturbance boundsare required for the guarantee to be true. Due to the underlying assumption ofadversarial disturbances, a worst-case analysis is conservative but can be use-ful when constraint satisfaction is paramount, well-understood low-dimensionalmodels exist, and disturbance bounds can be characterized accurately a priori.

One of the first reachability analysis methods for discrete-time non-stochasticsystems was developed using indicator functions by Bertsekas and Rhodes in1971 [4]. Methods have been devised for computing invariant sets or reachingtargets using variations of sum-of-squares programming [48, 49], linear matrixinequalities [50, 51, 52, 53], and geometric approaches [54, 55, 56]. A reachabilityanalysis method for continuous-time non-linear non-stochastic systems based ondynamic programming (DP) was developed in the mid-2000s using Hamilton-Jacobi partial differential equations [57, 8, 58, 59, 9]. While DP-based methodscannot scale to high-dimensional systems (e.g., robots with many joints, a city-wide power grid, a traffic network), some classes of systems have structuralproperties that permit more efficient DP-based computations, e.g., see [59].

13

A spectrum of desirable system behaviors can be specified using temporallogic. Originally, this framework was proposed for verifying computer programsin the 1970s [60], and a concise modern summary is provided by [61]. Requiringa system to remain inside a particular region always or reach a target ultimatelyare examples of temporal logic specifications. Additional examples include re-quiring a system to exhibit a behavior infinitely many times, remain in a givenlocation until some event occurs, and visit a particular location after visitinganother location always. Formal methods with temporal logic specificationshave been devised for non-stochastic discrete-time traffic flow networks [61] andnon-stochastic continuous-time systems applied to robotics [62].

3.2. Using Probabilistic Analyses to Quantify Risk for Autonomous Systems

Systems that are affected by unbounded disturbances may benefit from riskassessments, but a worst-case analysis is not designed for this setting. For-tunately, methods have been devised to assess and optimize risk for stochas-tic systems, in which the notion of risk is described in terms of probabilities[63, 64, 11, 65, 66, 67, 68]. In the early 2000s, Geibel and colleagues formulateda constrained MDP problem, in which risk is defined in terms of the probabil-ity of the system entering a dangerous state [63, 64]. The authors proposed avariation of Q-learning to solve the problem of interest approximately [63, 64].Building on this work, in 2012, Garcıa and Fernandez assessed risk in termsof the probability of entering an unexplored, and therefore potentially danger-ous, state [11, Definition 8]. Their approach for permitting exploration involvedperturbing demonstrations from an expert with noise [11].

Under appropriate modeling assumptions, methods can verify probabilisticproperties by providing theoretically guaranteed answers to questions, such as:

1. What are the initial states from which the state trajectory stays within adesired operating region with sufficiently high probability? [65, 67, 68]

2. What are the initial states from which the state trajectory reaches a targetset while remaining inside a desired operating region with sufficiently highprobability? [66, 67]

The methods [65, 66, 67, 68] use DP with indicator stage-cost functions tooptimize a probability of safety or performance and compute a policy that at-tains the optimal probability, provided that the system dynamics satisfy someassumptions. At a high level, the assumptions impose a sufficient degree ofregularity on dynamical models so that integrals and minimizers are known toexist (by applying principles from measure theory and real analysis). Differentmethods permit different types of disturbances. The works [65, 66] from 2008and 2010, respectively, concern non-adversarial probabilistic disturbances. Thatis, these disturbances behave as random noise, perturbing the system’s trajec-tory without favoring particular outcomes. More recent methods have extendedthe previous works to permit adversarial probabilistic disturbances [67] anddisturbances that are characterized by a family of plausible distributions [68].Adversarial probabilistic disturbances compete against policies to degrade theperformance or safety of the system, a setting that is inspired by game theory.

14

The formulation in [68] involves a related setting of distributional robustness,which permits inexact knowledge of the true disturbance distribution.

Probabilistic temporal logic extends temporal logic (Sec. 3.1) to a proba-bilistic setting. In 2016, Sadigh and Kapoor proposed this language, expressedspecifications in this language using chance constraints, and demonstrated theirapplicability to safe vehicle control [69]. In earlier work by some of the sameauthors, the concept of probabilistic temporal logic was applied to finite-stateMDPs with unknown transition probabilities, and a temporal difference learningalgorithm was proposed [70]. A method that uses second-order cone program-ming has been devised to verify the satisfaction of probabilistic temporal logicspecifications with applications to autonomous driving [71], and similar ideashave been investigated by [72].

3.3. Using Risk Functionals to Quantify Risk for Autonomous Systems

The methods in Sec. 3.2 are designed to quantify probabilities, but theyare not designed to quantify the magnitude of harmful outcomes, which is animportant aspect of risk analysis (recall the Introduction). For instance, if theprobability of a system leaving a desired operating region is small, then themagnitude of the departure is expected to be small as well. However, knowingthe probability does not provide a more explicit measure of the magnitude of thedeparture. More generally, quantifying the magnitude of harmful outcomes, aswell as their likelihood, may be important in practice when harmful outcomes areinevitable, yet their severity should be minimized as much as possible. Exampleapplication domains include: renewable energy [73], water resources [74, 75],and biology [76, 77]. In these domains, noise is ubiquitous, as suggested by theabove references, and reducing the magnitude of harmful outcomes is critical.For example, it is important to reduce the difference between energy generationand consumption, limit the volume of stormwater overflows, and minimize theextent of adverse medicinal side effects. The practical importance of assessingand minimizing the likelihood and magnitude of harmful outcomes motivatesthe study of methods for autonomous systems that assess random costs usingrisk functionals.

The problem of computing a policy for an autonomous system that optimizesa random cost, where the cost is evaluated using a risk functional, has beenstudied in various forms since the 1970s. Some approaches throughout thedecades include [78, 24, 79, 80, 35, 43, 81, 23, 44, 29, 82]. The type of riskfunctional that is chosen and whether it defines an objective or a constraintdepend on the desired characteristics of the cost distribution and considerationsof theoretical and computational complexity and explainability. We organizeour presentation of the literature according to methodology. First, we introducemethods that are based on Exponential Utility, which has a rich history. Then,we present more modern techniques, which includes methods based on state-space augmentation, inner-outer optimization, nested risk functionals, and riskconstraints.

15

3.3.1. Methods based on Exponential Utility

One of the first studies was conducted by Howard and Matheson in 1972in the setting of finite-state MDPs [78]. The authors were inspired by decisiontheory [83] to encode the subjective preferences of a risk-averse or risk-seekingdecision-maker using the Exponential-Utility functional. Independently, Jacob-son and colleagues studied an Exponential-Utility performance criterion in thesetting of linear systems subject to Gaussian noise with quadratic costs (LEQG)in 1973 [84] and 1974 [85]. The initial investigation was motivated by theintuition that the linear-quadratic-Gaussian controller, which is equivalent tothe linear-quadratic-regulator controller without noise, may neglect key char-acteristics of disturbance distributions [84]. In 1981, the optimization of anExponential-Utility criterion for a continuous-time stochastic system was stud-ied by Kumar and van Schuppen [86]. Also, in 1981, Whittle published a seminalpaper showing parallels between LEQG theory and linear-quadratic-regulatortheory, in particular for partially observable systems [24]. In subsequent years,Whittle considered a Hamiltonian formulation [87] and developed a Pontryaginmaximum principle for an Exponential-Utility criterion in a partially observ-able setting [88]. Researchers have drawn connections between LEQG theoryand a) H∞ control [89], b) minimax linear-quadratic-Gaussian model predictivecontrol [90, p. 99], and c) soft-constrained linear-quadratic games [91]; pleasesee our prior work [82, Remarks 5 & 6] for an extended discussion about theseconnections. In 1999, Coraluppi and Marcus studied relations between the prob-lems of optimizing Exponential-Utility and worst-case performance objectivesfor MDPs with finitely many states [6]. In the same year, di Masi and Stet-tner examined the problem of optimizing an infinite-time Exponential-Utilitycriterion for an MDP whose state space need not have finite cardinality [79]. In2002, a Q-learning algorithm to optimize an infinite-time Exponential-Utilitycriterion for MDPs with finitely many states was studied by Borkar [80]. Sincethe late 1990s, numerous papers have examined MDPs with Exponential-Utilitycriteria in the settings of state spaces with finite cardinality, e.g., [92, 93, 94],and infinite-time horizons with more general state spaces, e.g., [95, 96, 97]. Amore detailed survey on Exponential-Utility MDPs can be found in our recentwork [98].

Despite the extensive research on using Exponential Utility to assess risk forautonomous systems, this functional has disadvantages that cannot be ignored.Exponential Utility is a special case of an Expected-Utility functional, thereexist different utility functions to choose from, and an exponential one need notrepresent a user’s subjective preferences appropriately. As we have discussedin Sec. 2.2, Exponential Utility lacks coherence and provides a theoreticallyjustified approximation to a weighted sum of mean and variance under limitedconditions. In particular, choosing a more negative value of the parameter θdoes not guarantee a reduced variance, as we demonstrated in prior work [46].

Therefore, naturally, different approaches for assessing and optimizing riskfor autonomous systems have been studied, although such approaches tend tobe more complex in theory or computation versus Exponential-Utility-based ap-

16

proaches. As we have mentioned earlier, we organize our presentation into fourcategories of approaches: state-space augmentation, inner-outer optimization,nested risk functionals, and risk constraints. We note that some works use acombination of these approaches.

3.3.2. Methods based on State-Space Augmentation

The expectation of a cumulative cost for an MDP can be optimized tractably(e.g., using DP recursion on the state space) because the expected cost-to-godepends on the past only through the current state when Markov policies areused. This fact stems from the law of iterated expectations. In this setting, thecurrent state provides sufficient information about the past to make informeddecisions, which is often called Bellman’s Principle of Optimality. Unfortu-nately, the Principle does not hold for a risk functional in general. The termtime inconsistent is used to describe problems in which Bellman’s Principle isnot satisfied [99].

A technique to alleviate the issue of time inconsistency is the technique ofstate-space augmentation. Typically, this technique involves 1) defining thedynamics of an extra state and 2) defining a larger (i.e., augmented) statespace so that a DP recursion does hold on the augmented space. The extrastate is defined to summarize enough information about the past so that analgorithm can use this information, instead of a complete record of the past,to make informed decisions for the present.11 This concept dates back at leastto a paper on MDPs by Kreps from 1977, in which a “summary space” foroptimizing an Expected-Utility functional was proposed [100]. We shall seethat using augmented state spaces has become more common in recent years,perhaps inspired by the advancements of modern computing.

One can guarantee the existence of an optimal policy that is Markovian onan augmented state space under appropriate conditions, e.g., see [43, 23, 45, 18],which were written between 2011 and 2021. Such policies may be called opti-mal pre-commitment policies to emphasize that they depend on the dynamicsof an extra state. There is debate regarding the usefulness of pre-commitmentpolicies; for instance, [101] and [102] provide different points of view. Nonethe-less, algorithms that compute pre-commitment policies are receiving significantresearch attention, as they are topics of theoretical interest and have potentialto become useful in practice.

In 2011 and 2014, Bauerle and colleagues devised the cutting-edge approachesfor optimizing CVaR and Expected-Utility criteria, respectively, for MDPs onBorel state spaces [43, 23]. There, key steps include defining an extra stateto track the cumulative cost from the initial time to the current time and us-ing these dynamics to devise a DP recursion on the augmented state space

11The precise definition of the extra state depends on the problem at hand. For example, inthe setting of a cumulative cost, an extra state of the form St+1 = St−Ct has been used [43],where St is the (random) extra state at time t and Ct is the (random) stage cost at time t.In our prior work involving a maximum cost, an extra state of the form St+1 = maxSt, Cthas been useful [18].

17

[43, 23]. Building on these works, in 2015, Haskell and Jain formulated infinite-dimensional linear programs in occupation measures on an augmented statespace [44]. Expected-Utility and CVaR objectives were considered as well asminimizing an expected cost subject to a risk-aware constraint [44]. In relatedwork, Borkar and Jain used an augmented state space to optimize an expectedreward subject to a CVaR constraint for an MDP on a finite-time horizon [81].The method uses a DP recursion with multiple loops, one of which updatesa dual variable, and the domain of the value functions is an augmented statespace [81]. Chow et al. augmented the state space using a time-varying risk-sensitivity level to optimize MDPs with CVaR performance criteria approxi-mately [103]. The concept of a time-varying risk-sensitivity level comes froma history-dependent temporal decomposition for CVaR [42, 104]. The historydependence is removed in the DP algorithm proposed by [103] to provide aheuristic approach with computational benefits.

In prior work, we proposed the problem of optimizing the CVaR of a max-imum cost to assess the magnitude of an MDP’s constraint violation in a frac-tion of worst cases [17, 18]. A classical robust safety analysis method [57],advancements in the theory of optimizing CVaR criteria for MDPs [43], and thepractical importance of including an assessment of consequences in risk analysisinspired us to formulate this problem. We proposed a DP-based method usingstate-space augmentation that provides an exact solution in principle for finite-time MDPs with Borel state and action spaces [18]. In addition, we devised atheoretically-guaranteed under-approximation method [17], which does not re-quire state-space augmentation and thus is considerably more efficient, at theexpense of being less accurate, in comparison to [18].

3.3.3. Methods based on Inner-Outer Optimization

A popular technique for optimizing a risk-functional criterion for an MDPis to re-express the problem in terms of two problems: an inner optimizationproblem over policies and an outer optimization problem over R or a class of R-valued functions. This technique has been used with state-space augmentationby [43, 45, 18, 82] and without state-space augmentation by [29].12 Next, wedescribe the inner-outer optimization technique in the context of CVaR forMDPs with Borel state and action spaces.

For any policy π ∈ Π and initial condition x ∈ S, let Pπx denote the probabil-ity measure on (Ω,BΩ), where 1) Ω is a Borel space containing all realizationsof the system’s random trajectory, and 2) Pπx (B) is the probability that therandom trajectory is in a Borel-measurable rectangle B ⊆ Ω.13 Define the ex-

12Ref. [29] considers a terminal cost rather than a sum of stage and terminal costs.13The definition of Ω and Pπx depend on the problem at hand. If the goal is to minimize the

expectation of a cumulative cost for an MDP (a classical risk-neutral problem), it is typicalto define Ω := (S×A)∞. In this case, the random trajectory is (X0, U0, X1, U1, ...), where Xtis the random state at time t and Ut is the random action at time t. An element ω of Ω is apossible sequence of states and actions, which takes the form ω = (x0, u0, x1, u1, . . . ). A formalconstruction of Pπx can be found in [3, Prop. 7.28, pp. 140–141], and a high-level description

18

pectation of a Borel-measurable function Y : Ω → R∗ with respect to Pπx byEπx (Y ) :=

∫ΩY dPπx . Now, let Z : Ω → R∗ be a Borel-measurable function

representing a random cost incurred by a stochastic system. Suppose that thereis a policy π ∈ Π such that Eπx (|Z|) < +∞ for all initial states x ∈ S. Then,use the definition of CVaR (7) and interchange the order of the infima over Πand R to derive the following expression:

infπ∈Π

CVaRπα,x(Z) = inf

π∈Πinfs∈R

(s+ 1

αEπx (maxZ − s, 0)

)= infs∈R

(s+ 1

α infπ∈Π

Eπx (maxZ − s, 0)).

(10)

Under certain conditions, the inner problem J∗s (x) := infπ∈Π

Eπx (maxZ − s, 0)can be solved for each s exactly in principle, e.g., using DP, and subsequently,the family of functions J∗s : s ∈ R can be used to compute the optimal CVaRexactly in principle [43, 18]. In 2021, Bauerle and Glauner generalized thetechnique (10) to optimize spectral risk functionals for MDPs [45].14

3.3.4. Methods based on Nested Risk Functionals

An alternative to the above two approaches is to optimize a nested (i.e.,recursive, compositional) risk functional via DP on the original state space[35, 37, 38, 39]. In general, nested risk functionals have been criticized for be-ing harder to interpret but have a numerical advantage of avoiding state-spaceaugmentation. First, we summarize some key ideas from the seminal papersby Ruszczynski [35, 37], and then, we briefly describe other works that em-ploy nested risk functionals. Under appropriate conditions, a nested risk func-tional can be written as a composition of one-step conditional risk functionals,ρt : Zt+1 → Zt [35, Eq. 13]. Zt is a set of random variables, where each randomvariable can depend on the system’s state-history from time 0 to time t. Recallthat a nested risk functional of a cumulative cost Z := Z0 + Z1 + Z2 takes theform ρ(Z) := ρ0(Z0 + ρ1(Z1 + ρ2(Z2))) (Sec. 2.1.6). A key insight from [35, 37]is that ρ(Z) can be optimized recursively backwards-in-time starting from thelast stage; in our simplified discussion, the last stage is 2. At a high level, anested risk functional mimics the structure of the law of iterated expectations,which permits a DP-based optimization approach on the state space without

can be found in [105, Remark C.11, p. 178]. For a risk-aware problem, a larger sample spaceΩ may be required to permit extra states that summarize the history of states and actions.For example, in prior work, we considered a risk-averse safety analysis problem that involvesoptimizing the CVaR of a maximum cost for an MDP on a finite-time horizon of length N ∈ N[18]. In this setting, the random trajectory is (X0, Z0, U0, ..., XN−1, ZN−1, UN−1, XN , ZN ),where Xt and Ut are as before, and Zt is the random maximum cost up to time t [18].The sample space is Ω := (S × Z × A)N × S × Z, where Z ⊆ R is the range of a givenstage cost. Any ω ∈ Ω is a tuple of augmented states and actions, which takes the formω = (x0, z0, u0, . . . , xN−1, zN−1, uN−1, xN , zN ). A formal construction of Pπx for this risk-aware setting is in the supplementary material of [18].

14CVaR is an example of a spectral risk functional. More generally, a spectral risk functionalcan be written as an integral over CVaRα with respect to α [45, Prop. 2.5].

19

augmentation. The technical arguments underlying the correctness of this op-timization approach are quite involved, and details can be found in [35, 37].Algorithms adapted to a model-predictive-control setting were studied in 2018[38]. Connections between nested risk functionals and distributionally robustproblems for data processes were examined in 2012 [31, Prop. 4.1], and connec-tions between nested risk functionals and distributionally robust problems forMDPs were explored in 2021 [39].

3.3.5. Methods based on Risk Constraints

The previous approaches study the problem of optimizing a random costincurred by a stochastic system that is assessed in terms of a risk functional.A related problem is to optimize an expected cost incurred by a stochastic sys-tem subject to a risk constraint. Such risk-constrained problems have beenstudied, for example, by [106, 107, 44, 108, 81, 109]. In 2012, Moldovan andAbbeel defined a risk constraint to ensure that an MDP satisfies the prop-erty of ergodicity, i.e., each state can be visited from any other state, for asafe exploration problem [106]. A novel contribution of this work is forminga rigorous connection between the concepts of ergodicity and safety [106]. Achance-constrained problem is a special case of a risk-constrained problem, asthe notion of risk is quantified in terms of probabilities. Incorporating chanceconstraints into multi-stage decision-making problems is popular, and exampleworks include [107] from 2014 which concerns model predictive control and [44]from 2015 which concerns MDPs. To alleviate computational complexity, anapproach for approximating chance constraints is to replace a chance constraintby a Conditional-Value-at-Risk constraint [19, p. 258, Eq. 6.24]. For exam-ple, this technique was employed by van Parys et al. in a setting with linearsystems, quadratic costs, and additive disturbances characterized by a familyof distributions [108]. More broadly, CVaR constraints are popular in the au-tonomous systems literature, and we outline two examples briefly. In 2014,Borkar and Jain considered the problem of maximizing a cumulative reward foran MDP subject to a CVaR constraint applied to a noisy cost process [81]; theauthors proposed an algorithm using state-space augmentation [81] (Sec. 3.3.2).In 2018, Samuelson and Yang utilized a CVaR constraint to define stage-wisesafety specifications for stochastic systems, in which CVaR evaluates a stagecost at each time point [109]. To our knowledge, this is the first work to useCVaR to define a safety property for a stochastic system [109].15

3.4. Disadvantages of Model-Based Approaches

Up until now, most methods that we have presented require mathematicalassumptions about the system model, and these assumptions are not permittedto change in unanticipated ways over time. Many systems of interest operate

15We extended this idea to propose trajectory-wise safety specifications based on optimizingthe CVaR of a maximum trajectory-wise cost for an MDP [110, 17, 18]. We have describedsome of this prior work in Sec. 3.3.2.

20

according to known physical or chemical laws, and thus, assuming that thedynamics equations have a particular structure need not be limiting. However,systems can contain many interacting components, e.g., robots with many joints,city-wide water networks, or several cancer drugs applied in combination, lead-ing to high-dimensional models that are challenging to analyze and expensiveto optimize. In addition, even for systems that can be well-represented usinglow-dimensional models, system behavior can be affected by disturbances thatare challenging to anticipate. While one can conduct experiments to estimatethe distributions of disturbances, distributions can change in unexpected waysduring a system’s operation. Assuming exact knowledge of a disturbance dis-tribution a priori may lead to poor outcomes in practice, if the distributionat run time differs significantly from the distribution that is assumed duringplanning. These issues are well-known in the autonomous systems community.In the following section, we present methods that have potential to alleviatesome of these issues by improving the real-time adaptive capabilities of policies.Some papers from the previous sections are also included in Sec. 4, as theircontributions are relevant to the discussion.

4. Towards Real-Time Adaptation to Real-World Situations

An approach for devising policies when there is limited prior mathemati-cal knowledge is to use algorithms that rely on sampling rather than explicitmathematical models. Q-learning and temporal difference learning are classicalexamples of model-free algorithms, which require an experimental platform thatcan generate sufficiently many representative samples.16 Risk-aware model-freealgorithms started receiving attention in the mid-1990s. In 1994, Heger proposeda Q-learning algorithm for finite-state finite-action MDPs with a maximum-costcriterion and also presented different approaches for defining risk to the machinelearning community [5] (Sec. 2.1). In 2002, Mihatsch and Neuneier proposedQ-learning and temporal difference learning algorithms that encode risk-awarepreferences for MDPs with finitely many states and actions [114].17 They mod-ified the standard algorithms using a specific weighting function and showedthat the modified versions satisfy certain convergence properties [114]. At thesame time, Borkar independently proposed a risk-aware Q-learning algorithmfor finite-state finite-action MDPs that uses the Exponential-Utility functionalto assess risk [80] (Sec. 3.3).

In subsequent years, researchers developed Q-learning-like algorithms for ad-ditional risk-aware settings; example works include [64, 116, 117, 118]. In 2005,

16An early exploration of (non-risk-aware) temporal difference learning was conducted byWitten in 1977 in an adaptive control setting [111]. The vanilla form of Q-learning is attributedto Watkins’ seminal research from 1989 and 1992 [112, 113]. A more detailed history can befound in Sutton and Barto [2].

17It is likely that the formulation first appeared in a conference paper by Neuneier andMihatsch from 1998 [115], but we could not find the original paper to confirm.

21

Geibel and Wysotzki proposed a simulation-based algorithm that resembles Q-learning to optimize a chance-constrained MDP approximately [64], extendingearlier work by Geibel [63]. The authors quantified risk in terms of the probabil-ity of entering an undesirable region of the state space [63, 64], which is an ex-ample of a probabilistic risk assessment (Sec. 3.2). In 2014, Shen et al. deviseda Q-learning algorithm for utility-based shortfall risk functionals and studiedrelations between the algorithm and human behavior [116]. Later, in 2017 and2021, Huang and Haskell developed a model-free algorithm, which builds onQ-learning, for finite-state finite-action MDPs with a minimax Expected-Utilitycriterion [117, 118]. The authors showed how the proposed criterion applies toseveral well-established risk functionals [117, 118].

While the trade-offs between model-free and model-based approaches arewell-known, we would like to summarize some key aspects of the debate briefly.A disadvantage of algorithms that rely on simulators is that a simulator neednot provide a sufficient characterization of the real physical world. Samplesthat come from real experiments, rather than from simulators, can be used toestimate average outcomes in practice. However, characterizing rarer outcomesrequires more samples, and collecting a larger number of samples may or maynot be feasible for the application of interest. In contrast to sampling-basedapproaches, model-based approaches require prior knowledge about the dynam-ics equations and plausible probability distributions for disturbances. Overall,model-free and model-based approaches have diverse requirements.

Motivated by these diverse requirements, hybrid approaches that combinefeatures of model-free and model-based methods have been developed to devisepolicies with enhanced capabilities to adapt to unanticipated real-world situa-tions safely, e.g., [119, 120, 121, 122, 10, 123, 124, 125, 126, 127, 128, 129, 130].These approaches build on ideas from adaptive control, which dates back atleast to the 1950s [131], and are now classified in the field of safe learning. Wehighlight some key papers in this area and emphasize their connections to therisk-averse systems literature.

Safe learning approaches may be classified as risk averse, although the no-tion of risk is defined indirectly. Typically, additional constraints are imposedon a standard problem formulation to restrict the class of policies that a sys-tem is permitted to consider. In 1996, Schneider examined an application ofdynamic programming in which the assumed model is re-identified continuallyusing up-to-date observations to promote safer operation [119]. Later on, poli-cies have been constructed by maximizing an expected reward subject to anergodic safety constraint [106], by applying robust safety constraints to a nomi-nal model [122], and by using Lyapunov stability techniques [120, 127]. In [124],a policy is computed by optimizing a variance criterion subject to a safety con-straint that depends on the moments of a Gaussian process. The class of policiesis constrained using control barrier functions, which generalize Lyapunov func-tions, in [129, 130]. Wabersich et al. wraps a safety policy around an RL-basedpolicy, where the safety policy is constructed using concepts from stochasticmodel predictive control and reachable tubes [132, 133].

A common thread of safe learning methods is using samples from the sys-

22

tem’s trajectory to improve knowledge about disturbances to the system orknowledge about where the system can operate safely. For example, researchershave incorporated online observations into Hamilton-Jacobi reachability, a ro-bust optimal control method, to improve estimates of disturbance bounds [134,128]. Some works provide probabilistic guarantees about the performance ofpolicies or accuracy of estimated operating regions, e.g., see [127, Thm. 4],[128, Prop. 7], and [129, Lemma 1]. Such guarantees require statistical as-sumptions about the samples or modeling uncertainties. Assuming that anuncertain component (e.g., part of the dynamical model or performance cri-terion) is well-characterized using a Gaussian process is a popular approach[123, 124, 125, 126, 127, 128, 129].

5. Future Directions

We conclude the review by recommending directions for future research.Thus far, most studies have assumed a fixed degree of risk aversion, whereasthere may be a practical need to update this degree iteratively as additional datais acquired in real time. That is, ideas from adaptive control and safe learningcould be used to learn an appropriate degree of risk aversion for the presenttime period according to how accurate the current model is believed to be. Inparticular, the degree of risk aversion could be the parameter α that character-izes the Conditional Value-at-Risk functional, and we recall that α represents afraction of worst-case outcomes. To reduce the gap between theoretical guaran-tees and satisfactory operation in practice, we require improved understandingabout how to generate data samples with desired statistical properties at runtime.

We see advantages and disadvantages to model-free and model-based ap-proaches and argue that hybrid approaches with a well-chosen risk assessmenttechnique may be beneficial for autonomous systems. Naturally, an appropri-ate combination of model-free and model-based characteristics is application-dependent, as different forms of a priori knowledge are available in differentdomains. For example, assuming the availability of representative samples todescribe customer behavior may be appropriate in online shopping, whereas ablend of patient data, biological and chemical models, and expert advice fromoncologists is needed to improve cancer treatments. Stronger interdisciplinarycommunication between domain experts and the reinforcement learning andstochastic control communities is important for realizing the potential of risk-averse methods in practice.

Implicitly or explicitly, most research on risk-averse autonomous systemshas concerned applications related to robots or ground/aerial vehicles. Whilethese applications are important, diversity of applications is also important.Particularly relevant domains for current and future research on risk-aversesystems include: renewable energy [73], water resources [74, 75], and biology [76,77]. These domains require autonomous systems to operate effectively despiteuncertainties, and the mitigation of risk in these domains is paramount forhuman and environmental welfare.

23

Moreover, we see value in further studying the numerical efficiency-accuracytrade-offs that arise in risk-averse methods. Developing scalable risk-aversemethods with finite-sample quality-of-approximation guarantees is required toexamine how well the risk-averse paradigm applies to high-dimensional sys-tems. It is well-known that purely model-based methods are restricted to low-dimensional systems, except in special cases such as linear-quadratic settings.On the other hand, state-of-the-art scalable methods typically lack finite-sampleanalysis; an exception is the contribution by Huang and Haskell from 2021 [118].

Due to a convenient parametrization, Gaussian processes are common insafe learning. However, this approach conceals asymmetries or higher-ordermoments that may be present in real distributions. Developing a new techniquethat incorporates such potentially important information may be feasible byadopting ideas that underlie well-understood risk functionals. More broadly,further studies are needed to help select which risk functionals may be moreappropriate for particular applications of autonomous systems.

Acknowledgements

We gratefully acknowledge discussions with Kevin M. Smith and Yuxi Han.We are thankful for comprehensive presentations by A. Shapiro, D. Dentcheva,and A. Ruszczynski [19] and N. Bauerle and U. Rieder [23], which provided astrong foundation for Sec. 2.1. In addition, we thank J. Kisiala [41] for offeringintuitive introductions to CVaR, VaR, and coherent risk functionals, and wethank S. Coogan, M. Arcak, and C. Belta [61] for an informative summaryabout temporal logic. This research is supported by the Edward S. Rogers Sr.Department of Electrical and Computer Engineering, University of Toronto.

References

[1] Risk management–Guidelines, https://www.iso.org/obp/ui/#iso:

std:iso:31000:ed-2:v1:en, accessed: July 11, 2021 (2018).

[2] R. S. Sutton, A. G. Barto, Reinforcement Learning: An Introduction, 2ndEdition, The MIT Press, Cambridge, 2014.

[3] D. P. Bertsekas, S. Shreve, Stochastic Optimal Control: The Discrete-Time Case, Athena Scientific, Belmont, 1996.

[4] D. P. Bertsekas, I. B. Rhodes, On the minimax reachability of target setsand target tubes, Automatica 7 (2) (1971) 233–247.

[5] M. Heger, Consideration of risk in reinforcement learning, in: Proceedingsof the International Machine Learning Conference, 1994, pp. 105–111.

[6] S. P. Coraluppi, S. I. Marcus, Risk-sensitive and minimax control ofdiscrete-time, finite-state Markov decision processes, Automatica 35 (2)(1999) 301–309.

24

[7] J. Morimoto, K. Doya, Robust reinforcement learning, Neural Computa-tion 17 (2) (2005) 335–359.

[8] K. Margellos, J. Lygeros, Hamilton–Jacobi formulation for reach–avoiddifferential games, IEEE Transactions on Automatic Control 56 (8) (2011)1849–1861.

[9] M. Chen, C. J. Tomlin, Hamilton–Jacobi reachability: Some recent the-oretical advances and applications in unmanned airspace management,Annual Review of Control, Robotics, and Autonomous Systems 1 (2018)333–358.

[10] M. Pecka, T. Svoboda, Safe exploration techniques for reinforcementlearning–An overview, in: Proceedings of the International Workshop onModelling and Simulation for Autonomous Systems, Springer, 2014, pp.357–375.

[11] J. Garcıa, F. Fernandez, Safe exploration of state and action spaces in re-inforcement learning, Journal of Artificial Intelligence Research 45 (2012)515–564.

[12] H. Ravichandar, A. S. Polydoros, S. Chernova, A. Billard, Recent ad-vances in robot learning from demonstration, Annual Review of Control,Robotics, and Autonomous Systems 3 (2020) 297–330.

[13] J. Garcıa, F. Fernandez, A comprehensive survey on safe reinforcementlearning, Journal of Machine Learning Research 16 (1) (2015) 1437–1480.

[14] L. Hewing, K. P. Wabersich, M. Menner, M. N. Zeilinger, Learning-basedmodel predictive control: Toward safe learning in control, Annual Reviewof Control, Robotics, and Autonomous Systems 3 (2020) 269–296.

[15] N. Ab Azar, A. Shahmansoorian, M. Davoudi, From inverse optimal con-trol to inverse reinforcement learning: A historical review, Annual Reviewsin Control 50 (2020) 119–138.

[16] S. Arora, P. Doshi, A survey of inverse reinforcement learning: Challenges,methods and progress, Artificial Intelligence 297 (2021) 103500.

[17] M. P. Chapman, R. Bonalli, K. M. Smith, I. Yang, M. Pavone, C. J. Tom-lin, Risk-sensitive safety analysis using Conditional Value-at-Risk, underreview, preprint arXiv:2101.12086.

[18] M. P. Chapman, M. Fauss, H. V. Poor, K. M. Smith, Risk-sensitivesafety analysis via state-space augmentation, under review, preprintarXiv:2106.00776.

[19] A. Shapiro, D. Dentcheva, A. Ruszczynski, Lectures on Stochastic Pro-gramming: Modeling and Theory, Society for Industrial and AppliedMathematics, Mathematical Programming Society, Philadelphia, PA,2009.

25

[20] L. Eeckhoudt, C. Gollier, H. Schlesinger, Economic and Financial Deci-sions under Risk, Princeton University Press, Princeton, NJ, 2005.

[21] D. Bernoulli, Exposition of a new theory on the measurement of risk,Econometrica 22 (1) (1954) 23–36, originally published in 1738 as “Speci-men Theoriae Novae de Mensura Sortis,” in Latin, translated by L. Som-mer.

[22] J. von Neumann, O. Morgenstern, Theory of Games and Economic Be-havior, Princeton University Press, Princeton, NJ, 1944.

[23] N. Bauerle, U. Rieder, More risk-sensitive Markov decision processes,Mathematics of Operations Research 39 (1) (2014) 105–120.

[24] P. Whittle, Risk-sensitive linear/quadratic/Gaussian control, Advances inApplied Probability 13 (1981) 764–777.

[25] G. B. Folland, Real Analysis: Modern Techniques and Their Applications,2nd Edition, John Wiley & Sons, Inc., New York, NY, 1999.

[26] H. Markowitz, Porfolio selection, The Journal of Finance 7 (1) (1952)77–91.

[27] H. Markowitz, Portfolio selection: Efficient diversification of investments,John Wiley & Sons, Inc., New York, NY, 1959.

[28] C.-H. Won, Cost-cumulants and risk-sensitive control, in: W.-K. Chen(Ed.), The Electrical Engineering Handbook, Elsevier, Inc., 2005, Ch. 5,pp. 1061–1068.

[29] C. W. Miller, I. Yang, Optimal control of Conditional Value-at-Risk incontinuous time, SIAM Journal on Control and Optimization 55 (2) (2017)856–884.

[30] R. Ash, Real Analysis and Probability, Academic Press, Inc., New York,NY, 1972.

[31] A. Shapiro, Minimax and risk averse multistage stochastic programming,European Journal of Operational Research 219 (3) (2012) 719–726.

[32] R. T. Rockafellar, S. Uryasev, Optimization of Conditional Value-at-Risk,Journal of Risk 2 (3) (2000) 21–42.

[33] R. T. Rockafellar, S. Uryasev, Conditional Value-at-Risk for general lossdistributions, Journal of Banking & Finance 26 (7) (2002) 1443–1471.

[34] C. Acerbi, D. Tasche, On the coherence of expected shortfall, Journal ofBanking & Finance 26 (7) (2002) 1487–1503.

[35] A. Ruszczynski, Risk-averse dynamic programming for Markov decisionprocesses, Mathematical Programming 125 (2) (2010) 235–261.

26

[36] Y. Shen, W. Stannat, K. Obermayer, Risk-sensitive Markov control pro-cesses, SIAM Journal on Control and Optimization 51 (5) (2013) 3652–3672.

[37] A. Ruszczynski, Erratum to: Risk-averse dynamic programming forMarkov decision processes, Mathematical Programming 145 (1) (2014)601–604.

[38] S. Singh, Y. Chow, A. Majumdar, M. Pavone, A framework for time-consistent, risk-sensitive model predictive control: Theory and algorithms,IEEE Transactions on Automatic Control 64 (7) (2018) 2905–2912.

[39] N. Bauerle, A. Glauner, Markov decision processes with recursive riskmeasures, European Journal of Operational Research (2021) in press.

[40] P. Artzner, F. Delbaen, J.-M. Eber, D. Heath, Coherent measures of risk,Mathematical Finance 9 (3) (1999) 203–228.

[41] J. Kisiala, Conditional Value-at-Risk: Theory and Applications, Master’sthesis, The School of Mathematics, The University of Edinburgh, Edin-burgh, Scotland (2015).

[42] G. C. Pflug, A. Pichler, Time-consistent decisions and temporal decom-position of coherent risk functionals, Mathematics of Operations Research41 (2) (2016) 682–699.

[43] N. Bauerle, J. Ott, Markov decision processes with Average-Value-at-Riskcriteria, Mathematical Methods of Operations Research 74 (3) (2011) 361–379.

[44] W. B. Haskell, R. Jain, A convex analytic approach to risk-aware Markovdecision processes, SIAM Journal on Control and Optimization 53 (3)(2015) 1569–1598.

[45] N. Bauerle, A. Glauner, Minimizing spectral risk measures applied toMarkov decision processes, Mathematical Methods of Operations Research(2021) in press.

[46] K. M. Smith, M. P. Chapman, On Exponential Utility and ConditionalValue-at-Risk as risk-averse performance criteria, under review, preprintarXiv:2108.01771.

[47] U. Gupta, Y. G. Kim, S. Lee, J. Tse, H.-H. S. Lee, G.-Y. Wei, D. Brooks,C.-J. Wu, Chasing carbon: The elusive environmental footprint of com-puting, in: Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, IEEE, 2021, pp. 854–867.

[48] A. Majumdar, A. A. Ahmadi, R. Tedrake, Control and verification ofhigh-dimensional systems with DSOS and SDSOS programming, in: Pro-ceedings of the IEEE Conference on Decision and Control, IEEE, 2014,pp. 394–401.

27

[49] A. A. Ahmadi, A. Majumdar, DSOS and SDSOS optimization: Moretractable alternatives to sum of squares and semidefinite optimization,SIAM Journal on Applied Algebra and Geometry 3 (2) (2019) 193–230.

[50] S. Boyd, L. el Ghaoui, E. Feron, V. Balakrishnan, Linear Matrix Inequal-ities in System and Control Theory, SIAM, 1994.

[51] F. Blanchini, Set invariance in control, Automatica 35 (11) (1999) 1747–1767.

[52] Z. Wan, M. V. Kothare, An efficient off-line formulation of robust modelpredictive control using linear matrix inequalities, Automatica 39 (5)(2003) 837–846.

[53] P. Nilsson, N. Ozay, Synthesis of separable controlled invariant sets formodular local control design, in: Proceedings of the American ControlConference, IEEE, 2016, pp. 5656–5663.

[54] B. Roszak, M. E. Broucke, Necessary and sufficient conditions for reach-ability on a simplex, Automatica 42 (11) (2006) 1913–1918.

[55] B. Roszak, M. E. Broucke, Reachability of a set of facets for linear affinesystems with n−1 inputs, IEEE Transactions on Automatic Control 52 (2)(2007) 359–364.

[56] Z. Lin, M. E. Broucke, On a reachability problem for affine hypersurfacesystems on polytopes, Automatica 47 (4) (2011) 769–775.

[57] I. M. Mitchell, A. M. Bayen, C. J. Tomlin, A time-dependent Hamilton-Jacobi formulation of reachable sets for continuous dynamic games, IEEETransactions on Automatic Control 50 (7) (2005) 947–957.

[58] J. F. Fisac, M. Chen, C. J. Tomlin, S. S. Sastry, Reach-avoid problemswith time-varying dynamics, targets and constraints, in: Proceedings ofthe International Conference on Hybrid Systems: Computation and Con-trol, 2015, pp. 11–20.

[59] M. Chen, S. L. Herbert, M. S. Vashishtha, S. Bansal, C. J. Tomlin, De-composition of reachable sets and tubes for a class of nonlinear systems,IEEE Transactions on Automatic Control 63 (11) (2018) 3675–3688.

[60] A. Pnueli, The temporal logic of programs, in: Proceedings of the Sympo-sium on Foundations of Computer Science, IEEE, 1977, pp. 46–57.

[61] S. Coogan, M. Arcak, C. Belta, Formal methods for control of traffic flow:Automated control synthesis from finite-state transition models, IEEEControl Systems Magazine 37 (2) (2017) 109–128.

[62] M. Srinivasan, S. Coogan, Control of mobile robots using barrier functionsunder temporal logic specifications, IEEE Transactions on Robotics 37 (2)(2021) 363–374.

28

[63] P. Geibel, Reinforcement learning with bounded risk, in: Proceedings ofthe International Conference on Machine Learning, 2001, pp. 162–169.

[64] P. Geibel, F. Wysotzki, Risk-sensitive reinforcement learning applied tocontrol under constraints, Journal of Artificial Intelligence Research 24(2005) 81–108.

[65] A. Abate, M. Prandini, J. Lygeros, S. Sastry, Probabilistic reachabilityand safety for controlled discrete time stochastic hybrid systems, Auto-matica 44 (11) (2008) 2724–2734.

[66] S. Summers, J. Lygeros, Verification of discrete time stochastic hybridsystems: A stochastic reach-avoid decision problem, Automatica 46 (12)(2010) 1951–1961.

[67] J. Ding, M. Kamgarpour, S. Summers, A. Abate, J. Lygeros, C. Tomlin,A stochastic games framework for verification and control of discrete timestochastic hybrid systems, Automatica 49 (9) (2013) 2665–2674.

[68] I. Yang, A dynamic game approach to distributionally robust safety spec-ifications for stochastic systems, Automatica 94 (2018) 94–101.

[69] D. Sadigh, A. Kapoor, Safe control under uncertainty with probabilisticsignal temporal logic, in: Proceedings of Robotics: Science and Systems,2016.

[70] D. Sadigh, E. S. Kim, S. Coogan, S. S. Sastry, S. A. Seshia, A learningbased approach to control synthesis of Markov decision processes for lineartemporal logic specifications, in: Proceedings of the IEEE Conference onDecision and Control, IEEE, 2014, pp. 1091–1096.

[71] S. Jha, V. Raman, D. Sadigh, S. A. Seshia, Safe autonomy under per-ception uncertainty using chance-constrained temporal logic, Journal ofAutomated Reasoning 60 (1) (2018) 43–62.

[72] M. Tiger, F. Heintz, Incremental reasoning in probabilistic signal temporallogic, International Journal of Approximate Reasoning 119 (2020) 325–352.

[73] A. Zakaria, F. B. Ismail, M. H. Lipu, M. A. Hannan, Uncertainty modelsfor stochastic optimization in renewable energy applications, RenewableEnergy 145 (2020) 1543–1571.

[74] P. Harremoes, Stochastic models for estimation of extreme pollution fromurban runoff, Water Research 22 (8) (1988) 1017–1026.

[75] D. del Giudice, R. Lowe, H. Madsen, P. S. Mikkelsen, J. Rieckermann,Comparison of two stochastic techniques for reliable urban runoff pre-diction by modeling systematic errors, Water Resources Research 51 (7)(2015) 5004–5022.

29

[76] C. V. Rao, D. M. Wolf, A. P. Arkin, Control, exploitation and toleranceof intracellular noise, Nature 420 (6912) (2002) 231–237.

[77] N. Eling, M. D. Morgan, J. C. Marioni, Challenges in measuring andunderstanding biological noise, Nature Reviews Genetics 20 (9) (2019)536–548.

[78] R. A. Howard, J. E. Matheson, Risk-sensitive Markov decision processes,Management Science 18 (7) (1972) 356–369.

[79] G. B. di Masi, L. Stettner, Risk-sensitive control of discrete-time Markovprocesses with infinite horizon, SIAM Journal on Control and Optimiza-tion 38 (1) (1999) 61–78.

[80] V. S. Borkar, Q-learning for risk-sensitive control, Mathematics of Oper-ations Research 27 (2) (2002) 294–311.

[81] V. Borkar, R. Jain, Risk-constrained Markov decision processes, IEEETransactions on Automatic Control 59 (9) (2014) 2574–2579.

[82] M. P. Chapman, L. Lessard, Toward a scalable upper bound for a CVaR-LQ problem, IEEE Control Systems Letters 6 (2021) 920–925.

[83] R. D. Luce, H. Raiffa, Games and Decisions, John Wiley & Sons, Inc.,New York, NY, 1957.

[84] D. Jacobson, Optimal stochastic linear systems with exponential perfor-mance criteria and their relation to deterministic differential games, IEEETransactions on Automatic Control 18 (2) (1973) 124–131.

[85] J. L. Speyer, J. Deyst, D. H. Jacobson, Optimization of stochastic linearsystems with additive measurement and process noise using exponentialperformance criteria, IEEE Transactions on Automatic Control 19 (4)(1974) 358–366.

[86] P. R. Kumar, J. H. van Schuppen, On the optimal control of stochas-tic systems with an exponential-of-integral performance index, Journal ofMathematical Analysis and Applications 80 (2) (1981) 312–332.

[87] P. Whittle, J. Kuhn, A Hamiltonian formulation of risk-sensitive lin-ear/quadratic/Gaussian control, International Journal of Control 43 (1)(1986) 1–12.

[88] P. Whittle, A risk-sensitive maximum principle: The case of imperfectstate observation, IEEE Transactions on Automatic Control 36 (7) (1991)793–801.

[89] K. Glover, J. C. Doyle, State-space formulae for all stabilizing controllersthat satisfy an H∞-norm bound and relations to risk sensitivity, Systems& Control Letters 11 (3) (1988) 167–172.

30

[90] J. Lofberg, Minimax Approaches to Robust Model Predictive Control,Ph.D. thesis, Department of Electrical Engineering, Linkoping University,Linkoping, Sweden (2003).

[91] T. Basar, P. Bernhard, H∞-Optimal Control and Related Minimax DesignProblems: A Dynamic Game Approach, 2nd Edition, Birkhauser, Boston,1995.

[92] T. Bielecki, D. Hernandez-Hernandez, S. R. Pliska, Risk sensitive controlof finite state markov chains in discrete time, with applications to portfoliomanagement, Mathematical Methods of Operations Research 50 (2) (1999)167–188.

[93] R. Cavazos-Cadena, D. Hernandez-Hernandez, Discounted approxima-tions for risk-sensitive average criteria in Markov decision chains withfinite state space, Mathematics of Operations Research 36 (1) (2011) 133–146.

[94] R. Blancas-Rivera, R. Cavazos-Cadena, H. Cruz-Suarez, Discounted ap-proximations in risk-sensitive average Markov cost chains with finite statespace, Mathematical Methods of Operations Research 91 (2) (2020) 241–268.

[95] G. B. di Masi, L. Stettner, Infinite horizon risk sensitive control of discretetime Markov processes under minorization property, SIAM Journal onControl and Optimization 46 (1) (2007) 231–252.

[96] A. Jaskiewicz, Average optimality for risk-sensitive control with generalstate space, The Annals of Applied Probability 17 (2) (2007) 654–675.

[97] V. Anantharam, V. S. Borkar, A variational formula for risk-sensitivereward, SIAM Journal on Control and Optimization 55 (2) (2017) 961–988.

[98] M. P. Chapman, K. M. Smith, Classical risk-averse control for a finite-horizon Borel model, IEEE Control Systems Letters, in press, preprintarXiv:2107.13981.

[99] K. Boda, J. A. Filar, Time consistent dynamic risk measures, Mathemat-ical Methods of Operations Research 63 (1) (2006) 169–186.

[100] D. M. Kreps, Decision problems with expected utility criteria, II: Station-arity, Mathematics of Operations Research 2 (3) (1977) 266–274.

[101] B. Rudloff, A. Street, D. M. Valladao, Time consistency and risk aversedynamic decision models: Definition, interpretation and practical conse-quences, European Journal of Operational Research 234 (3) (2014) 743–750.

31

[102] P. A. Forsyth, Multiperiod mean Conditional Value at Risk asset alloca-tion: Is it advantageous to be time consistent?, SIAM Journal on Finan-cial Mathematics 11 (2) (2020) 358–384.

[103] Y. Chow, A. Tamar, S. Mannor, M. Pavone, Risk-sensitive and robustdecision-making: A CVaR optimization approach, in: Proceedings of theAdvances in Neural Information Processing Systems, 2015, pp. 1522–1530.

[104] G. C. Pflug, A. Pichler, Time-inconsistent multistage stochastic programs:Martingale bounds, European Journal of Operational Research 249 (1)(2016) 155–163.

[105] O. Hernandez-Lerma, J. B. Lasserre, Discrete-time Markov control pro-cesses: Basic optimality criteria, Springer Science & Business Media, NewYork, NY, 1996.

[106] T. M. Moldovan, P. Abbeel, Safe exploration in Markov decision processes,in: Proceedings of the International Conference on Machine Learning,2012.

[107] G. Schildbach, L. Fagiano, C. Frei, M. Morari, The scenario approach forstochastic model predictive control with bounds on closed-loop constraintviolations, Automatica 50 (12) (2014) 3009–3018.

[108] B. P. van Parys, D. Kuhn, P. J. Goulart, M. Morari, Distributionallyrobust control of constrained stochastic systems, IEEE Transactions onAutomatic Control 61 (2) (2015) 430–442.

[109] S. Samuelson, I. Yang, Safety-aware optimal control of stochastic systemsusing Conditional Value-at-Risk, in: Proceedings of the American ControlConference, IEEE, 2018, pp. 6285–6290.

[110] M. P. Chapman, J. Lacotte, A. Tamar, D. Lee, K. M. Smith, V. Cheng,J. F. Fisac, S. Jha, M. Pavone, C. J. Tomlin, A risk-sensitive finite-timereachability approach for safety of stochastic dynamic systems, in: Pro-ceedings of the American Control Conference, IEEE, 2019, pp. 2958–2963.

[111] I. H. Witten, An adaptive optimal controller for discrete-time Markovenvironments, Information and Control 34 (4) (1977) 286–295.

[112] C. Watkins, Learning from Delayed Rewards, Ph.D. thesis, CambridgeUniversity, United Kingdom (1989).

[113] C. Watkins, P. Dayan, Q-learning, Machine Learning 8 (1992) 279–292.

[114] O. Mihatsch, R. Neuneier, Risk-sensitive reinforcement learning, MachineLearning 49 (2) (2002) 267–290.

[115] R. Neuneier, O. Mihatsch, Risk sensitive reinforcement learning, in:M. Kearns, S. Solla, D. Cohn (Eds.), Proceedings of the Advances in Neu-ral Information Processing Systems, Vol. 11, The MIT Press, 1998.

32

[116] Y. Shen, M. J. Tobia, T. Sommer, K. Obermayer, Risk-sensitive reinforce-ment learning, Neural Computation 26 (7) (2014) 1298–1328.

[117] W. Huang, W. B. Haskell, Risk-aware Q-learning for Markov decision pro-cesses, in: Proceedings of the IEEE Conference on Decision and Control,IEEE, 2017, pp. 4928–4933.

[118] W. Huang, W. B. Haskell, Stochastic approximation for risk-aware Markovdecision processes, IEEE Transactions on Automatic Control 66 (3) (2021)1314–1320.

[119] J. G. Schneider, Exploiting model uncertainty estimates for safe dynamiccontrol learning, in: Proceedings of the Advances in Neural InformationProcessing Systems, 1996, pp. 1047–1053.

[120] T. J. Perkins, A. G. Barto, Lyapunov design for safe reinforcement learn-ing, Journal of Machine Learning Research 3 (2002) 803–832.

[121] P. Abbeel, M. Quigley, A. Y. Ng, Using inaccurate models in reinforce-ment learning, in: Proceedings of the International Conference on MachineLearning, 2006, pp. 1–8.

[122] A. Aswani, H. Gonzalez, S. S. Sastry, C. Tomlin, Provably safe and robustlearning-based model predictive control, Automatica 49 (5) (2013) 1216–1226.

[123] Y. Sui, A. Gotovos, J. Burdick, A. Krause, Safe exploration for opti-mization with Gaussian processes, in: Proceedings of the InternationalConference on Machine Learning, PMLR, 2015, pp. 997–1005.

[124] J. Schreiter, D. Nguyen-Tuong, M. Eberts, B. Bischoff, H. Markert,M. Toussaint, Safe exploration for active learning with Gaussian processes,in: Proceedings of the Joint European Conference on Machine Learningand Knowledge Discovery in Databases, Springer, 2015, pp. 133–149.

[125] M. Turchetta, F. Berkenkamp, A. Krause, Safe exploration in finiteMarkov decision processes with Gaussian processes, in: Proceedings ofthe Advances in Neural Information Processing Systems, Vol. 29, CurranAssociates, Inc., 2016.

[126] F. Berkenkamp, A. P. Schoellig, A. Krause, Safe controller optimization forquadrotors with Gaussian processes, in: Proceedings of the IEEE Interna-tional Conference on Robotics and Automation, IEEE, 2016, pp. 491–496.

[127] F. Berkenkamp, M. Turchetta, A. P. Schoellig, A. Krause, Safe model-based reinforcement learning with stability guarantees, in: Proceedings ofthe Advances in Neural Information Processing Systems, Vol. 30, CurranAssociates, Inc., 2017.

33

[128] J. F. Fisac, A. K. Akametalu, M. N. Zeilinger, S. Kaynama, J. Gillula,C. J. Tomlin, A general safety framework for learning-based control in un-certain robotic systems, IEEE Transactions on Automatic Control 64 (7)(2018) 2737–2752.

[129] R. Cheng, G. Orosz, R. M. Murray, J. W. Burdick, End-to-end safe rein-forcement learning through barrier functions for safety-critical continuouscontrol tasks, in: Proceedings of the AAAI Conference on Artificial Intel-ligence, Vol. 33, 2019, pp. 3387–3395.

[130] A. Taylor, A. Singletary, Y. Yue, A. Ames, Learning for safety-criticalcontrol with control barrier functions, in: Proceedings of the Conferenceon Learning for Dynamics and Control, Vol. 120, PMLR, 2020.

[131] S. Sastry, M. Bodson, Adaptive Control: Stability, Convergence, and Ro-bustness, Prentice-Hall, Inc., Englewood Cliffs, 1989.

[132] K. P. Wabersich, L. Hewing, A. Carron, M. N. Zeilinger, Probabilis-tic model predictive safety certification for learning-based control, IEEETransactions on Automatic Control (2021) in press.

[133] K. P. Wabersich, M. N. Zeilinger, A predictive safety filter for learning-based control of constrained nonlinear dynamical systems, Automatica 129(2021) in press.

[134] J. H. Gillula, C. J. Tomlin, Reducing conservativeness in safety guaranteesby learning disturbances online: Iterated guaranteed safe online learning,in: Proceedings of Robotics: Science and Systems, Vol. 8, The MIT Press,2013, pp. 81–88.

34

35

Worst-case analyses

Reachability or invariance analysis

Bertsekas and Rhodes (1971)

Sum-of-squares Majumdar, Ahmadi, Tedrake (2014)

Ahmadi and Majumdar (2019)

Linear matrix inequalities

Blanchini (1999)

Wan and Kothare (2003)

Nilsson and Ozay (2016)

Geometric approaches Roszak and Broucke (2006, 2007)

Lin and Broucke (2011)

Hamilton-Jacobi PDEs

Mitchell, Bayen, Tomlin (2005)

Margellos and Lygeros (2011)

Fisac et al. (2015)

Chen et al. (2018)

Additional temporal logic specifications Coogan, Arcak, Belta (2017)

Srinivasan and Coogan (2021)

Glover and Doyle (1988)

Heger (1994)

Coraluppi and Marcus (1999)

Probabilistic analyses

Model-based

Probabilistic temporal logic

Sadigh and Kapoor (2016)

Jha et al. (2018)

Tiger and Heintz (2020)

Dynamic programming

Abate et al. (2008)

Summers and Lygeros (2010)

Ding et al. (2013)

Yang (2018)

Haskell and Jain (2015)

van Parys et al. (2015)

Model-free

Geibel (2001)

Geibel and Wysotzki (2005)

Garcıa and Fernandez (2012)

Sadigh et al. (2014)

Using risk functionals

Exponential utility

Howard and Matheson (1972)

Jacobson (1973)

Kumar and van Schuppen (1981)

Whittle (1981, 1991)

Glover and Doyle (1988)

Coraluppi and Marcus (1999)

di Masi and Stettner (1999, 2007)

Jaskiewicz (2007)

Cavazos-Cadena and Hernandez-Hernandez (2011)

Anantharam and Borkar (2017)

Blancas-Rivera, Cavazos-Cadena, Cruz-Suarez (2020)

State-space augmentation

Bauerle and Ott (2011)

Bauerle and Rieder (2014)

Borkar and Jain (2014)

Chow et al. (2015)

Haskell and Jain (2015)

Bauerle and Glauner (Math. Methods Oper. Res., 2021)

Chapman and Lessard (2021)

Chapman, Fauss, et al. (under review)

Inner-outer optimization

Bauerle and Ott (2011)

Miller and Yang (2017)

Bauerle and Glauner (Math. Methods Oper. Res., 2021)

Chapman and Lessard (2021)

Nested risk functionals

Ruszczynski (2010, 2014)

Singh et al. (2018)

Bauerle and Glauner (Eur. J. Oper. Res., 2021)

Risk constraints

Moldovan and Abbeel (2012)

Borkar and Jain (2014)

Schildbach et al. (2014)

Haskell and Jain (2015)

van Parys et al. (2015)

Samuelson and Yang (2018)

Model-free

Heger (1994)

Borkar (2002)

Mihatsch and Neuneier (2002)

Shen et al. (2014)

Huang and Haskell (2017, 2021)

Figure 3: This figure summarizes the literature that we cover in this review, and the summarycontinues in Fig. 4. The first category pertains to methods that use worst-case analysesto quantify risk for autonomous systems. The second category describes methods that useprobabilistic analyses, which includes chance constraints, and the third category involvesmethods that use mainly risk functionals. Some papers belong to more than one category.

36

Figure 4: This figure summarizes the literature from Sec. 4 in the sub-field of safe learning.

Figure 5: An illustration of the Value-at-Risk (VaR) of a random cost Z ∈ L1(Ω,F , µi). Weconsider two different distributions of Z, Fi(z) := µi(Z ≤ z) for i = 1, 2, to explain theVaR. F1 is continuous and strictly increasing (left panel), while F2 is right-continuous andincreasing (right panel). µi is a probability measure, Fi is the CDF that corresponds to µi,0 < α2 ≤ α1 < 1, and zj is a realization of Z. On the left, it holds that F1(z1) = 1− α1 andF1(z2) = 1− α2. Thus, when the VaR is defined using µ1, it holds that VaRα1 (Z) = z1 andVaRα2 (Z) = z2. On the right, there is no z such that F2(z) = 1 − α1, but for any z ≥ z3,it holds that F2(z) ≥ 1 − α1. Consequently, when the VaR is defined using µ2, we haveVaRα1 (Z) = z3. Also, on the right, it holds that F2(z) = 1− α2 for all z ∈ [z4, z5). Since z4is the smallest value in [z4, z5), it follows that VaRα2 (Z) = z4 when the VaR is defined usingµ2.

37

Figure 6: A desired operating region K ⊆ S and a target set T ⊆ K are depicted. Everytrajectory that starts in the blue shaded region stays inside K (invariance). Every trajectorythat starts in the purple shaded region enters the target set T and stays inside K (reachabilityand invariance). A trajectory that exhibits the reachability property but not the invarianceproperty is also shown.