Dynamic Pricing and Management for Electric Autonomous ... · dynamic joint pricing and routing...

1

Dynamic Pricing and Management for ElectricAutonomous Mobility on Demand Systems Using

Reinforcement LearningBerkay Turan Ramtin Pedarsani Mahnoosh Alizadeh

Abstract—The proliferation of ride sharing systems is a majordrive in the advancement of autonomous and electric vehicle tech-nologies. This paper considers the joint routing, battery charging,and pricing problem faced by a profit-maximizing transportationservice provider that operates a fleet of autonomous electricvehicles. We define the dynamic system model that captures thetime dependent and stochastic features of an electric autonomous-mobility-on-demand system. To accommodate for the time-varying nature of trip demands, renewable energy availability,and electricity prices and to further optimally manage theautonomous fleet, a dynamic policy is required. In order todevelop a dynamic control policy, we first formulate the dynamicprogression of the system as a Markov decision process. Weargue that it is intractable to exactly solve for the optimal policyusing exact dynamic programming methods and therefore applydeep reinforcement learning to develop a near-optimal controlpolicy. Furthermore, we establish the static planning problemby considering time-invariant system parameters. We define thecapacity region and determine the optimal static policy to serveas a baseline for comparison with our dynamic policy. Whilethe static policy provides important insights on optimal pricingand fleet management, we show that in a real dynamic setting,it is inefficient to utilize a static policy. The two case studieswe conducted in Manhattan and San Francisco demonstrate theefficacy of our dynamic policy in terms of network stability andprofits, while keeping the queue lengths up to 200 times less thanthe static policy.

I. INTRODUCTION

The rapid evolution of enabling technologies for au-tonomous driving coupled with advancements in eco-friendlyelectric vehicles (EVs) has facilitated state-of-the-art trans-portation options for urban mobility. Owing to these devel-opments in automation, it is possible for an autonomous-mobility-on-demand (AMoD) fleet of autonomous EVs toserve the society’s transportation needs, with multiple com-panies now heavily investing in AMoD technology [1].

The introduction of autonomous vehicles for mobility ondemand services provides an opportunity for better fleetmanagement. Specifically, idle vehicles can be rebalancedthroughout the network in order to prevent accumulatingat certain locations and to serve induced demand at everylocation. Autonomous vehicles allow rebalancing to be per-formed centrally by a platform operator who observes thestate of all the vehicles and the demand, rather than locallyby individual drivers. Furthermore, EVs provide opportunities

B. Turan, R. Pedarsani, and M. Alizadeh are with the Department ofElectrical and Computer Engineering, University of California, Santa Barbara,CA, 93106 USA e-mail:{bturan,ramtin,alizadeh}@ucsb.edu.

for cheap and environment-friendly energy resources (e.g.,solar energy). However, electricity supplies and prices differamong the network both geographically and temporally. Assuch, this diversity can be exploited for cheaper energy optionswhen the fleet is operated by a platform operator that isaware of the electricity prices throughout the whole network.Moreover, a dynamic pricing scheme for rides is essential tomaximize profits earned by serving the customers. Couplingan optimal fleet management policy with a dynamic pricingscheme allows the revenues to be maximized while reducingthe rebalancing cost and the waiting time of the customers byadjusting the induced demand.

We consider a model that captures the opportunities andchallenges of an AMoD fleet of EVs, and consists of complexstate and action spaces. In particular, the platform operatorhas to consider the number of customers waiting to be servedat each location (queue lengths), the electricity prices, andthe states of the EVs (locations, battery energy levels) inorder to make decisions. These decisions consist of pricingfor rides for every OD-pair and routing/charging decision forevery vehicle in the network. Upon taking an action, the stateof the network undergoes through a stochastic transition dueto the randomness in customer behaviour and exogenously-determined electricity prices.

Due to the continuous and high dimensional state-actionspaces, it is infeasible to develop an optimal policy using exactdynamic programming algorithms. As such, we utilize deepreinforcement learning (RL) to develop a near-optimal policy.Specifically, we show that it is possible to learn a policy viaTrust Region Policy Optimization (TRPO) [2] that increasesthe total profits generated by jointly managing the fleet of EVs(by making routing and charging decisions) and pricing for therides. We demonstrate the performance of our policy by usingthe total profits generated and the queue lengths as metrics.

Our contributions are as follows:• We formalize a vehicle and network model that captures

the aforementioned characteristics of an AMoD fleet ofEVs as well as the stochasticity in demand and electricityprices.

• We employ deep RL methods to learn a joint pricing,routing and charging policy that effectively stabilizes thequeues and increases the profits.

• We analyze the static problem, where we consider a time-invariant environment (time-invariant arrivals, electricityprices, etc.), to gain insight towards the actual dynamicproblem and to further provide a baseline for comparison.

arX

iv:1

909.

0696

2v1

[ee

ss.S

Y]

16

Sep

2019

2

Fig. 1: The schematic diagram of our framework. Our deep RL agent processes the state of the vehicles, queues and electricityprices and outputs a control policy for pricing as well as autonomous EVs’ routing and charging.

(a) (b)

Fig. 2: (a) The optimal static policy manages to stabilize thequeues over a very long time period but is unable to clearthem whereas (b) RL control policy stabilizes the queues andmanages to keep them significantly low (note the scales).

We visualize our framework as a schematic diagram in Figure1 and preview our results in Figure 2, showing that the RLpolicy successfully keeps the queue lengths 200 times lowerthan the static policy.Related work: Comprehensive research perceiving variousaspects of AMoD systems is being conducted in the literature.Studies surrounding fleet management focus on optimal EVcharging in order to reduce electricity costs as well as optimalvehicle routing in order to serve the customers and to rebalancethe empty vehicles throughout the network so as to reducethe operational costs and the customers’ waiting times. Time-invariant control policies adopting queueing theoretical [3],fluidic [4], network flow [5] , and Markovian [6] modelshave been developed by using the steady state of the system.The authors of [7] consider ride-sharing systems with mixedautonomy. However, the proposed control policies in thesepapers are not adaptive to the time-varying nature of the future

demand. As such, there is work on developing time-varyingmodel predictive control (MPC) algorithms [8]–[12]. Theauthors of [10], [11] propose data-driven algortihms and theauthors of [12] propose a stochastic MPC algorithm focusingon vehicle rebalancing. In [8], the authors also consider a fleetof EVs and hence propose an MPC approach that optimizesvehicle routing and scheduling subject to energy constraints.Using a fluid-based optimization framework, the authors of[13] investigate tradeoffs between fleet size, rebalancing cost,and queueing effects in terms of passenger and vehicle flowsunder time-varying demand. The authors in [14] develop aparametric controller that approximately solves the intractabledynamic program for rebalancing over an infinite-horizon.Aside from these, there are studies that aim to developdynamic policies for rebalancing as well as ride request as-signment via decentralized reinforcement learning approaches[15]–[17]. In these works, the policies are developed andapplied locally by each autonomous vehicle, and dynamicpricing and charging strategy are not considered. Dynamicrouting of autonomous vehicles using reinforcement learningwith the goal of reducing congestion in mixed autonomy trafficnetworks is proposed in [18].

Regarding charging strategies for large populations of EVs,[19]–[21] provide in-depth reviews and studies of smartcharging technologies. An agent-based model to simulate theoperations of an AMoD fleet of EVs under various vehicleand infrastructure scenarios has been examined in [22]. Theauthors of [23] propose an online charge scheduling algorithmfor EVs providing AMoD services. By adopting a static net-work flow model in [24], the benefits of smart charging havebeen investigated and approximate closed form expressionsthat highlight the trade-off between operational costs andcharging costs have been derived. Furthermore, [25] studies

3

interactions between AMoD systems and the power grid. Inaddition, [26] studies the implications of pricing schemeson an AMoD fleet of EVs. In [27], the authors propose adynamic joint pricing and routing strategy for non-electricshared mobility on demand services. [28] studies a quadraticprogramming problem in order to jointly optimize vehicledispatching, charge scheduling, and charging infrastructure,while the demand is defined exogenously.Paper Organization: The remainder of the paper is organizedas follows. In Section II, we present the system model anddefine the platform operator’s optimization problem. In SectionIII, we first formulate the dynamics of the system as a Markovdecision process and then explain the idea of reinforcementlearning method as well as the algorithm we adopted. InSection IV, we discuss the static planning problem associatedwith the system model and characterize the capacity regionas well as the optimal static policy. In Section V, we presentthe numerical results of the case studies we have conducted inManhattan and San Francisco to demonstrate the performanceof our dynamic control policy. Finally, we conclude the paperin Section VI.

II. SYSTEM MODEL AND PROBLEM DEFINITION

Network and Demand Models: We consider a fleet of AMoDEVs operating within a transportation network characterizedby a fully connected graph consisting of M = {1, . . . ,m}nodes that can each serve as a trip origin or destination. Westudy a discrete-time system with time periods normalized tointegral units t ∈ {0, 1, 2, . . . }. In this discrete-time system,we model the arrival of the potential customers with origin-destination (OD) pair (i, j) as a Poisson process with an arrivalrate of λij per period, where λii = 0. Moreover, we assumethat these riders are heterogeneous in terms of their willingnessto pay. In particular, if the price for receiving a ride from nodei to node j in period t is set to `ij(t), the induced arrival ratefor rides from i to j is given by Λij(t) = λij(1− F (`ij(t))),where F (·) is the cumulative distribution of riders’ willingnessto pay with a support of [0, `max]. Thus, the number of newride requests in time period t is Aij(t) ∼ Pois(Λij(t)) for ODpair (i, j).Vehicle Model: To capture the effect of trip demand and theassociated charging, routing, and rebalancing decisions on thefleet size, we assume that each autonomous vehicle in thefleet has a per period operational cost of β. Furthermore, asthe vehicles are electric, they have to sustain charge in orderto operate. Without loss of generality, we assume there is acharging station placed at each node i ∈M. To charge at nodei during time period t, the operator pays a price of electricitypi(t) per unit of energy. We assume that all EVs in the fleethave a battery capacity denoted as vmax ∈ Z+; therefore,each EV has a discrete battery energy level v ∈ V , whereV = {v ∈ N|0 ≤ v ≤ vmax}. In our discrete-time model, weassume each vehicle takes one period to charge one unit ofenergy and τij(t) periods to travel between OD pair (i, j) ifthe ride is starting at time period t, while consuming vij unitsof energy.Ride Sharing Model: The platform operator dynamicallyroutes the fleet of EVs in order to serve the demand at each

node. Customers that purchase a ride are not immediatelymatched with a ride, but enter the queue for OD pair (i, j).After the platform operator executes routing decisions for thefleet, the customers in the queue for OD pair (i, j) are matchedwith rides and served in a first-come, first-served discipline.A measure of the expected wait time is not available to eacharriving customer. However, the operator knows that longerwait times will negatively affect their business and hence seeksto minimize the total wait time experienced by users. Denotethe queue length for OD pair (i, j) by qij(t). If after servingthe customers, the queue length qij(t) > 0, the platformoperator is penalized by a fixed cost of w per person at thequeue to account for the value of time of the customers.Platform Operator’s Problem: We consider a profit-maximizing AMoD operator that manages a fleet of EVs thatmake trips to provide transportation services to customers. Theoperator’s goal is to maximize profits by 1) setting prices forrides and hence managing customer demand at each node; 2)optimally operating the AMoD fleet (i.e., charging, routing,and rebalancing) to minimize operational and charging costs.We will study two types of control policies the platformoperator utilizes: 1) a dynamic policy, where the pricing,routing and charging decisions are dependent on the systemstate (such as queue lengths, prices of electricity, and vehiclelocations and energy levels); 2) a static policy, where thepricing, routing and charging decisions are time invariant andindependent of the state of the system.

III. THE PROPOSED DYNAMIC POLICY

In this section, we establish a dynamic control policy tooptimize the decisions that the platform operator makes givenfull state information. We first formulate the dynamic evolutionof the network state as an MDP. The solution of this MDPis the optimal policy that determines which action to take foreach state the system is in, and can nominally be derived usingclassical exact dynamic programming algorithms (e.g., valueiteration). However, considering the complexity and the scaleof our dynamic problem, the curse of dimensionality rendersthe MDP intractable to solve with classical exact dynamicprogramming algorithms. As such, we resort to approximatedynamic programming methods. Specifically, we define thepolicy via a deep neural network that takes the current stateof the network (such as prices of electricity, queue lengths, andvehicle locations and energy levels) as input and outputs thebest action1 (such as prices for rides and vehicle routing andcharging decisions). Subsequently, we apply a reinforcementlearning algorithm to train the neural network in order toimprove the performance of the policy.

A. The Dynamic Problem as MDP

We define the MDP by the tuple (S,A, T , r), where S isthe state space, A is the action space, T is the state transitionoperator and r is the reward function. We define these elementsas follows:

1In general, the policy is a stochastic policy and determines the probabilitiesof taking the actions rather than deterministically producing an action.

4

1) S: The state space consists of prices of electricity ateach node, the queue lengths for each origin-destination pair,and the number of vehicles at each node and each energylevel. However, since travelling from node i to node j takesτij(t) periods of time, we need to define intermediate nodes.For brevity of exposition, let us assume that τij(t) is aconstant τij during the time period for which the dynamicpolicy is developed2. As such, we define τij − 1 number ofintermediate nodes between each origin and destination pair,for each battery energy level v. Hence, the state space consistsof sd = m2 + (vmax + 1)((

∑mi=1

∑mj=1 τij) − m2 + 2m)

dimensional vectors in Rsd≥0 (We include all the non-negativevalued vectors, however, only m2 − m entries can grow toinfinity because they are queue lengths, and the rest are alwaysupper bounded by fleet size or maximum price of electricity).As such, we define the elements of the state vector at timet as s(t) = [p(t) q(t) sveh(t)], where p(t) = [pi(t)]i∈M isthe electricity prices state vector, q(t) = [qij(t)]i,j∈M;i6=j isthe queue lengths state vector, and sveh(t) = [svijk(t)]∀i,j,k,vis the vehicle state vector, where svijk(t) is the numberof vehicles at vehicle state (i, j, k, v). The vehicle state(i, j, k, v) specifies the location of a vehicle that is travellingbetween OD pair (i, j) as the k’th intermediate node betweennodes i and j, and specifies the battery energy level of avehicle as v (The states of the vehicles at the nodes i ∈ Mwith energy level v is denoted by (i, i, 0, v)).

2) A: The action space consists of prices for rides ateach origin-destination pair and routing/charging decisionsfor vehicles at nodes i ∈ M at each energy level v.The price actions are continuous in range [0, `max]. Eachvehicle at state (i, i, 0, v) (∀i ∈ M, ∀v ∈ V) can eithercharge, stay idle or travel to one of the remaining m − 1nodes. To allow for different transitions for vehicles atthe same state (some might charge, some might travelto another node), we define the action taken at time tfor vehicles at state (i, i, 0, v) as an m + 1 dimensionalprobability vector with entries in [0, 1] that sum up to 1:αv

i (t) = [αvi1(t) . . . αvim(t) αvic(t)], where αvmaxic (t) = 0 and

αvij(t) = 0 if v < vij . The action space is then all the vectorsa of dimension ad = m2 − m + (vmax + 1)(m2 + m),whose first m2 −m entries are the prices and the rest are theprobability vectors satisfying the aforementioned properties.As such, we define the elements of the action vector at timet as a(t) = [`(t) α(t)], where `(t) = [`ij ]i,j∈M,i6=j is thevector of prices and α(t) = [αv

i (t)]∀i,v is the vector ofrouting/charging actions.

3) T : The transition operator is defined asTijk = Pr(s(t + 1) = j|s(t) = i, a(t) = k). Wecan define the transition probabilities for electricity prices

2To account for different traffic conditions during different time periodsof the day, we can define different sets of intermediate nodes. According tothe traffic conditions, the vehicles take the longer route (higher traffic, moreintermediate nodes) or the shorter route (less traffic, less intermediate nodes).Furthermore, to account for stochasticity on the routes (e.g., traffic lights),the number of intermediate nodes a vehicle traverses in one time period canbe defined as a random variable.

p(t+1), queue lengths q(t+1), and vehicle states sveh(t+1)as follows:

Electricity Price Transitions: Since we assume that thedynamics of prices of electricity are exogenous to ourAMoD system, Pr(p(t + 1) = p2|p(t) = p1, a(t)) =Pr(p(t + 1) = p2|p(t) = p1), i.e., the dynamics of theprice are independent of the action taken. Depending on thesetting, new prices might either be deterministic or distributedaccording to some probability density function at time t:p(t) ∼ P(t), which is determined by the electricity provider.

Vehicle Transitions: For each vehicle at node i and energylevel v, the transition probability is defined by the actionprobability vector αv

i (t). Each vehicle transitions intostate (i, j, 1, v − vij) with probability αvij(t), stays idlein state (i, i, 0, v) with probability αvii(t) or charges andtransitions into state (i, i, 0, v + 1) with probability αvic(t).The vehicles at intermediate states (i, j, k, v) transition intostate (i, j, k+ 1, v) if k < τij − 1 or (j, j, 0, v) if k = τij − 1with probability 1. The total transition probability to thevehicle states sveh(t + 1) given sveh(t) and α(t) is thesum of all the probabilities of the feasible transitions fromsveh(t) to sveh(t + 1) under α(t), where the probabilityof a feasible transition is the multiplication of individualvehicle transition probabilities (since the vehicle transitionprobabilities are independent). Note that instead of graduallydissipating the energy of the vehicles on their route, weimmediately discharge the required energy for the trip fromtheir batteries and keep them constant during the trip. Thisensures that the vehicles have enough battery to completethe ride and does not violate the model, because the vehiclesarrive to their destinations with true value of energy and anew action will only be taken when they reach the destination.

Queue Transitions: The queue lengths transition according tothe prices and the vehicle routing decisions. For prices `ij(t)and induced arrival rate Λij(t), the probability that Aij(t) newcustomers arrive in the queue (i, j) is:

Pr(Aij(t)) =e−Λij(t)Λij(t)

Aij(t)

(Aij(t))!

Let us denote the total number of vehicles routed from nodei to j at time t as xij(t), which is given by:

xij(t) =

vmax∑v=vij

xvij(t) =

vmax∑v=vij

sv−vijij1 (t+ 1). (1)

Given sveh(t + 1) and xij(t), the probability that the queuelength qij(t+ 1) = q is:

Pr(qij(t+ 1) = q|s(t),a(t), sveh(t+ 1)) =

Pr(Aij(t) = q − qij(t) + xij(t)),

if q > 0, and Pr(Aij(t) ≤ −qij(t) + xij(t)) if q = 0. Sincethe arrivals are independent, the total probability that the queuevector q(t+ 1) = q is:

Pr(q(t+ 1) = q|s(t),a(t), sveh(t+ 1)) =

Πmi=1Πm

j=1j 6=i

Pr(qij(t+ 1)|s(t),a(t), sveh(t+ 1)).

5

Fig. 3: The schematic diagram representing the state transition of our MDP. Upon taking an action, a vehicle at state (i, i, 0, v)charges for a price of pi(t) and transitions into state (i, i, 0, v + 1) with probability αvic(t), stays idle at state (i, i, 0, v) withprobability αvii(t), or starts traveling to another node j and transitions into state (i, j, 1, v − vij) with probability αvij(t).Furthermore, Aij(t) new customers arrive to the queue (i, j) depending on the price ìj(t). After the routing and chargingdecisions are executed for all the EVs in the fleet, the queues are modified.

Hence, the transition probability is defined as:

Pr(s(t+ 1)|s(t),a(t)) = Pr(p(t+ 1)|p(t))

× Pr(sveh(t+ 1)|s(t),α(t))

× Pr(q(t+ 1)|s(t),α(t), sveh(t+ 1))

(2)

We illustrate how the vehicles and queues transition into newstates consequent to an action in Figure 3.4) r: The reward function r(t) is a function of state-actionpairs at time t: r(t) = r(a(t), s(t)). Let xvic(t) denote thenumber of vehicles charging at node i starting with energylevel v at time period t. The reward function r(t) is definedas:

r(t) =

m∑i=1

m∑j=1j 6=i

ìj(t)Aij(t)− wm∑i=1

m∑j=1j 6=i

qij(t)

−m∑i=1

vmax−1∑v=0

(β + pi)xvic(t)

− βm∑i=1

m∑j=1j 6=i

xij(t)

− βm∑i=1

m∑j=1j 6=i

τij−1∑k=1

vmax−1∑v=0

svijk(t)

The first term corresponds to the revenue generated by thepassengers that request a ride for a price ìj(t), the secondterm is the queue cost of the passengers that have not yet beenserved, the third term is the charging and operational costs ofthe charging vehicles and the last two terms are the operationalcosts of the vehicles making trips. Note that revenue generatedis immediately added to the reward function when the pas-sengers enter the network instead of after the passengers areserved. Since the reinforcement learning approach is based onmaximizing the cumulative reward gained, all the passengerseventually have to be served in order to prevent queues fromblowing up and hence it does not violate the model to add therevenues immediately.

Using the definitions of the tuple (S,A, T , r), we modelthe dynamic problem as an MDP. Observe that asidefrom having a large dimensional state space (for instance,m = 10, vmax = 5, τij = 3 ∀i, j: sd = 1240) and actionspace, the cardinality of these spaces are not finite (queues cangrow unbounded, prices are continuous). As such, we can notsolve the MDP using exact dynamic programming methods.As a solution, we characterize the dynamic policy via a deepneural network and execute reinforcement learning in order todevelop a dynamic policy.

B. Reinforcement Learning MethodIn this subsection, we go through the preliminaries of

reinforcement learning and briefly explain the idea of thealgorithm we adopted.

1) Preliminaries: The dynamic policy associated withthe MDP is defined as a function parameterized by θ:πθ(a|s) = π : S × A → [0, 1], i.e., a probability distributionin the state-action space. Given a state s, the policy returnsthe probability for taking the action a (for all actions), andsamples an action according to the probability distribution.The goal is to derive the optimal policy π∗, which maximizesthe discounted cumulative expected rewards Jπ:

Jπ∗ = maxπ

Jπ = maxπ

Eπ

[ ∞∑t=0

γtr(t)

],

π∗ = arg maxπ

Eπ

[ ∞∑t=0

γtr(t)

],

where γ ∈ (0, 1] is the discount factor. The value of taking anaction a in state s, and following the policy π afterwards ischaracterized by the value function Qπ(s,a):

Qπ(s,a) = Eπ

[ ∞∑t=0

γtr(t)|s(0) = s,a(0) = a

].

The value of being in state s is formalized by the valuefunction Vπ(s):

Vπ(s) = Ea(0),π

[ ∞∑t=0

γtr(t)|s(0) = s

],

6

and the advantage of taking the action a in state s andfollowing the policy π thereafter is defined as the advantagefunction Aπ(s,a):

Aπ(s,a) = Qπ(s,a)− Vπ(s).

The methods used by reinforcement learning algorithms canbe divided into three main groups: 1) critic-only methods, 2)actor-only methods, and 3) actor-critic methods, where theword critic refers to the value function and the word actorrefers to the policy [29]. Critic-only (or value-function based)methods (such as Q-learning [30] and SARSA [31]) improvea deterministic policy using the value function by iterating:

a∗ = arg maxa

Qπ(s,a),

π(a∗|s)←− 1.

Actor-only methods (or policy gradient methods), such asWilliams’ REINFORCE algorithm [32], improve the policyby updating the parameter θ by gradient ascent, without usingany form of a stored value function:

θ(t+ 1) = θ(t) + α∇θEπθ(t)

[∑τ

γτr(τ)

].

The advantage of policy gradient methods is their ability togenerate actions from a continuous action space by utilizinga parameterized policy.

Finally, actor-critic methods [33], [34] make use of both thevalue functions and policy gradients:

θ(t+ 1) = θ(t) + α∇θEπθ(t)[Qπθ(t)(s,a)

].

Actor-critic methods are able to produce actions in a contin-uous action space, while reducing the high variance of thepolicy gradients by adding a critic (value function).

All of these methods aim to update the parameters θ (ordirectly update the policy π for critic-only methods) to im-prove the policy. In deep reinforcement learning, the policy πis defined by a deep neural network, whose weights constitutethe parameter θ. To develop a dynamic policy for our MDP, weadopt a practical policy gradient method called Trust RegionPolicy Optimization (TRPO).

2) Trust Region Policy Optimization: TRPO is a practicalpolicy gradient method developed in [2], and is effective foroptimizing large nonlinear policies such as neural networks. Itsupports continuous state-action spaces and guarantees mono-tonic improvement.

Let π and π be two different policies. Then, the followingequality indicates the expected return of policy π in terms ofthe advantage over π:

Jπ = Jπ + Eπ

[ ∞∑t=0

γtAπ(s,a)

]. (3)

Let σπ(s) be the discounted visitation frequency of state sunder policy π:

σπ(s) = Pr(s(0) = s) + γPr(s(1) = s) + . . . ,

where s(0) is distributed according to some initial distribu-tion σ0. Using σπ(s) and writing the expectation explicitly,Equation (3) becomes:

Jπ = Jπ +∑s

σπ(s)∑a

π(a|s)Aπ(s,a) (4)

This implies that any policy π such that∑a π(a|s)Aπ(s,a) ≥ 0 at every state s is at least as

good as policy π. However, because of dependency of σπon π, it is difficult to optimize Equation (4). Thus a localapproximator to Jπ using the visitation frequencies σπ(s) isintroduced:

Lπ(π) = Jπ +∑s

σπ(s)∑a

π(a|s)Aπ(s,a)

Using this approximator Lπ(π), Algorithm 1 can be ap-plied to utilize policy iteration with ε = max

s,a|Aπ(s,a)|

and DmaxKL (πt, π) = max

sDKL(π(·|s)||π(·|s)) being the KL

divergence between two policies maximized over the states.The key idea of Algorithm 1 is to utilize policy iteration

Algorithm 1: Policy iteration algorithm guaranteeing non-decreasing expected return

Initialize π.for t = 0, 1, 2, . . . until convergence do

Compute all advantage values Aπt(s,a).Solve the constrained optimization problem:

πt+1 = arg maxπ

[Lπt(π)− CDmaxKL (πt, π)],

where C = 4εγ/(1− γ)2,andLπt(π) = J(πt) +

∑s σπt(s)

∑a π(a|s)Aπt(s,a).

end

without changing the policy too much by imposing a penaltyon the KL divergence. This is the same idea that lies atthe heart of TRPO. Instead of penalizing the KL divergence,TRPO imposes a constraint on KL divergence and solves theconstrained maximization problem using conjugate gradient.In that sense, it is similar to natural policy gradient methods.We refer the reader to [2] for a comprehensive study.

IV. ANALYSIS OF THE STATIC PROBLEM

In this section, we establish and discuss the static planningproblem to provide a measure for comparison and demonstratethe efficacy of the dynamic policy. To do so, we considerthe fluid scaling of the dynamic network and characterize thestatic problem via a network flow formulation. Under thissetting, we use the expected values of the variables (traveldurations, arrivals, and prices of electricity) and ignore theirtime dependent dynamics, while allowing the vehicle routingdecisions to be flows (real numbers) rather than integers. Thestatic problem is convenient for determining the so-calledcapacity region of the dynamic problem as well as determiningthe optimal static pricing, routing, and charging policy of theplatform operator.

7

A. The Capacity Region

We formulate the static optimization problem via a networkflow model that characterizes the capacity region of thenetwork for a given set of prices ìj(t) = ìj ∀t (Hence,Λij(t) = Λij ∀t). The capacity region is defined as the set ofall arrival rates [Λij ]i,j∈M, where there exists a charging androuting policy under which the queueing network of the systemis stable [35]3. Let xvi be the number of vehicles available atnode i, αvij be the fraction of vehicles at node i with energylevel v being routed to node j, and αvic be the fraction ofvehicles charging at node i starting with energy level v. Wesay the static vehicle allocation for node i and energy level vis feasible if:

αvic +

m∑j=1j 6=i

αvij ≤ 1.

The optimization problem that characterizes the capacity re-gion of the network ensures that the total number of vehiclesrouted from i to j is at least as large as the nominal arrival rateto the queue (i, j). Namely, the problem can be formulated asfollows:

minxvi ,α

vij ,α

vic

ρ (5a)

subject to Λij ≤vmax∑v=vij

xvi αvij ∀i, j ∈M, (5b)

ρ ≥ αvic +

m∑j=1j 6=i

αvij ∀i ∈M, ∀v ∈ V, (5c)

xvi = xv−1i αv−1

ic

+

m∑j=1

xv+vjii α

v+vjiji ∀i ∈M, ∀v ∈ V, (5d)

m∑i=1

m∑j=1

vmax∑v=vij

xvi αvijτij

+

m∑i=1

vmax−1∑v=0

xvi αvic ≤ N, (5e)

αvmaxic = 0 ∀i ∈M, (5f)αvij = 0 ∀v < vij , ∀i, j ∈M (5g)

xvi ≥ 0, αvij ≥ 0 αvic ≥ 0, ∀i, j ∈M, ∀v ∈ V,(5h)

xvi = αvic = αvij = 0 ∀v /∈ V, ∀i, j ∈M. (5i)

The constraint (5b) requires the platform to operate at least asmany vehicles to serve all the induced demand between anytwo nodes i and j (The rest are the vehicles travelling withoutpassengers, i.e., rebalancing vehicles). We will refer to thisas the demand satisfaction constraint. The constraint (5d) isthe flow balance constraint for each node and each batteryenergy level, which restricts the number of available vehiclesat node i and energy level v to be the sum of arrivals from allnodes (including idle vehicles) and vehicles that are charging

3The stability condition that we are interested in is rate stability of allqueues. A queue for OD pair (i, j) is rate stable if lim

t→∞qij(t)/t = 0.

with energy level v − 1. The constraint (5e) is the fleet sizeconstraint, restricting the total number of operated vehiclesin the network to be upper bounded by N . The constraint(5f) ensures that the vehicles with full battery do not chargefurther, and the constraint (5g) ensures the vehicles sustainenough charge to travel between OD pair (i, j). Finally, theconstraint (5c) upper bounds the allocation of vehicles for eachnode i and energy level v.

Proposition 1. Let the optimal value of (5) be ρ∗. Then,ρ∗ ≤ 1 is a necessary and sufficient condition of rate stabilityof the system under some routing and charging policy.

The proof of Proposition 1 is provided in Appendix A. ByProposition 1, the capacity region CΛ of the network is the setof all Λij ∈ R+ for which the corresponding optimal solutionto the optimization problem (5) satisfies ρ∗ ≤ 1. As long asρ∗ ≤ 1, there exists a routing and charging policy such thatthe queues will be bounded away from infinity.

B. Static Profit Maximization Problem

The platform operator’s goal is to maximize its profits bysetting prices and making routing and charging decisions suchthat the system remains stable. Setting prices for rides allowsthe platform operator to shift the induced demand into thecapacity region (higher prices decrease the arrival rate and thusmaintain stability of the queues). In its most general form, theproblem can be formulated as follows:

maxìj ,xvi ,α

vij ,α

vic

U(Λij(ìj), xvi , α

vij , α

vic)

subject to [Λij(ìj)]i,j∈M ∈ CΛ,(6)

where U(·) is the utility function that depends on the prices,demand for rides and the vehicle decisions.

Next, we explicitly state the platform operator’s profit max-imization problem. Instead of imposing a fleet size constraintto the problem, we want to jointly optimize pricing, routing,and charging as well as the fleet size. To account for the effectof fleet size, we assign a per vehicle operational costs of β.Let xvic = xvi α

vic and xvij = xvi α

vij . Using these new variables

and noting that αvic+∑mj=1 α

vij = 1 when ρ∗ ≤ 1, the platform

operator’s problem can be stated as:

maxxvic,x

vij ,ìj

m∑i=1

m∑j=1

λijìj(1− F (ìj))

−m∑i=1

vmax−1∑v=0

(β + pi)xvic

− βm∑i=1

m∑j=1

vmax∑v=vij

xvijτij (7a)

subject to λij(1− F (ìj)) ≤vmax∑v=vij

xvij ∀i, j ∈M, (7b)

xvic +

m∑j=1

xvij =

xv−1ic +

m∑j=1

xv+vjiji ∀i ∈M, ∀v ∈ V, (7c)

8

xvmaxic = 0 ∀i ∈M, (7d)xvij = 0 ∀v < dij , ∀i, j ∈M, (7e)

xvic ≥ 0, xvij ≥ 0 ∀i, j ∈M, ∀v ∈ V, (7f)

xvic = xvij = 0 ∀v /∈ V, ∀i, j ∈M. (7g)

The first term in the objective function in (7) accounts for theaggregate revenue the platform generates by providing ridesfor λij(1− F (ìj)) number of riders with a price of ìj . Thesecond term is the operational and charging costs incurred bythe charging vehicles (assuming that pi(t) = pi ∀t under thestatic setting), and the last term is the operational costs of thetrip-making vehicles (including rebalancing trips). The con-straints are similar to those of (5), with xvi = xvic +

∑mj=1 x

vij

(excluding the fleet size constraint).The optimization problem in (7) is non-convex for a general

F (·). Nonetheless, when the platform’s profits are affine inthe induced demand λij(1 − F (·)), it can be rewritten asa convex optimization problem. Hence, we assume that therider’s willingness to pay is uniformly distributed in [0, `max],i.e., F (ìj) =

ìj`max

.

Marginal Pricing: The prices for rides are a crucial compo-nent of the profits generated. The next proposition highlightshow the optimal prices `∗ij for rides are related to the networkparameters, prices of electricity, and the operational costs.

Proposition 2. Let ν∗ij be optimal the dual variable corre-sponding to the demand satisfaction constraint for OD pair(i, j). The optimal prices `∗ij are:

`∗ij =`max + ν∗ij

2. (8)

These prices can be upper bounded by:

`∗ij ≤`max + β(τij + τji + vij + vji) + vijpj + vjipi

2(9)

Moreover, with these optimal prices `∗ij , the profits generatedper period is:

P =

m∑i=1

m∑j=1

λij`max

(`max − `∗ij)2. (10)

The proof of Proposition 2 is provided in Appendix B.Observe that the profits in Equation (10) are decreasing asthe prices for rides increase. Thus expensive rides generateless profits compared to the cheaper rides and it is morebeneficial if the optimal dual variables ν∗ij are small andprices are close to `max/2. We can interpret the dual variablesν∗ij as the cost of providing a single ride between i andj to the platform. In the worst case scenario, every singlerequested ride from node i requires rebalancing and chargingboth at the origin and the destination. Hence the upper boundon (9) includes the operational costs of passenger-carrying,rebalancing and charging vehicles (both at the origin and thedestination); and the energy costs of both passenger-carryingand rebalancing trips multiplied by the price of electricity atthe trip destinations. Similar to the taxes applied on products,whose burden is shared among the supplier and the customer;the costs associated with rides are shared among the platform

operator and the riders (which is why the price paid by theriders include half of the cost of the ride).

Even though the static planning problem provides importantinsights on capacity region, fleet size, and pricing, a staticpolicy does not perform well in a real dynamic setting becauseit does not acknowledge the time-varying dynamics of thesystem. We demonstrate the performance of both dynamic andstatic policies in the next section.

V. NUMERICAL STUDY

In this section, we discuss the numerical experiments andresults for the performance of reinforcement learning approachto the dynamic problem and compare with the performanceof several static policies, including the optimal static policyoutlined in Section IV. We solved for the optimal staticpolicy using CVX, a package for specifying and solvingconvex programs [36]. To implement the dynamic setting asan MDP compatible with reinforcement learning algorithms,we used Gym toolkit [37] developed by OpenAI to create anenvironment. For the implementation of the TRPO algorithm,we used Stable Baselines toolkit [38].

We chose an operational cost of β = $0.1 (by normalizingthe average price of an electric car over 5 years [39]) andmaximum willingness to pay `max = $30. For prices ofelectricity pi(t), we generated random prices for differentlocations and different times using the statistics of locationalmarginal prices in [40]. We chose a maximum battery capacityof 20kWh. We discretrized the battery energy into 5 units,where one unit of battery energy is 4kWh. The time it takesto deliver one unit of charge is taken as one time epoch, whichis equal to 5 minutes in our setup. The waiting time cost forone period is w = $2 (average hourly wage is around $24 inthe United States [41]).

Observe that the dimension of the state spacegrows significantly with vmax and τij (for instance,m = 10, vmax = 5, τij = 3 ∀i, j: sd = 1240). Therefore,for computational purposes, we conducted two case studies:1) Non-electric AMoD case study with a larger network inManhattan, 2) Electric AMoD case study with a smallernetwork in San Francisco. Both experiments were performedon a laptop computer with Intel® CoreTM i7-8750H CPU(6×2.20 GHz) and 16 GB DDR4 2666MHz RAM.

A. Case Study in Manhattan

In a non-electric AMoD network, the energy dimensionv vanishes. Because there is no charging action4, we canperform coarser discretizations of time. Specifically, we canallow each discrete time epoch to cover 5× min

i,j|i 6=jτij minutes,

and normalize the travel times τij and w accordingly (ForEV’s, because charging takes a non-negligible but shorter timethan travelling, in general we have τij > 1, and larger number

4The vehicles still refuel, however this takes negligible time compared tothe trip durations.

9

Fig. 4: Manhattan divided into m = 10 regions.

of states). The static profit maximization problem in (7) forAMoD with non-electric vehicles can be rewritten as:

maxxij ,ìj

m∑i=1

m∑j=1

λijìj(1− F (ìj))

− βgm∑i=1

m∑j=1

xijτij

subject to λij(1− F (ìj)) ≤ xij ∀i, j ∈M,m∑j=1

xij =

m∑j=1

xji ∀i ∈M,

xij ≥ 0 ∀i, j ∈M.

(11)

The operational costs βg = $2.5 (per 10 minutes, [42]) aredifferent than those of electric vehicles. Because there is no“charging” (or refueling action, since it takes negligible time),βg also includes fuel cost. The optimal static policy is usedto compare and highlight the performance of the dynamicpolicy5.

We divided Manhattan into 10 regions as in Figure 4,and using the yellow taxi data from the New York CityTaxi and Limousine Commission dataset [43] for May 04,2019, Saturday between 18.00-20.00, we extracted the averagearrival rates for rides and average trip durations τij between theregions (we exclude the rides occurring in the same region). Tocreate the potential arrival rate λij , we multiplied the averagearrival rates by 1.5. We trained our model by creating newinduced random arrivals with the same potential arrival rateusing prices determined by our policy. For the fleet size, weused a fleet of 4000 autonomous vehicles according to theoptimal fleet size of static problem (11).

For training, we used a neural network with 2 hidden layersand 64 neurons in each hidden layer, and a value function stepsize of 0.001. The rest of the parameters are left as default asspecified by the Stable Baselines toolkit [38]. We trained themodel for 5 million iterations. The first 1 million iterationsof the training phase is displayed in Figure 5a. Observe that,

5The solution of the static problem yields vehicle flows. In order to make thepolicy compatible with our environment and to generate integer actions thatcan be applied in a dynamic setting, we randomized the actions by dividingeach flow for OD pair (i, j) (and energy level v) by the total number ofvehicles in i (and energy level v) and used that fraction as the probability ofsending a vehicle from i to j (with energy level v).

(a) (b)

Fig. 5: The rewards during training phases for (a) Manhattancase study and (b) San Francisco case study. At the beginning,the rewards for both case studies go rapidly down because thequeues blow up. As the training process continues, the policylearns to stabilize the queues and hence the rewards increase.

during the first phase of the iterations, the rewards go downrapidly (because the queues blow up). Hence, the policy movestowards higher prices to decrease the arrival rates. However,because the queues can not be cleared with a single iteration ofhigher prices, the algorithm observes negative rewards are stillthere with higher prices, and hence decreases the prices again.This causes the queues to blow up if we leave the algorithmrun as is. To overcome this issue, for the first five hundredthousand iterations only, the reward output of the environmentwas set to the difference between the current and the previousreward. This allows the algorithm to learn that decreasing thequeue lengths is favorable. Furthermore, to reduce the varianceand stabilize the algorithm, we subtracted a baseline valuefrom the rewards6 after stabilizing the queues.

Next, we compare different policies’ performance using therewards and total queue length as metrics. The results aredemonstrated in Figure 6. In Figure 6a we compare the rewardsgenerated and the total queue length by applying the static andthe dynamic policies as defined in Sections IV and III. Wecan observe that while the optimal static policy provides ratestability in a dynamic setting (since the queues do not blowup), it fails to generate profits as it is not able to clear thequeues. On the other hand, the dynamic policy is able to keepthe total length of the queues 50 times shorter than the staticpolicy while generating higher profits.

The optimal static policy fails to generate profits and isnot necessarily the best static policy to apply in a dynamicsetting. As such, in Figure 6b we demonstrate the performanceof a sub-optimal static policy, where the prices are slightlyhigher to reduce the arrival rates and hence reduce the queuelengths. Observe that the profits generated are higher thanthe profits generated using optimal static policy for the staticplanning problem while the total queue length is less. Thisresult indicates that under the stochasticity of the dynamicsetting, a sub-optimal static policy can perform better than theoptimal static policy. Nevertheless, this policy does still doworse in terms of rewards and total queue length compared tothe dynamic policy.

6To get the baseline value, we tested the policy at every one millioniterations and subtracted the average value of reward from the reward outputduring training.

10

(a) (b) (c) (d)

Fig. 6: Comparison of different policies for Manhattan case study. The legends for all figures are the same as the top leftfigure, where red lines correspond to the dynamic and blue lines correspond to the static policies (We excluded the runningaverages for (d), because the static policy diverges). In all scenarios, we use the rewards generated and the total queue lengthas metrics. In (a), we demonstrate the results from applying the dynamic and the optimal static policy. In (b), we compare thedynamic policy with a sub-optimal static policy, where the prices are higher than the optimal static policy. In (c), we utilize asurge pricing policy along with the optimal static policy and compare with the dynamic policy. In (d), we employ the dynamicand static policies developed for May 4, 2019, Saturday for the arrivals on May 11, 2019, Saturday.

Fig. 7: San Francisco divided into m = 7 regions. We ob-tained the map from the San Francisco County TransportationAuthority [44]. The map shows number of TransportationNetwork Company (TNC) pickups and dropoffs. Darker colorsmean more trips to/from an area and we divided according tothe number of trips rather than the geographical areas.

Next, we showcase the even some heuristic modificationsthat resemble what is done in practice can do better than theoptimal static policy. We utilize the optimal static policy, butadditionally utilize a surge-pricing policy. The surge-pricingpolicy aims to decrease the arrival rates for longer queuesso that the queues will stay shorter and the rewards willincrease. At each time period, the policy is to increase the

prices of the queues longer than 2 people by $7.5 such thatthe arrival rates for those queues are decreased. The resultsare displayed in Figure 6c. New arrivals bring higher revenueper person and the total queue length is decreased, whichstabilizes the network while generating more profits. The surgepricing policy results in stable short queues and higher rewardscompared to the other static policies, yet our dynamic policybeats it.

Finally, we test how the static and the dynamic policies arerobust to variations in input statistics. We compare the rewardsgenerated and the total queue length applying the static andthe dynamic policies for the arrival rates of May 11, 2019,Saturday between 18.00-20.00. The results are displayed inFigure 6d. Even though the arrival rates between May 11 andMay 4 do not differ much, the static policy is not resilient andfails to stabilize when there is a slight change in the network.The dynamic policy, on the other hand, is still able to stabilizethe network and generate profits. The neural-network basedpolicy is able to determine the correct pricing and routingdecisions by considering the current state of the network, evenunder different arrival rates.

B. Case Study in San Francisco

We conducted the case study in San Francisco by utilizingan EV fleet of 420 vehicles (according to the optimal fleet sizefor the static planning problem). We divided San Franciscointo 7 regions as in Figure 7, and using the traceset of mobilityof taxi cabs data from CRAWDAD [45], we obtained theaverage arrival rates and travel times between regions (weexclude the rides occurring in the same region).

11

(a) (b) (c)

Fig. 8: Comparison of different policies for San Francisco case study. The legends for all figures are the same as the top leftfigure, where red lines correspond to the dynamic and blue lines correspond to the static policies. In all scenarios, we use therewards generated and the total queue length as metrics. In (a), we demonstrate the results from applying the dynamic and theoptimal static policy. In (b), we compare the dynamic policy with a sub-optimal static policy, where the prices are higher thanthe optimal static policy. In (c), we utilize a surge pricing policy along with the optimal static policy and compare with thedynamic policy.

In Figure 8a, we compare the rewards and the total queuelength resulting from the dynamic and the static policy. InFigure 8b, we again change the static policy such that theprices are slightly higher as detailed in Section V-A. In Figure8c, we use the static policy but also utilize a surge pricingpolicy in order to keep the queues shorter (See Section V-A).Similar to the case study in Manhattan, the results demonstratethat the performance of the trained dynamic policy is superiorto the other policies (we note that the performance can befurther improved by longer training).

In Figure 9, we compare the charging costs paid under thedynamic and the static policies. The static policy is generatedby using the average value of the electricity prices, whereas thedynamic policy takes into account the current electricity pricesbefore executing an action. Therefore, the dynamic policyprovides cheaper charging options by utilizing smart chargingmechanisms.

VI. CONCLUSION

In this paper, we developed a dynamic control policy basedon deep reinforcement learning for operating an AMoD fleetof EVs as well as pricing for rides. Our dynamic control policyjointly makes decisions for: 1) vehicle routing in order to servepassenger demand and to rebalance the empty vehicles, 2)vehicle charging in order to sustain energy for rides whileexploiting geographical and temporal diversity in electricityprices for cheaper charging options, and 3) pricing for rides

Fig. 9: Charging costs for the optimal static policy and thedynamic policy in San Francisco case study.

in order to adjust the potential demand so that the networkis stable and the profits are maximized. Furthermore, weformulated the static planning problem associated with thedynamic problem in order to define the capacity region of thedynamic problem and the optimal static policy for the staticplanning problem. The static policy provides stability of the

12

queues in the dynamic setting, yet it is not optimal regardingthe profits and keeping the queues sufficiently low. Finally, weconducted case studies in Manhattan and San Francisco thatdemonstrate the performance of our developed algorithm.

REFERENCES

[1] [Online]. Available: https://www.cbinsights.com/research/autonomous-driverless-vehicles-corporations-list/.

[2] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel,“Trust region policy optimization,” CoRR, vol. abs/1502.05477, 2015.[Online]. Available: http://arxiv.org/abs/1502.05477

[3] R. Zhang and M. Pavone, “Control of robotic Mobility-on-Demandsystems: A queueing-theoretical perspective,” In Int. Journal of RoboticsResearch, vol. 35, no. 1–3, pp. 186–203, 2016.

[4] M. Pavone, S. L. Smith, E. Frazzoli, and D. Rus, “Robotic load balancingfor Mobility-on-Demand systems,” Int. Journal of Robotics Research,vol. 31, no. 7, pp. 839–854, 2012.

[5] F. Rossi, R. Zhang, Y. Hindy, and M. Pavone, “Routing autonomousvehicles in congested transportation networks: Structural properties andcoordination algorithms,” Autonomous Robots, vol. 42, no. 7, pp. 1427–1442, 2018.

[6] M. Volkov, J. Aslam, and D. Rus, “Markov-based redistribution policymodel for future urban mobility networks,” Conference Record - IEEEConference on Intelligent Transportation Systems, pp. 1906–1911, 092012.

[7] Q. Wei, J. A. Rodriguez, R. Pedarsani, and S. Coogan, “Ride-sharingnetworks with mixed autonomy,” arXiv preprint arXiv:1903.07707,2019.

[8] R. Zhang, F. Rossi, and M. Pavone, “Model predictive control ofautonomous mobility-on-demand systems,” in 2016 IEEE InternationalConference on Robotics and Automation (ICRA), May 2016.

[9] F. Miao, S. Han, S. Lin, J. A. Stankovic, H. Huang, D. Zhang,S. Munir, T. He, and G. J. Pappas, “Taxi dispatch with real-timesensing data in metropolitan areas: A receding horizon controlapproach,” CoRR, vol. abs/1603.04418, 2016. [Online]. Available:http://arxiv.org/abs/1603.04418

[10] R. Iglesias, F. Rossi, K. Wang, D. Hallac, J. Leskovec, andM. Pavone, “Data-driven model predictive control of autonomousmobility-on-demand systems,” CoRR, vol. abs/1709.07032, 2017.[Online]. Available: http://arxiv.org/abs/1709.07032

[11] F. Miao, S. Han, A. M. Hendawi, M. E. Khalefa, J. A. Stankovic,and G. J. Pappas, “Data-driven distributionally robust vehicle balancingusing dynamic region partitions,” in 2017 ACM/IEEE 8th InternationalConference on Cyber-Physical Systems (ICCPS), April 2017, pp. 261–272.

[12] M. Tsao, R. Iglesias, and M. Pavone, “Stochastic model predictive con-trol for autonomous mobility on demand,” CoRR, vol. abs/1804.11074,2018. [Online]. Available: http://arxiv.org/abs/1804.11074

[13] K. Spieser, S. Samaranayake, and E. Frazzoli, “Vehicle routing forshared-mobility systems with time-varying demand,” in 2016 AmericanControl Conference (ACC), July 2016, pp. 796–802.

[14] R. Swaszek and C. Cassandras, “Load Balancing in Mobility-on-Demand Systems: Reallocation Via Parametric Control Using Concur-rent Estimation,” 2019, arXiv PrePrint arxiv:1904.03755.

[15] M. Han, P. Senellart, S. Bressan, and H. Wu, “Routing an autonomoustaxi with reinforcement learning,” in CIKM, 2016.

[16] M. Guriau and I. Dusparic, “Samod: Shared autonomous mobility-on-demand using decentralized reinforcement learning,” in 2018 21stInternational Conference on Intelligent Transportation Systems (ITSC),Nov 2018, pp. 1558–1563.

[17] J. Wen, J. Zhao, and P. Jaillet, “Rebalancing shared mobility-on-demand systems: A reinforcement learning approach,” in 2017 IEEE20th International Conference on Intelligent Transportation Systems(ITSC), Oct 2017, pp. 220–225.

[18] D. A. Lazar, E. Bıyık, D. Sadigh, and R. Pedarsani, “Learning how todynamically route autonomous vehicles on shared roads,” arXiv preprintarXiv:1909.03664, 2019.

[19] E. Veldman and R. A. Verzijlbergh, “Distribution grid impacts of smartelectric vehicle charging from different perspectives,” IEEE Transactionson Smart Grid, vol. 6, no. 1, pp. 333–342, Jan 2015.

[20] W. Su, H. Eichi, W. Zeng, and M. Chow, “A survey on the electrificationof transportation in a smart grid environment,” IEEE Transactions onIndustrial Informatics, vol. 8, no. 1, pp. 1–10, Feb 2012.

[21] J. C. Mukherjee and A. Gupta, “A review of charge scheduling of electricvehicles in smart grid,” IEEE Systems Journal, vol. 9, no. 4, pp. 1541–1553, Dec 2015.

[22] T. D. Chen, K. M. Kockelman, and J. P. Hanna, “Operations of a Shared,Autonomous, Electric Vehicle Fleet: Implications of Vehicle & ChargingInfrastructure Decisions,” Transportation Research Part A: Policy andand Practice, vol. 94, pp. 243–254, 2016.

[23] N. Tucker, B. Turan, and M. Alizadeh, “Online Charge Scheduling forElectric Vehicles in Autonomous Mobility on Demand Fleets,” In Proc.IEEE Int. Conf. on Intelligent Transportation Systems, 2019.

[24] B. Turan, N. Tucker, and M. Alizadeh, “Smart Charging Benefitsin Autonomous Mobility on Demand Systems,” In Proc. IEEE Int.Conf. on Intelligent Transportation Systems, 2019. [Online]. Available:https://arxiv.org/abs/1907.00106

[25] F. Rossi, R. Iglesias, M. Alizadeh, and M. Pavone, “On the interac-tion between autonomous mobility-on-demand systems and the powernetwork: models and coordination algorithms,” Robotics: Science andSystems XIV, Jun 2018.

[26] T. D. Chen and K. M. Kockelman, “Management of a shared au-tonomous electric vehicle fleet: Implications of pricing schemes,” Trans-portation Research Record, vol. 2572, no. 1, pp. 37–46, 2016.

[27] Y. Guan, A. M. Annaswamy, and H. E. Tseng, “Cumulative prospecttheory based dynamic pricing for shared mobility on demandservices,” CoRR, vol. abs/1904.04824, 2019. [Online]. Available:http://arxiv.org/abs/1904.04824

[28] C. J. R. Sheppard, G. S. Bauer, B. F. Gerke, J. B. Greenblatt,A. T. Jenn, and A. R. Gopal, “Joint optimization schemefor the planning and operations of shared autonomous electricvehicle fleets serving mobility on demand,” Transportation ResearchRecord, vol. 2673, no. 6, pp. 579–597, 2019. [Online]. Available:https://doi.org/10.1177/0361198119838270

[29] I. Grondman, L. Busoniu, G. A. D. Lopes, and R. Babuska, “Asurvey of actor-critic reinforcement learning: Standard and natural policygradients,” IEEE Transactions on Systems, Man, and Cybernetics, PartC (Applications and Reviews), vol. 42, no. 6, pp. 1291–1307, Nov 2012.

[30] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning,vol. 8, no. 3, pp. 279–292, May 1992. [Online]. Available:https://doi.org/10.1007/BF00992698

[31] G. A. Rummery and M. Niranjan, “On-line q-learning using connec-tionist systems,” Tech. Rep., 1994.

[32] R. J. Williams, “Simple statistical gradient-following algorithmsfor connectionist reinforcement learning,” Machine Learning,vol. 8, no. 3, pp. 229–256, May 1992. [Online]. Available:https://doi.org/10.1007/BF00992696

[33] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptiveelements that can solve difficult learning control problems,” IEEETransactions on Systems, Man, and Cybernetics, vol. SMC-13, no. 5,pp. 834–846, Sep. 1983.

[34] I. H. Witten, “An adaptive optimal controller for discrete-time markovenvironments,” Information and Control, vol. 34, pp. 286–295, 1977.

[35] R. Pedarsani, J. Walrand, and Y. Zhong, “Robust scheduling for flexibleprocessing networks,” Advances in Applied Probability, vol. 49, no. 2,pp. 603–628, 2017.

[36] M. Grant and S. Boyd, “CVX: Matlab software for disciplined convexprogramming, version 2.1,” http://cvxr.com/cvx, Mar. 2014.

[37] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman,J. Tang, and W. Zaremba, “Openai gym,” 2016.

[38] A. Hill, A. Raffin, M. Ernestus, A. Gleave, R. Traore, P. Dhari-wal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Rad-ford, J. Schulman, S. Sidor, and Y. Wu, “Stable baselines,”https://github.com/hill-a/stable-baselines, 2018.

[39] The average electric car in the US is getting cheaper. [On-line]. Available: https://qz.com/1695602/the-average-electric-vehicle-is-getting-cheaper-in-the-us/.

[40] [Online]. Available: http://oasis.caiso.com[41] United States Average Hourly Wages. [Online]. Available:

https://tradingeconomics.com/united-states/wages.[42] How much does driving your car cost, per minute? [Online]. Available:

https://www.bostonglobe.com/ideas/2014/08/08/how-much-driving-really-costs-per-minute/BqnNd2q7jETedLhxxzY2CI/story.html.

[43] [Online]. Available: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page[44] [Online]. Available: http://tncstoday.sfcta.org/[45] M. Piorkowski, N. Sarafijanovic-Djukic, and M. Grossglauser, “CRAW-

DAD dataset epfl/mobility (v. 2009-02-24),” Downloaded fromhttps://crawdad.org/epfl/mobility/20090224, Feb. 2009.

http://arxiv.org/abs/1502.05477




https://arxiv.org/abs/1907.00106


https://doi.org/10.1177/0361198119838270

https://doi.org/10.1007/BF00992698

https://doi.org/10.1007/BF00992696

http://cvxr.com/cvx

https://github.com/hill-a/stable-baselines

http://oasis.caiso.com

https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

http://tncstoday.sfcta.org/

https://crawdad.org/epfl/mobility/20090224

13

[46] J. G. Dai, “On positive harris recurrence of multiclass queueing net-works: A unified approach via fluid limit models,” Annals of AppliedProbability, vol. 5, pp. 49–77, 1995.

APPENDIX

A. Proof of Proposition 1

Consider the fluid scaling of the queueing network,Qrtij =

qij(brtc)r (see [46] for more discussion on the stability

of fluid models), and let Qtij be the corresponding fluid limit.The fluid model dynamics is as follows:

Qtij = Q0ij +Atij −Xt

ij ,

where Atij is the total number of riders from node i to nodej that have arrived to the network until time t and Xt

ij isthe total number of vehicles routed from node i to j up totime t. Suppose that ρ∗ > 1 and there exists a policy underwhich for all t ≥ 0 and for all origin-destination pairs (i, j),Qtij = 0. Pick a point t1, where Qt1ij is differentiable forall (i, j). Then, for all (i, j), Qt1ij = 0. Since At1ij = Λij ,this implies Xt1

ij = Λij . On the other hand, Xt1ij is the total

number of vehicles routed from i to j at t1. This impliesΛij =

∑vmax

v=vijxvi α

vij for all (i, j) and there exists αvij and αvic

at time t1 such that the flow balance constraints hold and theallocation vector [αvij α

vic] is feasible, i.e. αvic+

∑mj=1j 6=i

αvij ≤ 1.

This contradicts ρ∗ > 1.Now suppose ρ∗ ≤ 1 and α∗ = [αv∗ij αv∗ic ] is an allo-

cation vector that solves the static problem. The cumulativenumber of vehicles routed from node i to j up to time t isStij =

∑vmax

v=vijxvi α

vijt =

∑vmax

v=0 xvi αvijt ≥ Λijt. Suppose that

for some origin-destination pair (i, j), the queue Qt1ij ≥ ε > 0for some positive t1 and ε. By continuity of the fluid limit,there exists t0 ∈ (0, t1) such that Qt0ij = ε/2 and Qtij > 0

for t ∈ [t0, t1]. Then, Qtij > 0 implies Λij >∑vmax

v=0 xvi αvij ,

which is a contradiction.

B. Proof of Proposition 2

For brevity of notation, let β+ pi = Pi. Let νij be the dualvariables corresponding to the demand satisfaction constraintsand µvi be the dual variables corresponding to the flow balanceconstraints. Since the optimization problem (7) is a convexquadratic maximization problem (given a with uniform F (·))and Slater’s condition is satisfied, strong duality holds. We canwrite the dual problem as:

minνij ,µvi

maxìj

m∑i=1

m∑j=1

(λij(1−

ìj`max

) (ì − νij))

subject to νij ≥ 0,

νij + µvi − µv−vij − βτij ≤ 0,

µvi − µv+1i − Pi ≤ 0 ∀i, j, v.

For fixed νij and µvi , the inner maximization results in theoptimal prices:

`∗ij =`max + νij

2. (13)

By strong duality, the optimal primal solution satisfies thedual solution with optimal dual variables ν∗ij and µvi

∗, which

completes the first part of the proposition. The dual problemwith optimal prices in (13) can be written as:

minνij ,µvi

m∑i=1

m∑j=1

λij`max

(`max − νij

2

)2

(14a)

subject to νij ≥ 0, (14b)

νij + µvi − µv−vijj − βτij ≤ 0, (14c)

µvi − µv+1i − Pi ≤ 0 ∀i, j, v. (14d)

The objective function in (14a) with optimal dual variables,along with (13) suggests:

P =

m∑i=1

m∑j=1

λij`max

(`max − `∗ij)2,

where profits P is the value of the objective function of bothoptimal and dual problems. To get the upper bound on prices,we go through the following algebraic calculations using theconstraints. The inequality (14d) gives:

µv−vjii ≤ vjiPi + µvi , (15)

and equivalently:

µv−vijj ≤ vijPj + µvj . (16)

The inequalities (14c) and (14b) yield:

µvi − µv−vijj − βτij ≤ 0,

and equivalently:

µvj − µv−vjii − βτji ≤ 0, (17)

Inequalities (15) and (17):

µvj ≤ µvi + βτji + vjiPi. (18)

And finally, the constraint (14c):

νij ≤ βτij + µv−vijj − µvi

(16)≤ βτij + vijPj + µvj − µvi

(18)≤ βτij + vijPj + βτji + vjiPi.

Replacing Pi = pi + β and rearranging the terms:

νij ≤ β(τij + τji + vij + vji) + vijpj + vjipi. (19)

Using the upper bound on the dual variables νij and (13), wecan upper bound the optimal prices.

14

BERKAY TURAN is pursuing the Ph.D. degreein Electrical and Computer Engineering at the Uni-versity of California, Santa Barbara. He receivedthe B.Sc. degree in Electrical and Electronics En-gineering as well as the B.Sc. degree in Physicsdegree from Bogazici University, Istanbul, Turkey,in 2018. His research interests include optimizationand reinforcement learning for the design, control,and analysis of smart infrastructure systems such asthe power grid and transportation systems.

RAMTIN PEDARSANI is an Assistant Professorin ECE Department at the University of California,Santa Barbara. He received the B.Sc. degree inelectrical engineering from the University of Tehran,Tehran, Iran, in 2009, the M.Sc. degree in commu-nication systems from the Swiss Federal Instituteof Technology (EPFL), Lausanne, Switzerland, in2011, and his Ph.D. from the University of Califor-nia, Berkeley, in 2015. His research interests includemachine learning, intelligent transportation systems,and information theory. Ramtin is a recipient of the

IEEE international conference on communications (ICC) best paper award in2014.

MAHNOOSH ALIZADEH is an assistant profes-sor of Electrical and Computer Engineering at theUniversity of California Santa Barbara. Dr. Alizadehreceived the B.Sc. degree in Electrical Engineeringfrom Sharif University of Technology in 2009 andthe M.Sc. and Ph.D. degrees from the University ofCalifornia Davis in 2013 and 2014 respectively, bothin Electrical and Computer Engineering. From 2014to 2016, she was a postdoctoral scholar at StanfordUniversity. Her research interests are focused ondesigning scalable control and market mechanisms

for enabling sustainability and resiliency in societal infrastructures, with aparticular focus on demand response and electric transportation systems. Dr.Alizadeh is a recipient of the NSF CAREER award.

Dynamic Pricing and Management for Electric Autonomous ... · dynamic joint pricing and routing...

Documents

Transcript of Dynamic Pricing and Management for Electric Autonomous ... · dynamic joint pricing and routing...