arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of...

30
Macroscopic Traffic Flow Modeling with Physics Regularized Gaussian Process: A New Insight into Machine Learning Applications Yun Yuan a , Xianfeng Terry Yang* a, , Zhao Zhang a , Shandian Zhe b a Department of Civil & Environmental Engineering, University of Utah, Salt Lake City, UT 84112, USA b School of Computing, University of Utah, Salt Lake City, UT 84112, USA Abstract Despite the wide implementation of machine learning (ML) techniques in traffic flow modeling recently, those data-driven approaches often fall short of accuracy in the cases with a small or noisy dataset. To address this issue, this study presents a new modeling framework, named physics regularized machine learning (PRML), to encode classical traffic flow models (referred as physical models) into the ML architecture and to regularize the ML training process. More specifically, a stochastic physics regularized Gaussian process (PRGP) model is developed and a Bayesian inference algorithm is used to estimate the mean and kernel of the PRGP. A physical regularizer based on macroscopic traffic flow models is also developed to augment the estimation via a shadow GP and an enhanced latent force model is used to encode physical knowledge into stochastic processes. Based on the posterior regularization inference framework, an efficient stochastic optimization algorithm is also developed to maximize the evidence lowerbound of the system likelihood. To prove the effectiveness of the proposed model, this paper conducts empirical studies on a real-world dataset which is collected from a stretch of I-15 freeway, Utah. Results show the new PRGP model can outperform the previous compatible methods, such as calibrated pure physical models and pure machine learning methods, in estimation precision and input robustness. Keywords: macroscopic traffic flow model, physics regularized machine learning, multivariate Gaussian process, posterior regularization inference 1. Introduction Traffic state (i.e. flow, speed, and density) estimation (TSE) is the precursor of a variety of advanced traffic operation tasks and plays a key role in traffic management. In early stages, macroscopic traffic dynamics were found to be similar to hydrodynamics. By borrowing concepts from the fluid mechanism, flow, speed, and density were defined and their relationship, named the fundamental diagram, was discovered. Based on these definitions, macroscopic traffic flow models were developed based on the conservation law and momentum and a set of kinematic wave models were also formulated (Seo et al., 2017). However, most models, derived under ideal theoretical conditions, require great efforts for parameter calibrations and are Email address: [email protected] (Xianfeng Terry Yang*) Preprint submitted to Transportation Research Part B February 7, 2020 arXiv:2002.02374v1 [stat.ML] 6 Feb 2020

Transcript of arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of...

Page 1: arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of a conservation law equation and a fundamental relationship (Seo et al.,2017). For

Macroscopic Traffic Flow Modeling with Physics Regularized GaussianProcess: A New Insight into Machine Learning Applications

Yun Yuana, Xianfeng Terry Yang*a,, Zhao Zhanga, Shandian Zheb

aDepartment of Civil & Environmental Engineering, University of Utah, Salt Lake City, UT 84112, USAbSchool of Computing, University of Utah, Salt Lake City, UT 84112, USA

Abstract

Despite the wide implementation of machine learning (ML) techniques in traffic flow modeling recently, those

data-driven approaches often fall short of accuracy in the cases with a small or noisy dataset. To address

this issue, this study presents a new modeling framework, named physics regularized machine learning

(PRML), to encode classical traffic flow models (referred as physical models) into the ML architecture and

to regularize the ML training process. More specifically, a stochastic physics regularized Gaussian process

(PRGP) model is developed and a Bayesian inference algorithm is used to estimate the mean and kernel

of the PRGP. A physical regularizer based on macroscopic traffic flow models is also developed to augment

the estimation via a shadow GP and an enhanced latent force model is used to encode physical knowledge

into stochastic processes. Based on the posterior regularization inference framework, an efficient stochastic

optimization algorithm is also developed to maximize the evidence lowerbound of the system likelihood.

To prove the effectiveness of the proposed model, this paper conducts empirical studies on a real-world

dataset which is collected from a stretch of I-15 freeway, Utah. Results show the new PRGP model can

outperform the previous compatible methods, such as calibrated pure physical models and pure machine

learning methods, in estimation precision and input robustness.

Keywords: macroscopic traffic flow model, physics regularized machine learning, multivariate Gaussian

process, posterior regularization inference

1. Introduction

Traffic state (i.e. flow, speed, and density) estimation (TSE) is the precursor of a variety of advanced

traffic operation tasks and plays a key role in traffic management. In early stages, macroscopic traffic

dynamics were found to be similar to hydrodynamics. By borrowing concepts from the fluid mechanism, flow,

speed, and density were defined and their relationship, named the fundamental diagram, was discovered.

Based on these definitions, macroscopic traffic flow models were developed based on the conservation law

and momentum and a set of kinematic wave models were also formulated (Seo et al., 2017). However, most

models, derived under ideal theoretical conditions, require great efforts for parameter calibrations and are

Email address: [email protected] (Xianfeng Terry Yang*)

Preprint submitted to Transportation Research Part B February 7, 2020

arX

iv:2

002.

0237

4v1

[st

at.M

L]

6 F

eb 2

020

Page 2: arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of a conservation law equation and a fundamental relationship (Seo et al.,2017). For

difficult to work with noisy and fluctuated data collected by traffic sensors.

Then to capture the measurement errors, stochastic traffic flow models were developed for the investiga-

tion and explanation of a variety of observed traffic phenomena, which are also better suited for real-time

traffic state estimation and forecasting (Jabari et al., 2014). Since the deterministic prominent models

and their higher-order extensions are ill-posed, researchers developed stochastic traffic flow models in two

categories. The first category used stochastic extensions (Gazis and Knapp, 1971; Szeto and Gazis, 1972;

Gazis and Liu, 2003; Wang and Papageorgiou, 2005; Wang et al., 2007), which were performed by adding

Gaussian noises to the model expressions and obtained real-world data were used to quantify those noises.

However, Jabari and Liu (2012) pointed out that those simply-noised models could lead to the possibility of:

(i) causing negative sample paths and (ii) producing mean dynamics that do not coincide with the original

deterministic dynamics due to nonlinearity. The second category includes stochastic traffic models such as

Botlzmann-based models (Prigogine and Herman, 1971; Paveri-Fontana, 1975), Markovian queuing network

approaches (Davis and Kang, 1994; Kang, 1995; Di et al., 2010; Osorio et al., 2011; Jabari and Liu, 2012),

and cellular automaton based models (Nagel and Schreckenberg, 1992; Gray and Griffeath, 2001; Sopasakis

and Katsoulakis, 2006; Sopasakis, 2012). Stochastic traffic models do not have the same concerns of the

models in the first category. However, they may lose the analytical tractability (Jabari and Liu, 2013),

defined as the ability of obtaining a mathematical solution such as a closed-form expression, and are much

more similar to data-driven approaches than classical analytical models.

In view of the increasing data availability, many data-driven methods were developed because they do

not require explicit theoretical assumptions and have a remarkably low computational cost in the testing

phase. In the literature, data-driven approaches include autoregressive integrated moving average Zhong

et al. (2004), Bayesian network Ni and Leonard (2005), kernel regression (Yin et al., 2012), fuzzy c-means

clustering (Tang et al., 2015), k-nearest neighbors clustering (Tak et al., 2016), stochastic principal com-

ponent analysis (Li et al., 2013; Tan et al., 2014), Tucker decomposition (Tan et al., 2013), deep learning

(Duan et al., 2016; Polson and Sokolov, 2017b; Wu et al., 2018), Bayesian particle filter (Polson and Sokolov,

2017a), etc. However, due to the data-driven nature, those machine learning (ML) models fundamentally

suffers from three scenarios: (i) training data are scarce and insufficient to reveal the complexity of the

system, (ii) training data are noisy and include much incorrect/misleading information, and (iii) test data

are far from the training examples, i.e., extrapolation. In these scenarios which are unfortunately very com-

mon in the real-world, their performance can drop dramatically along with large and/or biased estimations.

Fig. 1a shows an example of applying a pure ML method on a dataset that contains flawed data and its

biased estimation (dash line) diverges from ML methods on accurate data (solid line). Moreover, another

deficiency of ML models is that they are developed as ”black boxes” and researchers are hard to interpret

model results.

In summary, classical traffic flow models can effectively characterize the underlying mechanisms (i.e.,

physical processes of traffic) of transportation systems, however, are usually developed with strong assump-

2

Page 3: arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of a conservation law equation and a fundamental relationship (Seo et al.,2017). For

0 x

yFlawed data

Accurate data

ML with all data

ML without flaw data

(a) ML with flawed data

0 x

yFlawed data

Accurate data

PRML with all data

ML without flaw data

(b) PRML with flawed data

Figure 1: Comparison between pure ML and the proposed PRML

tions, require great efforts in parameter calibrations, and fall short of capturing data uncertainties. On

the other hand, the performances of pure data-driven approaches such as ML models highly depend on the

data quality and their results are usually hard to be interpreted. Hence, recognizing those limitations, this

research aims to develop an innovative approach, named physics regularized machine learning (PRML),

to fill the gap between classical traffic flow (physical) models and ML methods. The contributions of this

study are significant. Compared with physical models, the PRML can (1) use the ML portion to capture the

uncertainties in estimation which beyond the capability of the closed-form expressions; and (2) eliminate

the efforts in calibrating model parameter by a sequential learning process. Different from pure ML models,

the PRML is (1) more robust under the condition of the noisy/flawed dataset as valuable knowledge from

physical models can help regularize the fitting process (see Fig. 1b); and (2) more explainable in terms of

the model performance in estimation accuracy. With this innovative modeling framework, this research is

expected to bring a new insight into ML applications in transportation and build a bridge to connect the

researches of classical traffic flow models and more recent data-driven approaches.

More specifically, this study develops a physics regularized Guassian process (PRGP) method for TSE by

integrating three macroscopic traffic flows models with Gaussian process (GP), implementing a shadow GP

to regularize the original GP, and incorporating enhanced Latent Force Models (LFM) (Raissi et al., 2017)

to encode the traffic flow model knowledge. To learn the GPs from data efficiently, this study also proposes

an inference algorithm under the posterior regularization inference framework. To justify the effectiveness

of the proposed methods, numerical experiments with field data are conducted on a I-15 freeway segment

in Utah and the performances of PRGP models are compared with that of both classical traffic flow models

and pure ML models. To further investigate the robustness of PRGP, synthesized noises are also added to

the training set and results show PRGP is much more resilient to the noisy/flawed dataset.

The remainder of this paper is organized as follows. Section 2 reviews the existing studies regarding

the TSE modeling and estimation methods as well as the Gaussian process and inference methods. In

Section 3, the integrated GP and enhanced LFM for encoding physics knowledge into Bayesian statistics

and the posterior regularized inference algorithm are derived. In Section 4, the case study on a real-world

3

Page 4: arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of a conservation law equation and a fundamental relationship (Seo et al.,2017). For

data from the interstate freeway I-15 is conducted to justify the proposed methods. The conclusion section

summarizes the the critical findings and future research directions.

2. Literature Review

2.1. Macroscopic Traffic flow model

To effectively control traffic flows, TSE has been recognized as a critical fundamental task of freeway

traffic management in the literature. TSE refers to estimating a complete traffic state based on limited

traffic measurement data from stationary sensors. Key parameters, i.e. traffic flow, speed, and density, of

the macroscopic traffic flow model are used to approximate the continuous traffic state with the fundamental

diagram. Deterministic traffic flow model usually consist of a conservation law equation and a fundamental

relationship (Seo et al., 2017). For formalization, key concepts, including cumulative flow, flow, density,

speed, are defined as follows.

Definition 1. The cumulative flow N(t, x) is defined as the number of vehicles that passed the position x

by the time t.

Definition 2. The flow q, density ρ, speed v are defined in Eqs. 1-3.

q(t, x) = ∂tN(t, x) (1)

ρ(t, x) = −∂xN(t, x) (2)

v(t, x) =q(t, x)

ρ(t, x)(3)

In traffic flow studies, researchers found the existence of the fundamental diagram (FD) to illustrate the

relationship among flow, speed and density:

Definition 3. The fundamental diagram is defined as the relationship among flow, speed, and density, as

shown in Eqs. 4-5.

v = V ρ (4)

q = ρV (ρ) (5)

where V (·) denotes the density-speed function. Macroscopic traffic flow models were proposed based on

continuum fluid approximation to describe the aggregated behavior of traffic, which can generally be clas-

sified into three basic formulations. The well-known first-order Lighthill-Whitham-Richards (LWR) model

(Lighthill and Whitham, 1955; Richards, 1956) is formulated in Eqs. 6-7.

∂tρ+ ∂x(ρv) = 0 (6)

v = V (ρ) (7)

4

Page 5: arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of a conservation law equation and a fundamental relationship (Seo et al.,2017). For

The LWR model can describe simple behaviors, such as traffic jam and shockwave, however, has limitations

in reproducibility of more complex phenomena.

To overcome such limitations, second-order models use the additional momentum equation to describe

the dynamics of speed.For example, Payne-Whitham (PW) model (Payne, 1971; Whitham, 1975) is formu-

lated by Eqs. 8-9, in which Eq.9 is the momentum equation.

∂tρ+ ∂x(ρv) = 0 (8)

∂tv + v∂xv = −V − V (ρ)

τ0− c20ρ∂xρ (9)

where τ0 denotes the relaxation time and c20 denotes a parameter related to driver anticipation. Despite the

success of the PW model and its extensions (Papageorgiou et al., 1989), the PW-like models may produce

non-realistic outputs, such as negative speed (Del Castillo et al., 1994; Daganzo, 1995; Papageorgiou, 1998;

Hoogendoorn and Bovy, 2001).

To overcome this limitation, another second-order Aw-Rascle-Zhang (ARZ) model (Aw and Rascle,

2000; Zhang, 2002) is formulated in Eqs. 10-11, where another momentum equation is proposed in Eq. 11.

The original ARZ model was extended extensively in the literature (Colombo, 2003; Lebacque et al., 2007;

Blandin et al., 2013; Fan et al., 2013).

∂tρ+ ∂x(ρv) = 0 (10)

∂t(v − V (ρ) + v∂x(v − V (ρ)) = −v − V (ρ)

τ0(11)

However, it should be noted that despite of the elegance of differential equation formalization, the traffic

flow model is difficult to estimate due to the nonlinearity and the measure errors of observations in the

real world. Thus, the researchers proposed advanced estimation methods to facilitate the application of the

models.

2.2. Stochastic estimation methods

To use field data to capture traffic flow uncertainties, some estimation models with stochastic extensions

are later derivedSeo et al. (2017). For example, TSE is defined as Boundary Value Problem (BVP) based

on partial observations (i.e. boundary conditions) (Coifman, 2002; Laval et al., 2012; Kuwahara, 2015;

Blandin et al., 2013; Fan et al., 2013). In solving BVPs, the boundary conditions are assumed to be correct.

However, the real-world measure error can not be ignored.

Considering system and observation noise, data assimilation or inverse modeling techniques were then

developed for model estimation and calibration. In the literature, there exist three ways to add randomness

in the traffic models: (a) stochastic initial and boundary conditions, (b) stochastic source terms (e.g.

inflows), and (c) stochastic speed-density relationship or fundamental diagram (Sumalee et al., 2011). To

capture the measure error in data, a stochastic modeling method is performed by adding Gaussian noise to

the traffic state estimates Gazis and Knapp (1971); Szeto and Gazis (1972); Gazis and Liu (2003); Wang and

Papageorgiou (2005); Wang et al. (2007); Sumalee et al. (2011). For example, in view of the nonlinearity

5

Page 6: arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of a conservation law equation and a fundamental relationship (Seo et al.,2017). For

of the second order traffic flow model, Gazis and Liu (2003); Wang and Papageorgiou (2005) assumed the

error terms on the formula and developed extended Kalman filter (EKF) to estimate a PW-like discrete

model (Papageorgiou et al., 1989).

Note that the applying EKF to non-differentiable models (e.g. Cell Transmission Model) is not rigorous

(Blandin et al., 2012). The unscented Kalman filter (UKF) overcomes the shortcomings of EKF by avoiding

an analytical differentiation (Mihaylova et al., 2006). The ensemble Kalman filter (EnKF) employs the

Monte Carlo simulation to handle nonlinear and nondifferentiable systems, but is computational costly

(Work et al., 2008). The particle filter (PF) uses Monte Carlo simulation and is computation-consuming

as well (Mihaylova and Boel, 2004). The simulation-based methods were further extended to reduce the

computational cost.

In summary, despite the adequate applications of these methods, the stochastic extension models have

two critical theoretical deficiencies: (a) negative sample paths and (b) the mean dynamics that do not

coincide with the original deterministic dynamics due to the nonlinearity (Jabari and Liu, 2012, 2013;

Jabari et al., 2014; Pascale et al., 2013; Wada et al., 2017). In view of such deficiencies, the intractable

methods, stochastic traffic flow models, were proposed in view of the tradeoff between relaxing assumptions

and the model tractability, such as (a) Botlzmann-based methods (Prigogine and Herman, 1971; Paveri-

Fontana, 1975), (b) Markovian queuing methods (Davis and Kang, 1994; Kang, 1995; Di et al., 2010; Osorio

et al., 2011; Jabari and Liu, 2012), (c) cellular automation based methods (Nagel and Schreckenberg, 1992;

Gray and Griffeath, 2001; Sopasakis and Katsoulakis, 2006; Sopasakis, 2012).

2.3. Data-driven method

More recently, with much enriched data, researchers started to seek data-driven methods, such as machine

learning, Bayesian statistics, etc. Among the existing data-driven methods, Gaussian process (GP) is a

powerful non-parametric function estimator and has various successful applications. In traffic modeling,

GP-based methods are applied in traffic speed imputation (Rodrigues and Pereira, 2018; Rodrigues et al.,

2018), public transport flows Neumann et al. (2009), traffic volume estimation and prediction Xie et al.

(2010), travel time prediction (Ide and Kato, 2009), driver velocity profiles (Armand et al., 2013) and traffic

congestion Liu et al. (2013). It can capture relationship between stochastic variables without requiring

strong assumptions (such as memorylessness).

However, as a data-driven approach, GPs can perform poorly when the training data are scarce and

insufficient to reflect the complexity of the system or testing inputs are far away from the training data. Few

traffic estimation methods were developed based on GPs because it’s difficult to obtain deductive insights

and leverage physics knowledge.

Taking advantage of valuable knowledge from physical models (i.e., classical traffic flow model), we aim

to encode them into GPs to improve their performance, especially when training on scarce data and marking

estimations in areas with flawed observations. However, it shall be noted that using GP to represent physical

knowledge, modeled by differential equations, has two major difficulties: (a) differential equations are hard

6

Page 7: arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of a conservation law equation and a fundamental relationship (Seo et al.,2017). For

to represent as a probabilistic term, such as priors and likelihoods; (b) in practice, physics knowledge is

usually incomplete, the differential equations can include latent functions and parameters (e.g. unobserved

noise, inflows, outflows), making their presentations and joint estimation with GPs even more challenging.

To better encode the differential equations in GPs, Alvarez et al. (2009, 2013) proposed a Latent Force

Models (LFM) for training and then the estimation of GP would be based on the convolved kernel upon

Green’s function. Later on, Raissi et al. (2017) extended the framework by assuming observable noise.

However, the assumption of LFM is too restrictive since many realistic flexible equations are nonlinear, or

linear but do not have analytical Green’s function. Also, the complete kernel is still infeasible to obtain.

Thus, it is more feasible to use expressive kernels, e.g. deep kernels (Wilson et al., 2016).

In summary, there lack a hybrid framework to consider the physics knowledge (i.e. kinematic wave

differential equations and fundamental diagram) and the data-driven methods with minimal assumptions

and reasonable computational cost. This paper aims to fill the gap by proposing a Gaussian process based

data-driven method considering tractable physics knowledge.

2.4. Gaussian process and Bayesian inference

Gaussian process is a general framework for measuring of the similarity between observations from

training data to estimate the unobserved values. Rodrigues and Pereira (2018) and Rodrigues et al. (2018)

applied the multi-output Gaussian processes to model the complex spatiotemporal patterns about incom-

plete traffic speed data. The key task is to learn the kernel (i.e. covariance) function between the variables.

The previous studies (Calderhead et al., 2009; Barber and Wang, 2014; Heinonen et al., 2018) investigated

the GP ordinary differential derivatives. They assumed the noisy forces are observable, for example, the

observable noisy forces (Graepel, 2003), and observable noisy forces and solutions (Raissi et al., 2017).

To model the observable noisy forces, Latent Force Models (LFM) (Alvarez et al., 2009, 2013) first placed

a prior over the latent forces, and then derives the covariance of the solution function via the convolution

operation. Despite the successful applications, such as transcriptional regulation modeling (Lawrence et al.,

2007), the LFM method has two critical deficiencies: (a) it requires the linear differential equations and

the analytical Green’s functions, which is restrictive and does not fit the traffic flow model; and (b) the

convolution procedure is computationally difficult and restrictive.

To address these issues, this paper generalizes the LFM framework and enables the nonlinear differenti-

ation to encode the physics knowledge. To key task is to optimize model likelihood on data and a penalty

term that encodes the constraints over the posterior of the latent variables. Via the penalty term, the

domain knowledge or constraints outright to the posteriors rather than through the priors and a complex,

intermediate computing procedure, hence it can be more convenient and effective. In view of computational

efficiency, this paper further employs a posterior regularization algorithms to solve the likelihood optimiza-

tion problem (Ganchev and Das, 2013; Zhu et al., 2014; Libbrecht et al., 2015; Song et al., 2016). To the

best of authors knowledge, this new modeling framework is innovative and has not been developed by other

transportation studies yet. The proposed method is designed to avoid the error-prone simple stochastic

7

Page 8: arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of a conservation law equation and a fundamental relationship (Seo et al.,2017). For

assumptions and leverage the physics knowledge in a data-driven framework, which also has remarkable

performance in scare data situations and unobserved inflow and outflows (e.g. an arterial stretch).

3. Methodology

3.1. Macroscopic traffic flow model with Physics Regularized Gaussian Process

3.1.1. Gaussian process

Suppose we aim to learn a machine f : Rd → Rd′ , it will map a d-dimensional Euclidean space to a

d′-dimensional Euclidean space from a training set D = (X,Y), where X = [x1, . . . ,xN ]ᵀ is the input

vector, Y = [y1, . . . ,yN ]ᵀ is the output vector, x is the d dimensional input vector, y is the d′ dimensional

output vector, f = [f(x1), . . . , f(xN )]ᵀ is the learning function, and N refers to the sample size. Note that

X,Y may have physical meanings only in their feasible domains.

Assumption 1. It is assumed that the input X and the true output f follow a multivariate Gaussian

distribution as shown in Eq. 12 , where N (·, ·) represents the Gaussian distribution, m denotes the mean

matrix, and K represents the covariance matrix.

p(f |X) = N (f |m,K) (12)

Note that Gaussian process in d-dimension is also called Gaussian Random Field and the above definition

involves the multi-dimensional outputs.

Assumption 2. It is assumed that the observations Y have an isotropic Gaussian noise, as shown in

Eq. 13.

p(Y|f) = N (f , τ−1I) (13)

where τ refers to the inverse variance, and isotropic noise means that the noise from each dimension is

independent identically distributed (i.i.d.) and of the same variance τ .

Then, by Marginalizing out f , we can obtain the marginal likelihood as shown in Eq. 14.

p(Y|X) = N (Y|0,K + τ−1I) (14)

where the kernel matrix K is defined in Eq. 15.

[K]ij = k(xi, xj) (15)

Commonly, Assumption 3, which requires the kernel has derivatives of all orders in its domain, would

also be necessary and the positive-definite kernels include linear, polynomial, radial-basis, Laplacian, etc.

(Fasshauer, 2011).

Assumption 3. The kernel k(·, ·) is assumed to be positive-definite and smooth.

8

Page 9: arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of a conservation law equation and a fundamental relationship (Seo et al.,2017). For

Given the new input x∗, the f function value can be estimated based on Eq. 16.

p(f(x∗)|x∗,X,Y) = N (f(x∗)|µ(x∗), ν(x∗)) (16)

where the mean µ(x∗), standard deviation ν(x∗), and the kernel vector k∗ are calculated in Eqs. 17-19,

respectively.

µ(x∗) = kᵀ∗(K + τ−1I)−1Y (17)

ν(x∗) = k(x∗,x∗)− kᵀ∗(K + τ−1I)−1k∗ (18)

k∗ = [k(x∗, x1), . . . , k(x∗, xN )]ᵀ (19)

If the kernel K has been learned from data D, the estimated output matrix f(x∗) can be calculated

via the reparameterization (Kingma and Ba, 2014) as shown in Eqs. 20-21, where ε is standard normally

distributed.

ε = N (0, 1) (20)

f(x∗) = µ(x∗) + ε ∗√ν(x∗) (21)

Fig. 2 shows the structure of the conventional GP method, where the circled nodes denote the random

vector,s the shaded node represent known vectors, and the arrows indicate the conditional probabilities.

X x∗

Y f f∗

learning estimation

data

noise

GP GP

Figure 2: The conventional framework for inferring Gaussian process

3.1.2. Latent Force Model

In many applications, physics knowledge, expressed as differential equations, provide the insight of the

system’s mechanism and can be very useful for both estimation and prediction. In the seminal work of

Alvarez et al. (2009, 2013), they propose latent force models (LFM) that use convolution operations to

encode physics into GP kernels. They assume the differential equations are linear and have analytical

Green’s function with the kernel of the latent functions. Given this assumption, the kernel of the target

function can be derived by convolving the Green’s function with the kernel of the latent functions. LFM

considers W output functions f1(x), . . . , fw(x), . . . , fW (x), and assumes each output function fw is governed

by a linear differential equation.

L fw(x) = uw(x) (22)

where L is linear differential operator (Courant and Hilbert, 2008), and u is a latent force function.

9

Page 10: arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of a conservation law equation and a fundamental relationship (Seo et al.,2017). For

Lemma 1. If one side of Eq. 22 is one GP, the other side is another GP. The covariance of a GP’s

derivative and the cross-covariance between the GP and its derivative can be obtained by taking derivatives

over the original covariance function.

Lemma 1 is proven by (Alvarez et al., 2009, 2013). The reasoning is based on that applying a linear

differential operator on one GP results in another GP (Graepel, 2003) because the derivative of GP is still

a GP (Williams and Rasmussen, 2006).

The latent force function u can be further decomposed as a linear combination of several common latent

force functions as follows.

uw(x) =

R∑r=1

srwgr(x) (23)

where R is the number of decomposed force functions, s is the latent matrix. Since L is linear, if we assign

a GP prior over u(x), fw(x) has a GP prior as well. Moreover, if the Green’s function, namely the solution

of Eq. 24, is available, we can obtain Eq. 25.

L G(x, s) = δ(s− x) (24)

where δ is the Dirac delta function, G is the Green’s function.

fw(x) =

∫G(x, s)ui(s)ds (25)

Hence, given the kernel for uw, we can derive the kernel for fw through a convolution operation which

is shown in Eq. 26.

kfw(x1,x2) =

∫∫G(x1, s1)G(x2, s2)kuw

(s1, s2)ds1ds2 (26)

To deal with the multiple outputs, we can place independent GP priors over common latent function

gr, then each uw and fw will obtain GP priors in turn. Via a similar convolution, we can derive the kernel

across different outputs (i.e. cross-covariance) kfw,fi′ . In this way, the physics knowledge in the Green’s

function are hybridized with the kernel for the latent forces. This procedure is used to learn the GP model

with an convolved kernel from the training data.

3.1.3. Augmented Latent Force Model

Despite the elegance and success of LFM, the precondition for using LFM might be too restrictive.

To enable the kernel convolution, LFM requires that the differential equations must be linear and have

analytical Green’s functions. However, many realistic differential equations from traffic flow models are

either nonlinear or linear but do not possess analytical Green’s functions, and therefore, cannot be exploited.

In some other cases, even with a tractable Green’s function, the complete kernel of all the input variables is

still infeasible to obtain. In order to obtain an analytical kernel after the convolution, we have to convolve

Green’s functions with smooth kernels. This may prevent us from integrating the physics knowledge into

more complex yet highly flexible kernels, such as deep kernel (Wilson et al., 2016). To handle the intractable

integral, we need to develop extra approximation methods, such as Monte-Carlo approximation.

10

Page 11: arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of a conservation law equation and a fundamental relationship (Seo et al.,2017). For

Given the differential equation that describes the physics knowledge, the proposed augmented LFM

equation is formulated in Eq. 27.

Ψf(x) = g(x) (27)

where the differential operator Ψ can be linear, nonlinear, or numerical differential operator, g(·) represents

the unknown latent force functions, f(x) is the function to be estimated from data D. We aim to create a

generative component to regularize the original GP with a differential equation. Using Augmented LFM,

the differential equation is encoded to another GP, which is called a shadow GP. To yield the numerical

outputs, the kernel of the shadow GP should be efficiently learnable.

Theorem 1. If one side of Eq. 27 is one GP, the other side is another GP.

Proof. The reasoning is based on that applying a differential operator on one GP results in another GP. The

regularization is fulfilled via a valid generative model component rather than the process differentiation,

and hence can be applied to any linear or nonlinear differential operators. In view of the fact that the

resultant covariance and cross-covariance are not obvious via analytical derivatives, the expressive kernels

can be learned from data empirically.

The original LFM starts with the RHS (right-hand side) of Eq. 22, assigns it a GP prior and then use

the convolution operation to obtain the GP prior of the left-hand side (LHS) target function. Since the

convolution operation is an integration procedure, it can be more restrictive and challenging. In contrast,

our approach chooses a reverse direction, i.e. from LHS to RHS. We first sample the target function with an

expressive kernel, use differentiation operation to obtain the latent force, and then regularize it with another

GP prior. The differentiation operation is more flexible and convenient (Baydin et al., 2018), which does not

need to restrict the operator and GP kernels to ensure tractable computation. The computational challenge

can be overcomed by using auto-differential libraries (Baydin et al., 2018) and deep learning techniques

(e.g. deep kernel, Tensorflow, PyTorch). Therefore, the shadow GP can be efficiently learned from pseudo

observations via differential computations.

3.2. Physics regularized Gaussian process (PRGP)

Involving the shadow GP, the design concept of the proposed PRGP is illustrated in Fig. 3. To enable

X

Y

Z

f g

ω

Ψ

shadow GP

data GP

Figure 3: The proposed framework for physics regularized Gaussian process learning

11

Page 12: arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of a conservation law equation and a fundamental relationship (Seo et al.,2017). For

Bayesian framework that incorporats the physics knowledge in Eq. 27, we introduce a set of m pseudo

observations, ω = [0, . . . , 0]ᵀ, to propose a generative component p(ω|X,Y) that acts as a physics knowledge

based regularizer on the GP model p(Y|X). To sample the pseudo observation ω, the input vector Z of the

length m is given as follow:

Z = [z1, . . . , zm]ᵀ (28)

Then, we sample the posterior function values at each zj , 1 ≤ j ≤ m as shown in Eq. 29.

p(f(zj)|zj ,X,Y) = N (f(zj)|µ(zj), ν(zj)) (29)

We apply the differentiation operator in Eq. 27 to obtain the latent function values at Z, g = [g(z1), . . . , g(zm)],

which is equivalent to sampling g(·) from the Green’s function in Eq. 30.

p(g|f) = δ(g −Ψf) (30)

Given the latent function values g, we sample the pseudo observations ω from another GP.

p(ω|g,Z) = N (ω|g, K) (31)

where K is the covariance matrix and each element is calculated from the kernel k(·, ·) in Eq. 32.

[K]ij = k(zi, zj) (32)

Considering the symmetry property of the Gaussian distribution shown in Eq. 33, the sampling of the pseudo

observations in essence is equivalent to placing another GP prior over the sampled latent force function g.

Therefore, this GP prior regularizes the sampled latent function. Through the differential operator Ψ, the

regularization propagates back to the target machine f(·).

p(ω|g, K) = p(g|ω, K) = p(Ψf |ω, K) (33)

Thus, the joint probability of the generative component is broken into four parts, as shown in Eq. 34.

p(ω,g, f ,Z|X,Y) = p(Z)p(f |Z,X,Y)p(g|f)p(ω|g) (34)

where the prior of the m input locations, p(Z), p(f ,Z,X,Y), and p(ω|g) are given by Eqs. 35-37, respectively.

Nals note that when no extra knowledge is available, zj can be uniformly distributed assumedly.

p(Z) = Πmj=1p(zj) (35)

p(f ,Z,X,Y) = Πmj=1[N (f(zj)|µ(zj), ν(zj))] (36)

p(ω|g) = N (ω|g, K) (37)

12

Page 13: arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of a conservation law equation and a fundamental relationship (Seo et al.,2017). For

3.3. Posterior regularized inference algorithm

Posterior regularization is a powerful inference methodology in the Bayesian stochastic modeling frame-

work (Ganchev et al., 2010). The objective includes the model likelihood on data and a penalty term that

encodes the constrains over the posterior of the latent variables. Via the penalty term, we can incorpo-

rate our domain knowledge or constrains outright to the posteriors, rather than through the priors and

a complex, intermediate computing procedure. A variety of successful posterior regularization algorithms

have been proposed (He et al., 2013; Ganchev and Das, 2013; Zhu et al., 2014; Libbrecht et al., 2015; Song

et al., 2016). Hence it can be more convenient and effective. For efficient model inference, we marginalize

out all latent variables in the joint probability to avoid estimating extra approximate posteriors. Then we

derive a convenient evidence lower bound to enable the reparameterization. Using the reparameterization

and auto-differentiation libraries, we develop an efficient stochastic optimization algorithm based on the

posterior regularization inference framework (Ganchev et al., 2010).

The proposed inference algorithm is derived as follows. The generative component in Eq. 34 is bind to

the original GP in Eq. 14 to obtain a new principled Bayesian model. The joint probability is given by

Eq. 38.

p(Y, ω,g, f ,Z|X) = p(Y|X)p(ω,g, f ,Z|X,Y) (38)

We first marginalize out all the latent variables in the generative component to avoid approximating their

posterior in Eq. 39.

p(ω|X,Y) =

∫∫∫[p(ω,g, f ,Z|X,Y)dZdgdf ]

=

∫∫[p(Z)p(f |Z,X,Y)p(ω|Ψf , K)dZdf ]

=

∫∫[p(Z)p(f |Z,X,Y)N (ω|Ψf , K)dZdf ]

= Ep(Z)Ep(f |Z,X,Y)N (Ψf |0, K)

(39)

The parameter γ ≥ 0 is used to control the strength of regularization effect.

p(Y, ω|X) = p(Y|X)p(ω|X,Y)γ (40)

The objective is to maximize the log-likelihood in Eq. 41.

log[p(Y, ω|X)] = log[p(Y|X)] + γ log[p(ω|X,Y)]

= log[(N (Y|0, K + τ−1I))]

+ γ log[Ep(Z)Ep(f |Z,X,Y)[N (Ψf |0, K)]]

(41)

However, the log-likelihood is intractable due to the expectation inside the logarithm term. To address

this problem, the Jensen’s inequality is used to obtain an evidence lower bound L in Eq. 42.

log[p(Y, ω|X)] ≥ L = log[N (Y|ω, K + τ−1I)]

+ γEp(z)Ep(f |Z,X,Y)[log[N (Ψf |ω, K)]](42)

13

Page 14: arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of a conservation law equation and a fundamental relationship (Seo et al.,2017). For

The existence of the general evidence lowerbound (ELBO) of a posterior distribution is proved with

analyzing a decomposition of the Kullback-Leibler (KL) divergence by Bishop (2006). Thus, we can obtain

the ELBO of the log-likelihood in Eq. 42. However, the ELBO is still intractable due to the non-analytical

expectation term. In view of the expectation is out of the logarithm, we can maximize L via stochastic

optimization shown in Alg. 1.

Algorithm 1: The stochastic inference algorithm

Result: Learned kernel parameters

1 Initialization;

2 while not reach stopping criteria do

3 Sample a set of input locations Z;

4 Estimate the mean µ and the variance ν of f in Eqs. 17-18;

5 Generate a parameterized sample of the posterior target function values f by the

reparameterization in Eqs. 20-21;

6 Substitute the parameterized samples f to obtain the unbiased estimated ELBO L in Eq. 42;

7 Calculate ∇θL, an unbiased stochastic gradient of L via the auto-differential technique;

8 Update the parameters θ via the gradient decent shown in Eq. 43 ;

9 end

θt+1 = θt + α∇θL (43)

where α refers to the learning rate and θ denotes all trainable parameters.

To prove the correctness of Alg. 1, we need to prove the correctness of employing a regularization via

ELBO as follows.

Theorem 2. Maximizing the lowerbound of the log-likelihood is equivalent to a soft constraint over the

posterior of the target function in the original GP.

Proof. While the proposed inference algorithm is developed for a hybrid model rather than pure GP

(Ganchev et al., 2010), the evidence lower bound optimized by Alg. 1 is a typical posterior regulariza-

tion objective that estimates a pure GP model and meanwhile penalizes the posterior of the target function

to encourage a consistency with the differential equations. Jointly maximizing the term

Ep(z)Ep(f |Z,X,Y)[log[N (Ψf |ω, K)]]

in the lowerbound of the log-likelihood L encourages all the possible latent force functions that are obtained

from the target function f(·) via the differential operator Ψ should be considered as being sampled from

the same shadow GP. This can be viewed as a soft constraint over the posterior of the target function in

the original GP model. Therefore, while being developed for inference of a hybrid model, the algorithm

is equivalent to estimating the original GP model with some soft constraints on its posterior distribution.

Thus, the physics knowledge regularizes the learning of the target function in the original GP.

14

Page 15: arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of a conservation law equation and a fundamental relationship (Seo et al.,2017). For

To apply the proposed method with multiple differential equations (i.e. FD, conservation law, momen-

tum), Fig. 4 shows the multi-equation multi-output framework of applying the proposed method to model

the stochastic traffic flow process.

X

Y

Z

f1

g1

. . . fd′

gw

ω

[(x, t)i]2×N

[(q, ρ, v)i]3×N

[(x, t)j ]2×m

[(q, ρ, v)j ]3×m

[g1,j ]1×m [gw,j ]1×m

[[0, . . . , 0]ᵀ]1×m

Ψf1(q, ρ, v) Ψfd′(q, ρ, v)

shadow GPs

data

GP1(K1) GPd′(Kd′)

Figure 4: The proposed framework for multi-output multi-equation PRGP learning

The log-likelihood and the ELBO of the traffic flow model can be formulated in Eq. 44.

log[p(Y, ω|X)] ≥ L =

d′∑i=1

log[N ([Y]i|ω, Ki + τ−1I)]

+

W∑w=1

γwEp(z)Ep(fw|Z,X,Y)[log[N (Ψfw|ω, Kw)]]

(44)

3.3.1. Expressive kernels

The expressive kernels are defined as the non-parametric smooth covariance functions, such as the

well-known Squared Exponential Automatic Relevance Determination (SEARD) Kernel, and Radial Basis

Function (RBF) kernel (Bishop, 2006), and deep kernels (Wilson et al., 2016). The employed kernel functions

are shown as follows:

The SE-ARD kernel is formulated in Eq. 45.

k(xi,xj) = σ2 exp(−(xi − xj)ᵀdiag(η(xi − xj))) (45)

where diag(·) represents the diagonal matrix, σ and η are kernel parameters.

The RBF kernel is formulated in Eq. 46.

k(xi,xj) = exp(− (||xi − xj ||)2

2σ2) (46)

where σ is the kernel parameter.

3.3.2. Algorithm complexity

The time complexity of the inference of the original GP is O(N3). The time complexity of the inference

of the shadow GP is O(m3). Thus, the total time complexity for the inference of two GPs is O((Nd′)3+m3).

15

Page 16: arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of a conservation law equation and a fundamental relationship (Seo et al.,2017). For

To store the kernel metrics of original GP and the shadow GP, the space complexity is O((Nd′)2 +m2). In

the testing phase, the time complexity of the model estimation is marginal (less than 1 ms) empirically.

3.4. Physics regularized traffic state estimation

To apply the proposed method, the traffic flow models need to be converted to the form of Eq. 27. In

this study, we aims to encode three classical traffic flow models, LWR, PW, and ARZ, into the GP and

compare their performance under the framework of PRGP. More specifically, the converted LWR, PW, ARZ

models are presented as follows. In PRGP, the stochastic conservation law of LWR is formulated in Eq. 47.

Ψf1(q, ρ, v) = ∂tρ+ ∂xq = g1 (47)

The stochastic PW model is formulated in Eqs. 48-49.

Ψf1(q, ρ, v) = ∂tρ+ ∂x(ρv) = g1 (48)

Ψf2(q, ρ, v) = ∂tv + v∂xv +V − V (ρ)

τ0+c20ρ∂xρ = g2 (49)

And the stochastic ARZ model is formulated in Eqs. 50-51.

Ψf1(q, ρ, v) = ∂tρ+ ∂x(ρv) = g1 (50)

Ψf2(q, ρ, v) = ∂t(v − V (ρ) + v∂x(v − V (ρ)) +v − V (ρ)

τ0= g2 (51)

4. Numerical Tests with Field Data

4.1. Case setting

To evaluate the performance of the proposed PRML framework, We applied the three PRGP models to

estimate the traffic flow in a stretch of the interstate freeway I-15 across Utah, U.S. The Utah Department

of Transportation (UDOT) has installed sensors every a few miles along the freeway. Each sensor counts the

number of vehicles passed every minute, measures the speed of each vehicle, and sends the data back to a

central database, named Performance Measurement System (PeMS). The collected real-time data and road

conditions are available online and can be accessed by the public. For model evaluations, the data, from

August 5, 2019 to August 11, 2019, were collected by four sensors on the I-15, Utah. The input variables

include the location coordinates of each sensor and the time of each read. The studied stretch is illustrated

in Fig. 5, where the yellow line indicates the studied freeway segments and the blue bars represent the

locations of traffic detectors. In the case, the data is shuffled and randomly split into the training set and

testing set separately.

16

Page 17: arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of a conservation law equation and a fundamental relationship (Seo et al.,2017). For

Figure 5: The stretch of the studied freeway segment which includes four detectors

4.2. Implementation

The deep kernel can be in any neural network structure, such as the feed-forward neural network, and

can be fine-tuned to achieve better empirical results. Incorporating the SEARD and RBF kernels, the

compound kernel of the d′-dimensional original GP and the W -dimensional shadow GP are computed in

Fig. 6. The procedure for estimating the target traffic state q, v of any given input x, t is illustrated in

Fig. 7. In the multi-output multi-equation PRGP, the d′-dimension means for each dimension of y creating

one compound kernel, and the W -dimension means for each differential equation creating one compound

kernel. Note that the structure of the GPs can be fine-tuned to achieve better empirical performance.

መ𝐟

Ψመ𝐟

𝐂𝑤

𝐗 (input data) Y (output data)

ARD Kernel

𝐒 = |𝐊 + 𝜏−1𝐈|

− log 𝑝 𝐘𝑖 𝐗 = 0.5log 𝐒 + 0.5𝐘𝑖T𝐒−𝟏𝐘𝑖

𝐊

Min ℒ = −σ𝑖 log 𝑝 𝐘𝑖 𝐗 −σ𝑖 log[𝑝(𝛚|𝐗, 𝐘𝑖)]

Z (randomized input)

− log 𝑝 𝛚 𝐗, 𝐘𝑖 ∝

𝑤

𝛾𝑤[0.5 log 𝐂 + 0.5𝐠𝑤𝑇 𝐂𝒘

−1𝐠𝑤]

Predict መ𝐟

Physics equations

RBF Kernel

𝐠𝒘

Figure 6: The structure of the proposed loss function

17

Page 18: arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of a conservation law equation and a fundamental relationship (Seo et al.,2017). For

𝐗 (input data)

ARD Kernel

𝐊

𝐒 = |𝐊 + 𝜏−1𝐈|

𝝁 = 𝐊𝐒−𝟏𝐘

𝝈 = 𝛽∑𝐊 𝐒−𝟏𝐊𝐓 −𝟏

Y (output data)

𝜖 = 𝒩(0,1)

𝜖(random number)

መ𝐟 = 𝝁 + 𝝈𝝐

Figure 7: The structure of estimation

In the experiments, the parameters of the proposed method are set as follows: (a) the number of pseudo

observations m = 10, (b) the strength of regularization λ is fine-tuned numerically. The proposed inference

algorithm is implemented in the Tensorflow framework, where the optimizer ADAM (Kingma and Ba, 2014)

is chosen for updating the parameters.

4.3. Results Analysis

4.3.1. Comparison with Pure Machine Learning Models

To prove the superiority of the proposed PRML framework compared with pure ML models, this sub-

section aims to compare the three PRGP models, LWR-PRGP, PW-PRGP, and ARZ-PRGP, with pure

GP and other popular ML models such as multilayer perceptron, support vector machine, and random

forest (Bishop, 2006). Also recall that one main contribution of PRML is that it is more explainable in

terms of model performance. Hence, this study further adopts another physical model, the well-known heat

equation, to prove the indispensability of classical traffic flow models in the PRML framework, since the

heat equation is not suitable to model traffic flows. The heat equation is formulated in Eq. 52.

∂fh(x, t)

∂t= β1∇2fh(x, t) (52)

Note that the inputs of the proposed PRGP-based methods and classical traffic flow models are different.

The latter method often requires the on-ramp and off-ramp flow observations as inputs, while the proposed

method assumes unobserved on-ramp and off-ramp flows in the framework and does not require such data.

The training process of each model, with 500 iterations and 2, 880 samples, costs 10, 480 seconds in average

on a workstation equipped with a 3.5GHz 6-core CPU. In the testing phase, the time complexity of the

model estimation is marginal (less than 1 second) empirically, similar to all ML models. Note that the

computational process can be accelerated by about 5 time if a NVIDIA CUDA-capable GPU is used.

18

Page 19: arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of a conservation law equation and a fundamental relationship (Seo et al.,2017). For

Figs 8-9 compare the flow and speed estimations with the ground truth in the studied case. If the coefficient

of the trend line is close to 1 and the intercept is close to 0, the estimation will be considered as accurate.

The results show that both pure GP and proposed PRGP models can perform well in estimating the flows

and speeds.

(a) GP (b) LWR-PRGP

(c) PW-PRGP (d) ARZ-PRGP

Figure 8: Comparison between flow estimation by GP and PRGPs and the ground truth

To quantify the precision of outputs, Rooted Mean Squared Error (RMSE) and Mean Absolute Percent-

age Error (MAPE) of each dimension are used as the performance metric, which are defined in Eqs. 53-54.

RMSEj =

√√√√ 1

N

N∑i=1

( [yj ]i − [fj ]iσi

)2,∀j ∈ 1, . . . , d′ (53)

MAPEj =100%

N

N∑i=1

∣∣∣ [yj ]i − [fj ]i[yj ]i

∣∣∣,∀j ∈ 1, . . . , d′ (54)

Table 1 summarizes of the results of the comparable baselines and the proposed method in the same

dataset. Among the four pure ML models, the GP can obviously outperform the other ML models in terms

of providing more accurate estimations of both flows and speeds. The GP can yield a 39.74 veh/5-min

of RMSE and a 13.70% of MAPE for flow and a 2.7 mph of RMSE and a 2.64% for MAPE for speed,

while the other three produced much higher RMSEs and MAPEs of both flow and speed estimates. Further

19

Page 20: arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of a conservation law equation and a fundamental relationship (Seo et al.,2017). For

(a) GP (b) LWR-PRGP

(c) PW-PRGP (d) ARZ-PRGP

Figure 9: Comparison between speed estimations by GP and PRGPs and the ground truth

comparison between the pure GP and the three PRGP models reveal that PRGP models can improve the

accuracy of both flow and speed estimations. However, the improvement is not significant, which is because

pure GP can already achieve a very good estimation performance and leaves limited space for improvement

by the PRGP. Moreover, to validate PRGP’s contribution in making the results more explainable, the

comparison with Heat-PRGP, which uses the physical knowledge from the heat equation, shows that a

physical model that cannot precisely describe the traffic flow patterns could even downgrade the capability

of the PRGP. Another side evidence is that PW and ARZ, which are the improved version of LWR, can

improve the performance of the PRGP compared with the LWR.

4.3.2. Comparison with physical models (Traffic Flow Models)

To provide physical baselines for the performance comparison, the LWR, PW, ARZ models are calibrated

with the obtained field data. For model calibration, we follow the method by Akwir et al. (2018), where the

hybrid scheme of neural network and nonlinear partial differential equation is used to dynamically adjust

all outputs of the three models to obtain their calibrated parameters. Figs 10-11 plot the estimated flow

and speed from the three physical models versus the ground truth. Obviously, the estimation results are

quite biased for both flow and speed.

20

Page 21: arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of a conservation law equation and a fundamental relationship (Seo et al.,2017). For

Table 1: Comparison of the results of the proposed method and the baseline methods

Method Flow RMSE

(veh/5min)

Flow MAPE Speed RMSE

(mph)

Speed MAPE

Multilayer perceptron 113.95 30.80% 13.61 19.91%

Support Vector Machine 124.84 34.24% 9.58 13.01%

Random Forest 108.24 27.60% 8.66 12.02%

pure GP 39.74 13.70% 2.76 2.64%

LWR-PRGP 37.19 12.77% 2.96 2.65%

PW-PRGP 35.45 12.42% 3.02 2.68%

ARZ-PRGP 34.75 11.48% 2.90 2.72%

Heat-PRGP 79.51 23.49% 5.20 6.75%

(a) LWR (b) PW (c) ARZ

Figure 10: Estimated flow by the calibrated physical models v.s. ground truth

(a) LWR (b) PW (c) ARZ

Figure 11: Estimated speed by the calibrated physical models v.s. ground truth

To better justify models’ estimation accuracy, Table 2 shows the results of proposed method and the

calibrated physical models in estimation errors. It can be found that the proposed method significantly

outperforms the baseline methods by around 80 veh/5min in flow RMSE and 18% in MAPE and 7 mph in

speed RMSE and 15% in MAPE. Hence, it can be concluded that the estimation performance of traffic flows

models can be greatly improved if they are encoded into a ML framework. The real-world uncertainties of

21

Page 22: arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of a conservation law equation and a fundamental relationship (Seo et al.,2017). For

flow and speed can be captured by the ML portion properly.

Table 2: Comparison of the results of the proposed methods and the physics-based methods

Method Flow RMSE

(veh/5min)

Flow MAPE Speed RMSE

(mph)

Speed MAPE

Calibrated LWR 115.75 32.96% 9.88 14.4%

LWR-regularized GP 37.19 12.77% 2.96 2.76%

Calibrated PW 115.80 30.00% 10.41 18.2%

PW-regularized GP 35.5 12.42% 3.02 2.68%

Calibrated ARZ 155.20 32.00% 12.71 18.4%

ARZ-regularized GP 34.75 11.48% 2.90 2.72%

4.3.3. Robustness study

As aforementioned, the proposed PRML framework is expected to be more robust than pure ML models

on noisy dataset. Hence, in this subsection, 50% of the training data is replaced by the flawed data,

which are generated with 100 veh/5min noises in flows, and the testing data keep unchanged. Notably, for

model evaluations, the testing dataset is not mixed with noises. Also, since GP can outperform multilayer

perceptron, support vector machine, and random forest in both flow and speed estimation, we will only

examine the robustness of GP and PRGP in this subsection. Table 3 and Figs 12-13 summarize their

estimation performance on the noised training data. The results show that the GP has limited resistance

to high biased data, e.g., caused by traffic detector malfunctions. The three PRGP models can greatly

outperform pure GP by about 160 veh/h of RMSE and over 100% of MAPE in flow estimations. Hence,

it can be concluded that the proposed PRML framework are much more robust than the pure ML models

when the input data is subject to unobserved random noise. This is due to PRML’s capability of adopting

physical knowledge to regularized the ML training process. The results also show that heat equation does

not capture the dynamics of the traffic flow, and only the well-developed traffic flow model can improve the

accuracy of Gaussian process.

5. Conclusions and Future Research Directions

In the literature, traffic flow models have been well developed to explain the traffic phenomena, however,

have theoretical difficulties in stochastic formulations and rigorous estimation. In view of the increasing

availability of data, the data-driven methods are prevailing and fast-developing, however, have limitations

of lacking sensitivity of irregular events and compromised effectiveness in sparse data. To address the issues

of both methods, a hybrid framework to incorporate the advantages of both methods is investigated. This

paper proposes a stochastic modeling framework to capture the random detection noise and the latent

22

Page 23: arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of a conservation law equation and a fundamental relationship (Seo et al.,2017). For

Table 3: Comparison of the estimation accuracy with noisy training dataset

Method Flow RMSE

(veh/5min)

Flow MAPE Speed RMSE

(mph)

Speed MAPE

pure GP 212.17 135.19% 5.96 3.35%

GP-LWR 41.78 9.73% 6.01 3.46%

GP-PW 41.11 9.60% 4.43 3.30%

GP-ARZ 35.37 9.51% 3.06 2.72%

GP-HEAT 215.01 138.29% 4.31 33.6%

(a) GP (b) GP-LWR

(c) GP-PW (d) GP-ARZ

Figure 12: Comparison between flow estimation and ground truth with noisy training dataset

unobserved of traffic data as well as leveraging the well-defined fundamental diagram, conservation law and

momentum conditions. The traffic state indicators (i.e. flow, speed, density) are assumed to be multi-variant

Gaussian distributed. A physics regularized Gaussian process (PRGP) is proposed to encode the physics

knowledge in the Bayesian inference structure as the shadow Gaussian process. The shadow Gaussian

process is proven to regularize the conventional constraint-free Gaussian process as a soft constraint. To

estimate the proposed PRGP, a posterior regularized inference algorithm is derived and implemented with

23

Page 24: arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of a conservation law equation and a fundamental relationship (Seo et al.,2017). For

(a) GP (b) GP-LWR

(c) GP-PW (d) GP-ARZ

Figure 13: Comparison between speed estimation and ground truth with noisy training dataset

auto-differentiation libraries. The computational complexity is cubic of the product of the sample size and

the output dimension O((Nd′)3+m3). A preliminary real-world case study is conducted on PeMS detection

data collected from a freeway segment in Utah and the well-known continuous traffic flow models (i.e. LWR,

PW, ARZ) are tested. In comparison to the pure machine learning methods and pure physical models, the

numerical results justify the effectiveness and the robustness of the proposed method.

The potential directions for future Research may include: (1) extending the proposed method to leverage

other models for traffic state estimation, such as discrete macroscopic traffic flow model regularized Gaussian

process; (2) extending the proposed method to solve other problems, such as microscopic behavior models

regularized Gaussian process for vehicle trajectory prediction; (3) extending the physics regularization

methodology in other machine learning algorithms, such as random forest and support vector machine to

combine general physics knowledge in learning tasks.

References

Akwir, N.A., Chedjou, J.C., Kyamakya, K., 2018. Neural-network-based calibration of macroscopic traffic

flow models, in: Recent Advances in Nonlinear Dynamics and Synchronization. Springer, pp. 151–173.

24

Page 25: arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of a conservation law equation and a fundamental relationship (Seo et al.,2017). For

Alvarez, M., Luengo, D., Lawrence, N.D., 2009. Latent force models, in: Artificial Intelligence and Statistics,

pp. 9–16.

Alvarez, M.A., Luengo, D., Lawrence, N.D., 2013. Linear latent force models using gaussian processes.

IEEE transactions on pattern analysis and machine intelligence 35, 2693–2705.

Armand, A., Filliat, D., Ibanez-Guzman, J., 2013. Modelling stop intersection approaches using gaussian

processes, in: 16th International IEEE Conference on Intelligent Transportation Systems (ITSC 2013),

IEEE. pp. 1650–1655.

Aw, A., Rascle, M., 2000. Resurrection of” second order” models of traffic flow. SIAM journal on applied

mathematics 60, 916–938.

Barber, D., Wang, Y., 2014. Gaussian processes for bayesian estimation in ordinary differential equations,

in: International Conference on Machine Learning, pp. 1485–1493.

Baydin, A.G., Pearlmutter, B.A., Radul, A.A., Siskind, J.M., 2018. Automatic differentiation in machine

learning: a survey. Journal of machine learning research 18.

Bishop, C.M., 2006. Pattern recognition and machine learning. springer.

Blandin, S., Argote, J., Bayen, A.M., Work, D.B., 2013. Phase transition model of non-stationary traffic

flow: Definition, properties and solution method. Transportation Research Part B: Methodological 52,

31–55.

Blandin, S., Couque, A., Bayen, A., Work, D., 2012. On sequential data assimilation for scalar macroscopic

traffic flow models. Physica D: Nonlinear Phenomena 241, 1421–1440.

Calderhead, B., Girolami, M., Lawrence, N.D., 2009. Accelerating bayesian inference over nonlinear dif-

ferential equations with gaussian processes, in: Advances in neural information processing systems, pp.

217–224.

Coifman, B., 2002. Estimating travel times and vehicle trajectories on freeways using dual loop detectors.

Transportation Research Part A: Policy and Practice 36, 351–364.

Colombo, R.M., 2003. Hyperbolic phase transitions in traffic flow. SIAM Journal on Applied Mathematics

63, 708–721.

Courant, R., Hilbert, D., 2008. Methods of Mathematical Physics: Partial Differential Equations. John

Wiley & Sons.

Daganzo, C.F., 1995. Requiem for second-order fluid approximations of traffic flow. Transportation Research

Part B: Methodological 29, 277–286.

25

Page 26: arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of a conservation law equation and a fundamental relationship (Seo et al.,2017). For

Davis, G.A., Kang, J.G., 1994. Estimating destination-specific traffic densities on urban freeways for ad-

vanced traffic management. 1457.

Del Castillo, J., Pintado, P., Benitez, F., 1994. The reaction time of drivers and the stability of traffic flow.

Transportation Research Part B: Methodological 28, 35–60.

Di, X., Liu, H.X., Davis, G.A., 2010. Hybrid extended kalman filtering approach for traffic density estimation

along signalized arterials: Use of global positioning system data. Transportation Research Record 2188,

165–173.

Duan, Y., Lv, Y., Liu, Y.L., Wang, F.Y., 2016. An efficient realization of deep learning for traffic data

imputation. Transportation research part C: emerging technologies 72, 168–181.

Fan, S., Herty, M., Seibold, B., 2013. Comparative model accuracy of a data-fitted generalized aw-rascle-

zhang model. arXiv preprint arXiv:1310.8219 .

Fasshauer, G.E., 2011. Positive definite kernels: past, present and future. Dolomite Research Notes on

Approximation 4, 21–63.

Ganchev, K., Das, D., 2013. Cross-lingual discriminative learning of sequence models with posterior regular-

ization, in: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing,

pp. 1996–2006.

Ganchev, K., Gillenwater, J., Taskar, B., et al., 2010. Posterior regularization for structured latent variable

models. Journal of Machine Learning Research 11, 2001–2049.

Gazis, D., Liu, C., 2003. Kalman filtering estimation of traffic counts for two network links in tandem.

Transportation Research Part B: Methodological 37, 737–745.

Gazis, D.C., Knapp, C.H., 1971. On-line estimation of traffic densities from time-series of flow and speed

data. Transportation Science 5, 283–301.

Graepel, T., 2003. Solving noisy linear operator equations by gaussian processes: Application to ordinary

and partial differential equations, in: ICML, pp. 234–241.

Gray, L., Griffeath, D., 2001. The ergodic theory of traffic jams. Journal of Statistical Physics 105, 413–452.

He, L., Gillenwater, J., Taskar, B., 2013. Graph-based posterior regularization for semi-supervised structured

prediction, in: Proceedings of the Seventeenth Conference on Computational Natural Language Learning,

pp. 38–46.

Heinonen, M., Yildiz, C., Mannerstrom, H., Intosalmi, J., Lahdesmaki, H., 2018. Learning unknown ode

models with gaussian processes. arXiv preprint arXiv:1803.04303 .

26

Page 27: arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of a conservation law equation and a fundamental relationship (Seo et al.,2017). For

Hoogendoorn, S.P., Bovy, P.H., 2001. State-of-the-art of vehicular traffic flow modelling. Proceedings of the

Institution of Mechanical Engineers, Part I: Journal of Systems and Control Engineering 215, 283–303.

Ide, T., Kato, S., 2009. Travel-time prediction using gaussian process regression: A trajectory-based ap-

proach, in: Proceedings of the 2009 SIAM International Conference on Data Mining, SIAM. pp. 1185–

1196.

Jabari, S.E., Liu, H.X., 2012. A stochastic model of traffic flow: Theoretical foundations. Transportation

Research Part B: Methodological 46, 156–174.

Jabari, S.E., Liu, H.X., 2013. A stochastic model of traffic flow: Gaussian approximation and estimation.

Transportation Research Part B: Methodological 47, 15–41.

Jabari, S.E., Zheng, J., Liu, H.X., 2014. A probabilistic stationary speed–density relation based on newells

simplified car-following model. Transportation Research Part B: Methodological 68, 205–223.

Kang, J.G., 1995. Estimation of destination-specific traffic densities and identification of parameters on

urban freeways using Markov models of traffic flow. University of Minnesota.

Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

.

Kuwahara, M., 2015. Theory, solution method and applications of kinematic wave. Interdisciplinary Infor-

mation Sciences 21, 63–75.

Laval, J.A., He, Z., Castrillon, F., 2012. Stochastic extension of newell’s three-detector method. Trans-

portation Research Record 2315, 73–80.

Lawrence, N.D., Sanguinetti, G., Rattray, M., 2007. Modelling transcriptional regulation using gaussian

processes, in: Advances in Neural Information Processing Systems, pp. 785–792.

Lebacque, J.P., Mammar, S., Salem, H.H., 2007. Generic second order traffic flow modelling, in: Trans-

portation and Traffic Theory 2007. Papers Selected for Presentation at ISTTT17Engineering and Physical

Sciences Research Council (Great Britain) Rees Jeffreys Road FundTransport Research FoundationTMS

ConsultancyOve Arup and Partners, Hong KongTransportation Planning (International) PTV AG.

Li, L., Li, Y., Li, Z., 2013. Efficient missing data imputing for traffic flow by considering temporal and

spatial dependence. Transportation research part C: emerging technologies 34, 108–120.

Libbrecht, M.W., Hoffman, M.M., Bilmes, J.A., Noble, W.S., 2015. Entropic graph-based posterior regu-

larization: Extended version, in: Proceedings of the International Conference on Machine Learning.

Lighthill, M.J., Whitham, G.B., 1955. On kinematic waves ii. a theory of traffic flow on long crowded roads.

Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences 229, 317–345.

27

Page 28: arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of a conservation law equation and a fundamental relationship (Seo et al.,2017). For

Liu, S., Yue, Y., Krishnan, R., 2013. Adaptive collective routing using gaussian process dynamic congestion

models, in: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and

data mining, ACM. pp. 704–712.

Mihaylova, L., Boel, R., 2004. A particle filter for freeway traffic estimation, in: 2004 43rd IEEE Conference

on Decision and Control (CDC)(IEEE Cat. No. 04CH37601), IEEE. pp. 2106–2111.

Mihaylova, L., Boel, R., Hegiy, A., 2006. An unscented kalman filter for freeway traffic estimation, IFAC.

Nagel, K., Schreckenberg, M., 1992. A cellular automaton model for freeway traffic. Journal de physique I

2, 2221–2229.

Neumann, M., Kersting, K., Xu, Z., Schulz, D., 2009. Stacked gaussian process learning, in: 2009 Ninth

IEEE International Conference on Data Mining, IEEE. pp. 387–396.

Ni, D., Leonard, J.D., 2005. Markov chain monte carlo multiple imputation using bayesian networks for

incomplete intelligent transportation systems data. Transportation research record 1935, 57–67.

Osorio, C., Flotterod, G., Bierlaire, M., 2011. Dynamic network loading: a stochastic differentiable model

that derives link state distributions. Procedia-Social and Behavioral Sciences 17, 364–381.

Papageorgiou, M., 1998. Some remarks on macroscopic traffic flow modelling. Transportation Research

Part A: Policy and Practice 32, 323–329.

Papageorgiou, M., Blosseville, J.M., Hadj-Salem, H., 1989. Macroscopic modelling of traffic flow on the

boulevard peripherique in paris. Transportation Research Part B: Methodological 23, 29–47.

Pascale, A., Gomes, G., Nicoli, M., 2013. Estimation of highway traffic from sparse sensors: Stochastic

modeling and particle filtering, in: 2013 IEEE International Conference on Acoustics, Speech and Signal

Processing, IEEE. pp. 6158–6162.

Paveri-Fontana, S., 1975. On boltzmann-like treatments for traffic flow: a critical review of the basic model

and an alternative proposal for dilute traffic analysis. Transportation research 9, 225–235.

Payne, H., 1971. Models of freeway traffic and control. mathematical models of public systems.

Polson, N., Sokolov, V., 2017a. Bayesian particle tracking of traffic flows. IEEE Transactions on Intelligent

Transportation Systems 19, 345–356.

Polson, N.G., Sokolov, V.O., 2017b. Deep learning for short-term traffic flow prediction. Transportation

Research Part C: Emerging Technologies 79, 1–17.

Prigogine, I., Herman, R., 1971. Kinetic theory of vehicular traffic. Technical Report.

28

Page 29: arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of a conservation law equation and a fundamental relationship (Seo et al.,2017). For

Raissi, M., Perdikaris, P., Karniadakis, G.E., 2017. Machine learning of linear differential equations using

gaussian processes. Journal of Computational Physics 348, 683–693.

Richards, P.I., 1956. Shock waves on the highway. Operations research 4, 42–51.

Rodrigues, F., Henrickson, K., Pereira, F.C., 2018. Multi-output gaussian processes for crowdsourced traffic

data imputation. IEEE Transactions on Intelligent Transportation Systems 20, 594–603.

Rodrigues, F., Pereira, F.C., 2018. Heteroscedastic gaussian processes for uncertainty modeling in large-scale

crowdsourced traffic data. Transportation research part C: emerging technologies 95, 636–651.

Seo, T., Bayen, A.M., Kusakabe, T., Asakura, Y., 2017. Traffic state estimation on highway: A compre-

hensive survey. Annual reviews in control 43, 128–151.

Song, Y., Zhu, J., Ren, Y., 2016. Kernel bayesian inference with posterior regularization, in: Advances in

Neural Information Processing Systems, pp. 4763–4771.

Sopasakis, A., 2012. Lattice free stochastic dynamics. Communications in Computational Physics 12,

691–702.

Sopasakis, A., Katsoulakis, M.A., 2006. Stochastic modeling and simulation of traffic flow: asymmetric

single exclusion process with arrhenius look-ahead dynamics. SIAM Journal on Applied Mathematics 66,

921–944.

Sumalee, A., Zhong, R., Pan, T., Szeto, W., 2011. Stochastic cell transmission model (sctm): A stochastic

dynamic traffic model for traffic state surveillance and assignment. Transportation Research Part B:

Methodological 45, 507–533.

Szeto, M.W., Gazis, D.C., 1972. Application of kalman filtering to the surveillance and control of traffic

systems. Transportation Science 6, 419–439.

Tak, S., Woo, S., Yeo, H., 2016. Data-driven imputation method for traffic data in sectional units of road

links. IEEE Transactions on Intelligent Transportation Systems 17, 1762–1771.

Tan, H., Feng, G., Feng, J., Wang, W., Zhang, Y.J., Li, F., 2013. A tensor-based method for missing traffic

data completion. Transportation Research Part C: Emerging Technologies 28, 15–27.

Tan, H., Wu, Y., Cheng, B., Wang, W., Ran, B., 2014. Robust missing traffic flow imputation considering

nonnegativity and road capacity. Mathematical Problems in Engineering 2014.

Tang, J., Zhang, G., Wang, Y., Wang, H., Liu, F., 2015. A hybrid approach to integrate fuzzy c-means based

imputation method with genetic algorithm for missing traffic volume data estimation. Transportation

Research Part C: Emerging Technologies 51, 29–40.

29

Page 30: arXiv:2002.02374v1 [stat.ML] 6 Feb 2020 · diagram. Deterministic tra c ow model usually consist of a conservation law equation and a fundamental relationship (Seo et al.,2017). For

Wada, K., Usui, K., Takigawa, T., Kuwahara, M., 2017. An optimization modeling of coordinated traffic

signal control based on the variational theory and its stochastic extension. Transportation research

procedia 23, 624–644.

Wang, Y., Papageorgiou, M., 2005. Real-time freeway traffic state estimation based on extended kalman

filter: a general approach. Transportation Research Part B: Methodological 39, 141–167.

Wang, Y., Papageorgiou, M., Messmer, A., 2007. Real-time freeway traffic state estimation based on

extended kalman filter: A case study. Transportation Science 41, 167–181.

Whitham, G., 1975. Linear and nonlinear waves. Modern Book Incorporated.

Williams, C.K., Rasmussen, C.E., 2006. Gaussian processes for machine learning. volume 2. MIT press

Cambridge, MA.

Wilson, A.G., Hu, Z., Salakhutdinov, R., Xing, E.P., 2016. Deep kernel learning, in: Artificial Intelligence

and Statistics, pp. 370–378.

Work, D.B., Tossavainen, O.P., Blandin, S., Bayen, A.M., Iwuchukwu, T., Tracton, K., 2008. An ensemble

kalman filtering approach to highway traffic estimation using gps enabled mobile devices, in: 2008 47th

IEEE Conference on Decision and Control, IEEE. pp. 5062–5068.

Wu, Y., Tan, H., Qin, L., Ran, B., Jiang, Z., 2018. A hybrid deep learning based traffic flow prediction

method and its understanding. Transportation Research Part C: Emerging Technologies 90, 166–180.

Xie, Y., Zhao, K., Sun, Y., Chen, D., 2010. Gaussian processes for short-term traffic volume forecasting.

Transportation Research Record 2165, 69–78.

Yin, W., Murray-Tuite, P., Rakha, H., 2012. Imputing erroneous data of single-station loop detectors

for nonincident conditions: Comparison between temporal and spatial methods. Journal of Intelligent

Transportation Systems 16, 159–176.

Zhang, H.M., 2002. A non-equilibrium traffic model devoid of gas-like behavior. Transportation Research

Part B: Methodological 36, 275–290.

Zhong, M., Lingras, P., Sharma, S., 2004. Estimation of missing traffic counts using factor, genetic, neural,

and regression techniques. Transportation Research Part C: Emerging Technologies 12, 139–166.

Zhu, J., Chen, N., Xing, E.P., 2014. Bayesian inference with posterior regularization and applications to

infinite latent svms. The Journal of Machine Learning Research 15, 1799–1847.

30