Modelling COVID-19 Infection Data With a Simple Gaussian Process · 2020-05-06 · Table 2:...

14
Modelling COVID-19 Infection Data With a Simple Gaussian Process B.C. Allanach, T. Baldauf, H.M. Banks, S.C. Crew, J. Davighi, W. Haddadin, M. Madigan, M. McCullough * , C. Turner, M. Ubiali Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, United Kingdom, CB3 0WA May 5, 2020 Abstract We have been asked, as part of the RAMP initiative, to investigate non-parametric methods of characterising COVID- 19 infection data, in the same spirit as Ref. [1]. Here, we model them with a simple Gaussian Process. Indeed, mod- elling infection data with a Gaussian Process is not theoretically unmotivated; in Ref. [2] it was shown that a stochastic Susceptible-Exposed-Infected-Recovered (SEIR) model for an epidemic can be well-approximated by a Gaussian Process under various sets of simplifying assumptions (such as linear noise [3], or a multivariate normal moment assumption [3]). The inferences of certain quantities of interest (such as growth rates) are qualitatively similar to those of Ref. [1], with some quantitative differences. We also suggest using the daily change in the logarithm of the number of daily cases rather than its derivative as an alternative, ‘integrated’ measure of growth. 1 Data Sources and Computer Program We use the data from Our World in Data: https://github.com/owid/covid-19-data/tree/master/public/data/. Specif- ically, we make use of the ECDC files total cases.csv and total deaths.csv. The current version of the data uses data up until 30 April 2020 (data obtained 18:30 GMT 4 May 2020). The source code can be found in the form of a Jupyter notebook on colab: https://colab.research.google.com/drive/1wlFY4Bdj8-6t3Qqh5Qv9LIzB-Ffg1FcJ. 2 Gaussian Processes: Kriging In this section, we describe one approach to data smoothing allowing for the extraction of a time-dependent growth rate based on Kriging [4] – namely choosing an appropriate Gaussian Process [5] prior for the data and then taking the point-wise mean of the posterior function distribution given some observations indexed by the day they were taken. Whenever the number of daily cases is zero, we clip the logarithm to 0. We begin the fit on the day of the 20th recorded case in each country. Concretely, we take a fixed mean function μ(t) = 0 and covariance matrix σ(t, t 0 ) and suppose that ln I (t), the natural logarithm of the number of new cases I (t) recorded in a 24 hour period at a time t, is a Gaussian Process with such a mean and covariance, ln I (t)= s(t) where s ∼N (0). (1) We choose a squared exponential kernel with t-independent normal noise, σ(t, t 0 )= a 2 exp - (t - t 0 ) 2 22 + tt 0 (2) which depends on three hyperparameters: a is the amplitude of the squared exponential; is its characteristic time-scale over which ln I (t) varies significantly; and v is the variance of observational noise. (δ tt 0 is the Kronecker delta.) The assumption of independent and identically distributed (i.i.d.) normal noise is often used in training a Gaussian Process; while it is perhaps a crude approximation to the observational noise in infection data (which might be more accurately described by some * And CERN, Theoretical Physics Department, Geneva, Switzerland. 1

Transcript of Modelling COVID-19 Infection Data With a Simple Gaussian Process · 2020-05-06 · Table 2:...

Modelling COVID-19 Infection Data With a Simple Gaussian Process

B.C. Allanach, T. Baldauf, H.M. Banks, S.C. Crew, J. Davighi, W. Haddadin, M. Madigan,M. McCullough∗, C. Turner, M. Ubiali

Department of Applied Mathematics and Theoretical Physics, University of Cambridge,Cambridge, United Kingdom, CB3 0WA

May 5, 2020

Abstract

We have been asked, as part of the RAMP initiative, to investigate non-parametric methods of characterising COVID-19 infection data, in the same spirit as Ref. [1]. Here, we model them with a simple Gaussian Process. Indeed, mod-elling infection data with a Gaussian Process is not theoretically unmotivated; in Ref. [2] it was shown that a stochasticSusceptible-Exposed-Infected-Recovered (SEIR) model for an epidemic can be well-approximated by a Gaussian Processunder various sets of simplifying assumptions (such as linear noise [3], or a multivariate normal moment assumption [3]).The inferences of certain quantities of interest (such as growth rates) are qualitatively similar to those of Ref. [1], withsome quantitative differences. We also suggest using the daily change in the logarithm of the number of daily cases ratherthan its derivative as an alternative, ‘integrated’ measure of growth.

1 Data Sources and Computer Program

We use the data from Our World in Data: https://github.com/owid/covid-19-data/tree/master/public/data/. Specif-ically, we make use of the ECDC files total cases.csv and total deaths.csv. The current version of the data uses dataup until 30 April 2020 (data obtained 18:30 GMT 4 May 2020). The source code can be found in the form of a Jupyternotebook on colab: https://colab.research.google.com/drive/1wlFY4Bdj8-6t3Qqh5Qv9LIzB-Ffg1FcJ.

2 Gaussian Processes: Kriging

In this section, we describe one approach to data smoothing allowing for the extraction of a time-dependent growth rate basedon Kriging [4] – namely choosing an appropriate Gaussian Process [5] prior for the data and then taking the point-wise meanof the posterior function distribution given some observations indexed by the day they were taken. Whenever the number ofdaily cases is zero, we clip the logarithm to 0. We begin the fit on the day of the 20th recorded case in each country.

Concretely, we take a fixed mean function µ(t) = 0 and covariance matrix σ(t, t′) and suppose that ln I(t), the naturallogarithm of the number of new cases I(t) recorded in a 24 hour period at a time t, is a Gaussian Process with such a meanand covariance,

ln I(t) = s(t) where s ∼ N (0, σ). (1)

We choose a squared exponential kernel with t−independent normal noise,

σ(t, t′) = a2 exp

(− (t− t′)2

2`2

)+ v δtt′ (2)

which depends on three hyperparameters: a is the amplitude of the squared exponential; ` is its characteristic time-scale overwhich ln I(t) varies significantly; and v is the variance of observational noise. (δtt′ is the Kronecker delta.) The assumption ofindependent and identically distributed (i.i.d.) normal noise is often used in training a Gaussian Process; while it is perhapsa crude approximation to the observational noise in infection data (which might be more accurately described by some

∗And CERN, Theoretical Physics Department, Geneva, Switzerland.

1

Country First Daily Decrease

Austria 28 March +1−2

Belgium 7 April +9−4

Czech Republic 1 April +2−2

France 3 April +2−2

Germany 1 April +2−1

Greece 30 March +4−4

Ireland 17 April +2−2

Italy 27 March +2−2

Netherlands 9 April +6−6

Poland 12 April +7−?

Portugal 3 April +5−2

Romania 12 April +8−4

Spain 31 March +4−3

Switzerland 1 April +3−3

United Kingdom 15 April +3−2

Table 1: The mean day on which the predicted growth rate was negative, according to a Monte Carlo simulation of theposterior distribution after Kriging, for 15 countries. Also supplied are asymmetric 95% CI bands within which this datefalls, given by reporting the upper and lower 2.5% quantiles. The ? indicates where this quantile extends beyond the end ofthe data set.

quasi-Poisson noise model as in [1]), i.i.d. normal noise is straightforward to implement, and makes solving for the posteriordistribution an analytically tractable problem. The values of the three hyperparameters a, `, and v are optimised by using agradient-descent method to extract the maximum marginalised likelihood values given the observed data (meaning that thelikelihood marginalised over s is maximised over a, ` and v). From now on and for determinations of other quantities, wethen fix to these optimised values, which we denote a, ˜ and v, respectively.

3 Results

We choose a densely spaced set of points in t to display our results: for a given t, we compute the posterior distribution of thefunction s(t). As mentioned, this is an analytically solvable problem, and for a Gaussian Process the posterior s(t) is normallydistributed. The mean of this distribution now gives a curve s(t) which we can consider to be a smoothed version of theinitial data points. The derivative of this function, s′(t), defines an estimated growth rate. One can also compute credibilityintervals (CIs) of both s(t) and its derivative s′(t) directly from the posterior distribution. (We compute derivatives by finitedifferences throughout; see the Appendix.)

In Fig. 1 we depict the fit to the number of daily cases, as well as the inferred growth rate and doubling times. Fig. 2shows the equivalent data and fits to daily reported COVID-19 related deaths for the same 15 European countries. Thedoubling time diverges as soon as the growth rate becomes negative and so loses its meaning. A clear limitation is thatdays when zero cases are recorded appear as conspicuous outliers in the raw data, and significantly inflate the predictedvariance of the data (for example, this effect is clearly evident in Greece’s daily case fit in Fig. 1). However, save for theinferred observational noise, the conspicuous outliers have very little impact on quantities of interest, such as the inferredgrowth rates, as may be established by binning data into two day periods and then Kriging on the resulting data (thenceameliorating the effect of the conspicuous outliers). We show this in the Appendix.

The uncertainties in the fits are under control because they are well covered by data. However, if one projects the fits intothe future, one may verify that the uncertainties become progressively larger, as expected for a Gaussian Process. Projectinga time roughly ˜ into the future, the uncertainty in predicted ln I(t) becomes roughly of size a (i.e. it is dominated by theassumption on the prior). We illustrate this effect in the case of Italy in Fig. 3 for the number of cases.

In Table 1, we supply the first day on which the model predicts a negative growth rate occurred, along a with confidence

2

Figure 1: The number of new daily cases (i.e. daily ‘incidence’) for 15 countries around Europe along with estimates ofgrowth rates and doubling times. The input data are shown by points whereas the mean estimate obtained by Kriging aGaussian Process is shown by the red curves. Left panel: daily new cases, where the shaded regions shows the 95% CI of theprocess with and without observational noise. Middle panel: the instantaneous growth rate, where the inner (outer) shadedregion shows the observational noise free 68% (95%) CI. Right panel: the corresponding instantaneous doubling time anddirectly translated CI. The abscissa gives time in days since 21 February 2020.

3

Figure 2: The number of daily COVID-19 related deaths for 15 countries around Europe along with estimates of growth ratesand doubling times. The input data are shown by points whereas the mean estimate obtained by Kriging a Gaussian Processis shown by the red curves. Left panel: daily deaths, where the shaded regions shows the 95% CI of the process with andwithout observational noise. Middle panel: the instantaneous growth rate, where the inner (outer) shaded region shows theobservational noise free 68% (95%) CI. Right panel: the corresponding instantaneous doubling time and directly translatedCI. The abscissa gives time in days since 21 February 2020.

4

Figure 3: Future projection of modelled daily new cases in the case of Italy, displaying the increase in uncertainty. The regionto the right-hand side of the dashed line is a projection into the future. Blue depicts the 95% CI including observation noisewhereas purple depicts the 95% CI without including the noise. The abscissa gives time in days since 21 February 2020.

interval, obtained by identifying this date for 10000 curves sampled from the posterior distribution, and reporting the meanas well as the upper and lower 2.5% quantiles. The values of the hyperparameters in (2) are given in Table 2.

A comparison of the growth rates in reported daily new cases and in the reported daily COVID-19 related deaths is shownin Fig. 4. We note that the bands on the displayed growth rates quantify uncertainties on the instantaneous daily growthrates. We imagine that a more useful presentation for policy-makers would be CIs of the daily-integrated growth rate∫ t

t−1dx

ds(x)

dx= s(t)− s(t− 1). (3)

We call this quantity the daily change, and depict it in Fig. 5. The size of the uncertainties on the daily change are oftensignificantly different to the uncertainties on the instantaneous growth rates.

4 Discussion

Kriging a Gaussian Process appears to provide a faithful characterisation of the data as well as simple probabilistic interpre-tations of various inferences at a time t. We estimate, as Table 1 shows, that the United Kingdom’s number of cases beganto shrink around the 15 April 2020. A quick perusal of Fig. 1 might persuade the reader of a large uncertainty in the dateof negative growth onset. Yet in Table 1 we bound this date within a narrow 6 day window at the 95% level. Also, a readermay conclude that on the extreme right-hand side of the United Kingdom’s number of cases growth rate plot (correspondingto April 30), that within the model we cannot infer, with 95% certainty, that we were in negative growth.

However, these are artefacts of the fact that it is the instantaneous growth rate that is plotted in Fig. 1. A reference toFig. 5 shows that we are more than 95% confident (within the inferences of the model) that the daily change in the logarithmof the number of cases (equal to the daily-integrated growth rate) is negative on 30 April. A much tighter band on the timeof zero-crossing is also offered. We advocate the plotting of the daily change rather than the instantaneous growth rate forease of interpretation by policy-makers.

Comparison to Generalized Additive Model In [1], a Generalized Additive Model (GAM) was trained to accomplishthe same task as our Gaussian Process. We extended the GAM code provided with Ref. [1] to run with more recent data,and in Figs. 6 and 7 we provide a comparison of the underlying model of daily incidences and the predicted growth rates,respectively. We see that the two models are qualitatively similar, with some quantitative differences.

5

Cases Deaths

a ˜ v a ˜ v

Austria 4.26 24.1 0.177 1.85 29.6 0.566Belgium 4.37 21.4 0.104 3.87 29.0 0.112Czech Republic 4.18 27.1 0.137 1.49 20.9 0.178France 6.14 38.2 0.172 5.14 32.7 0.131Germany 5.07 27.7 0.154 4.37 21.1 0.160Greece 2.54 27.4 0.990 1.03 69.8 0.476Ireland 4.11 19.8 0.103 2.53 37.7 0.220Italy 5.18 31.8 0.0373 5.48 30.1 0.0379Netherlands 4.25 23.7 0.0651 3.54 20.8 0.108Poland 5.31 34.5 0.0433 2.39 45.7 0.215Portugal 4.45 21.3 0.167 2.36 36.8 0.0394Romania 3.96 24.9 0.109 2.27 48.2 0.218Spain 6.12 30.6 1.17 4.81 27.8 0.805Switzerland 4.59 31.7 0.753 2.38 37.9 0.747United Kingdom 6.31 42.4 0.0885 4.60 26.6 0.0867

Table 2: Hyperparameters in (2) obtained as values for the corresponding Gaussian Process for 15 countries, where ˜ ismeasured in days.

Figure 4: Estimated growth rates from Figs. 1 and 2. (Red: growth rate in daily new cases, Blue: growth rate in reporteddaily COVID-19 related deaths. Each is shown with its shaded observational noise free 95% CI.) The abscissa gives time indays since 21 February 2020.

6

Figure 5: The daily change defined in (3), with observational-noise-free 95% CI. The abscissa gives time in days since 21February 2020.

7

Figure 6: Comparison of the daily case numbers extracted from the Generalized Additive Model of [1] (blue) and our GaussianProcess (red). The mean inference is shown as the curve, with 95% CIs depicted as bands. Models were trained on the datain the range depicted. The abscissa gives time in days since 21 February 2020.

8

Figure 7: Comparison of the daily instantaneous growth rates extracted from the Generalized Additive Model of [1] (blue)and our Gaussian Process (red). The mean inference is shown as the curve, with 95% CIs depicted as bands. The abscissagives time in days since 21 February 2020. Models were trained on the data in the range depicted.

9

Cases Deaths

a ˜ v a ˜ v

Austria 4.44 24.6 0.0785 3.34 * 0.192Belgium 5.11 42.7 0.0661 4.34 31.4 0.0432Czech Republic 4.29 29.1 0.0841 1.96 24.3 0.0963France 6.56 40.9 0.0599 4.92 31.3 0.0553Germany 5.72 31.1 0.103 4.37 34.6 0.0597Greece 3.27 29.2 0.263 1.75 90.2 0.163Ireland 4.78 21.7 0.0724 3.18 43.7 0.176Italy 5.40 30.2 0.0152 5.91 31.3 0.0124Netherlands 4.82 20.2 0.0222 3.96 26.2 0.0450Poland 5.50 35.6 0.0192 3.01 58.7 0.0810Portugal 6.03 23.7 0.0399 2.75 44.8 0.0113Romania 4.49 41.5 0.0883 3.52 173.0 0.133Spain 6.13 33.6 0.145 5.06 24.3 0.0244Switzerland 5.30 31.7 0.0629 2.98 35.2 0.184United Kingdom 6.60 44.2 0.0416 5.11 29.4 0.0509

Table 3: Hyperparameters in (2) obtained as values for the corresponding Gaussian Process for 15 countries for binned data,where ˜ is measured in days.

Appendix

Binning In Fig. 8, we show the difference in fitted growth rates if we put the data into bins of 2 days and then performthe Kriging in order to avoid the zero counts. We see from the figure that there are no particularly large changes induced bythe binning. On the other hand, a comparison of the hyperparameters in the binned (Table 2) and unbinned (Table 3) casesshows that the observational noise is much reduced for the cases where zeroes were eliminated by binning.

Finite Differences For our Gaussian Process s(t), one finds that the posterior values of s(t) and s(t + δt) follow ajoint multivariate normal distribution. As such, the finite-difference estimator s′(t) ≈ (s(t+ δt)− s(t)) /δt has a normaldistribution itself, and we can calculate both its mean and variance directly from standard expressions for the mean andcovariance of the posterior distribution. In principle, we could avoid this prescription and compute an analytic result, butfinite differences suffice for our purposes. In the paper, we choose δt so that there are 1000 equally spaced sample pointsacross the whole time interval of the fit.

Sierra Leone Ebola Epidemic As an example of the application of this method to a historical data set, we study data(obtained in a private communication) from 2014-15 describing the Sierra Leone Ebola epidemic of that time. We will focusonly on the data in this data set from the first 200 days.

It is instructive to visualize this data set. We depict the full set of raw data (giving the number of new cases on eachof a report divided by the time since the previous report) in Fig. 9. This data set is quite different in character to theEuropean COVID-19 data. The Sierra Leone Ebola data are clearly very noisy, and the large number of zero cases on theright is conspicuous. To aid the eye, we have drawn two simply-smoothed curves: one moving average over a t±7 day period,and one convolution with a Gaussian of length scale 7 days. These help clarify a gentle increase in slope up to a fairly longplateau, followed by a rapid decrease in new cases after 100 days.

To deal with the zeroes, we first (as we did above for the European data) bin together adjacent data points. For most ofthe duration of the epidemic, data points are still sampled somewhat erratically, with several days lapsing between updatedcumulative totals. To handle this, we divide the number of new cases at some time t by the number of days since the lastdata point, to give an estimate of new cases per day at the point t.

The Gaussian process trained on the resulting data has a = 2.96, ˜= 88.7 days and v = 0.241. The fit and instantaneousgrowth rates are given in Fig. 10. The corresponding daily changes from (3) are given in Fig. 11. Based on sampling fromthe posterior, we find that our Gaussian Process predicts that the first daily fall occurs on day 74 +10

−17 to 95% CI. This CI isdepicted in Fig. 12, which also superimposes the Gaussian-smoothed curve from Fig. 9 on top of the left-hand plot of Fig.

10

10, on a linear scale. This suggests that the Gaussian process estimate is perhaps a little too early, because of the limitationsof our simple noise model. We see especially some large model uncertainties in the red band (particularly obvious because ofthe linear scale in the plot).

References

[1] L. Pellis et al, Challenges in control of Covid-19: short doubling time and long delay to effect of interventions,arXiv:2004.00117.

[2] E. Buckingham-Jeffrey, V. Isham and T. House, Gaussian process approximations for fast inference from infectious disease,Mathematical Biosciences, Volume 301, 111–120 (2018).

[3] V. Isham, Assessing the variability of stochastic epidemics, Mathematical Biosciences, Volume 107 (2), 209–224 (1991).

[4] D.G. Krige, A statistical approach to some mine valuations and allied problems at the Witwatersrand, Master’s thesis ofthe University of Witwatersrand (1951).

[5] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning, MIT Press, ISBN 026218253X (2006).

11

Figure 8: Comparison of estimated growth rates for unbinned and binned data along with 95% CI. Red (unbinned)/Blue(binned): growth rate in daily new cases. The binning was in 2 day periods, in order to ameliorate the effects of egregiousstatistical outliers.

12

Figure 9: Incidence of new cases during Sierra Leone Ebola epidemic on a linear scale. Black data points are the number ofnew cases per day as inferred from the raw data. Black dashed line: a convolution with a Gaussian exp(−∆t2/2`2) for ` = 7.Green: moving average of all data points within ±7 days. The abscissa labels time in days.

Figure 10: Incidence of new cases for the Ebola outbreak in Sierra Leone, and estimate of corresponding growth rate afterKriging. The re-binned input data are shown by points whereas the mean estimate obtained by Kriging a Gaussian Processis shown by the red curve. Left panel: number of daily cases, where the shaded regions shows the 95% CI of the process with(blue) and without (red) observational noise. Right panel: the instantaneous growth rate, where the inner (outer) shadedregion shows the observational noise free 68% (95%) CI. The abscissa labels time in days.

13

Figure 11: The daily change defined in (3), with observational-noise-free 95% CI, for the Sierra Leone Ebola data. Theabscissa labels time in days.

Figure 12: Incidence of new cases for the Ebola outbreak in Sierra Leone, given by the left panel of Fig. 10 on a linear scale,with the Gaussian-smoothed curve of Fig. 9 superimposed (black dashed line). Also shown is the 95% CI interval of the timeof the first daily fall in new cases, as predicted by the posterior model (vertical green dotted lines). The abscissa labels timein days.

14