Interpretable Hierarchical Forecasting of Infectious Diseases

Adrian Lison

Interpretable Hierarchical Forecasting of Infectious Diseases

Master Thesis

at the Chair for Information Systems and Supply Chain Management(University of Münster)

Principal Supervisor: Prof. Dr.-Ing. Bernd HellingrathAssociate Supervisor: Johannes Ponge, M.Sc.Tutor: Dr. Alexander Ulrich (Robert Koch-Institut)

Presented by: Adrian Lison [429175][email protected]

Submission: 29th September 2020

II

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Forecasting in Infectious Disease Surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.2 Setting and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Design Choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.1 Modeling Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.2 Hierarchical Forecasting Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.3 Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5 Interpretable Hierarchical Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.1 Approach and Unified Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.2 Features and Forecasting Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.3 Matrices for Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.4 Base Attribution Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.5 Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526.1 Flexibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.2 Hierarchical Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.3 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.4 Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 777.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 777.2 Infectious Disease Forecasting via Statistical Models . . . . . . . . . . . . . . . . . . . 787.3 Reconciliation via Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 797.4 Interpretation via Feature Attribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

Acknowledgement

This thesis was written as part of an internship in the Signale project group at the RobertKoch Institute in Berlin. I want to thank my first and second tutor supervisors Dr.Alexander Ulrich and Ph.D. Stéphane Ghozzi, for their time, their continuous feedback,their crucial domain knowledge and the many insightful discussions. Moreover, I thankall members of the Signale team, including Auss Abbood, Fabian Eckelmann, TheresaKocher, Knut Perseke and Birte Wagner for their warm welcome and support. I verymuch enjoyed being part of this group during extraordinary times.

Furthermore, I would like to thank the Chair for Information Systems and Supply ChainManagement at the University of Münster for providing the environment to conductthis thesis as an applied research project. I appreciate the spot-on feedback from myprincipal supervisor Prof. Dr.-Ing. Bernd Hellingrath, which has improved the rigor ofmy research process. Last but not least, my special thanks go to my associate supervisorJohannes Ponge for his regular coachings and research input, guiding and structuringmy work, as well as his own fascination with the topic.

1

1 Introduction

Infectious disease surveillance is an integral function of public health, conducted world-wide through national public health institutes (NPHIs) like the Robert Koch Institute inGermany or the Centers for Disease Control and Prevention in the US (IANPHI 2020).The main goal of surveillance is to provide “information for action” (World Health Or-ganisation 1968) through early identification of noteworthy health risks to the populationand suggestion of appropriate counter-measures, thus constituting an important societaltask (Centers for Disease Control and Prevention 2006). Beyond monitoring of the cur-rent situation, disease prevention and control can benefit from forecasts about the futurecase numbers of infectious diseases (Polonsky et al. 2019; Chretien et al. 2014; Manheimet al. 2017): Early forecasts can buy time and facilitate risk management through betterresource allocation and staffing of health departments and healthcare providers (Domset al. 2018). Moreover, information on the prospective development of an outbreak mayhelp to chose countermeasures of adequate timing and scale (Lutz et al. 2019).

Due to these reasons, forecasting functionality is a desirable component of modern surveil-lance systems, reflected by the continued effort of health authorities to build and applyforecasting tools for endemic diseases (Biggerstaff et al. 2018; Mcgowan et al. 2019;Claus et al. 2017; Centers for Disease Control and Prevention 2016). Two special de-mands for forecasting can be identified in this context. On the one hand, forecasts of casenumbers should be provided at different levels of aggregation (for example at the county,state and national level), a task also known as hierarchical forecasting. The underlyingreason is that diseases often spread in certain geographical areas due to population con-tact patterns, but there can also be overarching and dispersed disease outbreaks, meaningthat both detailed and aggregated forecasts will provide unique insights (Salmon et al.2016; Unkel et al. 2012; Krause et al. 2007). On the other hand, there is a need for inter-pretable forecasts, which lend themselves to epidemiologically meaningful explanation(Doms et al. 2018; Driedger et al. 2014; Flahault et al. 2016). Such explanations aboutthe epidemiological background on which predictions have been made may not only sim-plify scrutiny of forecasts but also provide additional information on contributing factorsto current disease spreading that could aid outbreak control (Nsoesie et al. 2014; Lutzet al. 2019; Manheim et al. 2017).

While hierarchical forecasting and interpretability of forecasts have already been individ-ually addressed in the context of infectious disease surveillance (Stojanovic et al. 2019;Reich et al. 2016a; Kane et al. 2014; Gibson et al. 2019; Manheim et al. 2017), theirinterplay and compatibility has not yet been examined. This work takes the stance thata conceptual integration of both could prove valuable for surveillance practice and con-

2

tribute to health authorities’ ultimate goal of obtaining a “complete, multidimensionalpicture of the epidemiological situation” (Claus et al. 2017).

Accordingly, the guiding research question of this work will be how infectious diseaseforecasting, hierarchical forecasting and interpretability can be combined in order toobtain forecasts and epidemiologically meaningful explanations at different aggregationlevels. A design science research approach is adopted by building and evaluating an ap-proach for interpretable and hierarchical forecasting of infectious diseases in public healthsurveillance. Through the process of designing and testing such a solution, answers to thefollowing more detailed research questions should be found:

• What are the particular requirements in infectious disease surveillance and theresulting objectives for a forecasting approach?

• Which methods must be included in the approach, what are potential choicesfor each method and which choice is most suitable?

• How can the different methods be combined to produce the desired outputs?

• Does the approach designed meet the previously defined objectives?

The outline of this thesis is as follows. In chapter 2, the research methodology is pre-sented, including the overall research process and the literature search. Next, chapter 3gives an overview of previous work on the different topics, describes the setting of infec-tious disease surveillance and derives the design objectives for this work from the require-ments identified. In chapter 4, the pivotal decision points during design are addressed bypresenting and evaluating different options and justifying the final choice for this work.Based on these decisions, the approach for interpretable hierarchical forecasting is devel-oped and its elements are presented in detail in chapter 5. Subsequently, the feasibilityof the approach is demonstrated through prototype implementation and its quality is eval-uated by comparison with the design objectives in chapter 6. The insights of the designprocess with respect to the research questions are then discussed in chapter 7. Finally,the main conclusions of this work are summarized and future research issues proposed inchapter 8.

3

2 Methodology

This thesis adopts a design science research approach (Hevner et al. 2004): In a structuredprocess, an artefact will be built and evaluated which addresses the special demands offorecasting in infectious disease surveillance. The artefact to be designed is not a novelmethod, but a configurable approach which allows to combine existing methods in orderto produce interpretable and hierarchical forecasts.

Research Process

In order to ensure rigour of the research process, a design research methodology as pro-posed by Peffers et al. (2007) is adopted. The different steps of the process and theresearch methods used in each step are described in the following.

Problem Identification: The research should be tailored to a practical problem andgrounded in the existing body of knowledge. To gain an overview over infectious diseaseforecasting and the requirements in surveillance, guideline publications on epidemiologyin public health practice are consulted and a high-level review of the literature on infec-tious disease forecasting with focus on applications in surveillance is conducted (detailsof the literature search process are given below). In addition, the Signale system, a deci-sion support system for surveillance currently developed at the Robert Koch Institute, isused as an exemplary potential application context of the approach.

Definition of Objectives: Based on the previous characterization of infectious diseasesurveillance practice and the needs related to disease forecasting, several objectives forthe envisioned design artefact are inferred and specified.

Design and Development: First, the major decision points in the design of the envisionedapproach are identified and addressed (Ellis and Levy 2010). Based on three conceptual,integrative literature reviews with representative coverage (Cooper 1988), sets of concep-tually similar methods for the individual tasks (infectious disease forecasting, hierarchicalforecasting and model interpretation) are assessed regarding their suitability for the ap-proach and a selection of the most promising and mutually compatible sets is made. Thisselection defines the constituting elements of the approach. Next, a unified mathematicalnotation is proposed which allows to precisely define the element’s interfaces and howthey can be combined to produce the desired outputs. The notation is designed such thatthe individual elements remain configurable. This way, for instantiation of the approach,it should be possible to pick one method from the set of conceptually similar methods foreach element. The available methods are harmonized with the proposed interfaces anddescribed in detail.

4

Demonstration and Evaluation: The approach is demonstrated through a prototype.Concrete choices for the configurable parts are made and implemented in order to produceforecasts and explanations on real-world test data. For evaluation, the different designobjectives are compared with the artefact using suitable evaluation methods, includinglogical reasoning, cross-validation, a simulation experiment, mathematical proofs andillustrative scenarios (Peffers et al. 2012).

Communication: The research process is summarized and documented in this thesis, thesoftware code of the prototype serves as an implementation blueprint for the approach.The research questions answered through the design process are concluded and the noveltheoretical and practical insights discussed. New questions which arose during the processare noted for future research.

Literature Search

Three topic searches are conducted, one on infectious disease forecasting focused on boththeory and application, and two mostly theory-focused searches on hierarchical forecast-ing and model interpretation. Following the methodology by Vom Brocke et al. (2009),conceptual reviews are used to gain an initial understanding and identify suitable key-words. Then, various databases are searched and a two-phase evaluation (abstract scan,full text scan) of all publications according to predefined criteria is performed. The searchis limited to anglophone peer-reviewed journals, except for the very young field of modelinterpretation, where preprints are included as well. The set of identified publications isextended through backward references search and forward and backward author searchfor certain influential scholars (Levy and Ellis 2006). A high-level overview over thesearches is provided below1, a more detailed documentation of the results can be found inappendix A.

Topic Databases Keywords Criterion Used

InfectiousDisease

Forecasting

ACM DL, EBSCO, Scopus

"infectious disease" AND forecast AND (review OR

survey)

Review of methods or applications in surveillance.

50

HierarchicalForecasting

ACM DL, EBSCO, Scopus, WOS

(forecast "hierarchical time series") OR ("forecast

reconciliation")

Either about methods or applications to public

health.24

Model Interpretation

DBLP Computer Science

(explain OR interpret) AND predictions

On explanation methods, not models. 32

1 ACM DL = ACM Digital Library | WOS = Web of Science

5

3 Forecasting in Infectious Disease Surveillance

3.1 Related Work

As mentioned before, the three central topics of this work are infectious disease fore-casting, hierarchical forecasting and model interpretability. In the following, an overviewover related contributions in the field of epidemiology and beyond will be given.

Infectious Disease Forecasting

Infectious disease forecasting has been addressed as a research topic since the 1990s andexperienced considerable growth in the last two decades (Lauer et al. 2020). Its maingoal is to provide predictions about the uncertain future development and outcomes ofinfectious disease spreading on a population level, making it a subfield of epidemiologyand public health (Rivers et al. 2019; Flahault et al. 2016; Salathé et al. 2012). Theforecasting of infectious diseases has been considered particularly challenging due to thecomplexity of disease and population dynamics (Scarpino and Petri 2019). Nonetheless,its value for planning and control in public health is stressed repeatedly (Doms et al. 2018;Lutz et al. 2019; Chretien et al. 2014; Manheim et al. 2017).

The main object of forecasting are cases of infection with a disease, however other aspectssuch as hospitalizations or related deaths are sometimes forecast too (Dembek et al. 2018).The typical time horizons for prediction are in the rather short-term range of several weeksor months into the future (Lauer et al. 2020). In the most basic setting, infectious diseaseforecasting is conducted as a simple time series prediction task, where the only availabledata are the past case counts of a disease (Allard 1998). In contrast, modern forecast-ing approaches increasingly make use of further information for model fitting, includingcensus data, laboratory results, environmental measurements, participatory reporting andsocial media activity (Desai et al. 2019; Yang et al. 2016; Bansal et al. 2016).

A variety of forecasting models is known in the literature on infectious disease forecast-ing. Typical taxonomies distinguish between mechanistic approaches, which rely on ex-plicit modeling of the biological dynamics of disease spread (Chretien et al. 2014; Nsoesieet al. 2014; Hazelbag et al. 2020), and statistical approaches, which use historical data toproject past patterns of disease time series into the future (Allard 1998; Meyer et al. 2017;Lauer et al. 2020; Bansal et al. 2016; Chae et al. 2018; Kane et al. 2014). Recently, publichealth bodies have also shown efforts to explore the strength and weaknesses of differ-ent approaches through communities like the epidemic prediction initiative (Rivers et al.2019) and forecasting competitions such as FluSight for Influenza-like Illness (Mcgowanet al. 2019) or RAPIDD for Ebola (Desai et al. 2019).

6

Hierarchical Forecasting

The goal of hierarchical forecasting is to produce forecasts for so-called hierarchical timeseries, which can be hierarchically disaggregated into more detailed series, such as prod-uct groups or geographical regions (Athanasopoulos et al. 2020), where the value of aseries is equal to the sum of its more disaggregated series. For example, at any time, thetotal number of cases of a specific disease in Germany is the sum of case numbers in allthe states. In hierarchical forecasting, forecasts which fulfil this constraint and are thus co-herent should be produced for all time series at all aggregation levels (Fliedner 2015). It isfurthermore possible to disaggregate a time series along several dimensions, for exampleboth by age group and gender. In such a case, several possible hierarchies exist, depend-ing on the order of disaggregation, but the aggregation constraint still applies (Hyndmanet al. 2016). Hierarchical forecasting has been addressed as a task in economic literaturesince the 1970s (Dunn et al. 1976). Since then, a number of different approaches has beenproposed and thoroughly tested in various simulations and applications (Gross and Sohl1990; Schwarzkopf et al. 1988; Shlifer and Wolff 1979; Hyndman and Athanasopoulos2018; Athanasopoulos et al. 2020). All methods guarantee coherence by design but candiffer in the accuracy of the final forecasts (Hyndman et al. 2011; Shang and Hyndman2017).

On the one hand, the relevance of hierarchical forecasting for infectious disease surveil-lance is quite obvious, as the investigation of infectious disease in a population stratifiedby “time, place or person” (Centers for Disease Control and Prevention 2006) at differentlevels of detail is a fundamental task in epidemiology and relevant for surveillance, forexample because outbreaks can both evolve locally or globally and thus require hetero-geneous action by health authorities (Salmon et al. 2016; Unkel et al. 2012; Krause et al.2007). Accordingly, CDC’s FluSight or the RAPIDD Ebola forecasting challenge bothrequire forecasts on national and regional (state or county) levels (Viboud et al. 2018;Biggerstaff et al. 2016). On the other hand, many authors in the field seem unaware ofthe concept of hierarchical forecasting, as forecasts for different levels are most often pro-duced individually without consideration of coherence (Viboud et al. 2003; Osthus et al.2017; Ertem et al. 2018) or by application of the bottom-up strategy without explicit des-ignation (Kandula et al. 2017). Lately however, initiators of the FluSight challenge notedthis neglect of hierarchical forecasting and demonstrated its potential in improving fluforecasts (Gibson et al. 2019).

Model Interpretability

The concept of model interpretability can be roughly described as the capability to explainand understand a predictive model or its outputs (Christoph Molnar 2019; Miller 2019).Interpretability has been criticised as quasi-scientific by some authors, because scholars

7

often fail to state a clear definition of interpretability and its intended purpose in theirspecific application (Lipton 2018; Doshi-Velez and Kim 2017). Two general notions ofinterpretability can be distinguished: transparency, which refers to an understanding ofthe structure and mechanisms of a predictive model, i.e. its algorithm, parameters and de-cision logic, and post-hoc interpretability, which refers to explanations that help humansto make sense of predictions without necessarily understanding the underlying model tofull extent (Lipton 2018). Moreover, scholars have proposed a categorization into globalexplanations, which apply to the full input and decision space of a model, and local ex-planations, which only elucidate an individual prediction (Mohseni et al. 2018; ChristophMolnar 2019; Berrada and Adadi 2018). Regarding the value and usage of interpretabil-ity, various motivations have been purported, including end-user trust, individual’s rightto explanation, fairness and ethics, bias detection, provision of additional information,debugging, scientific discovery and hypotheses generation (Lipton 2018; Miller 2019;Doshi-Velez and Kim 2017).

Initially, the concept of interpretability originates from research on rule-based expert sys-tems in the 1970s (Biran and Cotton 2017) but recently attracted great attention in thefield of machine learning and artificial intelligence, where a lack of interpretability dueto high model complexity is assumed an obstacle to other goals of predictive analytics(Mohseni et al. 2018; Berrada and Adadi 2018). Nevertheless, interpretability can be ofconcern even with well-established and much less complex models, for example in linearregression (Lipton 2018; Schielzeth 2010). In the context of infectious disease surveil-lance, the ability to explain models and their predictions is perceived as important bypublic health practitioners (Doms et al. 2018; Driedger et al. 2014; Flahault et al. 2016).More specifically, the interpretability of forecasts has been directly linked to their practi-cal value in outbreak control: It has been argued that "forecasts must be interpretable to beuseful" (Nsoesie et al. 2014) because health practitioners can only act on forecasts with ameaningful epidemiological interpretation (Lutz et al. 2019). Currently, interpretability ininfectious disease forecasting models is mostly addressed on an individual basis by schol-ars both for statistical (Stojanovic et al. 2019; Meyer et al. 2017; Reich et al. 2016a; Kaneet al. 2014; Corberán-Vallet and Lawson 2014) and mechanistic models (Arık et al. 2020;Ghosh and Guha 2011). Moreover, the interpretability of different classes of forecast-ing models has been discussed to assess their value in decision support for surveillance(Manheim et al. 2017).

3.2 Setting and Motivation

Infectious disease surveillance is primarily conducted by dedicated public authoritiesranging from local surveillance bodies in cities or counties to international organisations(Amato-Gauci and Ammon 2008). In Germany, the main surveillance actors are over 400

8

county-level health departments, 16 state-level departments and the Robert Koch Institute(RKI), the federal agency for disease prevention and control (Faensen et al. 2006). Thelatter agency, of which comparable institutions exist worldwide (IANPHI 2020), has thespecial role of continuously accumulating epidemiological information from all healthdepartments and other sources, conducting analyses and disseminating the insights backto health departments and other stakeholders of the health system (Centers for DiseaseControl and Prevention 2016). This everyday surveillance practice, mostly carried outby epidemiologists at national public health institutes like the RKI, provides the mainapplication context of this work.

While the vital societal function of surveillance attracts public attention mostly duringhigh-profile events such as emerging pandemics, it is important to recognise that infec-tious disease surveillance is first and foremost a routine activity. Many infectious diseasesin Germany and around the world exist in an endemic state, that is they are constantlymaintained at a low baseline level, but have the potential to become epidemic throughoutbreaks and sustained transmission (Centers for Disease Control and Prevention 2006).Such diseases must be continuously monitored but only require action in case of alarm-ing developments. In Germany, the Robert Koch Institute lists over 80 different report-ing categories of so-called notifiable diseases which are subject to ongoing surveillance(Faensen et al. 2006), constituting a considerable surveillance workload for epidemiolo-gists. Accordingly, public health authorities strive for increased automation and algorith-mic analysis to assist the work of epidemiologists through dedicated application systemsfor disease monitoring (Thorve et al. 2018; Naumova et al. 2005; Semenza 2015). In thiscontext, forecasts can be one of many inputs to support epidemiological analysis and de-cision making. For example, depending on how quickly and how far case numbers arepredicted to rise, different intensities of preparation or intervention of health departmentsmay be considered appropriate by decision makers. In general, generic systems applica-ble to a large number of diseases are often preferred over case-specific solutions due toresource constraints (Salmon et al. 2016).

The main information source of modern surveillance systems is traditional reporting, inwhich cases of a notifiable disease are identified in patients through hospitals or gen-eral practitioners, confirmed through laboratory tests, reported to the responsible healthdepartment and subsequently escalated through the surveillance hierarchy until they areregistered by the federal agency (Krause et al. 2007). In many countries, this process hasbeen at least partially digitalized. Because endemic diseases have often been monitoredfor many years, even decades, substantial historical data is available (Robert Koch-Institut2017). Aside from reports of confirmed cases, surveillance bodies increasingly strive forintegration of additional sources of epidemiologically relevant information, also knownas syndromic surveillance, for example certain symptoms encountered in hospital emer-

9

gency rooms or population behaviour such as absence from work rates or over-the-counterdrug sales (Corberán-Vallet and Lawson 2014). Moreover, news articles, social media oronline search behaviour have been considered (Amato-Gauci and Ammon 2008). Whilethe collection and usage of such additional, inconclusive information is still at an experi-mental stage, the integration of non-traditional data not primarily created for surveillanceis a clear future trend in epidemiology (Salathé et al. 2012).

An illustrative example of digital instruments for surveillance is the “Signale” system byRKI, which also provides the context for this thesis. The system offers functionality forstatistical outbreak detection since its first version in 2017, but is currently extended intoa versatile "outbreak information" tool for epidemiologists (Claus et al. 2017). Throughan interactive dashboard, descriptive information but also automatic analyses and predic-tions will be provided to the user, including short-term forecasts. The system is intendedto support a large number of diseases. Moreover, state-of-the art machine learning tech-niques are explored to be able to integrate and analyse external information sources suchas social media activity in the future.

Regarding hierarchical forecasting, the ability to oversee disease progression at variouslevels of aggregation to detect both local and global phenomena has been described asa key value of centralized surveillance systems (Krause et al. 2007). The current designof the Signale system already takes the importance of multi-level, hierarchical analysisinto consideration, as expressed both by the structure of the outbreak detection algorithm(aberration detection on variously aggregated and sliced time series) and the Signale userinterface (visualisations at different levels, e.g. state or county) (Robert Koch Institut2020). Correspondingly, further components such as forecasting functionality should bedesigned to integrate well with this structure.

Moreover, in line with the goal of Signale to provide a "complete, multidimensional pic-ture of the epidemiological situation" (Claus et al. 2017), it seems plausible that inter-pretable forecasts with the ability to obtain an explanation for each prediction could pro-vide additional value beyond a plain forecast. For example, given the mere predictions,one could only distinguish two outbreak forecasts by their size. However, if explana-tions with epidemiological meaning were available, they could be used to also distinguishbetween different kinds of outbreak dynamics, for example between local outbreaks andnation-wide dispersed outbreaks. On the one hand, this may have implications for appro-priate reactions by health authorities. On the other hand, such explanations could help toidentify similar epidemiological situations from the past, allowing for comparison and in-ference. Moreover, forecasts could also be put in relation to other analyses of surveillancedata. For instance, epidemiologists may scrutinize and revise forecasts by combining theforecast explanation with other information that is not yet available in digital or structuredformat or with their general domain knowledge (Chretien et al. 2014).

10

3.3 Objectives

Based on the previously elaborated demands of infectious disease surveillance as an ap-plication context and a review of the literature, four key objectives of an approach forinterpretable hierarchical forecasting are proposed in the following. These objectivesboth serve as a guidance during the design of the intended artefact and as criteria in itssubsequent evaluation.

Flexibility

The approach should be flexible in the sense that it can be adopted in different surveil-lance situations with varying conditions and opportunities. This objective follows mainlyfrom the demands of routine surveillance and can be differentiated into three dimensions,hereafter termed disease flexibility, information flexibility and method flexibility:

• Disease Flexibility: The approach should not be highly disease-specific suchthat expensive tailoring to each disease selected for surveillance would be needed.As public health authorities monitor a large number of pathogens with limitedresources, an approach readily suitable for various diseases is required.

• Information Flexibility: Health authorities may have knowledge and data ofvarying degrees of detail available for different notifiable diseases, for examplebecause prominent diseases could be better investigated and researched thanothers. A flexible approach would allow to take advantage of all informationavailable for a specific disease but have easily satisfiable minimal requirements.Furthermore, as institutions strive for acquisition of new sources of surveillancedata, an approach compatible with diverse information sources may prove moreviable in the long term.

• Method Flexibility: The field of infectious disease forecasting has a varietyof methods available. While some approaches are particularly popular, theredoes not seem to be a silver bullet (Polonsky et al. 2019; Lauer et al. 2020), thesame applies for hierarchical forecasting (Athanasopoulos et al. 2020). There-fore, an approach which has sufficiently standardized interfaces between theindividual components that allow to exchange or modify some element with-out impairment of the others is favourable, offering the opportunity to flexiblychoose the best method for a certain case, potentially even through automaticor semi-automatic benchmarking. Obviously, this objective is limited by therequirement that all potential methods must be compatible to one another.

The evaluation of flexibility will be conducted qualitatively based on the properties of thecomponents chosen and their interdependencies.

11

Hierarchical Coherence

As has been argued, the approach should yield forecasts for different levels of aggregationalong one or several dimensions, because both a detailed prediction of the disease progres-sion in sub-populations and an overview over global or dispersed disease dynamics areneeded in surveillance. The focus will here be on forecasts of count data like disease casenumbers. These different forecasts can be arranged in a tree-like hierarchy with the mostaggregated, national forecast at the top and highly disaggregated forecasts at the bottom,where the value of each time series is equal to the sum of its child time series values (forillustration see Figure 1).

As previously noted, it is desirable that the forecasts for the different levels are "coher-ent", i.e. they add up accordingly as well. This requirement has the essential motivationthat violations of coherence may result in ambiguity about the forecasts. In the case ofincoherent forecasts, users could arrive at contradictory conclusions about the expectedfuture by looking at different aggregation levels, which has been described as an obstacleto aligned decision making. In infectious disease surveillance, the possibility of incoher-ence would imply that epidemiologists always had to inspect the forecasts of all levels inorder to obtain a complete overview over the epidemiological situation and progression ofa disease. Moreover, it seems unclear how epidemiologists were to resolve strong contra-dictions between the forecasts (i.e. national forecast predicts rising trend while state-levelforecasts all predict falling trend) in order to arrive at actionable conclusions for diseasecontrol. Note that the notion of ambiguity is very different from the general concept ofuncertainty in forecasting: While uncertainty simply expresses different eventualities ofthe future, ambiguous forecasts predict an impossible future.

Thus, hierarchical coherence is another objective of the approach envisioned here. It canbe straightforwardly evaluated by checking the forecasts for any contradictions betweenaggregation levels. If the forecasting approach is explicitly designed to produce coherentforecasts, it may also be possible to formally prove the coherence property.

Accuracy

A very obvious goal of infectious disease forecasting is that the predicted values shouldbe as close as possible to the true values in order to produce exact expectations of thefuture and allow for appropriate planning. Despite the clarity of this objective, variousmetrics for its evaluation have been proposed and used in the context of infectious diseaseforecasting, which will be introduced and judged in chapter chapter 6.

In general, the accuracy of hierarchical forecasts can be evaluated individually for eachtime series of the hierarchy. Still, the errors for the different series are often summarized

12

for overview purposes, which is usually achieved by simple averaging of error scores(Wickramasuriya et al. 2020; Taieb et al. 2020; Li et al. 2019; Rehman et al. 2019). Intheory, the accuracy of hierarchical forecasts could thus be expressed in a single aggre-gated measure, but one should expect richer insights from an evaluation differentiated bythe different hierarchy levels. Notably, hierarchical forecasting of infectious disease casenumbers poses particular challenges to accuracy evaluation, because the case numbersmay range from extremely small or zero counts at the lower levels up to thousands ofcases at the top level. This makes comparison across hierarchy levels particularly chal-lenging, as will be addressed in chapter 6 as well.

Ideally, forecasts should not only be precise in terms of proximity to the true future value,but also provide a measure of uncertainty about the future (Doms et al. 2018; Neitingand Raftery 2007). This is especially relevant in infectious disease forecasting due to thelimited knowledge (epistemic uncertainty) and high stochasticity (aleatoric uncertainty)of disease dynamics in a population (Scarpino and Petri 2019; Centers for Disease Controland Prevention 2016). This demand is usually met by providing prediction intervals or fullprobability distributions as forecasts, also known as probabilistic forecasting (Lauer et al.2020; Gibson et al. 2019). Because an adequate consideration of probabilistic hierarchicalforecasting would be out of scope, this objective is however not included in the presentdesign. Nevertheless, it will shortly be discussed how the approach designed could beextended to probabilistic forecasting in the outlook to future research.

An essential question regarding hierarchical forecasting is whether the consideration ofhierarchical coherence of forecasts comes at the cost of accuracy or not. Ideally, theapproach proposed would provide coherent forecasts which are at least as accurate asforecasts that ignore the aggregation constraints. Therefore, the accuracy evaluation inthis work will be focused on comparisons between coherent and incoherent forecasts bothobtained using the same basic forecasting method. In other words, the approach shouldprovide hierarchically coherent forecasts without loss of accuracy.

Interpretability

Lastly, the approach should provide additional information on the forecasts in the formof epidemiologically meaningful explanations. The notion of interpretability taken hereis that of post-hoc interpretability, i.e. explanation without elucidation of the inner work-ings of the forecasting model. Instead, local explanations for individual forecasts shouldbe attained that help to characterise the underlying epidemiological situation. Thus, thegoal of interpretability in this work is that of interpretability for informativeness (Lip-ton 2018), aligning well with the application context of infectious disease surveillance:the objective of the forecasting approach is to provide "useful information to decisionmakers" (Lipton 2018), which is of course mainly achieved through the forecast, but can

13

be enriched through additional information in the form of an explanation. This framingmeans that the approach must explicitly produce meaningful explanations of individualforecasts – mere simplicity of the models used will not be sufficient to term the forecasts"interpretable" in the above sense. Furthermore, complex models are also permitted inprinciple if the complexity does not impede meaningful explanation.

Regarding the evaluation of interpretability, a distinction is often made between threegeneral approaches (Doshi-Velez and Kim 2017; Samek et al. 2019; Mohseni et al. 2018).In application-grounded evaluation, the usefulness of explanations is tested through userexperiments within the real application, i.e. in the present case with epidemiologistsin surveillance practice. In human-grounded evaluation, humans are asked to carry outsimplified proxy tasks with the help of explanations (for example making a judgementalforecast solely based on the explanation) and their performance is taken as an indicator ofexplanation quality. Lastly, in functionally-grounded evaluation, no humans are involved,instead explanations are judged against formal criteria of interpretability or used to per-form other computational tasks to test their informative value. Here, only functionally-grounded evaluation will be conducted. While not being conclusive regarding the real-world use of explanations, it is an important groundwork which ensures that explanationsare sound and have information content before testing them with human users. Accord-ingly, quality is here defined along two dimensions. Firstly, while the explanations donot have to uncover the inner workings of the model, they must still reflect its decisionlogic appropriately (Christoph Molnar 2019). Otherwise, explanations for forecasts couldbe plausible but deceptive. However, what constitutes an appropriate coupling of expla-nation and model of course depends on the type of model used and will therefore beconcretized in the course of the design process. Secondly, given a sufficiently calibratedmodel, explanations should provide additional information on the epidemiological situa-tion which cannot be inferred from the forecast alone. This aspect will be evaluated bytesting whether explanations for infectious disease forecasts allow to identify the under-lying outbreak situation better than the forecast.

14

4 Design Choices

In the following, several fundamental design choices which restrict the applicable meth-ods for interpretable hierarchical forecasting to a set of appropriate and mutually compat-ible elements are made and justified.

4.1 Modeling Paradigm

The first essential choice for the envisioned approach is about the class of forecastingmodels to be used, because the model both determines the applicable hierarchical fore-casting strategies and the mode of explanation. As previously discussed, the literatureon infectious disease forecasting broadly distinguishes between mechanistic and statisti-cal models, which will be compared in the following with regard to the present designobjectives.

Mechanistic Models

The class of mechanistic models is based on explicit representation of the disease spread-ing dynamics at varying levels of detail and includes compartmental, meta-population andagent-based models (Nsoesie et al. 2014). Compartmental models separate the populationinto different stocks of disease status groups, called compartments, for example "suscep-tible", "infectious" or "recovered" and describe the evolution of compartment sizes usingordinary differential equations that represent the transition of individuals from one com-partment to another (Carias et al. 2019). More detail can be added through additionalcompartments, for example for different age groups or genders. A typical and delicateassumption of such models is homogeneous mixing, which means that contacts betweenmembers of two compartments are equally likely (Manheim et al. 2017). In contrast,metapopulation models consist of several similar subpopulations in a spatial arrangementwhich simulate the spreading of pathogens through the environment from one subpopu-lation to the other (Höhle 2016). Finally, agent-based models simulate the everyday lifeand behaviour of a large number of individuals with detailed demographic characteristicsto reproduce disease spreading under complex contact patterns (Dembek et al. 2018). Allmechanistic models require a number of parameters such as the contact rate or transmis-sibility of a disease, which are usually estimated using historical data (Hazelbag et al.2020). Forecasts can either be produced analytically by solution of the model’s differen-tial equations or, in the case of agent-based models, through simulation (Manheim et al.2017).

Although some high-level standards such as the SIR template exist, mechanistic modelsare usually highly disease-specific. In order to achieve significant results, modelers often

15

have to tailor their model to particular circumstances of the case and engage in elaborateparameter estimation (Hazelbag et al. 2020; Carias et al. 2019). Moreover, real-worlddisease dynamics may change and invalidate formerly reasonable model assumptions, re-quiring remodeling and leading to erroneous predictions if unnoticed (Lauer et al. 2020).Regarding information flexibility, mechanistic models can also work with only few his-torical data if experts can supply sufficient assumptions instead (Manheim et al. 2017).However, the set of parameters required in a mechanistic model is usually fixed, hencethe model can only be applied if all parameters can be estimated (Carias et al. 2019). Afurther important downside is that only information with direct biological relevance fordisease spreading can be integrated (Lauer et al. 2020). For example, unless a causaltheory about the relationship between influenza infections and user behaviour on the webwas available, a purely mechanistic model could not make use of google search statis-tics for its forecasts. Lastly, regarding method flexibility, an important consideration maybe whether it is possible to find a mode of explanation that is not too closely tied to themodel and its assumptions, so that the model can be altered without invalidation of theexplanation method.

In order to provide forecasts for different aggregation levels, the mechanistic model musthave a sufficient degree of detail. While agent-based models usually provide very highdetail per se, the resolution of compartmental or meta-population models largely dependson the modeler’s choice. Nevertheless, limitations are given through the fact that thenumber of parameters increases with the level of detail and that parameter estimation maybe difficult at very high resolution (Lauer et al. 2020; Hazelbag et al. 2020). Eventually,hierarchical forecasts can be obtained in two ways: Either all strata are forecasted togetherusing a single, overarching model (e.g. a nation-wide, agent-based model) or severalmodels are employed for the different strata (e.g one compartmental model for each stateand one for the nation) (Yang et al. 2016; Osthus et al. 2017). In the first case, the forecastsshould naturally be coherent if the model mimics the real data-generating process. In thesecond case, coherence must be ensured post-hoc using an appropriate technique from thefield of hierarchical forecasting (Fliedner 2015).

Given suitable modeling assumptions and good estimates of the model parameters, mech-anistic models are expected to produce usable and accurate forecasts (Lauer et al. 2020;Dembek et al. 2018). A particular strength is that such models can also simulate dynamicswhich have not been observed in the past and are thus promising for emerging diseases andextraordinary outbreaks (Manheim et al. 2017). The uncertainty in parameter estimatescan be generally incorporated in probabilistic forecasts through Monte Carlo simulation,nevertheless a certain risk of model misspecification leading to flawed forecasts remains(Hazelbag et al. 2020; Manheim et al. 2017). A particular challenge arises when data orforecasting targets involve dynamics that are outside of disease transmission, for example

16

when reported cases are used as a proxy for total cases, thus including the data-generatingprocess of the reporting system (Lauer et al. 2020).

Since by design all parameters and starting conditions of mechanistic models have a rel-evance for disease transmission, meaningful explanations for the resulting forecasts arerather straightforward to derive (Manheim et al. 2017). Therefore, it can generally beexpected that an approach with mechanistic models offers good post-hoc interpretability.

Statistical Models

Statistical models largely forgo an explicit representation of disease dynamics and try toinfer the future trend through replication of historical patterns instead (Lauer et al. 2020).Classical time series models are built for typical patterns of many time series such asautoregression and seasonality, for example exponential smoothing (Unkel et al. 2012)or autoregressive moving average (ARMA) models (Allard 1998). A more flexible classof forecasting models are generalized linear models which fit a linear function of regres-sors to the conditional mean of a probability distribution from the exponential family(Zeger and Karim 1991). The regressors, also called features, can be past observationslike in classical time series models but also other potential predictors of disease spread-ing. A further extension are generalized additive models which are sums of potentiallynon-linear functions such as splines or kernels, allowing the modeling of complex rela-tionships (Stojanovic et al. 2019). While the above models assume a specific probabilitydistribution of the target to be forecasted, models from the field of machine learning fit ar-bitrary functions by minimizing error on a training set without a-priori consideration of aspecific distribution (Flahault et al. 2016; Salathé et al. 2012). This offers maximum flex-ibility but requires more training data and can lead to very complex models. Examples ofmachine learning methods used in infectious disease forecasting are random forests (Kaneet al. 2014), artificial neural networks (Chae et al. 2018) or Gaussian processes (Ak et al.2018).

The disease flexibility of statistical methods is very high, as the input features are oftengeneric and no detailed assumptions about disease spreading are made. This also has theadvantage that shifts in reality can be addressed by refitting of the model on new data(Lauer et al. 2020). Overall, the degree of manual modeling and expert input requiredwill depend on the exact model type but should range from medium effort with detailedparametric methods to very low effort with general-purpose machine learning. Regard-ing information flexibility, statistical approaches have the strong prerequisite of historicaldata in order to be accurate. Other than that, the minimal data requirements of statisti-cal methods are very low, as most approaches would already work only with a historicaltime series of the forecasting target. While the inclusion of further data sources in clas-sical time series analysis methods such as ARIMA is only possible to a limited extend,

17

regression-based methods can easily integrate additional features (Johnson et al. 2018;Meyer et al. 2017; Ertem et al. 2018). Most importantly, this also includes informationwhich cannot be directly translated into disease dynamics such as syndromic surveillancedata, environmental measurements or social media activity (Lauer et al. 2020). In thisregard, some machine learning methods may be especially promising with their abilityto perform automatic selection of features and operate with high-dimensional data andnon-linear relationships (Kane et al. 2014; Chae et al. 2018). With respect to methodflexibility, the template of supervised learning offers considerable standardization (Man-heim et al. 2017). It should thus allow to draw from a large pool of available regressionmethods without changing the interfaces of the forecasting component.

Most statistical methods are tailored to univariate or low-dimensional multivariate predic-tions. Hence, hierarchical forecasts require individual predictions for each series of thehierarchy. Because forecasts for different levels obtained in this way are not guaranteedto be coherent, a hierarchical forecasting strategy must be used to ensure coherence.

Given sufficient historical data, statistical methods are capable of producing equally orslightly more accurate forecasts than mechanistic models (Lauer et al. 2020). However,due to the general assumption that the future disease spreading will follow a similar pat-tern as in the past, statistical models are generally more suited for short-term forecastingand may be inaccurate in very extraordinary settings like pandemics (Scarpino and Petri2019; Dembek et al. 2018; Manheim et al. 2017). Parametric statistical methods allow toexpress uncertainty of the forecasts through the variance of the predictive distribution andthe standard error of model parameters (Paul and Meyer 2016; Stojanovic et al. 2019).Machine learning models are often restricted to point predictions, but some methods havebeen extended to also provide quantile forecasts, which can be used to construct predic-tion intervals (Meinshausen 2006; Gasthaus et al. 2020; Smyl 2020).

For local explanation of forecasts from statistical models, both the values of the inputfeatures and the model itself should somehow be taken into account. Among the varietyof known explanation methods for statistical models one can distinguish between model-specific approaches, which are tailored to a specific type of model and make explicit useof the model parameters to produce an explanation, and so-called model-agnostic ap-proaches, which treat the forecasting model as a black box and rely on repeated samplingof the prediction model with slightly perturbed input features to establish a connectionbetween input and forecast (Christoph Molnar 2019; Ribeiro et al. 2016a). Thus, in the-ory, explanations can be produced for arbitrary models, but the computational feasibilityand quality of such explanations may be at risk from high model complexity (Mohseniet al. 2018). Moreover, in order to add value to disease surveillance, the components ofthe explanation should have epidemiological meaning, which could have implications forthe suitability of certain model types and input features.

18

Choice

To summarize, both mechanistic and statistical models can provide accurate forecasts,with potentially larger resource requirements in mechanistic modeling. The performanceof statistical models depends on the availability of historical data, which is usually given inthe setting of routine surveillance of endemic diseases. Hierarchical forecasting is equallypossible with mechanistic and statistical models and may require additional steps in bothcases. Meaningful explanations for forecasts may be more straightforward to obtain frommechanistic models, but there also exist methods for explanation of statistical forecasts.Mechanistic models seem to be limited in their ability to handle many different diseasesand information situations at once, while statistical models promise very high flexibility.

Given this comparison, the application context of infectious disease surveillance seemsto favour the use of statistical models over mechanistic ones for the present approach,mainly because the higher flexibility and standardization of statistical models facilitatesimultaneous surveillance of many diseases. A further advantage is that regression-basedapproaches provide a standardized interface, increasing method flexibility, and allow theintegration of new, heterogeneous data sources without direct connection to disease trans-mission and are thus well aligned with the apparent trend towards digital epidemiology insurveillance. Nevertheless, statistical models require a dedicated approach for hierarchi-cal forecasting and model interpretation, which shall be investigated next.

4.2 Hierarchical Forecasting Strategy

Four different types of strategies for coherent hierarchical forecasting with statistical mod-els have been identified in the literature and will be compared in the following.

Single-Level Base Forecasting

The early and most simple strategies for hierarchical forecasting all involve producing"base forecasts" for one level of the hierarchy and then completing the remaining levelsbased on this result (Schwarzkopf et al. 1988; Shlifer and Wolff 1979): The bottom-upstrategy forecasts the bottom level time series of the hierarchy and adds up the predictionsto obtain the higher-level forecasts. In contrast, the top-down strategy, also known as"proration", forecasts only the single top level time series and distributes the result to thelower levels using appropriate proportions, which are usually determined by the averagehistorical proportions (Gross and Sohl 1990). Lastly, the middle-out approach forecastsneither the bottom nor top level of the hierarchy but some level in between and usesboth aggregation and disaggregation to obtain the remaining forecasts (Hyndman andAthanasopoulos 2018).

19

All these strategies are simple to implement and flexible because they are agnostic aboutthe model used to produce the base forecast. The bottom-up strategy only requires thebase forecasts as input, the top-down and middle-out strategy may additionally requireinformation on historical proportions.

Regarding accuracy, the bottom-up strategy can be problematic in case of strongly disag-gregated series, where the bottom-level forecasts could suffer from a low signal-to-noiseratio, potentially leading to inaccurate higher-level forecasts (Shang and Hyndman 2017).This case is quite probable in infectious disease surveillance, as some endemic diseasescan have very low incidence already at the county level. The top-down strategy may befavourable then, but it incurs a certain loss of information due to the aggregation of thetop level, meaning that the fine-grained characteristics of the lower levels may not beadequately represented (Shlifer and Wolff 1979; Hyndman and Athanasopoulos 2018).The slightly more complex middle-out approach may alleviate but not completely removethese disadvantages. Another problem with the bottom-up and middle-out strategy is thatif the base forecasts are produced using individual models for each time series of the level,which is convenient and often done, then potential correlations between the time series ofthis level are not taken into account (Athanasopoulos et al. 2020).

With respect to interpretability, the forecasts for the higher levels however should combinethe explanations for their associated base forecasts. For example, if the forecast for acertain state is derived by adding up the base forecasts for its 17 counties in a bottom-upfashion, then the explanation for the state forecast should somehow take into account theexplanations for all the 17 county forecasts.

Reconciliation via Projection

The basic idea of reconciliation is to first produce base forecasts for all levels of thehierarchy and then combine them into a set of so-called reconciled forecasts which respectthe coherence constraint (Athanasopoulos et al. 2020). Hyndman et al. (2011) describethe reconciliation as a two-step procedure: First, the reconciled value for each bottom-level forecast is computed as a linear combination of all base forecasts. For example,in a geographical hierarchy with two levels (state and nation), the reconciled forecastfor a specific state is not only defined by its base forecast, but may also depend on thebase forecasts for the other states and for the nation. The coefficients used for this linearcombination must be predefined, and different choices to obtain optimal forecasts havebeen theoretically motivated in the literature (Wickramasuriya et al. 2019). Second, theresulting revised bottom-level forecasts are then used to obtain the reconciled forecastsfor the higher levels using the classical bottom-up strategy by simply adding up. In thisway, information from all levels was taken into account but the final forecasts are still

20

coherent. The total procedure can be geometrically framed as linear projection of thebase forecasts onto a coherent subspace (Panagiotelis et al. 2020).

Reconciliation via projection is more complex than single-level strategies and always re-quires base forecasts for all levels. On the other hand, it is generic in the sense that thecoefficients for projection are configurable, which ultimately determine how the informa-tion from the base forecasts is combined. In fact, one can even design projections whichrepresent the single-level strategies, thus including bottom-up, top-down and middle-outmerely as special cases of reconciliation (Hyndman et al. 2011).

It has been empirically shown that reconciliation via projection can indeed take advantageof the information from all levels and yield better accuracy than the traditional single-level strategies (Athanasopoulos et al. 2017; Li and Tang 2019; Gibson et al. 2019; Shangand Smith 2013). However, because the theoretical assumptions used to propose optimalprojections may not fully hold in practice, their superiority is not guaranteed (Shang andHyndman 2017).

Because it combines base forecasts from all levels, reconciliation constitutes a consid-erable obfuscation of forecasting. In general, the explanation of a reconciled forecastmust be an appropriate combination of explanations from all base forecasts. However,a lightening aspect of reconciliation via projection is that the projections used are lin-ear combinations (Hyndman et al. 2011). Thus, the individual explanations of the baseforecasts can be considered in an additive manner, limiting the complexity of explanation.

Reconciliation via Optimization

van Erven and Cugliari (2015) propose an optimization-based strategy to obtain coherentforecasts. Similar to reconciliation via projection, base forecasts for all levels of thehierarchy are produced first. Next, the authors formulate a minimax optimization problemwhose solution is a reconciled forecast that minimizes the maximal possible increasein aggregate error compared to the original base forecast. Coherence of the solution isensured by defining the aggregation structure as optimization constraints. As the authorsshow, the maximum increase in aggregate error of the optimal solution is always smalleror equal to zero, meaning that the reconciled forecasts are, in total, never worse than thebase forecasts (van Erven and Cugliari 2015).

This optimization-based approach is especially flexible in the sense that it uses the hierar-chy structure as the minimal set of constraints, but other potential constraints expressingfurther knowledge such as confidence intervals for the forecasts can be added (van Ervenand Cugliari 2015). However, the required optimization routine must be applied to eachforecast and can become computationally expensive.

21

Because no potentially violated theoretical assumptions about the base forecasts are made,optimization can provide better accuracy than projection-based approaches in practice(van Erven and Cugliari 2015). It is also safe insofar as the reconciled forecasts willnever be worse than the base forecasts.

Unfortunately, the fact that the reconciled forecasts are obtained from the solution of anoptimization problem poses a severe obstacle to interpretability. In general, no closedform solution is available for the problem so that quadratic programming must be used inorder to obtain the reconciled forecasts (van Erven and Cugliari 2015). Thereby, the rela-tionship between the base forecasts and the final reconciled forecasts becomes intractable,making meaningful explanation very difficult to impossible.

Direct Hierarchical Forecasting

A very different approach which directly produces coherent forecasts is introduced byOuyang et al. (2019). The authors propose an artificial neural network for forecastingwhich captures the hierarchical structure of the time series through its propagation logic:The network has a multivariate output layer with one neuron for each time series of thehierarchy and the outputs for the higher-level series are defined as sums of the correspond-ing lower-level outputs, hence the multivariate output is intrinsically coherent. The lossof the network can then be defined as the combined loss of all outputs, so that the objec-tive function to be optimized takes into account not only the forecasting accuracy at thebottom-level but at all levels (Ouyang et al. 2019). No ex-post reconciliation is required.

While the flexibility regarding disease and available information of this statistical ap-proach may be high due to the versatility of neural networks, it restricts the forecastingmethod to differentiable methods which can be expressed as an artificial neural networks.It furthermore requires all time series to be forecasted in one model.

The general idea of jointly optimizing accuracy for all levels of the hierarchy is appealing,but the method has so far only been tested on one epidemiology-unrelated case and com-pared with bottom-up hierarchical forecasting using simple time series models (Ouyanget al. 2019). It is thus difficult to judge its overall performance.

Regarding post-hoc interpretability, methods to obtain explanations for artificial neuralnetworks do exist (Lapuschkin 2019; Shrikumar et al. 2017). The quality of explanationsmay however depend on the activation functions used and the complexity of the layers.

Choice

The above insights suggest reconciliation via projection as the most promising strategyfor the present application, because it has already been successfully tested in many appli-

22

cations and is sufficiently flexible. An optimization-based strategy may even yield moreaccurate forecasts but lacks interpretability due to the absence of a closed form solution.On the other hand, the neural-network-based direct forecasting strategy would require toproduce all forecasts in one giant model and only works with differentiable components.Lastly, the traditional single-level strategies can even be included in the reconciliation viaprojection strategy as special cases.

4.3 Explanation

Last but not least, a type of local explanation to ensure post-hoc interpretability must bechosen. As previously discussed, the goal is not to fully elucidate the prediction modelbut to provide additional information about a specific forecast which helps to understandthe underlying epidemiological situation. In the following section, different types of localexplanations for statistical models which can be found in the literature are presented andevaluated with respect to their suitability.

Feature Attribution

The idea of feature attribution is to explain a forecast by assigning a "contribution score"to each input feature. The score describes how the current value of a feature influencedthe forecast, i.e. the direction (positive or negative) and strength (absolute value) of theinfluence is indicated through the score (Sundararajan and Najmi 2019). Hence, the ex-planation is attached to the input features of the forecasting model. Because influence ismeasured as a change in prediction compared to some baseline, feature attributions arecontrastive explanations (Merrick and Taly 2019). This property makes them intuitiveto understand: As evidence from the social sciences suggests, human-friendly explana-tions should generally focus on the difference between the observed and some alternativereference scenario (Miller 2019). Moreover, because each feature receives its individualcontribution score, the resulting explanations are additive in the sense that the influence ofthe individual features is assumed to add up (Janzing et al. 2019). This leads to simplicitybut has the disadvantage that interactions between two or more features are not explic-itly represented and instead divided between the individual features. However, it is alsopossible but computationally expensive to assign contribution scores to pairs or triplets offeatures to represent interactions between features (Lundberg et al. 2019).

Feature attribution methods have been proposed for a variety of models (Ribeiro et al.2016a; Lundberg and Lee 2017; Lundberg et al. 2019; Sundararajan and Najmi 2019;Shrikumar et al. 2017). Lundberg and Lee (2017) proposed a mathematical notationwhich unifies all existing feature attribution methods and provided a theoretical founda-tion of feature attribution by linking it to a conceptually related problem in cooperative

23

game theory. This later connection has enabled an axiomatic approach in which desirableproperties of the contribution scores are defined mathematically and used to evaluate andcompare different feature attribution methods (Lundberg and Lee 2017; Sundararajan andNajmi 2019).

Counterfactual Explanation

Counterfactual explanations have in common with feature attribution that the explanationsare contrastive and based on the input features. However, a counterfactual explanationdoes not indicate the general influence of the individual features. Instead, it proposeshow the input features of the given instance had to change in order to produce a differentprediction than the current one (Wachter et al. 2018). Thus, explanations are of the form:“If the value of feature X was x′ instead of x, the prediction would be y′ instead of y”.Usually, the goal of counterfactual explanation is to find the minimal changes necessaryto obtain a specific alternative prediction. This task is related to the concept of adversarialattack and generally involves a complicated search in the feature space around the presentinstance (Wachter et al. 2018). Here, a particular challenge is to decide how big or smalla certain change in feature values would be in reality.

Counterfactual explanations are especially valuable in use cases where humans are inter-ested in changing the current prediction to a more desired outcome and have the ability totweak some of the features’ values to do so (Barocas et al. 2020). For the time series fore-casting envisioned here, counterfactual explanations seem less appropriate because thefeatures used are usually based on past observations which cannot be changed. Moreover,counterfactuals focus on small changes towards a predefined alternative prediction andmay thus overemphasize features to which the model is locally sensitive. For illustration,consider a feature which indicates that the week to be forecasted is in the middle of theepidemic season, where case numbers are usually very high. Slight changes of the featuremay not be enough to move the instance out of the epidemic season, so this feature mightnot be considered by a counterfactual explanation. Nevertheless, from an epidemiologicalstandpoint, seasonality is a very important aspect of disease dynamics (Lauer et al. 2020)and should thus be part of an explanation.

Rule-Based Explanation

It has also been argued that shallow decision trees or short rule lists are easily under-standable to humans (Christoph Molnar 2019). In such models, each prediction is definedthrough a set of conditions which must hold. The conditions are related to the input fea-tures of the model and usually of the form "a ≤ X ≤ b" for numerical variables, wherea and b are interval boundaries, or X = x for categorical variables, where x is a concretecategorical value. An exemplary rule-based explanation would thus be: "The prediction is

24

y because a ≤ X0 ≤ b ∧ X1 = x". Obviously, the set of conditions given in an explanationmust not be too large to still be intelligible.

In practice, decision trees or rule lists with limited complexity may not provide sufficientaccuracy (Christoph Molnar 2019), so it would not be acceptable to restrict forecastingto only such models. Nevertheless, rule-based explanations may be obtained from arbi-trary models using an approach called Anchors: This method samples predictions fromthe neighborhood of the instance to be explained and identifies sets of conditions which"anchor" the current prediction, i.e. guarantee that the prediction will remain unchangedas long as the conditions hold (Ribeiro et al. 2018). The underlying goal of such anexplanation is to point out the features which are decisive for the current prediction.

Two important limitations of such explanations must be considered. First, numeric predic-tions such as infectious disease forecasts are usually very sensitive to changes in featurevalues so that it may be impossible to identify a small set of anchoring conditions. For ex-ample, the forecast from a linear model will change with any feature which has a nonzerocoefficient. Therefore, anchors for regression models must usually be binned, i.e. theprediction is only guaranteed to remain within a certain interval, creating a trade-off be-tween sparsity and coverage of the explanation (Ribeiro et al. 2018). Second, rule-basedexplanations indicate the importance of features but not their effect on the forecast. Forillustration, consider a simple linear model y = f (x0, x1) = x0 − x1. Here, two differentforecasts f (10,−10) and f (−10, 10) would receive x0 = 10 ∧ x1 = −10 =⇒ 0 andx0 = −10 ∧ x1 = 10 =⇒ 0 as anchor explanation. From this explanation, it can only befigured that x0 and x1 are influential, but not when they increase or decrease the prediction.

Explanation by Example

Examples can be a human-friendly way of understanding through analogy, hence oneexplanation approach is to find instances which are “similar” to the current prediction,where similarity is defined with regard to the model’s decision logic (Kim et al. 2014;Lipton 2018). For example, infectious disease forecast could be accompanied by refer-ences to previous points in time which are seen as similar by the model.

However, in order to provide such examples, a suitable method is required to measurethe similarity of instances with respect to the model. Broadly speaking, similar instancesshould follow a similar decision path and arrive at a similar prediction (Caruana et al.1999). For example, the latent representation of the input at a deep layer of a neuralnetwork could be used as distance space (Caruana et al. 1999). For other models, it maybe more difficult to define a suitable distance measure. Lundberg et al. (2019) argue thatsimilarity can also be identified using feature attribution, where the distance between twoinstances is defined by the distance of their features’ contribution scores.

25

Textual Explanation

If textual explanations from experts exist for past forecasts, it may also be possible tobuild an additional language model, for example using an artificial neural network, whichmaps textual explanations to the predictions (Krening et al. 2017). Thus, explanationswould be produced by a separate "explanation model" which is tailored to the forecastingmodel and provides understandable explanations. An advantage of such an approach isthat text is a very intuitive medium of explanation and that through appropriate choice ofthe "ground-truth" explanations used for training of the language model, meaningful andplausible explanations could be obtained. Nevertheless, the availability of a sufficientlylarge sample of explanations for training is a strong requirement and the construction ofa suitable language model could be very complex. Moreover, a general risk may be thatexplanations are plausible but not well tied to the decision logic of the forecasting modeland thus deceptive after all (Lipton 2018).

Choice

The concept of feature attribution appears to be the most viable form of explanation forthe approach to be designed. Due to its generic formulation and the availability of bothmodel-specific and model-agnostic methods for explanation, feature attribution can beapplied to theoretically any regression model for forecasting, thus offering high methodflexibility. The explanations are contrastive against a predefined baseline which could bechosen in an epidemiologically meaningful way. The direction and strength of influenceappears suitable for interpretation in disease control: if a certain feature has high positiveinfluence on the predicted number of cases, it should be relevant for the current outbreakdynamic. Counterfactual explanations seem more suited for a different use case, whenthe interest of the recipient is to change the model prediction to a different value and thefeature values are directly modifiable. Textual explanations have the strong requirement ofground truth explanations for training of a language model and carry the risk of deceptionthrough misrepresentation of the model. Rule-based explanations are intuitive but lessexpressive (they only explain importance but not effect of features) and not well suitedfor numerical forecasts. Last but not least, feature attributions can also serve as a basisto provide explanation through examples because similarity in feature attribution can beused as a measure for similarity of instances with respect to the model.

26

5 Interpretable Hierarchical Forecasting

Based on the previous fundamental design choices (statistical methods for disease fore-casting, reconciliation via projection for hierarchical forecasting and feature attributionfor explanation), this chapter develops an approach for interpretable hierarchical forecast-ing of infectious diseases. To do so, the existing concepts are harmonized in a unifiedand abstract notation which allows to state the constituting elements of the approach ina precise manner while preserving the flexibility to chose the specific methods case-by-case. The notation is furthermore used to propose how the individual elements can becombined to produce hierarchical forecasts and corresponding explanations via featureattribution. Lastly, suitable choices for the configurable elements of the approach withrespect to infectious disease surveillance are suggested.

5.1 Approach and Unified Notation

Hierarchical Time Series and Forecasting

As previously introduced, a hierarchical time series is a time series which can be disaggre-gated into a tree-like hierarchy with the total value at the top and the most disaggregatedvalues at the bottom, as visualised in figure 1. In the following, it will be assumed that thishierarchy has a set of m different series, calledM = {1, 2, ...,m}. Each series k ∈ M stands

Adrian Lison | Intermediate Presentation 7

Hierarchical Time Series

Nation𝑦𝑦𝑑𝑑1 = 𝑦𝑦𝑐𝑐1 + 𝑦𝑦𝑐𝑐2

State A𝑦𝑦𝑐𝑐1 = 𝑦𝑦𝑏𝑏1 + 𝑦𝑦𝑏𝑏2

County 1𝑦𝑦𝑏𝑏1


State B𝑦𝑦𝑐𝑐2 = 𝑦𝑦𝑏𝑏3


�𝑦𝑦 = 𝑆𝑆 ×�𝑦𝑦1�𝑦𝑦2�𝑦𝑦3

Wickramasuriya, Athanasopoulos, & Hyndman, 2019

Figure 1 Geographical hierarchy of a na-tion with two states and three counties.

for one individual time series that is a sliceof the total time series at a specific ag-gregation level. For example, series k =

c1 in Figure 1 represents the time seriesaggregated at state level and filtered for“State A”. The values of the individualtime series k for a given discrete time in-terval from 1 to T shall be denoted byyk = (y1

k , y2k , ..., y

Tk ). Furthermore, let yt =

(yt1, y

t2, ..., y

tm) denote all individual time se-

ries values at time index t stacked into acolumn vector. It is here assumed that thetime step size is identical on all levels anddefined by the resolution at the lowest level. For an intuitive order, assume that the timeseries are arranged starting with the top level series and then traversing the tree breadth-first so that the bottom-level series come last.

Now, given this structure, a hierarchical time series has the constraint that for any pointin time t, the value of any non-leaf series k is equal to the sum of its child series’ values,

27

i.e. ytk =

∑j∈C yt

j, where C is the set of children of k. Recursive application of the aboverelationship reveals that the whole hierarchical time series is already sufficiently definedthrough its bottom-level series. To express this property formally, letB = {b1, b2, ..., bn} bethe set of the n most disaggregated series at the bottom level and yt

B= (yt

b1, yt

b2, ..., yt

bn) the

vector of the corresponding time series values. Moreover, let S be a so-called “summingmatrix” of order m×n. This matrix has one row for each time series k of the hierarchy andone column for each bottom-level time series. Its binary entries can be read row-wise andindicate how the bottom-level series aggregate to the higher levels. The summing matrixthus represents the structure of the hierarchy, allowing to concisely express the aggrega-tion property of the hierarchical time series as: yt = Syt

B. An alternative formulation using

only matrix operations is yt = SJyt, with a matrix J =[

0n×(m−n) In

]that simply selects

the bottom-level time series from the full vector. This formulation will later be used toalso define a coherence constraint for the explanations.

For illustration, the summing matrix and vectors for the hierarchy in figure 1 are:

yt =

ytd1

ytc1

ytc2

ytb1

ytb2

ytb3

, ytB =

yt

b1

ytb2

ytb3

, S =

1 1 11 1 00 0 11 0 00 1 00 0 1

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣

ytd1

ytc1

ytc2

ytb1

ytb2

ytb3

=

1 1 11 1 00 0 11 0 00 1 00 0 1

·

yt

b1

ytb2

ytb3

Next, the statistical forecasting method can be integrated into the notation. As alreadymentioned, the approach suggested here is that of reconciliation via projection, wherebase forecasts for the time series are first produced individually and then reconciled us-ing a suitable transformation. Because the reconciliation step is independent of the baseforecasts, each time series k ∈ M could be forecasted with a different model, representedby an arbitrary function fk in the following. Moreover, all p input features used by thedifferent forecasting models are jointly labeled by the set P = {1, 2, ..., p} and a concreteinput instance is written as x = (x1, x2, ..., xp) = (xi){i∈P}. In practice, it is rather unrealisticthat every forecasting model fk uses the exact same set of features. Instead, models wouldbe specialised on the features which describe the recent past of the time series they aremeant to predict, so that a county-level model would primarily use information about itscounty and the national-level model would primarily use information aggregated at thenational level. Of course, the accuracy of forecasts may be improved by also consider-ing features outside of the subpopulation to predict (this possibility will be addressed insection 5.2), but a total match of features between all base forecast models is still unre-alistic. Nevertheless, for notational purposes regarding the feature attribution which willbecome apparent soon, it is vital to assume that the individual functions fk all take thesame, complete set of features as input. One should therefore imagine that the functions

28

usually ignore a large part of their input and only operate on a selected set of features thatthe respective model would use in practice.

The focus will here be on direct forecasting, i.e. forecasts for multiple steps into the futureare produced directly and not through recursive application of single-step-ahead forecasts.Given this assumption, the base forecasts for a certain time in the future can be written asyt+h

k = fk(xt), where t represents the current time and h a fixed forecast horizon, so thatxt is the current state of knowledge expressed through the input features for the model.Accordingly, yt+h = (yt+h

1 , yt+h2 , ..., yt+h

m ) is the stacked vector of all base forecasts. Forconciseness, the time index will further be omitted. Moreover, the collection of individualforecasting functions fk can be summarized in a vector function f : Rp → Rm, x 7→ f (x),which shall be called base forecast function in the following.

Next, the reconciliation step is defined. As Hyndman et al. (2011) show, all existing lin-ear reconciliation strategies can be expressed using the summing matrix S and a secondmatrix P of order n×m to obtain the coherent, reconciled forecasts y with y = SP y: First,the bottom-level forecasts are obtained as linear combinations of all base forecasts bymultiplication of y with P. Suitable choice for P are presented in section 5.3. The therebyrevised bottom-level forecasts are then summed up according to the hierarchy structurethrough multiplication with the summing matrix S in order to obtain the reconciled fore-casts for all levels. This transformation can also be expressed through a second vectorfunction f , called the reconciled forecast function, which produces the correspondingcoherent, reconciled forecasts such that y = SyB:

f : Rp → Rm, x 7→ f (x)∣∣∣ f (x) = SP f (x)

Hierarchical Feature Attribution

Given the input features and the models, represented as vector forecasting function, theconcept of feature attribution can be introduced: Let φ(i, x, f ,R) be called feature attribu-tion (other names sometimes used are “contribution”, “effect”, “influence”, “relevance”or “importance”), defined as a function of four inputs2: the feature i ∈ P to attribute,the current instance x to be explained, the forecasting function f , and a reference againstwhich the current instance is compared, denoted R. The reference R = (R1,R2, ...,Rp)is assumed to be a discrete, multivariate random variable which has the same dimensionas the set of input features. Its distribution describes the probability of observing a spe-cific input to the forecasting model and serves as a baseline that would be expected as2 With the aim of a sound but flexible formulation, Lundberg and Lee (2017)’s notation is used here but

extended to also accommodate other feature attribution approaches with different points of referenceagainst which the current prediction is compared (e.g. Ribeiro et al. (2016b); Lapuschkin (2019);Shrikumar et al. (2017)). This furthermore follows Merrick and Taly (2019)’s call for a configurablereference distribution.

29

“default”, if the concrete input was not known. Accordingly, the influence of a featurein a specific forecast will be its attributed change of the forecast compared to the refer-ence. The corresponding expected default forecast is given by E[ f (R)]. In practice, thedistribution of R may be provided empirically through a reference dataset, that is a set ofobservations which are defined as the default for the use case. A standard choice of thereference dataset is the whole training dataset. Nevertheless, R may also represent a single

instance x′ and thus have the density p(r) =

{1, if r = x′

0, else

}, so that E[ f (R)] = f (x′),

meaning that the reference value is essentially the value of f at a reference point x′. Forexample, x′ = (0, 0, ..., 0) would mean that the default situation is when all features areset to zero.

Given the instance x, forecast function f and reference R, an attribution method can bedefined as a function φ : (i, x, f ,R) 7→ φ(i, x, f ,R), hereafter called feature attributionfunction, that computes the attribution for feature i. The explanation for a forecast f (x)can then be summarized in an additive model of feature attributions with simplified inputz ∈ {0, 1}P (Lundberg and Lee 2017):

g(z) = E[ f (R)] +∑i∈P

φ(i, x, f ,R) · zi

This explanation model describes how each feature contributes (defined by the featureattributions as weights) to the difference between the present forecast and the defaultforecast. The simplified inputs zi indicate the selection of a subset of features. Thus, forexample, g(z) with z = (1, 1, 0, 0, ...0) indicates the joint influence of the features 1 and 2.

In order to integrate with hierarchical forecasting, the concept of feature attribution mustbe extended so that a separate attribution for each individual forecast in the hierarchyis provided (because each individual forecast should be explainable). Therefore, it willbe assumed that to explain a single forecast yk in the hierarchy, an appropriate featureattribution method φk is used to compute an attribution φk(i, x, fk,R) for each feature i.Because even varying types of forecasting models could be used for the individual timeseries, the corresponding attribution methods may differ as well. However, the instancex and also the reference R are identical for the whole hierarchy. Then, the set of attri-bution functions can be summarized in a vector function φ which directly operates onthe vector forecast function f : Rp → Rm (this could be the base or reconciled forecastfunction) to compute the attributions of feature i for the complete hierarchical forecast y,i.e. φ(i, x, f ,R) = (φk(i, x, fk,R)){k∈M} ∈ Rm. The resulting m-dimensional vector of attri-butions φ(i, x, f ,R) for feature i will hereinafter be called hierarchical feature attribution.

30

Desirable Properties of Hierarchical Feature Attributions

Given the above concise definition of feature attribution in the hierarchical case, it cannow be specified what constitutes an appropriate coupling of explanation and model. Todo so, a number of desirable properties which should be satisfied by the hierarchicalfeature attributions are presented in the following. These properties will later be used toevaluate the soundness of the approach proposed in this work.

Property 1 (Coherence): Coherence has already been defined as an objective of hierar-chical forecasting in order to prevent contradictions between forecasts for different levelsof the hierarchy. It is evident that the same applies to hierarchical attributions: if localexplanations were incoherent, users could come to different conclusions about the im-portance of a feature value or even its direction of effect on the overall prediction whenlooking at explanations for different levels of the hierarchy. On the other hand, if hier-archical attributions are coherent, it is sufficient to know the attributions at one level inorder to deduce the attributions for all higher levels.

Therefore, it will be demanded that the attributions of a feature i for all individual fore-casts follow the same aggregation structure as the hierarchical time series. This meansfor example that the sum of a feature’s contributions to the bottom-level forecasts mustbe equal to its contribution to the top-level forecast. The property of coherence can beexpressed as follows:

Coherence.Let J be a bottom-level selection matrix of order n × m with J =

[0n×(m−n) In

]. If an

attribution function φ is coherent, the following equality holds for all features i ∈ P:

φ(i, x, f ,R) = SJφ(i, x, f ,R)

The matrix J selects only attributions of feature i for the bottom-level forecasts. Theequality thus means that the whole attributions are already sufficiently determined by thebottom-level attributions and the aggregation structure of the hierarchy.

Moreover, several further properties have already been suggested for normal feature attri-bution by other authors with the aim to ensure that the explanations obtained are closelytied to the decision logic of the model. Surprisingly, these properties have been borrowedfrom a different field, namely cooperative game theory: Feature attribution can be framedas a game-theoretic problem by viewing the input features x = (x1, x2, ..., xp) as playersand f (x) as the collective payoff for all players (Lundberg and Lee 2017). The problemof allocating a share of the total payoff to each feature/player i due to their contributionto the collective result can then be seen as similar to feature attribution. This connectionhas not only been used to inspire feature attribution methods based on game-theoretic al-

31

location schemes, but also to suggest several theoretical properties known in game theoryas desirable in feature attribution too (Lundberg and Lee 2017; Sundararajan and Najmi2019; Janzing et al. 2019). In the following, these properties will be transferred to thehierarchical setting and harmonized with the more generic notation of feature attributionused in this work.

Property 2 (Completeness): This property is analogous to the univariate case and de-mands that the attribution vectors for the features exactly sum up to the difference be-tween the forecast and the reference forecast (Sundararajan et al. 2017). It ensures thatthe complete deviation of the forecast from the reference is attributed to the input features.

Completeness.If an attribution function φ is complete, the sum of attributions for all features equals thedifference between the prediction and the expected reference prediction3:∑

i∈P

φ(i, x, f ,R) = f (x) − E[ f (R)]

It is very important to note that by defining completeness as above, all effects must beattributed to the individual features in order to fulfil the property. Therefore, interactioneffects between features must somehow be distributed to the single features by the attribu-tion method. An extension to pairs of features has been proposed (Lundberg et al. 2019),but is not included in this work in order to limit complexity. Nevertheless, the above def-inition of completeness is slightly more general than previous formulations, which haveeither been limited to a single reference forecast (Sundararajan et al. 2017) or to the ex-pected value of the forecast (Lundberg and Lee 2017). Depending on the choice of R,both cases are represented here.

Property 3 (Strong Monotonicity): Strong monotonicity requires that given an instancex, if a feature i has an equal or stronger “marginal contribution” to a forecast f (x) thanon some other forecast f ′(x), irrespective of the values of the other features, then theattribution of i for f should be at least as high as for f ′ (Lundberg and Lee 2017). Forexample, consider two simple linear forecasting models f and f ′, which both have onefeature a describing the seasonal variation of disease case numbers and one feature bdescribing the current number of cases. Now, if f had a higher coefficient for b than f ′,b would have a stronger influence on the numbers of cases predicted by f compared tof ′, regardless of the current seasonality a. In such a case, one would expect that b shouldreceive a higher attribution for a forecast f (x) than for f ′(x).

In order to formally specify this property, a definition of “marginal contribution” of a fea-ture is needed. Lundberg and Lee (2017) defined it as the difference in expected forecast3 Note that φ(i, x, f ,R), f (x) and E[ f (R)] are m-dimensional vectors of the different time series k ∈ M.

32

when the feature is present or “missing”, given a set of already present features. However,they did not specify upfront how exactly absence of a feature is to be interpreted and whatit means for the forecast. To avoid this ambiguity, an interventional perspective accordingto Janzing et al. (2019) is taken here explicitly: The default is defined as the expectedforecast E[ f (X)] when all feature values are jointly drawn from the reference distribution,i.e. X = R. When a feature is “present”, its value is fixed to the value of the instance tobe explained, i.e. Xi = xi, without effect on the expected values of the other features. Theresulting forecast can be represented by taking the marginal expectation over the missingfeatures:

Definition.Let v(S, x, f ,R) = E[ f (xS,RP\S)] be the marginal expectation of f (X) over a referencedistribution R with a set of features S ∈ P fixed to xS4. Moreover, let vi(Z, x, f ,R) =

v(Z ∪ {i}, x, f ,R) − v(Z, x, f ,R) describe the expected marginal contribution of feature iwhen featuresZ are already fixed.

The difference of expectations vi(Z, x, f ,R) = v(Z ∪ {i}, x, f ,R) − v(Z, x, f ,R) is usedto describe the expected change in the forecast when feature i is set to its value xi, as-suming that the features in Z are already fixed. Such a marginal contribution also takesinto account potential interactions between feature i and the features in Z. With thisinterpretation of contribution, strong monotonicity can be defined as follows:

Strong Monotonicity.Let f ′ and f ′′ be two functions with similar inputs. If an attribution function φ is stronglymonotonic, the following implication holds for all features i ∈ P and all individual timeseries k ∈ M:

If vik(Z, x, f ′,R) ≥ vi

k(Z, x, f ′′,R) for all Z ⊆ (P \ {i}),

then φk(i, x, f ′,R) ≥ φk(i, x, f ′′,R)

Therefore, if f ′′ changes to f ′ and the contribution of feature i does not decrease inde-pendent of the other features (all marginal contributions increase or remain unchangedirrespective of which features are considered as fixed), then its attribution must not de-crease either.

Property 4 (Symmetry): The symmetry property demands that,given an instance x, fea-tures which influence the function symmetrically receive equal attribution (Sundararajanand Najmi 2019; Janzing et al. 2019). This requirement is very intuitive. For illustration,take the example of the case numbers in two counties which both equally influence theforecast for a state. In an explanation for the state forecast, these features should have

4 To be precise, the marginal expectation here means E[ f (xS,RP\S)] =∫

f (xS, rP\S) p(rP\S) drP\S

33

equal importance. Just like in the case of strong monotonicity, marginal contributions areused to define this property:

Symmetry.If an attribution function φ is symmetric, the following implication holds for two featuresi, j ∈ P and all individual time series k ∈ M:

If vk(Z∪ {i}, x, f ,R) = vk(Z∪ { j}, x, f ,R) for all Z ⊆ (P \ {i, j}),

then φk(i, x, f ,R) = φk( j, x, f ,R)

Property 5 (Dummy Null Effect): Features which are not at all used by the forecastfunction (that is, in game-theoretic parlance, they are dummy players) should receive anattribution of zero (Sundararajan and Najmi 2019) because any explanation which assignsimportance to irrelevant features would be misleading. This property is also relevant withregard to the formulation of forecast functions in this work: It has been assumed that allbase forecast functions fk take the same feature vector x as input, but ignore many of thefeatures. The dummy property ensures that these features receive no attribution and arethus not part of the explanation.

If a feature i is ignored by a forecast function, the forecast will be independent of thedummy feature’s value. This fact is represented below by stating that predictions wherethe dummy feature is set to arbitrary values r′i of the reference distribution are all equal tothe predictions where the feature is set to its true value.

Dummy Null Effect.If an attribution function φ has null effects for dummies, the following implication holdsfor all features i ∈ P and all individual time series k ∈ M:

If fk(xi, xZ\{i}, rP\(Z∪{i})) = fk(ri, xZ\{i}, rP\(Z∪{i})) for all r ∈ supp(R),Z ⊆ P,

then φk(i, x, f ,R) = 0

Property 6 (Additivity): Finally, an attribution method that satisfies additivity preservesthe linearity of the function it is applied to by summing up the attributions for the func-tion’s additive components (Janzing et al. 2019):

Additivity.Let f , f ′ and g be three functions with identical inputs and g(x) = f (x) + f ′(x). If anattribution function φ is additive, the following equality holds for all features i ∈ P:

φ(i, x, g,R) = φ(i, x, f ,R) + φ(i, x, f ′,R)

Although widely used in the literature on feature attribution, this property is a rathercritical one, because no substantive reason can be given as to why attributions should

34

always be additive to represent models in practice. In contrast, while additivity of expla-nations would be a natural demand for additive models, it is a distorted representation ofmultiplicative models (Kumar et al. 2020). Instead, this property is rather desired froma methodical standpoint, because additive explanations exhibit convenient mathematicalqualities and are thus easy to handle, for example because they can be summarized us-ing aggregation functions like the mean. This later point may be important to obtainmore concise explanations. Hence, despite doubtfulness of its relevance for truthful ex-planation, additivity is included here as well. In fact, this property will be of particularimportance for computing attributions for reconciled forecasts, as will be addressed in thenext section.

Computation of Hierarchical Feature Attributions

As already mentioned, existing methods for feature attribution are univariate and must besomehow combined to produce a hierarchical feature attribution function φ. The questionis therefore how feature attributions for the reconciled forecasts are best obtained.

Here, a first observation to make is that for the base forecasts y = f (x), feature attribu-tion can be straightforwardly performed by using univariate feature attribution methodsφk for each independent time series forecast and then stacking all attributions for a fea-ture i into an m-dimensional vector to obtain a hierarchical attribution. In order to obtainattributions for the reconciled forecasts, a naive approach would be to treat the individualreconciled forecast functions fk as arbitrary functions and proceed similarly to the caseof base forecasts by first applying a univariate attribution method to the individual rec-onciled forecasts of the hierarchy and then stacking the results together. Without furtherassumptions, a model-agnostic attribution method would have to be used that approxi-mates the influence of the features through sampling of fk. However, this procedure hasseveral important downsides:

First and foremost, model-agnostic attribution is generally inefficient compared to model-specific methods, which construct the function directly from a model’s parameters. Be-cause there is a clear trade-of between sampling effort and accuracy of the attributions,a considerable amount of function calls will be necessary to obtain reliable estimates(Lundberg et al. 2019). Second, this general disadvantage of sampling-based feature at-tribution is substantially aggravated in the hierarchical setting, because each reconciledforecast yk is a combination of all base forecasts y1, y2, ..., ym, therefore each base forecastfunction must be evaluated to compute one reconciled forecast. As m can be very large,function calls of f may be extremely costly, rendering sufficient sampling prohibitive.Third, not only may sampling become inefficient, but an exceptional number of samplescould be required too. To understand why this is the case, recall that the base forecastsare independent and specialised on one time series k, so they will usually work with a

35

limited set of features specific to k, such as the reproduction number in a specific state.The reconciled forecast functions fk however are constructed from the base forecasts forall time series and consequently depend on all features P used by any of the base forecastfunctions. A model-agnostic attribution method applied to fk would therefore have to ex-plore a high-dimensional feature space and require extensive sampling. This is especiallytrue for attribution methods which account for potential interactions between the features,leading to combinatorial explosion if too many features are present.

Considering these problems, it becomes apparent that a naive, model-agnostic approach tofeature attribution for reconciled forecasts will fail in practice due to prohibitive compu-tational effort, even with hierarchies of moderate size. In the following, a more efficientsolution which is aligned with the projection performed during forecast reconciliation,will be suggested.

The central notion of the proposed approach is that the reconciled forecasts are linearcombinations of the base forecasts. Hence, if one now confines the set of applicableattribution methods to those which fulfil the property of additivity, the following equalitycan be obtained, which provides a simple solution to the problem of feature attributionfor reconciled forecasts:

φ(i, x, f ,R) = SPφ(i, x, f ,R)

In words, the attributions for reconciled forecasts can be obtained by first computing at-tributions for the base forecasts and then applying the same projection that has been usedfor forecast reconciliation. This procedure has several advantages. Most importantly,the method for attribution of the base forecasts is not constrained to model-agnostic ap-proaches, meaning that efficient and accurate model-specific attribution methods couldbe used for the base forecast models. Moreover, even if no model-specific method existsfor the base forecast models, there will be significant performance gains compared to thenaive approach, since the base forecast models can be called independently for functionevaluation and their attributions can be reused for all levels of the hierarchy. Furthermore,as has been argued, each base forecast function will usually operate only on a small subsetPk ⊂ P of all features. Given the dummy null effect property, the attribution for fk(x) willbe quite sparse with φk(i, x, fk,R) = 0 ∀i < Pk and can be more efficiently computed onthe small subset Pk of features.

Summary

To summarize, the constituting elements of the approach have been defined as the baseforecast function f , the reconciled forecast function f and the hierarchical feature attribu-tion function φ, which uses existing univariate attribution methods to compute attributionsfor the base forecasts and applies the linear projection SP to them to obtain attributionsfor the reconciled forecasts. The elements and their interrelationships are visualised in

36

Figure 2. Furthermore, six properties have been proposed that should be fulfilled by the


Interpretable Hierarchical ForecastingA framework

Hierarchical time seriesHierarchy of time series with

aggregation constraints

Base attributions

Reconciled forecastsReconciled attributions

Forecasting

ReconciliationReconciliation Identical projection

Predict time series individually, ignoring aggregation constraints.Interpretable


Feature attribution

Base forecasts

𝑓𝑓(𝑥𝑥)

𝑓𝑓 𝑥𝑥 = 𝑺𝑺𝑺𝑺𝑓𝑓(𝑥𝑥)

ϕ(𝑖𝑖, 𝑥𝑥, 𝑓𝑓,𝑅𝑅)

ϕ 𝑖𝑖, 𝑥𝑥, 𝑓𝑓,𝑅𝑅 = 𝑺𝑺𝑺𝑺ϕ(𝑖𝑖, 𝑥𝑥, 𝑓𝑓,𝑅𝑅)

𝑺𝑺𝑺𝑺 𝑺𝑺𝑺𝑺

Figure 2 Hierarchical forecasting with reconciliation and feature attribution

attribution function φ. However, the notation so far only provides a generic structure andthe following elements are configurable:

• P, fk: the set of features and the individual forecasting models

• P: the matrix for reconciliation

• φk, R: the individual base attribution methods and the reference

In the next section, suitable choices for these aspects will be illustrated.

5.2 Features and Forecasting Models

The possibilities for statistical modeling of infectious disease time series are too manifoldto be covered here in detail. The goal of this section is thus to provide an overview overthe range of possible features and models and to describe how to harmonize them withthe interpretable hierarchical forecasting approach proposed here.

Features

With respect to infectious disease surveillance, a central dichotomy used in modeling andalso feature design is the division into “seasonal” or “non-epidemic” components on theone hand and “autoregressive” or “epidemic” components on the other (Corberán-Valletand Lawson 2014). The underlying rationale is that non-epidemic components model theusual development of the case numbers over time, while epidemic components represent

37

recent, abnormal behaviour. This is not only a promising strategy for accurate forecasting,but may also improve interpretability because the distinction between non-epidemic andepidemic elements in an explanation aligns well with outbreak detection – one couldstraightforwardly interpret a high contribution of epidemic features to a forecast as anindication of a potential outbreak.

The most basic set of features which can be used are raw values of the time series to beforecasted (Desai et al. 2019). When forecasting count data like reported cases, hospital-izations or deaths, this means to simply use past case numbers as input to the forecastingmodel. Such variables are also called lagged features (Kane et al. 2014). While featureslagged a few time steps into the past are most prominent as epidemic features, one couldimagine to use strongly lagged features (e.g. case numbers from the same week last year)as non-epidemic features as well.

Nevertheless, seasonality and large-scale trends are more often modeled explicitly. Atypical approach is to use combinations of sines and cosines to represent the seasonalvariation of case number throughout the year, which often follows a more or less regularwave with one peak per year (Unkel et al. 2012). Of course, sinusoidal curves withvarying frequencies can be used to model more complex seasonal patterns. Moreover,a linear trend is often added to describe the general development of rising or falling casenumbers over several years (Paul and Meyer 2016). An alternative to model seasonality ortrends is to fit regression splines, which are functions consisting of piecewise polynomialsconnected at predefined points (called knots) (Reich et al. 2016a). The number of piecesand placement of knots determines the variability of the spline function.

Regarding multivariate or hierarchical time series, it is often useful to include featureswhich describe the epidemiological situation in other, somehow related subpopulations.For example, models often use the past case numbers of adjacent geographical regionssuch as neighbouring counties or states as features (Corberán-Vallet and Lawson 2014).The most simple approach is to construct a network graph between the regions and includeall neighbours of first, second etc. order as features (Paul and Meyer 2016). Alternatively,spatial kernels can be used which express the influence between regions as a smooth,continuous function that would usually be chosen to diminish with increasing distancebetween two points (Stojanovic et al. 2019). In the context of hierarchical forecasting, itmay as well be an option to include features which describe other higher- or lower-leveltime series, for example a county-level forecast which uses the case numbers of its stateas input.

Of course, the recent development cannot only be described by raw lagged variables butalso by derived features. A good example is the effective reproduction number R, whichdescribes the average number of secondary cases that an infected individual causes over

38

its infective time period (Cori et al. 2013). This measure is well known in descriptive epi-demiology and mechanistic modeling of disease spreading5 and can be estimated back-wards looking using the instantaneous reproduction Rt =

yt∑ts=1 yt−sws

at time index t, wherethe denominator represents the total infectiousness of the currently infected individualsat time t (ws describes the probability distribution of being infectious after s time steps)(Cori et al. 2013). This estimate can be highly variable and is therefore often averagedover a certain time window in practice (Cori et al. 2013).

All of the above features can be solely constructed from the original time series. However,as has been argued, a strength of statistical modeling is that other information sources canbe taken into account as well. In infectious disease surveillance, health authorities mayhave access to so-called line list data (Centers for Disease Control and Prevention 2006)which describes certain demographic and epidemiological details of each case reportedand could also be used in forecasting (Reich et al. 2016a; Viboud et al. 2018). This couldfor example be information about symptoms, the setting in which infection occurred orthe occupation of the infected person. Many other external data sources with potentialepidemic relevance have been used in forecasting, including climate and weather datasuch as temperature, rainfall or humidity (Biggerstaff et al. 2018; Lauer et al. 2020; Chaeet al. 2018), syndromic surveillance data such as electronic medical records of symptoms,medication sales or school absenteism rates (Lutz et al. 2019; Ertem et al. 2018) or dataabout online and social media activities like Google or Wikipedia search trends or twitterposts (Biggerstaff et al. 2018; Nsoesie et al. 2014; Chae et al. 2018; Ertem et al. 2018).

When statistical models are jointly fit on several time series, for example on all countylevel time series, information about each specific region is often used to discriminatebetween the subpopulations. Corresponding features are for example the age and genderdistribution, the overall population size (Stojanovic et al. 2019; Desai et al. 2019) or theregional vaccination coverage (Meyer et al. 2017). These features will usually be constantfor each time series and thus only explain the variation between the jointly modeled timeseries.

Models

The most basic models used in infectious disease forecasting are classic time series anal-ysis techniques such as exponential smoothing (ETS) and autoregressive moving average(ARMA), and extensions thereof. ETS predicts the future based on a weighted average ofearlier values where the importance of past observations declines exponentially over timebased on a decay factor alpha (Unkel et al. 2012). ARMA models combine weightedaverages of past values and past error terms to infer the next time series value (Allard

5 Or variants of it, like R0, the basic reproduction number, which describes the average number ofsecondary cases caused by an individual if the whole population was susceptible to the disease.

39

1998). The basic versions have been extended to account for seasonal variation througha seasonal component (i.e. Holt-Winters’ ETS or SARMA) or external influence factorsthrough additional regressors (i.e. ARMAX) (Unkel et al. 2012; Hyndman and Athana-sopoulos 2018). On the one hand, these models are practical and well tested in a variety offields, on the other hand, they are not specifically tailored to infectious disease forecastingor count data and their flexibility is rather limited (Lauer et al. 2020).

Exactly for these reasons, a more popular class of methods are the generalized linear mod-els (GLM) or generalized additive model (GAM). GLMs fit a linear model to the condi-tional mean of a predefined probability distribution from the exponential family (Zegerand Karim 1991). In infectious disease forecasting, the poisson or negative binomialdistribution, which are useful for modeling count data, is usually chosen as probabilitydensity (Reich et al. 2016a; Stojanovic et al. 2019; Lawson and Song 2010). GAMs are awider class in the sense that the predictor is not necessarily a linear function but insteada sum of several, potentially non-linear components. As such, all of the above featurescould be theoretically included in a GAM, given that they can be appropriately modeledas a non-linear component. Typically, lagged features, seasonal patterns, and trends usingsinusoidal functions or splines and interactions with neighbouring geographical regionsare included in such models (Meyer et al. 2017; Paul and Meyer 2016; Stojanovic et al.2019; Reich et al. 2016a).

A further set of methods used which avoid explicit modeling are techniques from the fieldof machine learning, especially supervised learning, which seeks to minimize the erroron a given training set of observations (Flahault et al. 2016). Machine learning modelscan capture highly complex relationships but require extensive training data (Lauer et al.2020). While they are currently not as popular in infectious disease forecasting, theyhave already been used in some applications and forecasting challenges (Mcgowan et al.2019; Desai et al. 2019). For example, Kane et al. (2014) use random forests to predictavian influenza spreading. Chae et al. (2018) employ an artificial neural network (ANN)to forecast malaria, scarlet fever, and chickenpox. In time series forecasting, so-calledrecurrent neural networks such as the long-short-term memory (LSTM) architecture areespecially popular since they can operate on sequences of data and capture patterns intime very well (Chae et al. 2018).

Regarding the integration into the present approach, it is generally possible with onlyminor adaptations to represent all of the above models as base forecast functions fk(x)with a set of input features6. However, an important aspect of the present approach is thatthe definition of the forecast function f and the set of input features P do not necessarily

6 In the case of exponential smoothing, a cutoff must be chosen to constrain the lagged features to afixed size. In the case of recurrent neural networks, all steps on the input sequence must be chosen asfeatures.

40

have to be fully congruent with the underlying model. In contrast, this could also be abarrier to interpretability. For an intuitive example, consider a seasonal component whichis modeled as a sine wave. Here, the respective input to the model would simply be atime index t which is then transformed by the sine operation into a seasonal pattern. Onthe one hand, this leads to a complex relationship between the feature and the forecastwhich can impede accurate feature attribution. On the other hand, the time index mayalso be used to model an additional linear trend. The feature attribution for t wouldtherefore intermingle the seasonal and trend effect. In such a case, it is favourable to notuse the raw input features but the epidemiologically meaningful components of the modelas features and define the forecast function only as a function of the components. Theresulting explanation for a forecast will then have the individual components and theireffect as explanation and not the raw time series features.

One further aspect to consider in the hierarchical setting is that if one model is jointly fiton several time series, all input features must still be clearly “identified” in f . This meansthat a feature should always refer to a concrete entity (“case numbers in state A”) insteadof a relative object (“case numbers in this state”). Otherwise, the feature attributions forthe reconciled forecasts, which combine attributions from all base forecasts, will haveconfounded effects, because “this state” in fact refers to different states depending on forwhich state the respective forecast was made.

5.3 Matrices for Projection

The matrix P determines how the different base forecasts influence the final reconciledforecasts. P is chosen such that the full reconciliation operation SP is a projection (Pana-giotelis et al. 2020). Theoretical considerations have motivated a number of projectionswhich are assumed to be optimal or near-optimal under certain conditions related to thebase forecasts. While some of the P matrices can be constructed solely from the structureof the hierarchy, others are calibrated using historical error variances of the base forecasts.In the following, popular suggestions and their underlying rationale are presented.

Hyndman et al. (2011) propose that an optimal projection should fulfil the equality SPS =

S: They make the assumption that the base forecasts are unbiased, i.e. E[Y] = E[Y] =

SE[YB]. The last equality follows from the additivity of the expectation and the aggrega-tion structure of the hierarchy. Similarly, the expectation of the reconciled forecasts canbe expressed as a function of the base forecasts and, given the above assumption, in termsof the bottom-level ground truth, so that E[Y] = SPE[Y] = SPSE[YB] (Hyndman et al.2011). Hence, for the reconciled forecasts to be unbiased as well, that is E[Y] = E[Y], itsuffices if the equality SPS = S holds, hereafter called unbiasedness condition.

41

Moreover, to derive an optimal matrix P, the authors express the (suppositionally unbi-ased) base forecasts as y = SE[YB] + ε, where ε is the “coherency” error of the baseforecasts. The underlying rationale is that if the base forecasts were exactly the expectedvalue of the ground truth, they would be coherent. Any deviation from this expectationthus introduces incoherence to the base forecasts. Then, the authors find that the unknowntruth E[YB] can be estimated from the base forecasts by treating the above equation as aregression problem and E[YB] as the regression coefficients. The resulting generalizedleast squares estimate of the expected bottom-level time series values would thus be:E[YB] = (STΣ−1S)−1STΣ−1y, where Σ = var(ε) is the variance-covariance matrix of thecoherency errors and Σ−1 its generalized inverse (Hyndman et al. 2011). Given this result,the optimal matrix P would be P = (STΣ−1S)−1STΣ−1, which also fulfils the unbiasednesscondition. Unfortunately, the variance-covariance matrix of coherency errors Σ is gener-ally unidentifiable, so Hyndman et al. (2011) make the additional assumption that ε = εB,which only applies if the coherency errors add up in the same aggregation structure asthe hierarchy. This assumption reduces the generalized least squares problem to ordinaryleast squares, essentially replacing Σ by the identity matrix Im. Consequently, the authorspropose the following matrix P as optimal under the assumptions:

POLS = (STS)−1ST

This matrix fulfils the unbiasedness condition as well and furthermore has the convenientproperty that it only depends on the structure of the hierarchical time series but not on theforecasts. On the other hand, the assumption about the aggregation structure of coherencyerrors will almost always be violated in practice, making the strategy at least suboptimalin real-world applications (Wickramasuriya et al. 2019).

Wickramasuriya et al. (2019) take a slightly different approach to the problem. First,they express the covariance of the reconciled forecast errors e = y − y as a functionof the covariance of the base forecast errors by proving that var(e) = SPWPTST, wereW = var(e) is the variance-covariance matrix of the base forecast errors e = y − y.Moreover, they again assume that the base forecasts are unbiased and that SPS = Sholds, so that the reconciled forecasts y will be unbiased too. Under this assumption, theexpected value of the reconciled forecast errors e would be zero, so in order to obtainminimal errors, one would only have to find a matrix P which minimizes their variance,in other words the trace of the variance-covariance matrix tr(var(e)) = tr(SPWPTST).Wickramasuriya et al. (2019) prove that minimizing the trace of this matrix subject toSPS = S yields the following matrix P:

Pmintrace = (STW−1S)−1STW−1

Interestingly, this result has a similar form as the GLS estimator from Hyndman et al.(2011), but W, the variance-covariance matrix of the base forecast errors, can be much

42

better estimated than Σ, the variance-covariance matrix of the coherency errors. Thefollowing estimates for W have been proposed (Wickramasuriya et al. 2019):

• Structural scaling: Assumes that base forecast error variances add up hierar-chically. Like OLS, this is independent of the forecasts. The matrix W is thenW = diag(S1), where 1 represents an n-dimensional vector of ones. In otherwords, W is a diagonal matrix whose entries indicate the number of bottom-level time series that are aggregated in each higher-level series of the hierarchy.

• Sample: Uses the full sample variance-covariance matrix of base forecast er-rors on the training data, hence W = 1

T

∑Tt=1 etetT. The estimation quality of the

sample covariances may depend strongly on the ratio between the size of thehierarchy and the number of observations which can be used to compute theresiduals.

• Variance scaling: Only uses the diagonal of the sample variance-covariancematrix, assuming all covariances as zero, hence W = diag( 1

T

∑Tt=1 etetT). This

matrix can be reliably estimated with less training data but ignores the errorcovariance structure of the base forecasts.

• Shrinkage: A mix of the variance diagonal matrix and the full sample co-variance matrix. The shrinkage parameter λ is used to shrink the covariancestowards zero: W = λ diag( 1

T

∑Tt=1 etetT) + (1 − λ)( 1

T

∑Tt=1 etetT). The underlying

rationale is that in the face of limited training data, the full sample covariancemay be overestimated and should be reduced to obtain a more realistic estimate.The parameter could be chosen via cross-validation or computed based on thevariance of the residuals’ correlation coefficients.

An approach which goes without the assumption of unbiasedness is proposed by Ben Taieband Koo (2019), who frame the search for an optimal matrix P as an empirical risk mini-mization problem7 on a validation dataset from time Ttrain to T , where Ttrain is the numberof training observations used8 (Ben Taieb and Koo 2019):

minPERM

1(T − Ttrain + 1)m

T∑t=Ttrain

||yt − SPyt||22

As can be seen, the matrix PERM is chosen to minimize the mean squared error of thereconciled forecasts on the validation data yTtrain , yTtrain+1, ..., yT . The general solution tothis optimization problem is PERM = (STS)−1STyTy(yTy)−1, where y and y represent thevalidation data and corresponding forecast. Moreover, Ben Taieb and Koo (2019) suggest

7 Note that this is still an instance of reconciliation via projection, because the solution of the optimiza-tion problem is the matrix P and not the reconciled forecast y.

8 The original formulation has been slightly adapted here for notational compatibility.

43

to add a regularization term to the objective, thus introducing sparsity to P in order toreduce estimation variance, and solve via LASSO regression.

Despite these theoretical considerations, it is not clear which matrix P will produce thebest projection in practice, because the underlying assumptions are often violated and theeffects of regularization or shrinkage are difficult to foresee in general (Ben Taieb andKoo 2019; Shang and Hyndman 2017; Wickramasuriya et al. 2019). Therefore, noneof the above matrices is here suggested as superior for infectious disease forecasting.Instead, different projections should be compared via cross-validation on real-world data,as demonstrated in chapter 6.

Single-Level Strategies

For compatibility, the single-level strategies can also be implemented through appropriatechoice of P9. In this case, the projection will only use base forecasts from one level:

First, the bottom-up strategy is given by PBU =[

0n×(m−n) In

]. This projection simply

selects the n bottom-level base forecasts and ignores all higher-level base forecasts.

Second, the top-down strategy is given by PTD =[

p 0n×(m−1)

], where p is an n-

dimensional proportion vector that divides the top-level base forecast among the timeseries at the bottom-level. The proportions are usually determined using training data.Two common alternatives are (Gross and Sohl 1990):

pk =1T

T∑t=1

ytk

yt0

or pk =

T∑t=1

ytk

T/

T∑t=1

yt0

T

Here, yt0 stands for the top-level time series value at time t. Therefore, the first version

computes the average of the historical proportions and the second version computes theproportion of historical averages. Note that the sum of the above proportions will likelybe close to one, but

∑mk=1 pk = 1 is neither guaranteed nor a requirement to obtain coherent

reconciled forecasts.

Last, the middle-out strategy is given by PMO =[

0n×lfirst An×(llast−lfirst+1) 0n×(m−1−llast)

],

where lfirst and llast are the first and last index of the time series at the middle level chosenand A is a proportion matrix that divides the different middle-level base forecasts amongthe time series at the bottom-level. Each bottom-level forecast only gets a share of itsrespective superordinate middle-level base forecast. For instance, a proportion matrix forthe exemplary hierarchy in figure 1 could be:

A =

0.5 00.5 00 1

9 The formulations for PBU and PTD are taken from Hyndman et al. (2011), the formulation for PMO is

newly derived.

44

This divides the state A base forecast equally among county 1 and 2 but assigns the wholestate B base forecast to its only child, county 3.

Of course, producing base forecasts for all levels first and then applying such a selectiveprojection is inefficient, but it allows to compare the performance of more sophisticatedprojections with the single-level strategies in one standardized procedure. Even if differ-ent projections are tried, the base forecasts only have to be produced once.

Non-Negative Reconciliation

As the matrix P may also contain negative coefficients, it is not guaranteed that the recon-ciled forecasts will be non-negative even if all base forecasts are positive (Wickramasuriyaet al. 2020). This is obviously an issue in infectious disease forecasting, because negativecase numbers are impossible in reality and may occur especially in low-level time seriesof the hierarchy with case numbers close to zero.

Unfortunately, all currently existing solutions which would allow to enforce non-negativityof the reconciled forecasts are optimization-based approaches (Wickramasuriya et al.2020; van Erven and Cugliari 2015) that provide no projection and are thus not inter-pretable. As an alternative, the following novel approach is proposed here:

Assume that some projection has been chosen via a matrix P, which however producesnegative reconciled forecasts for some given base forecasts y. Then, chose a revisedmatrix Pnon-neg = (1−α)P +αPBU with shrinkage parameter α ∈ [0, 1] that shrinks Pnon-neg

towards the bottom-up matrix PBU. The underlying rationale is that the bottom-up matrixalways produces non-negative reconciled forecasts given that the base forecasts are non-negative. Hence, assuming that at least one reconciled forecast is negative, α can bechosen using the following minimization problem:

minα

α s.t. ((1 − α)P + αPBU) y = Py + α(PBU − P)y ≥ 0

with the corresponding closed-form solution:

αmin = maxk∈{k:(Py)k<0}

(Py)k

(Py − PBUy)k

This αmin is the smallest α which ensures that all reconciled forecasts are non-negativeand simply follows from the tightest non-negativity constraint of the above minimizationproblem. The goal is therefore to ensure non-negativity while staying as close as possibleto the original matrix P. As such, 0 < αmin ≤ 1 is guaranteed, because for every k suchthat (Py)k < 0:

(Py)k

(Py − PBUy)k> 1 ⇐⇒ (PBUy)k < 0

45

In other words, for the above ratio to be greater than one, the bottom-up forecast wouldhave to be negative, which is impossible assuming non-negative base forecasts. Besides,(Py)k < 0 ensures that both numerator and denominator are negative so that the ratio isalways positive.

Moreover, a convenient theoretical property of the revised matrix is that if P fulfils theunbiasedness condition, then Pnon-neg will fulfil it as well, because

SPnon-negS = S((1 − α)P + αPBU)S = (1 − α)SPS + αSPBUS = (1 − α)S + αS = S

However, this solution affects interpretability, because Pnon-neg also depends on the baseforecasts y (via the choice of αmin), while the procedure for obtaining attributions for thereconciled forecasts as proposed in Figure 5.1 assumes P to be a fixed parameter of thereconciled forecast function. Hence, if Pnon-neg is used for reconciliation, the influenceof the base forecasts on the shrinkage parameter α is neglected. Moreover, if differentchoices for α are made over time, the attributions for different forecasts could becomemore difficult to compare. The degree of distortion will of course depend on how large αmust be chosen to ensure non-negativity. A more detailed investigation of the issue willbe performed in chapter 6.

5.4 Base Attribution Methods

A variety of univariate attribution methods for statistical models has been proposed inthe literature on model interpretation, some of which take a rather heuristic approach tothe estimation of feature influence (Ribeiro et al. 2016b; Lapuschkin 2019; Shrikumaret al. 2017; Gosiewska and Biecek 2019). A detailed review of all existing methods isout of scope, therefore the following section will present two methods with theoreticalgrounding that have been proven to satisfy properties 2–6 in the univariate case and arethus promising candidates for the explanation of base forecasts.

SHAP

Lundberg and Lee (2017) propose an attribution method called SHAP which is based onthe renowned Shapley value, a solution for the payoff allocation problem in cooperativegame theory (Roth 1988). As mentioned earlier, the problem of feature attribution canbe translated into a cooperative game, where the forecast function f (x) is interpretedas the joint payoff for the coalition of players x. The Shapley value divides the payoff

between the players based on their individual contributions to all possible joint payoffs, itis defined on binary games in which players are either part of the coalition or not (Roth1988). Therefore, the authors use a special characteristic function v(S, x, f ,R) which

46

represents the forecast when only the features in S are given (thus forming the coalition)and the other features are “missing” (the interpretation of “missing” will be addressedbelow). The SHAP feature attribution function is then defined as10:

φ(i, x, f ,R) =∑

Z⊆(P\{i})

|Z|!(|P| − |Z| − 1)!|P|!

(v f ,R(Z∪ {i}) − v(Z, x, f ,R))

For illustration, imagine that starting with all input features set to “missing”, the inputfeatures are “added” (i.e. set from “missing” to their actual value) one-by-one in a spe-cific order until all features are fixed at their actual value. At each step, the marginalchange of the forecast, given by v(Z ∪ {i}, x, f ,R) − v(Z, x, f ,R), is attributed to the fea-ture which was added last and thus defines the feature’s contribution to the forecast inthis specific ordering. Now, the SHAP attribution method averages these marginal contri-butions over all possible orderings of the features. The weight |Z|!(|P|−|Z|−1)!

|P|! indicates thefraction of all orderings in which the feature is added after the features inZ but before theremaining features11. An alternative interpretation is that the SHAP attribution computesthe expected marginal contribution of feature i when the features are added in a randomorder.

Regarding the representation of missing features, Lundberg and Lee (2017) originallypropose v(S, x, f ,R) = E[ f (R)|RS = xS] as characteristic function. This is the expectedforecast when the features in S are fixed at their actual values xS and the remainingfeature are set to a random draw of the reference R, conditioned on RS = xS. Due toconditioning, the distribution of the remaining features is dependent on the features inS. This interpretation is flawed in the sense that it can lead to non-zero attributions fordummy features, because conditioning on a dummy feature may induce changes in thereference distribution of other influential features that are only correlated with the dummy(Sundararajan and Najmi 2019; Janzing et al. 2019). As suggested by Janzing et al.(2019), it is therefore favourable to define the set function as follows:

v f ,R(S) = E[ f (xS,RP\S)]

Here, the unconditional distribution of R is used and the features in S are fixed indepen-dently. With this characteristic function, the properties completeness, strong monotonic-ity, symmetry, dummy null effect and additivity are satisfied by SHAP due to the theoreticalproperties of the Shapley value (Lundberg and Lee 2017), making it a promising methodfor the explanation of base forecasts.

For the computation of SHAP feature attributions, efficient algorithms have been explicitlydeveloped for certain classes of statistical models. In the case of a simple linear model10 Adapted from Lundberg and Lee (2017).11 This can be explained as follows: There are |Z|! different orderings of the features in Z before i is

added. These can be combined with another (|P|−|Z|−1)! different orderings of the remaining featuresafter i has been added. Thus, feature i is added after the features in Z in |Z|!(|P| − |Z| − 1)! differentorderings out of all |P|! potential orderings of the full set of features P.

47

f (x) = β0+∑

i∈P βixi, the exact attribution for feature i can be derived as follows (Štrumbeljand Kononenko 2014):

f (x) − E[ f (R)] =∑i∈P

βi(xi − E[Ri]) → φ(i, x, f ,R) = βi(xi − E[Ri])

Thus, the attribution for feature i is the difference between its current value and the ex-pected value of its reference distribution, weighted by the model coefficient. For general-ized linear models with a link function l(y) = β0 +

∑i∈P βixi other than the identity function

(e.g. logarithmic), the SHAP value must be approximated using a rescaling approach asfollows12:

φ(i, x, f ,R) = E

[l−1( f (x)) − l−1( f (R))

βi(xi − Ri)x − r

]This solution computes the contribution of feature i to the change in output of the linkfunction’s inverse compared to one specific reference instance, averaged over many in-stances drawn from the reference distribution. Of course an alternative would be to onlycompute attributions for the linear prediction before application of the link function. Inthe case of a logarithmic link function, this would then mean that the attributions explainnot the forecast but its logarithm, which has an influence on the interpretation because ad-ditive relationships in logarithmic space are multiplicative relationships in identity space(Christoph Molnar 2019).

For machine learning models based on decision trees, including ensemble methods suchas random forests or gradient tree boosting (Banfield et al. 2007), Lundberg et al. (2019)propose an efficient algorithm which follows the decision paths in a tree and observeswhich subsets of features lead to which leaf nodes (based on the split conducted at eachnode) in order to simultaneously infer the characteristic function v f ,R(S) for all possiblesubsets.

In order to compute SHAP values for features in neural networks, the recipe of an at-tribution method called DeepLIFT (Shrikumar et al. 2017) can be used. The methodrecursively distributes the difference between the current activation and a “reference” ac-tivation of each neuron in any given layer to the input neurons from the previous layer.The distributed shares are proportional to the input neurons’ deviations from their ownreference activation, meaning that the influence of each neuron to its subsequent neuronsis approximated by a linear function. Through repeated application of the chain rule, thedifference from reference of the model’s prediction can be distributed backwards throughthe network to the input layer, whose neurons represent the input features (Shrikumaret al. 2017). Now, if one computes the SHAP values individually for all componentsof the network, where each neuron is treated as a prediction function and its inputs asthe features, and uses them to distribute the contribution to the previous neurons (pro-portionally to their deviation from the expected value of their reference), the DeepLIFT12 Adapted from Chen et al. (2019)

48

scheme allows to compute a SHAP feature attribution for the full network (Chen et al.2019). Because neurons can be represented as linear functions, their SHAP values canbe solved analytically as above. However, non-linear activation functions must be locallylinearized like the link functions shown above, thus only providing an approximation ofthe contribution.

Aside from these model-specific methods, a model-agnostic approach which samples anarbitrary forecast function to approximate the influence of the input features on the predic-tion has been proposed as well (Lundberg and Lee 2017). More specifically, the approachuses a weighted quadratic loss between the forecast and the sum of the feature attributionsover subsetsZ of the features:

∑Z⊆P

|Z|!(|P| − |Z| − 1)!|P|!

E[ f (xZ,RP\Z)] − E[ f (R)] −∑i∈P

φ(i, x, f ,R)1i∈Z

2

As can be seen, the weight used here is the Shapley weight, similar to the earlier defi-nition of SHAP values. This loss can be estimated by drawing samples of subsets Z aswell as samples f (xZ, rP\Z) from the function, where the subsetZ of the feature values isreplaced by realizations of the reference R. When this loss is minimized using weightedlinear regression with the attributions φ(i, x, f ,R) as coefficients, the SHAP values areapproximated (Lundberg and Lee 2017). The procedure hence allows to compute attribu-tions for an arbitrary function, however the quality of the approximation depends on thenumber of samples drawn and may quickly degrade with increasing number of featuredue to combinatorial explosion (Lundberg et al. 2019).

Integrated Gradients

Sundararajan et al. (2017) propose Integrated Gradients as a feature attribution tech-nique for models with gradients, especially artificial neural networks. It is only definedfor a single reference value as baseline, indicated in the following through the use of r in-stead of R as reference to the attribution function. To measure the contribution of a featurei to the forecast, the method cumulates (through integration) the partial derivatives of theforecast function with respect to feature xi at different points on a straight line betweenthe reference r and the instance x (Sundararajan et al. 2017):

φ(i, x, f , r) = (xi − ri)∫ 1

α=0

∂ f (r + α(x − r))∂xi

dα

According to the gradient theorem, the line integral over the gradient of f along a pathfrom the reference forecast f (r) to the prediction f (x) yields the difference between thetwo points (Sundararajan et al. 2017). This difference is divided between the featuresusing their partial derivatives. This can be imagined as not switching the feature valuesfrom the reference to their true value at once (like in SHAP) but instead slowly changing

49

the input from r to x on a straight line path in infinitesimal steps. At each step, themarginal contribution of feature i is defined as the partial derivative of f weighted by thedifference in feature value between the instance and the reference, xi − ri (this basicallymeasures how f would change if it had a gradient of ∆ f (r + α(x − r)) everywhere and Xi

was set from r to xi). These marginal contributions are averaged over the whole path tocompute the attribution for feature i.

The Integrated Gradients method is based on the Aumann-Shapley value, an extensionof the Shapley value to games with infinite players and thus another feature attributionmethod grounded in cooperative game theory (Friedman 2004; Sundararajan et al. 2017).As such, it satisfies previously defined properties completeness, strong monotonicity, sym-metry, dummy null effect and additivity as well (Friedman 2004). In comparison to SHAP,the method yields smoother feature attributions by averaging over intermediate inputs be-tween the reference r and the instance x (Sundararajan et al. 2017).

The method has been extended to also average over multiple reference values by comput-ing an expectation over Integrated Gradients when r is a realization of R, called “ExpectedGradients” (Erion et al. 2019):

φ(i, x, f ,R) = E

[(xi − R)

∫ 1

α=0

∂ f (R + α(x − R))∂xi

dα]

5.5 Prototype

The first central functionality provided is a preprocessing function which constructs ahierarchical time series from input data as illustrated in Figure 3. This functionality isaligned with the format of typical OLAP database query results (Bulos and Forsman2006). As such, the epidemiological count data is provided as a long-format table withone column for the time index, a number of columns for the dimensions of the hierarchyand one column for the measure of interest. This input table should only provide thedata at the lowest level of the hierarchy and can therefore be constructed using a standardgroup-by query on a line list data. During the preprocessing step, the hierarchy structureis automatically inferred from the unique combinations of the dimension values and thecorresponding summing matrix is constructed. The bottom-level time series values arestacked together row-wise into a data matrix and the time stamps of the indices are storedin a separate vector. The higher-level time series are thus not stored explicitly, but com-puted on request by multiplying the corresponding row in the summing matrix with thefull data matrix. The original dimension values are saved as well so that a specific timeseries cannot only be selected using its index but also by querying for its attributes. Thefunction also allows for multiple columns with measurement data in order to store inputfeatures which are hierarchical time series too.

50


Time

Time State County Cases2019-12-30 State A County 1 122019-12-30 State A County 2 52019-12-30 State B County 3 252020-01-06 State A County 1 192020-01-06 State A County 2 72020-01-06 State B County 3 16… … … …

S =

1 1 11 1 00100

0010

1001

y = 125

25

197

16… County

Bottom-level time series

Structureof Hierarchy

t = 2019−12−302020−01−06

… Time index

Figure 3 Preprocessing of a long-format table into the constituting elements of a hier-archical time series (exemplary data similar to Figure 1).

For reconciliation via projection, a function which takes a base forecast vector as inputand return the reconciled vector is implemented with the following strategies to choose:bottom-up, top-down, ordinary least squares, structural scaling, full sample, variance scal-ing and shrinkage. The first four projections depend solely on the hierarchy structure andare thus computed only once, the other projections require residuals as input and are newlycomputed on every call. The non-negativity enforcement proposed in section 5.3 is im-plemented as an additional parameter to the function. If activated, the reconciled vectoris checked for negative values, and αmin and Pnon-neg are computed and applied if needed.

Aside from the representation of the hierarchical time series and the reconciliation func-tion, software to manage the base forecasting models in a convenient way is needed. Asthe hierarchy may become very large, it is impractical to specify all models and featuresindividually for each time series. Instead, a generic interface is implemented which al-lows to define all models at once. Different statistical model classes including generalizedlinear models and several tree-based machine learning models such as random forests areimplemented with the same interface for model fitting and prediction. In the current ver-sion, it is assumed that the same model type is used for all time series and each time serieshas its own model instance, but this aspect could be refined easily. The features can bedefined in a relative notation by the user, i.e. conceptually of the form “case numbers ofthe present time series lagged by one time step” or “seasonal feature”, the concrete fea-tures are then defined automatically for each time series and named unambiguously usinga time series identifier as prefix.

The training of the base forecast models and the forecasting are offered in one singlefunction each, which fits or forecasts all time series in parallel, taking advantage of thefull computational power available using multiprocessing. For feature attribution of the

51

forecasts, SHAP values are computed using the authors’ open source implementation. Inthis prototype, the training dataset used for the models is chosen to define the referenceR. The most suitable model-specific attribution algorithm available is selected automati-cally and the base attributions are computed for all individual forecasts and only for thefeatures used by each model so that dummy features are already excluded. Then, all theattributions are merged into an attribution matrix of order m × p so that similar featuresare in the same column. Finally, the reconciliation of both the base forecasts and the baseattributions is done jointly so that the same projection matrix can be reused. Moreover,the expected reference forecasts E[ f (R)] is reconciled as well to yield E[ f (R)].

The summing matrix and especially the attribution matrix are very large and have manyzero entries. Therefore, considerable efficiency gains are achieved by representing themin a sparse matrix format, which only stores the indices and data of non-zero entries. Thesame format is chosen for the P matrices, allowing for efficient sparse matrix multiplica-tion during reconciliation.

The components described above provide all functionality necessary to perform hierar-chical time series forecasting and obtain explanations in the form of SHAP feature attri-butions. In the context of this work, further functions for visualisation, cross-validationand post-processing of the feature attributions are used, but their implementation is notcentral to the design objectives and therefore not discussed in detail. Nevertheless, theyare used for demonstration and evaluation of the prototype in the next chapter.

52

6 Evaluation

In the following sections, the approach designed in chapter 5 is evaluated with regard tothe design objectives defined in chapter 3.

For several parts of the evaluation, the implemented prototype is tested on exemplaryforecasting tasks for two infectious diseases, Norovirus, an often cause of gastroenteritis,and Salmonellosis, which is caused by bacteria of the species Salmonella (Robert Koch-Institut 2019). Both diseases are endemic in Germany and have been reported as partof routine surveillance for two decades. For the present evaluation, weekly time seriesof the reported case numbers and further epidemiological information (see below) areprovided by the Robert Koch Institute through the SurvNet surveillance system (Krauseet al. 2007).

Forecasts and explanations are produced for a total of 429 time series at the county(“Landkreise” and “Stadtkreise”), state and national level. Two different forecastingmodel types are used, generalized linear models and random forests, illustrating the dif-ferent degrees of model complexity which could be chosen. For the generalized linearmodels, the popular hhh4 modeling framework is utilized (Paul and Meyer 2016). Theweekly count data is modeled individually for every time series k of the hierarchy asconditionally Poisson distributed:

yt+hk ∼ Poisson(λyt

k+νt+hk ) | log(νt+h

k ) = αk+βkt+γk sin(

2π365.25/7

(t + h))+δk cos

(2π

365.25/7(t + h)

)The non-epidemic component νt+h

k describes a seasonal and trend pattern with an interceptαk, a log-linear trend βk and yearly variation in form of a sinusoidal wave. The parame-ters γk and δk together define the height and onset of the wave. The epidemic componentλyt

k only consists of the current case numbers. Given this model, the baseline forecastis defined as the conditional mean, thus fk(νt+h

k , yt) = λytk + νt+h

k . As can be seen, thenon-epidemic component is here used as a single feature as proposed in section 5.2. Forsimplicity, the random forest models are initialised mostly with out-of-the-box configura-tions (Scikit-learn 2020), including 100 trees as estimators, and no hyperparameter tuningis performed. The random forest models are provided with a variety of features, includingthe current case numbers, the calendar week, estimates of the instantaneous reproductionnumber with different smoothing windows up to 4 weeks, and demographic and infectioninformation on the reported cases. The demographic information consists of the percent-age of male, female and diverse patients of the cases reported in one week, as well as thepercentages of different age groups. The infection information includes the percentagesof patients who work or are hosted in special context such as healthcare, mass accom-modations or youth care as well the percentages of special contexts in which infectionsoccurred, such as at work, in public transport or in a medical facility.

53

6.1 Flexibility

The flexibility of the approach developed is mostly shaped by the central design choices,i.e. the classes of methods selected, and their interdependencies. In the following, flexi-bility along the predefined dimensions is assessed by reasoning using the insights gainedabout the individual methods during design.

Disease Flexibility

The statistical modeling paradigm allows for forecasting of many different diseases with-out extensive customization effort, because the main modeling concepts used such asseasonality, trend, autoregression or the use of count data probability distributions are notvery disease-specific. This fact is also emphasized by the existence of generic softwarepackages for infectious disease forecasting (Meyer et al. 2017). Therefore, in the optimalcase, standard models could be deployed to many diseases without further configuration.Nevertheless, if the goal is to achieve outstanding forecast accuracy, one may still need tomodel disease-specific aspects and perform manual adjustments (Manheim et al. 2017).Moreover, the choice of the features could be disease-specific to a certain degree. Themethods chosen for reconciliation and feature attribution are independent of the diseaseand therefore provide high flexibility. On the other hand, the explanations obtained viafeature attribution can only be as disease-specific as the forecasting model. Therefore,if a particular aspect of a disease (e.g. the geographical distribution of ticks for FSME(Robert Koch-Institut 2019)) is desired as part of the explanation, it must be included inthe model.

Information Flexibility

As already discussed, a main requirement of statistical models, and especially machinelearning models, is sufficient historical data. The minimal requirement is difficult to stategenerally. A theoretical limit is that the number of observations must be at least equalto the number of parameters, but randomness of the time series leads to much strongerrequirements in practice (Hyndman and Kostenko 2007). Usually, several years of his-torical data are used (Reich et al. 2016a; Mcgowan et al. 2019). Therefore, emergingdiseases which have not been reported for some time cannot be forecasted using the ap-proach proposed here. Nevertheless, in the application context of routine surveillance,this requirement is satisfied for a large number of notifiable diseases (Robert Koch-Institut2019).

Aside from the volume of data, the minimal requirement for forecasting is simply a timeseries of the quantity of interest, from which many important features can already be ex-tracted. This makes the approach applicable to most surveillance scenarios. Nevertheless,

54

as has been illustrated in section 5.2, statistical models also allow to integrate many morefeatures, even without direct biological interpretation, thus constituting high informationflexibility. The ability to also use novel data sources such as online activity or syndromicsurveillance data makes the approach well aligned with trends in digital epidemiology.The ability of certain machine learning models such as tree-based models to perform au-tomatic feature selection also allows to use large numbers of features. However, certainbiological expert knowledge at the level of disease transmission could be difficult to inte-grate into such models if no data that represents this relationship is available.

Again, because the constituting elements of the explanations obtained are contributionscores for the individual features, the explanations of this approach are closely tied tothe model and the features. The potential richness but also complexity of explanationsthus increases with the number of features. A strong advantage is that if new, statisticallyrelevant features are added to a model, they will automatically become part of the expla-nation without further modeling effort. Moreover, because less influential features can beexpected to receive contribution scores closer to zero, attributions provide insights intothe relative importance of the features currently in the model13.

Method Flexibility

The approach is restricted to statistical models which can be framed as regression modelswith a fixed set of input features. As discussed in section 5.2, miscellaneous model typesfall under this category, including the most popular statistical models currently used ininfectious disease forecasting. Regarding hierarchical forecasting, the choice of reconcil-iation via projection offers the choice between different linear projections and classicalsingle-level strategies. Because of the standardized interface, multiple projections can betried out and compared on the same base forecasts. The approach works irrespective of thestatistical model used, but the accuracy of the reconciled forecasts may depend on howclose the base forecasts match the theoretical assumptions of the different projections,such as unbiasedness or covariance structure (Panagiotelis et al. 2020). It is also possiblethat better linear projections could be proposed in the future. However, the design ex-cludes optimization-based approaches. First, this forgoes the option of maybe obtainingeven more accurate forecasts through sophisticated optimization. Second, it has the essen-tial downside that further validity constraints such as non-negativity cannot be expresseddirectly in a linear projection and workarounds such as in section 5.3 are needed.

The choice of local explanation via feature attribution restricts the approach to statisticalforecasting models and linear projections for reconciliation. On the other hand, withinthese limits, methods can be chosen quite freely. Feature attribution offers a very generic

13 Note that this is not identical to the absolute importance of a feature, i.e. the change in forecastaccuracy when adding or removing the feature (Christoph Molnar 2019)

55

framework for attribution which applies to many forecasting models. Nevertheless, ef-ficient model-specific algorithms only exists for certain model types, essentially linearmodels, tree-based models and neural networks. Feature attributions for generalized lin-ear models with a link function other than the identity function need to be rescaled and arethus only approximations. For other model types, model-agnostic attribution via samplingis possible but may be computationally costly, which is especially relevant in hierarchicalforecasting due to the large number of models. Many different feature attribution meth-ods already exist, but only two with a theoretical foundation that can guarantee certaindesirable properties have been identified in this work (SHAP and Integrated Gradients).Lastly, the design of feature attribution for the reconciled forecasts as developed in Fig-ure 5.1 restricts the approach to additive explanations only, which may be convenient insome respects, but distorts explanations for multiplicative models (Kumar et al. 2020).

6.2 Hierarchical Coherence

The fact that individual base forecasts for different levels of the hierarchy are not guaran-teed to be coherent can also be verified empirically on forecasts for disease time series.Figure 4 shows the relative incoherence of random forest base forecasts for Norovirus atthe state level. Incoherence is here calculated as the absolute difference between the fore-cast for a state and the sum of forecasts for all its counties. As can be seen, the incoherencecan be up to several times larger than the true case numbers, which is also considerablein absolute terms (several 100 cases). In contrast, all reconciled forecasts are coherent by

2014 2015 2016 2017 2018 20190%

100%

200%

300%

400%

500%

Incoherence between County and State Forecasts over Time

Date

Inco

here

nce

Figure 4 Incoherence between county and state base forecasts over time, relative to thetrue case numbers (100 cases per week on average). Colors indicate the different states.

definition. This follows directly from the projection applied to the base forecasts, whichcombines the forecasts from all levels to a reconciled bottom-level forecast via the matrix

56

P and then computes the higher-level reconciled forecasts through multiplication with thesumming matrix S. This property can also be verified using the formal constraint y = SJyintroduced in chapter 5, by noting the following equality:

SJy = SJSPy = SInPy = SPSPy = y with JS =[

0n×(m−n) In

]×

[S′(m−n)×n

In

]= In.

A further important aspect, albeit not originally related to hierarchical coherence, is theissue of negative reconciled forecasts. As already noted, the reconciliation procedurecan yield negative forecasts even if all base forecasts are positive, because the projec-tion matrix may have negative coefficients. This trait is obviously troublesome whenforecasting epidemiological count data, but the reconciled forecasts produced with theprototype in this work indicate that the issue may not be too severe, as the negativity re-mains within narrow limits for certain projections. For example, forecast reconciliationvia WLS V projection on the same base forecatss as in Figure 4 produced only forecastsabove −1, as shown in Figure 5, so that upward rounding would already give non-negativeresults. Nonetheless, one may prefer to use a workaround as proposed in chapter 5, which

2014 2015 2016 2017 2018 2019−0.3

−0.25

−0.2

−0.15

−0.1

−0.05

0

Negative Reconciled Forecasts over Time

Date

Rec

onci

led

Fore

cast

Figure 5 Negative reconciled forecasts obatined using WLSV over time.

shrinks the projection towards a bottom-up strategy and thus prevents negative forecastsaltogether. As already discussed, this however comes at the cost of lower theoreticalsoundness of the feature attributions. The effect of this workaround on accuracy is eval-uated in the next section. Lastly, it should be noted that the projections representing thesingle-level strategies do not exhibit this issue, because the entries of the correspondingprojection matrices are all non-negative.

57

6.3 Accuracy

Because the approach designed in this work may embrace many different infectious dis-ease forecasting methods, it is difficult to evaluate its general accuracy. In the following,the evaluation will therefore focus on a relative assessment by investigating the accuracygains or losses of the reconciled forecasts compared to the base forecasts.

For evaluation of point forecast accuracy, popular absolute measures of accuracy are themean squared error (MSE(y, y) = 1

n

∑Tt=1(yt−yt)2) and the mean absolute error (MAE(y, y) =

1n

∑Tt=1 |y

t − yt|) (Kane et al. 2014; Reich et al. 2016a; Hyndman and Koehler 2006). Thesemeasures summarize the deviation between ground truth and prediction over several ob-servations and both take over- and underestimation into account. Often, also a rootedversion of the MSE (RMSE) is used to ensure that the metric scales proportionally tothe individual errors (Nsoesie et al. 2014). Because the MSE disproportionally penalizeslarge deviations, it is more sensible to outliers than the MAE (Hyndman and Koehler2006).

A drawback of absolute error measures is that they are not well suited for comparisonacross different datasets, because they do not scale against the underlying ground truthdata so that forecasts for time series with higher values tend to also have larger errors. Inhierarchical forecasting, this may be a problem since the top-level time series can indeedhave much higher values than the bottom-level series (Athanasopoulos et al. 2020). Analternative which provides scaling are percentage errors such as the mean absolute per-centage error (MAPE(y, y) = 1

n

∑Tt=1 |

yt−yt

yt |), which divides the forecast error by the groundtruth value (Nsoesie et al. 2014). Nevertheless, percentage errors have the disadvantagethat they are undefined for ground truth values equal to zero (Hyndman 2014). This prop-erty is especially troubling in the case of hierarchical forecasting of disease case numbers,because epidemic diseases can exhibit low incidence rates outside the epidemic season sothat the more detailed time series may often have zero new cases within a relevant timeinterval.

Another scaled alternative are so-called relative error measures which divide the absoluteerror measure of the forecast (according to some metric like the MAE) by the absoluteerror measure of a baseline forecast, thus yielding a scaled and easily interpretable metric,which is smaller than one if the forecast is better than the baseline and larger otherwise(Hyndman and Koehler 2006; Reich et al. 2016a). An example for such a measure is therelative mean absolute error (with baseline forecast y):

RMAE(y, y) =MAE(y, y)MAE(y, y)

This ratio is well defined unless the absolute error of the baseline forecast is always zero,which is very unlikely. Usually, the baseline forecast is derived from a naive model, which

58

may be a simple statistical model of the seasonal case numbers in the case of infectiousdisease forecasting (Reich et al. 2016b).

Given the above considerations, the RMAE is chosen here as the main criterion to judgethe overall accuracy of hierarchical forecasts, because it allows to compare forecast ac-curacy even across hierarchy levels. Following the suggestion of Reich et al. (2016b), ageneralized linear model with sinusoidal seasonal variation (similar to the seasonal com-ponent of hhh4 as proposed by Paul and Meyer (2016)) is fit on the training data toproduce the naive baseline forecast for RMAE.

For the validation of a forecasting method on empirical data, it is recommended to mea-sure the performance of forecasts for new observations that were not used to calibratethe model (out-of-sample forecasts) in order to detect overfitting (Hyndman and Athana-sopoulos 2018). Accordingly, a walk-forward cross-validation scheme is used here. Thebase forecast models are fit on training data from an input window with fixed size (3 yearswith 52 weeks each) and then used to forecast the subsequent 8 weeks. Afterwards, thetraining window is moved by eight weeks into the future. This procedure hence imitatesa deployment in which the models are retrained every 2 months on the most recent data14.

The cross-validation is performed for both exemplary diseases (Norovirus and Salmonel-losis) and using both forecasting model types (generalized linear models and randomforests). Moreover, different short-term forecasting horizons are tested (1,2 and 4 weeks).The projections for reconciliation tried out are bottom-up (BU), top-down (TD), ordinaryleast squares (OLS), structural scaling (WLSS ), variance scaling (WLSV), full sample ofcovariance (MinTsample) and full sample of covariance with shrinkage (MinTshrink), as wellas the corresponding non-negative versions. All reconciled forecasts are calculated usingthe same base forecasts, which is not only efficient but also reduces undesirable variancein relative accuracy which could result from variation in the forecasting models. For thetop-down strategy, the average historical proportions (on the training data) of the bottom-level time series are used. Similarly, the residuals used for calculation of the WLSV ,MinTsample and MinTshrink projections are the residuals of the in-sample forecasts on thetraining data.

Table 6.3 gives a quantitative overview over the cross-validation results for Norovirus,where the average MAE within a hierarchy level for the different projections is comparedagainst the average RMAE for the base forecasts. The table distinguishes between level,model and forecast horizon. For each set, the percentage change in average RMAE rel-ative to the base forecasts is given and the best projection highlighted in boldface. The

14 This resolution was chosen under consideration of the computational burden of cross-validation. Ofcourse one may obtain more accurate forecasts with higher update frequency, but a tentative compar-ison indicated no qualitative differences between weekly and monthly updates for the evaluation ofrelative accuracy as intended here.

59

MAE and RMAE of the base forecasts is given for reference. Note that MinTsample isexcluded because of exceptionally bad performance (see explanation below). The resultsfor Salmonellosis are qualitatively very similar and reported in section B of the appendix.

GLM Random Foresth = 1 2 4 h = 1 2 4

Level Projection

Nation

BU +6.18 +6.18 +6.18 -2.96 -3.23 -4.07TD -0.00 -0.00 -0.00 -0.00 -0.00 -0.00OLS -0.20 -0.20 -0.20 -0.60 -0.66 -0.70WLSS +0.20 +0.20 +0.20 -4.20 -4.05 -4.56WLSV +4.33 +4.33 +4.33 -3.28 -3.46 -4.18MinTshrink +697.80 +3,094.36 +1,498.64 -3.10 -1.98 -3.68OLS+ +132.05 +132.05 +132.05 +4.57 +4.04 +2.90WLSS + +79.80 +79.80 +79.80 +0.66 +0.20 -0.60WLSV+ +130.50 +130.50 +130.50 -2.50 -2.69 -3.47MinTshrink+ +7.05 +6.83 +6.05 -3.54 -3.29 -4.23Base MAE 355.65 355.65 355.65 401.42 399.55 404.82Base RMAE 0.93 0.93 0.93 1.05 1.04 1.06

State

BU +5.36 +5.36 +5.36 -2.10 -1.82 -2.37TD +13.46 +13.46 +13.46 +9.35 +9.36 +9.42OLS +1.31 +1.31 +1.31 +9.40 +9.03 +9.85WLSS +0.93 +0.93 +0.93 -2.35 -2.07 -2.32WLSV +3.74 +3.74 +3.74 -2.54 -2.31 -2.75MinTshrink +886.09 +3,492.71 +3,301.36 -1.49 -0.55 +18.32OLS+ +153.00 +153.00 +153.00 +5.17 +5.25 +4.77WLSS + +58.98 +58.98 +58.98 +0.11 +0.31 -0.13WLSV+ +197.72 +197.72 +197.72 -2.01 -1.75 -2.28MinTshrink+ +8.93 +8.05 +8.20 -2.50 -1.87 -2.56Base MAE 28.23 28.23 28.23 31.55 31.46 31.56Base RMAE 0.93 0.93 0.93 1.01 1.01 1.01

County

BU +0.00 +0.00 +0.00 +0.00 +0.00 +0.00TD +4.46 +4.46 +4.46 -2.03 -2.12 -2.04OLS +6.05 +6.05 +6.05 +2.78 +2.49 +2.85WLSS +3.73 +3.73 +3.73 -0.06 -0.12 -0.03WLSV -0.30 -0.30 -0.30 -0.21 -0.21 -0.21MinTshrink +2,957.29 +12,797.12 +6,305.14 +4.78 +5.57 +43.17OLS+ +99.31 +99.31 +99.31 +2.53 +2.38 +2.36WLSS + +33.87 +33.87 +33.87 +1.35 +1.29 +1.27WLSV+ +60.55 +60.55 +60.55 +0.07 +0.08 +0.07MinTshrink+ +9.20 +10.60 +7.22 +0.10 +0.13 +0.12Base MAE 2.39 2.39 2.39 2.48 2.48 2.48Base RMAE 0.98 0.98 0.98 1.02 1.02 1.02

A first central observation to make is that while the performances of the projections arevery stable over the different forecast horizons, there are extreme differences between the

60

two model types. Improvements for random forest base forecasts often match deteriora-tion for the GLM and it seems that reconciliation of the GLM forecasts generally comesat the cost of accuracy. However, little incoherence of the base forecasts is not an expla-nation for this phenomenon, because the bottom-up method performs bad as well. On theother hand, the random forest base forecasts appear too noisy on the bottom level, becausethey can be improved by the top down strategy, but a consideration of all levels leads tothe best performance. Overall, the random forest models perform worse than the GLMon average.

The relative improvements achievable are generally more pronounced at the higher lev-els of the hierarchy, but still not very big in absolute terms (changes of 5% correspondto approx. 20 cases at the national level). In total, the weighted least squares estima-tors exhibit fairly good and stable performance compared to the other projections. TheMinTshrink shows potentially miserable performance in the case of GLM and the resultsfor MinTsample can be over 20 times worse. The reason for this complete breakdownof the method is that the number of training data observations used for the residuals(3 ∗ 52 = 156) is less than the total number of time series (429), leading to a flawedsample covariance matrix (Wickramasuriya et al. 2019). Regarding the non-negativityversions of the projections, it is surprising that the RMAE can be worse or better thanthe RMAE of both the original matrix and the bottom-up matrix. It thus seems that theprojections obtained by shrinking from a matrix P towards PBU are idiosyncratic in termsof performance.

Nevertheless, an assessment of the performances should not only be based on the averagechange but also the range of possible improvements or deteriorations, as shown in Fig-ure 6. On the county level for example, the outstanding average RMAE of the top downmethod is moderated by its very high variance. This may result from the aggregationat the top level, which yields a good average forecast but is less sensitive to lower-levelvariation. From a risk-averse standpoint, one may thus prefer reconciliation via OLS orWLSS to minimize the maximal deterioration.

To summarize, the present results are heterogeneous, showing considerable variation de-pending on the base forecasts used. It is furthermore not possible to identify one superiormethod on all levels and a weighing of average performance against robustness may berequired. On the other hand, among the single-level strategies and the projections withoutfull sample covariance matrix, the absolute differences in performance are not extreme.

61

−30%

−20%

−10%

−0%

10%

20%

30%

(a) Nation

−30%

−20%

−10%

−0%

10%

20%

30%

(b) States

−30%

−20%

−10%

−0%

10%

20%

30%

(c) Counties

Figure 6 Relative changes in MAE between base and reconciled forecasts for Noroviruswith random forests

62

6.4 Interpretability

Lastly, the quality of the explanations that can be obtained with the proposed approachmust be judged in order to evaluate its post-hoc interpretability. As mentioned in sec-tion 3.3, a functionally-grounded approach is taken here (Doshi-Velez and Kim 2017;Samek et al. 2019), consisting of two parts. First, it is evaluated whether the explanationsthat can be obtained are sound with respect to the forecasting model. Because there is ev-idence that humans often intuitively chose simple explanations over faithful explanations(Mohseni et al. 2018), this aspect is best evaluated formally. Here, it is investigated towhat extent the desirable properties proposed in section 5.1 can be fulfilled by hierarchi-cal attributions. Second, the informative value of the explanations is studied by testinghow well the feature attributions are suited to characterise the underlying epidemiologicalsituation and thus provide insights in addition to the forecast.

Fulfilment of Desirable Properties

In chapter 5, it has been suggested to obtain attributions of a feature i for the base forecastsby computing a univariate attribution φk(i, x, fk,R) for each time series k ∈ M individuallyand stacking all attributions into an m-dimensional vector φ(i, x, f ,R). For the reconciledforecasts, feature attributions can then be obtained by applying the same linear projec-tion to the base attributions which was used for reconciliation of the base forecasts, i.e.φ(i, x, f ,R) = SPφ(i, x, f ,R). Moreover, six desirable properties of hierarchical feature at-tributions have been defined in section 5.1. In this section, the quality of such attributionswith respect to the desirable properties is analysed, using the unified notation establishedin chapter 5 . For conciseness, φ(i, x, f ,R) will hereafter be called the base attribution andφ(i, x, f ,R) the reconciled attribution.

Base Attributions

Because properties 3-6 are defined separately for each time series k ∈ M, they will holdif the univariate attribution methods φk satisfy them individually. This also applies toproperty 2, completeness, because if for all time series k, the attributions for the baseforecast yk match the difference between the forecast and the expected reference forecast,the same will be true for the stacked vectors of attributions and forecasts. On the otherhand, the coherence property does not have an equivalent in the univariate case and is notguaranteed to hold. On the contrary, if the base forecasts are incoherent and the attribution

63

method fulfils completeness, then coherence cannot be satisfied. To see why this is thecase, simply note that

φ(i, x, f ,R) = SJφ(i, x, f ,R) ∀i ∈ P (Coherence)∑i∈P

φ(i, x, f ,R) = SJ∑i∈P

φ(i, x, f ,R)

f (x) − f (R) = SJ( f (x) − f (R)) (Completeness)

In order for f (x) − f (R) = SJ( f (x) − f (R)) to always hold, f must produce coherentforecasts which respect the aggregation constraints of the hierarchical time series. Never-theless, one would not expected a coherent explanation for an incoherent forecast, so thepotential violation of coherence is negligible regarding the attribution of base forecasts.

Reconciled Attributions

Regarding the reconciled attributions, the main question is whether the desirable proper-ties can be preserved under the linear projection SP. In the following, a series of formalpropositions and proofs is provided that state under which conditions each of the proper-ties is also satisfied by the reconciled attributions.

CoherenceA first convenient observation to make is that the reconciled attributions are guaranteedto be coherent, based on the same rationale as already discussed regarding the reconciledforecasts. The projection SP ensures coherence of the attributions for a feature i betweenall levels. For example, the sum of the bottom-level attributions will always be equal tothe top-level attribution. The proof is analogue to the one presented for the reconciledforecasts in section 6.2.

Proposition.Reconciled attributions always satisfy coherence.

Proof.

SJφ(i, x, f ,R) = SJSPφ(i, x, f ,R) = SInPφ(i, x, f ,R) = SPφ(i, x, f ,R) = φ(i, x, f ,R)

because JS =[

0n×(m−n) In

]×

[S′(m−n)×n

In

]= In �

Note that coherence only applies individually to each feature. In particular, the aboveproperty does not imply coherence between the attributions of different features, even ifthey are taken from a hierarchy. For example, consider a case where the base forecasts forstate A and its counties 1 and 2 each use a specific feature xA, x1 and x2, which describes

64

the current number of cases in the respective area. Then, the attributions for the featuresdescribing the case numbers in the counties do not necessarily sum up to the featuredescribing the case numbers in the whole state:

xA = x1 + x2 6 ==⇒ φ(i, xA, f ,R) = φ(i, x1, f ,R) + φ(i, x2, f ,R)

On the other hand, such a form of coherence may not even be desirable, as it wouldforce the attributions for aggregated features to be essentially redundant with respect tothe lower-level features. On the contrary, aggregated features may indicate more generaltrends than the disaggregated features and hence have their own explanatory value. Theinterplay between attributions for different features should thus only depend on the baseforecast models used.

CompletenessAs defined in chapter 5, if an attribution method satisfies the completeness property, thesum of attributions for all features will match the difference between the expected refer-ence forecast and the present forecast. As can be easily shown, if one applies the linearprojection SP to a set of base forecasts and corresponding base attributions which ful-fil completeness, the reconciled attributions will inherit the property with respect to thereconciled forecasts.

Proposition.If the base attributions satisfy completeness, then the reconciled attributions satisfy it too.

Proof.∑i∈P

φ(i, x, f ,R) =∑i∈P

SPφ(i, x, f ,R) = SP∑i∈P

φ(i, x, f ,R) = SP( f (x) − E[ f (R)]) =

SP f (x) − SPE[ f (R)] = SP f (x) − E[SP f (R)] = f (x) − E[ f (R)] �

In practice, if the expected reference forecast E[ f (R)] is needed for the explanation, it canbe deduced from the expected reference base forecast, using E[ f (R)] = SPE[ f (R)].

Strong MonotonicityIn the following proofs, denote C = SP as the projection matrix used to compute thereconciled forecast, so that reconciliation can be written in shorter notation as f (x) =

C f (x). Hence, for one specific time series k, the reconciled forecast is simply the matrix-vector product fk(x) =

∑mj=1 Ck, j f j(x)

First, note that even if vik(Z, x, f ′,R) ≥ vi

k(Z, x, f ′′,R) for allZ ⊆ (P\ {i}), the inequalityvi

j(Z, x, f ′,R) ≥ vij(Z, x, f ′′,R) for all Z ⊆ (P \ {i}) is not guaranteed. In other words,

the marginal contribution of i to the reconciled forecast f ′k can be equal or greater than

65

its contribution to f ′′k for some time series k, even if its contribution to the base forecastf ′j is smaller than to f ′′j for some time series j. As a consequence, using the strongmonotonicity property of the base attributions alone is not sufficient to prove that it issatisfied by the reconciled attributions as well, because the condition of the property maynot even apply for the base forecasts. For some intuitive examples, consider that C mayhave zero entries or even negative entries, so that the effect of smaller contributions tobase forecasts can be ignored or even inverted for the reconciled forecasts. But even if allentries of C were strictly positive, a decrease in contribution to one base forecast could beoffset by an increase in contribution to other base forecasts, leading to an overall increasein contribution to the reconciled forecast.

From these examples it follows that strong monotonicity can only be guaranteed if the baseattributions for the different time series are related to each other in a way proportional tothe relationship between the expected marginal contributions. Consider a base forecastfunction f ′ which changes in a way that, given a set of other features Z, the expectedmarginal contribution of feature i to the base forecast for time series j increases. Let∆vi

j(Z, x, f ,R) be this increase. Then, the effect on the expected marginal contribution tothe reconciled forecast for time series k is simply ∆vi

k, j(Z, x, f ,R) = Ck, j ∆vij(Z, x, f ,R).

For strong monotonicity to hold, this effect must be proportional to the effect on the rec-onciled attribution ∆φk, j(i, x, fk,R) = Ck, j ∆φ j(i, x, f j,R). Otherwise, the increase in attri-bution for k could be offset by a decrease in attribution for another time series k′, evenwhen the increase of the expected marginal contribution for k is larger than the decreasefor k′.

Therefore, further assumptions about the base attribution method must be made whichintroduce a connection between the expected marginal contributions of features and thecorresponding feature attributions:

Definition.Let an attribution be called a weighted sum of marginal contributions if and only if itcan be expressed as φ(i, x, f ,R) =

∑Z⊆(P\{i}) ωP(Z) vi(Z, x, f ,R), where ωP(Z) ≥ 0 is a

weighting function which only depends on the set of features P and a specific subset offeaturesZ.

It is easy to see that the feature attribution method SHAP is included in this definition byverifying that the Shapley value uses the weighting function ωP(Z) =

|Z|!(|P|−|Z|−1)!|P|−|! . The

Integrated Gradients method does not fully fit the definition because it integrates alsoover values between the reference and the instance. However, if the data used to describeR is sufficiently diverse, for example when the full training dataset it used, the values inbetween r ∼ R and x tend to be again values from R. Hence, Integrated Gradients is

66

at least an approximation in the case of diverse references R. With this definition, thefollowing proposition holds:

Proposition.If the base attributions are weighted sums of marginal contributions, then the reconciledattributions satisfy strong monotonicity.

Proof.Note that v(S, x, f ,R) = E[ f (xS,RP\S)] = E[C f (xS,RP\S)] = CE[ f (xS,RP\S)].

Therefore, the expected marginal contribution of feature i to the reconciled forecast, de-noted vi(S, x, f ,R), is a projection of the expected marginal contributions to the baseforecasts.

vik(Z, x, f ′,R) ≥ vi

k(Z, x, f ′′,R) ∀Z ⊆ (P \ {i})

⇐⇒

m∑j=1

Ck, j vij(Z, x, f ′,R) ≥

m∑j=1

Ck, j vij(Z, x, f ′′,R) ∀Z ⊆ (P \ {i})

=⇒ ωP(Z)m∑

j=1

Ck, j vij(Z, x, f ′,R) ≥ωP(Z)

m∑j=1

Ck, j vij(Z, x, f ′′,R) ∀Z ⊆ (P \ {i})

=⇒∑

Z⊆(P\{i})

ωP(Z)m∑

j=1

Ck, j vij(Z, x, f ′,R) ≥

∑Z⊆(P\{i})

ωP(Z)m∑

j=1

Ck, j vij(Z, x, f ′′,R)

⇐⇒

m∑j=1

Ck, j

∑Z⊆(P\{i})

ωP(Z) vij(Z, x, f ′,R) ≥

m∑j=1

Ck, j

∑Z⊆(P\{i})

ωP(Z) vij(Z, x, f ′′,R)

⇐⇒

m∑j=1

Ck, j φ j(i, x, f ′j ,R) ≥m∑

j=1

Ck, j φ j(i, x, f ′′j ,R) (attribution)

⇐⇒ φk(i, x, f ′k ,R) ≥ φk(i, x, f ′′k ,R) �

The proof uses the fact that if all the sums for sets Z of marginal contributions for f ′ aregreater than or equal to the ones for f ′′, the same applies for a weighted sum of theseexpressions, where the weights are the non-negative ωP(Z). The penultimate transfor-mation builds on the assumption that the base attributions are weighted sums of marginalcontributions.

SymmetryThe fulfillment of the symmetry property can be proved by showing that the strong mono-tonicity axiom implies the symmetry axiom, similar to (Lundberg and Lee 2017).

Proposition.If the base attributions are weighted sums of marginal contributions, then the reconciledattributions satisfy symmetry.

67

Proof.Let f ′′ be a function identical to f ′, only with the inputs i and j swapped. Then, thesymmetry condition can be reformulated as:

vk(Z ∪ {i}, x, f ′,R) = vk(Z ∪ { j}, x, f ′,R) ∀Z ⊆ (P \ {i, j})

⇐⇒ vk(Z ∪ {i}, x, f ′,R) = vk(Z ∪ {i}, x, f ′′,R) ∀Z ⊆ (P \ {i, j})

⇐⇒ vk(Z ∪ {i}, x, f ′,R) − vk(Z, x, f ′,R) = vk(Z ∪ {i}, x, f ′′,R) − vk(Z, x, f ′′,R) ∀Z ⊆ (P \ {i, j})

⇐⇒ vik(Z, x, f ′,R) = vi

k(Z, x, f ′′,R) ∀Z ⊆ (P \ {i, j})

The same transformation can be carried out with substitution of i instead of j. There-fore, the symmetry property follows from a double application of the strong monotonicityproperty, only on the smaller subsetZ ⊆ (P \ {i, j}).

This also means that the same consequences of the matrix projection apply, viz. thatsymmetry of the reconciled attributions does not follow directly from symmetry of thebase attributions and that further assumptions are required.

Dummy Null EffectFirst of all, it can be noticed that the reconciliation yields a “downstream” null effect fordummy variables:

Proposition.If a feature i is a dummy in all base forecast functions, then its reconciled attributions willbe zero15:

∀k ∈ M,∀Z ⊆ P,∀r ∈ supp(R) : fk(xi, xZ\{i}, rP\(Z∪{i})) = fk(ri, xZ\{i}, rP\(Z∪{i}))

=⇒ φ(i, x, f ,R) = 0

Proof.

fk(xi, xZ\{i}, rP\(Z∪{i})) = fk(ri, xZ\{i}, rP\(Z∪{i})) ∀k ∈ M,∀Z ⊆ P,∀r ∈ supp(R)

=⇒ φ(i, x, f ,R) = 0 (dummy null effect)

⇐⇒ SPφ(i, x, f ,R) = SP 0

⇐⇒ φ(i, x, f ,R) = 0 �

On the other hand, because different base forecast functions could offset each other inthe feature i, it is possible in theory to have a reconciled forecast function where i isessentially a dummy, but has nonzero base attributions. Hence, the dummy null effectproperty of base attributions does not directly guarantee the same property for reconciled15 Note that 0 refers to the m-dimensional null vector.

68

attributions. However, the following structural assumption about the attributions allowsto still infer this property:

Definition.Let an attribution be called a “weighted sum of approximated gradients” if and only if itcan be expressed by the following equation:

φ(i, x, f ,R) =∑Z⊆P

∑r∈supp(R)

τP(x, r, i,Z)[f (xi, xZ\{i}, rP\(Z∪{i})) − f (ri, xZ\{i}, rP\(Z∪{i}))

]with the weighting function τP(x, r, i,Z) ≥ 0.

This assumption is rather weak and fulfilled by any attribution method which samplespredictions with feature values from the reference distribution R and the actual input xto approximate the gradient of f with respect to xi and computes the feature attributionas a weighted sum of such approximated gradients. Moreover, it is also implied by theearlier and stronger assumption that the base attributions are weighted sums of marginalcontributions, viz. φ(i, x, f ,R) =

∑Z⊆(P\{i}) ωP(Z) vi(Z, x, f ,R). To see why this is the

case, note that the expected marginal contribution is defined on the reference distribution:

vif ,R

(Z) = E[ f (xZ∪{i},RP\(Z∪{i}))− f (xZ,RP\Z)] =∑

r∈supp(R)

Prob(r) [ f (xZ∪{i}, rP\(Z∪{i}))− f (xZ, rP\Z)]

Letting τP(x, r, i,Z) = ωP(Z)P(r)1i<Z(Z) yields equality between both requirements,therefore, the stronger assumption is a special case of the weaker one.

Proposition.If the base attributions are weighted sums of approximated gradients, then the reconciledattributions satisfy the dummy null effect property.

Proof.

fk(xi, xZ\{i}, rP\(Z∪{i})) = fk(ri, xZ\{i}, rP\(Z∪{i})) ∀Z ⊆ P, ∀r ∈ supp(R)

⇐⇒

m∑j=1

Ck, j f j(xi, xZ\{i}, rP\(Z∪{i})) =

m∑j=1

Ck, j f j(ri, xZ\{i}, rP\(Z∪{i})) ∀Z ⊆ P, ∀r ∈ supp(R)

⇐⇒

m∑j=1

Ck, j

[f j(xi, xZ\{i}, rP\(Z∪{i})) − f j(ri, xZ\{i}, rP\(Z∪{i}))

]= 0 ∀Z ⊆ P, ∀r ∈ supp(R)

69

For simpler notation, denote ∆P( f , x, r, i,Z) = f (xi, xZ\{i}, rP\(Z∪{i}))− f (ri, xZ\{i}, rP\(Z∪{i})).

m∑j=1

Ck, j

[∆P( f , x, r, i,Z) j

]= 0 ∀Z ⊆ P, ∀r ∈ supp(R)

=⇒

m∑j=1

Ck, j

∑Z⊆P

∑r∈supp(R)

ωP(x, r, i,Z)[∆P( f , x, r, i,Z)k

]= 0 (weighted sum)

⇐⇒

m∑j=1

Ck, j φ(i, x, f ,R) j = 0 (attribution)

⇐⇒ φ(i, x, f ,R)k = 0 �

AdditivityProposition.If the reconciled forecast function g is a sum of two reconciled forecast functions f andf ′, the reconciled attribution of each feature i for g will be equal to the sum of reconciledattributions for f and f ′ if the corresponding base attributions satisfy additivity.

Proof.Assume a base forecast function g(x) = f (x) + f ′(x) and the corresponding reconciledforecast function g(x) = SPg(x) = SP( f (x) + f ′(x)) = SP f (x) + SP f ′(x) = f (x) + f ′(x).

φ(i, x, g,R) = SPφ(i, x, g,R)

= SP(φ(i, x, f ,R) + φ(i, x, f ′,R)) (additivity)

= SPφ(i, x, f ,R) + SPφ(i, x, f ′,R)

= φ(i, x, f ,R) + φ(i, x, f ′,R) �

SummaryIn this section, it has been shown that reconciled attributions defined as φ(i, x, f ,R) =

SPφ(i, x, f ,R)...

• ... always satisfy coherence

• ... satisfy completeness and additivity if the base attributions satisfy it too

• ... satisfy dummy null effect if the base attributions are weighted sums of ap-proximated gradients

• ... satisfy strong monotonicity, symmetry and dummy null effect if the baseattributions are weighted sums of marginal contributions

These formal properties only depend on the attribution method and not on the model orthe features used. Therefore, by appropriate choice of the base attribution method, the

70

desirable properties of attribution methods can be guaranteed for reconciled attributionsand thus ensure that the explanations are coupled with the decision logic of the model. Ashas been discussed above, the SHAP values are such an appropriate attribution method,because they satisfy completeness, are additive and can be expressed as a weighted sumof marginal contributions. Integrated Gradients does not fully match the last definition,but is still an approximation in case of a diverse R.

Informativeness

This section investigates the second dimension of interpretability as defined for this work,the information value of the explanations regarding epidemiological interpretation of fore-casts. The evaluation is conducted by case study of the explanations obtained for forecastsof the exemplary disease Norovirus, as well as through a simulation experiment.

First, it quickly becomes apparent that the reconciled feature attributions obtained are notinterpretable in raw form simply due to the large number of features used by the baseforecast models. In the setting tested here, where each model had features specific tothe individual time series, between approx. 1000 and 13,000 different features were usedoverall. The full set of attributions for these features is obviously an explanation of over-whelming complexity and would rather be an obstacle to understanding and acceptance(Mohseni et al. 2018). Therefore, the complexity is here reduced through grouping byfeature type (e.g. recent case numbers, seasonality, reproduction number, ...) and hier-archy level. Due to the additivity property, all features in a group can be summed up toobtain their joint influence on the forecast. Moreover, for the use case of outbreak detec-tion, the features can be divided into non-epidemic and epidemic features with the hopethat when looking only at epidemic features, extraordinary developments can be identifiedmore clearly. Lastly, a further step is to scale the attributions by the population size of thecorresponding subpopulation to make them better comparable between differently sizedregions or across hierarchy levels.

Geographical Characterisation of Outbreaks

The evaluation conducted here is based on the recognition that while the forecast shouldpredict the future spreading of a disease, an informative explanation would indicate itsgeographical origin. Accordingly, it is hypothesized that if a forecasting model predicts adisease proliferation across several subpopulations during an outbreak, the feature attribu-tions will highlight the sources of the outbreak. In order to test this hypothesis, a groundtruth definition of the outbreak status in every subpopulation is needed, which is unfortu-nately often missing for real-world epidemic time series. An example is given in Figure 7,where 2-week-ahead reconciled forecasts (base forecasts with GLM) and attributions ofthe epidemic features for the reconciled national forecast are scaled by 100, 000 inhabi-

71

tants and compared at the state level during the first week of October 2012. Both viewshighlight a large outbreak of Norovirus in schools and kindergartens which occurred inEast Germany at the time (Robert Koch-Institut 2012), however with subtle differences.For example, the forecast predicts comparably few cases in Berlin, while it has a mediumhigh epidemic attribution. In contrast, the forecast for Saxony-Anhalt is high but theepidemic attribution low. While the descriptive epidemiological report by RKI indicatesthat Berlin was an outbreak hot-spot in fact, it states that it remains unclear whether therise in case numbers in Saxony-Anhalt was due to the outbreak or simply the “normal”begin of the winter season with higher case numbers (Robert Koch-Institut 2012). Thefeature attributions here indicate that the higher forecasts are indeed based on the seasonalvariation and not on an outbreak, but this hypothesis cannot be verified with existing re-ports 16 This lack of ground truth is a common problem in outbreak detection, so simple


Figure 7 Reconciled forecasts and attributions of epidemic features by state in the firstreporting week of October 2012 highlight an outbreak in East Germany.

simulations of epidemic time series are often used as a makeshift. Similarly, taking thesimulation procedure of the popular R package “surveillance” (Höhle et al. 2020) as ablueprint, a simplistic simulation experiment is conducted here to provide a controlledsetting in which only the disease spreading between subpopulations is modeled and otherinfluencing factors are excluded.

The simulation is a combination of a Hidden Markov Model to simulate random outbreaksin time and a Poisson model to produce corresponding case numbers. First a binaryoutbreak state (no outbreak vs. outbreak) is simulated using a Markov Chain for everypoint in time, where the next state only depends on the previous state. Then, a timeseries of weekly case numbers is simulated using a Poisson distribution, whose meandepends on the outbreak state. During no-outbreak times, a simple seasonal baseline

16 The evaluation was further impeded by the fact that outbreak reports include patient cases withoutlaboratory test, while the data available here was limited to cases with laboratory confirmation.

72

model is used as the mean. During an outbreak, the seasonal baseline is multiplied by apredefined factor to simulate unusually high case numbers (Höhle et al. 2020). For thepresent purpose, several interdependent time series which compose the bottom-level ofthe hierarchical time series must be simulated, so the Markov Model is extended to themultivariate setting. It is assumed that the subpopulations are arranged in a grid-structuredgraph and that an outbreak currently in k can spread across the edges of the graph to thedirect neighbouring populations, denoted by N(k). Therefore, at each point in time t,each time series k has a binary outbreak state S t

k ∈ {0, 1}, with the following transitionprobabilities:

S 0k = 0

P(S tk = 0|S t−1

k = 0) = (1 − pk)∏

j∈N(k)

(1 − p j,k ∗ s j)

P(S tk = 0|S t−1

k = 1) = (1 − qk)∏

j∈N(k)

(1 − p j,k ∗ s j)

As can be seen, no outbreak is assumed at the first time step. Then, the probability ofhaving no outbreak in each subsequent step depends on the previous state as well as thestates of the neighbours of k, where pk is the probability of a new outbreak in region kif there is currently no outbreak in the region, qk is the probability of an outbreak to endin region k if there is currently an outbreak in the region, and p j,k is the probability of anoutbreak to spread from region j to region k if there is currently an outbreak in regionj. In other words, pk controls the frequency, qk the duration, and p j,k the spreading ofoutbreaks.

The time series model of the weekly case counts ytk consists of a sinusoidal curve very

similar to the seasonal models introduced earlier, and an outbreak component:

ytk ∼ Poisson(µt

k) | log(µtk) = αk + γk sin

(2π52

(t + δk))

+ σtkS

tk

αk = U(1.5, 2.5) | γk ∼ U(1.5, 2.5) | δk ∼ U(0, 3) | σtk ∼ U(1.005, 1.015)

Here, αk is an intercept which can be thought as an indicator of population size. The heightγk and shift δk of the seasonal curve allow slightly varying seasonal patterns for each timeseries k. Lastly, the outbreak size factor δk determines the size of an outbreak relativeto the seasonal baseline. Each of the parameters is drawn from a uniform distribution asdefined above. The frequency of the seasonality is set to 52 weeks per year.

While being a strong abstraction of reality, the above model depicts the following proper-ties of infectious disease time series in an idealized form: seasonal variation, noise (via apoisson distribution), outbreaks in the form of sudden spikes in case numbers and spread-ing of outbreaks between subpopulations (Höhle 2016). The proliferation to neighbour-ing population is here only modeled via correlation between the binary outbreak states, it

73

does not depend on the size of the outbreak in terms of case numbers. However, this onlymakes forecasting and explanation of outbreaks based on the case numbers more difficult,providing a lower bound for the quality of explanations. In the experiment, a simple two-level hierarchy is simulated with 9 bottom-level time series of subpopulations arranged as3×3 regular grid cells and one top-level time series (see Figure 8 for illustration). The dis-ease can spread between cells only in orthogonal direction and not diagonally. For baseforecasting, random forests are used to avoid any implicit replication of the simulationassumptions. Hence, the calendar week is used as a seasonal, non-epidemic feature, andthe current case numbers and the instantaneous reproduction number are used as autore-gressive, epidemic features. To best capture the spreading dynamics during forecasting,features for the current case numbers of all neighbouring states are included as well asepidemic features. The parameters for the transition probabilities of the Markov Modelare assumed similar for all k and fixed at pk = 0.2% and qk = 80% but varied for p j,k.

One-step-ahead base forecasts for the time series and corresponding attributions are gen-erated and reconciled via projection with structural scaling, (WLSS ). Then, the epidemicattributions for the top-level forecast are grouped by time series and summed, yieldinga total contribution score for each time series which takes into account all positive andnegative contributions of its epidemic features to the reconciled forecast for the top level.These scores are then used to characterise the underlying outbreak situation. As shownin an example in Figure 8, it can be observed that while the reconciled forecasts predictwhich grid cells may experience an outbreak in the near future, the feature attributions forthe autoregressive features indicate the underlying source of the outbreak. For example, ifa base forecast for a cell k uses the high case numbers of a neighbouring cell j to predict anoutbreak, the contribution to the reconciled forecast is still attributed to the case numbersof j. The explanation for the forecast therefore recovers the underlying reason related toepidemic spreading. To verify this presumption, the agreement between the true outbreak

(a) Outbreak States (b) Forecasts (c) Attributions

Figure 8 Simulated outbreak situation, reconciled forecasts and attributions of autore-gressive features, all at bottom level. Colors represent the scaled values each. The fore-casts predict spreading, the attributions highlight the origin.

74

situation and the situation as characterised by the attributions is measured as follows: Ateach time step with an outbreak, the cells are divided into two clusters based on their trueoutbreak state (cluster “outbreak” and cluster “no-outbreak”). Then, the Silhouette coef-ficient (Rousseeuw 1987) for this clustering is computed, with the distance between twocells k and j defined by the distance between their feature attributions for the epidemicfeatures, i.e. d(k, j) = |Φk − Φ j|, where Φk is the sum of attributions of all epidemic fea-tures related to time series k. If furthermore d(Cthis, k) is the average distance between kand the other cells in its own cluster and d(Cother, k) the average distance between k andthe cells in the opposite cluster, then the Silhouette coefficient for cell k is:

Silhouette(k) =d(Cother, k) − d(Cthis, k)

max(d(Cthis, k), d(Cother, k))

The silhouette score is scaled between −1 and 1, where a higher mean Silhouette scoreover all cells indicates dissimilarity between the two clusters. Hence, if the attributionsallow to differentiate well between cells with and without outbreak, the silhouette scoreshould be high. Figure 9 shows the distribution of silhouette scores over five years ofsimulated out-of-sample forecasts (repeated 5 times with different draws of αk, γk and δk

each) for different probabilities of disease spreading p j, k between the cells. For compari-son, the silhouette score has also been calculated with the reconciled forecasts as distancemeasure instead of the attributions. As can be seen, the attributions provide a better sepa-ration of outbreak and no-outbreak cells, with the difference becoming more pronouncedas p j, k grows. At the same time, the separation is not perfect, as Silhouette scores closeto zero occur, indicating that the attributions between outbreak and no-outbreak cells arevery similar.

−0.4

−0.2

0

0.2

0.4

0.6

0.8

Distance MeasureForecastsAttributions

Silh

ouet

te S

core

Figure 9 Silhouette scores for different probabilities of disease spreading between cells.

75

Comparison of Projections and Models

The choice of projection for reconciliation also has a qualitative impact on the expla-nations obtained. First of all, assuming that the top-level model only uses a moderatenumber of features, the top down strategy yields significantly less complex explanationsthan the other projections. A downside however is that leaving the proportion factorsaside, all forecasts get identical feature attributions. What is missing is an explanationfor the proportions used when distributing the top-level forecast to the lower levels, thiscan however not be provided through the approach designed here. On the other hand,while the explanations obtained when using a full projection matrix such as OLS or avariant of WLS are more detailed, they can be counterintuitive at lower levels, becausethe projection matrix SP usually has slightly negative coefficients for neighbouring timeseries at the same level. As a consequence, when the base forecast for a county has highepidemic attributions, they will enter the explanations for the neighbouring counties’ fore-casts with negative sign and thus appear as negative epidemic attributions. At the higherlevels, this negative contribution is compensated because the county base forecasts arealso part of the reconciled state and nation forecasts, where the weights are positive, sothat the overall epidemic attribution at the top-level will still be clearly positive. Never-theless, the explanations for the lower-level reconciled forecasts are still deceptive withrespect to neighbouring regions. A remedy for this essential problem is to summarize theattributions of all neighbouring regions together with the attributions at the next higherlevel, leading to a reduced explanation as in Figure 10. Lastly, the bottom-up strategyleads to explanations which are complex as well, but avoids counterintuitive attributionsat the lower levels, because all coefficients of SP are non-negative. However, the recon-ciled attributions for the bottom-level forecasts will be identical to the base attributions,because no influence from neighbouring regions or higher-level forecasts exists. The rich-ness of the lower-level explanations therefore fully depends on the base forecast modeland features.

Regarding the forecast model type, the use of more complex machine learning modelsallows to perform forecasts using a features without prior modeling of their interactions,which could in theory also yield interesting insights into disease dynamics. Figure 11shows an example of a potential usage, where the feature attributions of a random forestmodel have been grouped by feature type and then depicted using principal componentanalysis and hierarchical clustering. As can be seen, such post-processing may allow todepict different subtleties of the feature usage by the model. On the other hand, inter-action effects between two features are equally distributed to both features by attributionmethods like SHAP or Integrated Gradients, which can result in potentially misleadingexplanations. For example, a seasonal feature and a feature of the current case numbersmay interact by indicating that the current case numbers are extraordinarily high given

76

Figure 10 Exemplary reduced explanation for a state forecast, explaining the differencefrom the reference forecast (bottom) to the real forecast (top). Positive and negative con-tributions have been separated. “Nation” includes attributions from all other states andcounties outside of the current state.

the current time of the year. The resulting increase in forecast would then be attributedequally to both features, increasing their contribution scores. However, an individualinterpretation of the attribution for the seasonal feature is could then lead to the wrongconclusion that the current time of the year indicates higher case numbers.


Figure 11 Results of a principal component analysis of the feature attributions obtainedwith random forests for Norovirus forecasts. The first and second component are mereseasonality and current case numbers, the later components differentiate between percent-ages of age groups and information on patients and infection settings.

77

7 Discussion

In the following, the insights gained on the interpretable and hierarchical prediction ofinfectious diseases through the design, implementation and evaluation of the proposedapproach are discussed, also considering the interplay between the different elements ofthe approach and reflections on some of the techniques used for evaluation of the designobjectives.

7.1 Approach

While all methods used in the present approach already exist, the combination of hierar-chical forecasting and feature attribution to obtain explanations for reconciled forecastsis novel. Through the development of a unified notation it could be shown that a theoret-ical harmonization of the concepts is indeed possible. The obtained notation is valuablein two regards. First, it allows to specify the approach in an abstract way, leaving theindividual elements configurable while still describing their usage with precision. Thisseparation of the concrete methods from the overall approach should serve the design ob-jective of method flexibility. Moreover, the definition of hierarchical feature attributionsas a new construct paved the way for an axiomatic, functionally-grounded evaluation ofinterpretability by specifying desirable properties which should be met by any attributionmethod used in order to ensure soundness of the explanations with regard to the model.The properties specified are transferred from the literature on univariate feature attributionto the hierarchical setting, with the exception of coherence, which is instead transferredfrom the literature on hierarchical forecasting to the realm of feature attribution. Whilesensible justifications for properties 1-5 exist also with respect to the application of in-fectious disease surveillance, property 6, additivity, had to be included in hindsight formethodical reasons. Only through the assumption of additivity, the projection for recon-ciliation of the forecasts can also be applied to the feature attributions. It is clear thatotherwise, precise computation of feature attributions for the reconciled forecasts wouldbe computationally infeasible. Therefore, this last property is both a cornerstone and aweak point of the whole approach developed.

As has been demonstrated through a prototype, the approach can be implemented suchthat the disease data and forecasting models can be chosen flexibly. Because only oneattribution method, SHAP, was used, the flexibility with respect to attribution was notdemonstrated. The interface should nevertheless allow to add further attribution methods.During implementation, it became clear that the high complexity of hierarchical fore-casting requires special consideration of computational efficiency, here achieved throughsparse matrix multiplication or parallelization. Still, forecasting and explanation for allstates and counties of Germany required considerable runtime with the present prototype.

78

While it seems acceptable for weekly forecasts in practice, the high resource requirementscould be a barrier to forecasting at higher frequencies or to detailed cross-validation.Moreover, users must be supported in the specification of models and features through asuitable interface, because it is not feasible to specify each model of the hierarchy indi-vidually.

7.2 Infectious Disease Forecasting via Statistical Models

Given a setting of epidemic diseases with much historical data, the choice of statisticalmodels over mechanistic models seems reasonable with regard to flexibility and manualmodeling effort. The review of statistical models and features in section 5.2 revealed thevariety of possible choices available. Examples of integration of data sources outside ofreporting such as weather information or online activity data into statistical models werefound, although they have not been tested here.

For surveillance practice, a preselection of the methods available and guidelines for thechoice of features would be needed. However, it is well imaginable that beyond the basicfeatures introduced in section 5.2, such as seasonality and current case numbers, choicesfor more sophisticated features are use-case- and disease-dependent. While a detailedreview of this aspect was outside the scope of this thesis, it indicates that the customizationeffort required for specialised statistical models must not be underestimated. Furthermore,it is interesting to note that for complex generalized additive models as in (Stojanovicet al. 2019), the line between mechanistic and statistical modeling becomes less clear-cut, since the model components also embody assumptions about the biological dynamicsof disease propagation.

As observed during the experiments on forecast accuracy, out-of-the-box machine learn-ing models such as random forests do not necessarily provide a performance advantageover simple linear models. This means that in practice, modeling effort is also neededfor machine learning models, beyond motivated selection or tuning of hyperparameters.Instead, hybrid architectures of a classical time series model for detrending and desea-sonlization and a machine learning model for the remaining residuals may be required toachieve performance gains (Smyl 2020; Makridakis et al. 2018). The assumed flexibil-ity of machine learning for integrating new data sources is further questioned regardinginterpretability, as addressed in section 7.4.

Lastly, statistical forecasting methods are tied to the regression target used. Therefore,if weekly reported cases are used as target, the forecasts will also predict the reportedcase numbers and not the new infections. On the one hand, this will lead to an underes-timation of the true case numbers due to underreporting. On the other hand, dynamicsof the reporting system are included in the forecast, most prominently reporting delays.

79

For example, a sharp drop in reported cases can be observed during Christmas holidays,which is most likely an artefact of reporting. But aside from such special occasions, anadjustment for the reporting delays through “nowcasting” of the current cases of infectionor disease onsets would allow a more timely forecast with explanations that are closer tothe underlying disease dynamics, because the variability of reporting is excluded (Law-less 1994; Höhle and An Der Heiden 2014). However, if nowcasts of the current casenumbers are to be used as target and input features for further forecasting into the future,it seems advisable to also ensure coherence of the nowcasts beforehand.

7.3 Reconciliation via Projection

The choice of reconciliation via projection as a method for hierarchical forecasting isattractive because it embraces the classical single level strategies but also more sophisti-cated projections as introduced in section 5.3. Nevertheless, the theoretical guarantees ofoptimality seem to provide limited value in practice since the underlying assumptions canbe easily violated or a stable estimation of the quantities required (full sample covariancematrix) is difficult. As the results of the cross-validation of forecast accuracy indicate, noclear superior reconciliation strategy can be identified and there may be interdependenciesbetween the base forecast models used and the performance of the projection matrices.While the latter aspect deserves further theoretical analysis, cross-validation currentlyseems indispensable if one wants to select an optimal projection for a specific forecastmodel. Regarding the eventual choice of one projection, further subtleties are to be con-sidered. Because relative performance varies across the levels, the relevance of accuracyat the county, state or national level must be weighted. Moreover, since the present re-sults indicate a trade-off between average performance and variability of the projections,an evaluation of the risk of strong negative outliers should be conducted. If occasionalflawed predictions are acceptable, the projection with optimal average performance isfavourable, otherwise the strategy with minimal maximum error may be considered.

Two major disadvantages of reconciliation via projection only became apparent during thedevelopment and evaluation of the approach. First, the fact that reconciled forecasts canbe negative even if the base forecasts are not is a troubling aspect which could be a clearbarrier to acceptance in epidemiological decision support. A workaround using shrinkagetowards the bottom-up projection was developed in this thesis, but appeared not veryrobust regarding accuracy. Further analysis of the properties of such shrunken projectionswould have been interesting but beyond the present scope. Aside from this solution, aremedy may be to only choose single-level projections or projections with only slightlynegative values as shown in section 6.2. The further effect of negative coefficients in theprojection matrix regarding interpretability is discussed in section 7.4.

80

Given the above considerations, a pertinent question is whether the increased complexityand disadvantages of reconciliation via projection are worthwhile if only limited perfor-mance gains can be achieved compared to the single-level strategies and whether oneshould use a simple bottom-up or top-down approach by default. However, this ques-tion cannot be finally answered in this work. Not only was the sample of diseases andforecasting methods used here too small to allow such a general conclusion, but alsoshould the evaluation procedure be more differentiated. The cross-validation conductedin section 6.3 compared the performance of the different projections over the full timehorizon, while it may be important to evaluate accuracy separately during epidemic andnon-epidemic phases. For example, the observation that the top-down strategy showedgood average performance but also had large outliers is reasonable, assuming that highlyaggregated case numbers are a suitable proxy for the normal seasonal pattern but an in-accurate measure of the lower-level variation during local outbreaks. In order to uncoverthese details, the validation data would need to be separated into verified outbreak andno-outbreak phases and the accuracy measured on these subsets.

Beside these frictions, the evaluation of the incoherence of base forecasts in section 6.2strongly underpins the fundamental motivation of hierarchical forecasting that consider-able contradictions can occur when all level are forecasted individually, with differencesof several hundred cases between state and county level. It seems therefore generallyadvisable to use reconciled forecasts instead of individual base forecasts.

7.4 Interpretation via Feature Attribution

As could be shown in section 6.4, the application of the projection SP to the feature at-tributions allows to preserve the desirable properties when using a method like SHAP.Therefore, it is possible to obtain feature attributions for reconciled forecasts which arestill theoretically sound. Here, the interesting insights was made that some of the prop-erties do not follow directly from the univariate case but additional assumptions wererequired to prove their fulfilment. Despite the seemingly flexible formulation of featureattribution, the choice of available methods appears very small, as only SHAP and In-tegrated Gradients are currently known to satisfy all desirable properties. This couldmotivate a reconsideration of the necessity of each property. However, while the additiv-ity property is indispensable for the present approach, the other properties all seem verydesirable for infectious disease forecasting, as has been argued in Figure 5.1. On theother hand, the present experiments have shown that feature attribution is computation-ally demanding in the hierarchical setting, even with model-specific methods. Althoughnot tested, it seems unlikely in practice to obtain accurate attributions via the model-agnostic methods introduced in section 5.4. This introduces a strong interdependencybetween forecasting model and attribution method, currently limits the compatible fore-

81

casting methods to (generalized) linear models, tree-based learners and neural networks.While these classes still allow for a variety of models, method flexibility is certainly re-duced.

Aside from the aspect of theoretical soundness, an evaluation of interpretability with re-gard to the informativeness of the explanations has proven to be difficult. Here, the mostimportant barrier is a lack of ground truth explanations for epidemiological situations.Because epidemiological reports on historical outbreaks mostly feature descriptive epi-demiology, an interpretation of the situation with regard to the future spreading of thedisease is missing. What would instead be required are in fact “feature attributions” ofthe real-world outbreak development, i.e. analyses of the explanatory value of differentcontributing factors to the future development of case numbers. Here, a main challengewould be that explanations based on constructs of time series analysis, such as trend orseasonality, are not necessarily causal in the sense of individual disease transmission. Thesimulation experiment conducted here is too simplistic to be conclusive regarding the abil-ity of feature attributions to characterise the spatial outbreak situation, but at least servedas an illustration of the general principle. This evaluation strategy could be extendedthrough more detailed and realistic simulation models. Here, mechanistic modeling mayalso provide a good testing ground, because detailed compartmental models or agent-based simulations could also provide ground truth explanations for more detailed featuressuch as demography or information about the infection settings. A remaining risk how-ever is to reproduce the simulation assumptions in the forecasting models and thus biasthe explanations towards the simulation.

Nevertheless, the present experiments already uncovered several issues of feature attribu-tions which must be addressed in order to obtain informative explanations. First of all,a post-processing of attributions is required in order to reduce the number of elementsto an intelligible size. As argued in section 6.4, grouping and summing of attributionspreserves the desirable properties, so the reduced explanations are still truthful to theforecasting model. However, this reduction also has the risk of concealing certain sub-tleties within the summarized groups. For example, a separate statement of positive andnegative attributions is required to distinguish between a group of strong but opposingcontributions and a group of simply weak contributions. Apart from this, the observationthat negative coefficients in the projection matrix can lead to counterintuitive explanationsat the lower-levels is a further downside of reconciliation via projection. Unfortunately,attributions truthful to the reconciled forecasts will always inherit the characteristics ofthe projection. The only options are thus to either renounce some of the desirable prop-erties and compute less truthful but intuitive explanations or to resolve the issue throughaggregation to the higher level, with the disadvantage of less detail.

82

The design research in this thesis also provides insights into the relationship between lo-cal explanations for post-hoc interpretability and other notions of interpretability, suchas transparency. First, because explanation methods such as feature attribution combineboth the model and the data, they are conceptually different from explanations only ofthe model, as demanded for transparency (Christoph Molnar 2019). Therefore, even incase of a simple linear model, which is usually denoted as inherently transparent, featureattributions can provide additional interpretability by explaining individual predictions.Second, it has been argued in section 5.2 that explicit linear models have the advantagethat they can be clearly broken down into epidemiologically meaningful components,which can then be defined as “features” for the present approach. At the same time, ithas been observed that explanations for forecasts from machine learning models suchas random forests can be deceptive if the interactions between the features are not suffi-ciently captured. A solution could be to compute attributions also for pairs, triplets etc. offeatures (Lundberg et al. 2019), but this comes with considerable additional complexitywhich may be prohibitive in the hierarchical setting. A general increased risk of forecast-ing via machine learning is that even in the case of good performance, the models maynot be structurally valid, i.e. capture statistical relationships which are not adequate rep-resentations of the underlying epidemiological situation (Troitzsch 2009). Therefore, itseems that even when post-hoc interpretability is defined as the design goal, transparencyof the models can substantially improve local explanations because it allows to selectmeaningful components of the models as explanation elements.

If a functionally-grounded testing of explanations produces promising results, the eval-uation can be extended to human-grounded and application-grounded methods. Here, afirst test could be to use the feature attributions as a similarity measure between differentpoints in time to identify past epidemiological situations which are most identical withrespect to the forecast and let experts judge whether the situations are really similar basedon their domain knowledge. Eventually, an integration of the prototype into a surveillancetool like Signale and experiments with users would be needed to test whether the com-bination of forecast and explanation improves the decision making of epidemiologists insurveillance.

83

8 Conclusion

The central research question of this work was how infectious disease forecasting, hier-archical forecasting and model interpretation can be combined too obtain forecasts forinfectious disease surveillance at different geographical aggregation levels, as well as epi-demiologically meaningful explanations for the forecasts. Using a design science researchmethodology, an approach to produce the desired outputs has been developed and testedin several steps.

First, a literature review of the special requirements in infectious disease surveillance hasbeen conducted, leading to the definition of four central design objectives for this work,namely flexibility with respect to disease, information richness and methods, hierarchicalcoherence, accuracy and post-hoc interpretability. These objectives were used as guid-ance in the subsequent design and development step. First, broad reviews of existingmethods for infectious disease forecasting, hierarchical forecasting and model interpreta-tion were conducted, leading to the justified design choice of statistical forecasting meth-ods, forecast reconciliation via projection and feature attribution as constituting elementsof the approach. Afterwards, these elements were conceptually integrated by developinga unified notation in which the individual methods are represented in a generic way topreserve method flexibility of the approach. Moreover, six desirable formal propertieshave been motivated and defined which should be fulfilled by the attributions, namelycoherence, completeness, strong monotonicity, symmetry, dummy null effect and additiv-ity, whereby the latter property has been identified as a methodically necessary restrictionwith potential downsides. Given the designed notation, it was then proposed how theelements can be efficiently combined to produce coherent hierarchical forecasts and cor-responding explanations through feature attribution. Moreover, based on a more detailedliterature review of the chosen elements, the range of suitable choices for the individualmethods was defined. Finally, an instantiation of the approach in the form of a prototypeimplementation was created and crucial design aspects on an implementation level withregard to the intended application were identified.

Eventually, the developed approach was evaluated with respect to the previously defineddesign objectives and the main insights from development and evaluation were discussed.It was found that the design generally allows an application to many different diseases,with varying detail of information available and different methods for forecasting, rec-onciliation and feature attribution. At the same time, interdependencies with the inter-pretability objective which limit this flexibility were identified. Among the existing effi-cient feature attribution methods, only certain forecasting model types are currently sup-ported. Furthermore, it was observed that the choice of features and model has an impacton the quality of explanations with regard to epidemiological informativeness.

84

It could be shown that considerable incoherence arises when all levels are forecasted indi-vidually and that the proposed approach successfully resolves this issue through forecastreconciliation. However, the problem of potentially negative reconciled forecasts was de-tected and different remedies discussed. An evaluation of the accuracy of the reconciledforecasts showed that coherent forecasting is possible without clear loss in accuracy butthe performance of the different reconciliation methods available has proven to be veryheterogeneous, so a case-based cross-validation of the forecasts seems required in prac-tice. Moreover, the simple single-level strategies for forecast reconciliation have not beenclearly outperformed by the more sophisticated strategies. However, to conclude whetherthe complexity of the latter strategies is worthwhile or not, it was argued that a moredifferentiated comparison of the strategies’ relative performances during epidemic andnon-epidemic phases would need to be conducted.

Regarding interpretability, it was shown by proof that the feature attributions obtainedthrough the developed approach can fulfil the desirable properties introduced earlier andthus be theoretically sound with respect to the forecasting model. For an evaluation ofthe informativeness of the explanations obtained, a lack of ground truth information wasfound. It was discussed what form of ground truth explanations would be required anda simplistic time series simulation of an infectious disease was used alternatively to testthe informativeness of the explanations in a controlled setting. While the results are farfrom conclusive, it seems that feature attributions have the potential to provide additionalinsights into epidemiological situation, for example by highlighting the sources of anoutbreak. Further details of interpretability such as reduction of explanation complex-ity through grouping of features, potential counterintuitive explanations for lower-levelforecasts and the pitfalls regarding more complex machine learning techniques were dis-cussed.

Lastly, several interesting research questions regarding interpretable hierarchical forecast-ing of infectious diseases could not be covered in this thesis and are here stated for futureresearch. Regarding forecast reconciliation via projection, further theoretical and empiri-cal analysis could investigate which projections perform best given a specific model anddisease. In this work, experiments were only conducted with individual forecasting mod-els for each time series of the hierarchy, but it seems promising also fit models on severaltime series (e.g. one model per level) and examine the effects on coherence and accu-racy. It also remains unclear how negative reconciled forecasts could be avoided whilestill using a closed-form projection which allows for model interpretation. Moreover, theempirical risk minimization strategy, another instance of reconciliation via projection wasintroduced but not implemented here and it would be valuable to compare its performancewith the other strategies. As discussed in section 4.2, direct forecasting using neural net-works has been proposed as an alternative method for hierarchical forecasting. While it

85

was not selected due to its requirement of forecasting all time series in one model, it is stillan appealing option, because hierarchical feature attributions could be obtained from thismethod as well using SHAP or Integrated Gradients. In this work, hierarchical forecast-ing has been framed as the task of forecasting a tree-like hierarchy of time series and itsapplication was focused on geographical hierarchies. Nevertheless, the approach could bemodified to apply to other hierarchies, for example temporal hierarchies, where the levelsare for example given by daily, weekly and monthly forecasts (Spiliotis et al. 2019), aswell as extended to grouped time series, where disaggregation is performed by severaldimensions in interchangeable order (for example disaggregation of the case numbers bygender and age group) (Shang and Hyndman 2017). In fact, the notation developed in thiswork would already embrace this latter application, because grouped time series can berepresented through a summing matrix as well (Wickramasuriya et al. 2019).

A further avenue of research not covered in this thesis is the extension from point fore-casts to probabilistic forecasts, which are especially important in infectious disease fore-casting due to the high uncertainty associated with disease spreading (Lauer et al. 2020).The developed approach provides a starting point for research into this direction, becauseprobabilistic forecasts for hierarchical time series are usually obtained by drawing sets ofsamples from the predictive distributions of the base forecast and applying the projectionfor reconciliation to each set of samples individually in order to obtain reconciled sam-ples (Gibson et al. 2019; Taieb et al. 2017). Such an extension of the approach wouldhowever require a review of suitable base forecast models as well as a reconsideration ofthe explanation method, because the existing methods for feature attribution are currentlylimited to explanation of point predictions.

Regarding feature attribution, aside from an in-depth evaluation of the informativeness ofthe general approach as well as the impacts of different forecasting models and features,three further research topics seem promising for surveillance practice. Firstly, the choiceof the reference R for feature attribution has here been limited to the training data set usedduring model fitting, but other potential choices for R, such as only non-epidemic points intime, could provide explanations which are better aligned with specific use cases such asoutbreak detection. Secondly, the post-processing of feature attributions into less complexexplanations should be further investigated with respect to the tasks of epidemiologists,allowing to emphasize the most important aspects for decision support in surveillance.Lastly, an interesting extension of the present approach would be to compute feature at-tributions not for the forecasts but for the forecast errors (Lundberg et al. 2019), to gaininsights into the contribution of features to accuracy and identify deficiencies of the baseforecast models over time.

86

Appendix

A Literature Search Details

The table below gives an overview over the three literature searches which have been con-ducted. For infectious disease forecasting and hierarchical forecasting, databases of peer-reviewed journals were used. Because model interpretability is a very recent field, theDBLP Computer Science catalog was used instead, which also allows to access preprintpapers. As can be however seen, the large set of publications for model interpretationonly yielded a few finally relevant papers. For this topic, but for the other two as well,considerable author forward and backward and reference backward search was performedto identify the relevant literature. The detailed research results of the database searchestogether with a documentation of the decision from title and abstract scan are provided inthe supplementary material.

Infectious Disease Forecasting


Model Interpretation

Databases ACM Digital Library, EBSCO Host, Scopus

ACM Digital Library, EBSCO Host, Scopus, Web

of Science

DBLP Computer Science Catalog

Search Terms"infectious disease" AND forecast AND (review OR survey)

(forecast "hierarchical time series") OR ("forecast

reconciliation")

(explain OR interpret) AND predictions

Inclusion CriteriaReview of methods or

applications in surveillance.

Either about methods or applications to public

health.

On explanation methods, not models.

Search Results 96 56 72

After Title and Abstract Scan 21 18 21

After Full Text Scan 14 11 6

Additional fromBackward and

Forward Search36 13 26

Used in Total 50 24 32

Figure 12 Literature search process following Vom Brocke et al. (2009)

87

B Cross-validation of Accuracy

The following table gives an overview over the cross-validation results for Salmonellosis,using the exact same procedure and models as for Norovirus. The relative change in errorfor the MinTsample projection is extreme and here given in thousands. The full data of thecross-validation for both diseases, including predictions and error measures, is providedin the supplementary material.


Level Projection

nation

BU +5.18 +5.18 +5.18 +10.42 +11.09 +12.82TD +0.00 +0.00 +0.00 -0.00 +0.00 +0.00OLS -0.13 -0.13 -0.13 -1.15 -1.12 -0.90WLSS -1.19 -1.19 -1.19 -5.60 -4.54 -3.54WLSV +0.05 +0.05 +0.05 +0.28 +1.27 +2.77MinTsample >+60T >+80T >+90T >+440T >+120T >+110TMinTshrink +2.44 +2.30 +16.89 +0.89 -0.21 +14.39OLS+ +94.71 +94.71 +94.71 +14.56 +14.07 +15.99WLSS + +160.78 +160.78 +160.78 +12.98 +13.96 +14.78WLSV+ +3,360.77 +3,360.77 +3,360.77 +11.49 +12.31 +14.23MinTshrink+ +29.28 +21.99 +52.97 +9.20 +10.38 +12.35Base RMAE 0.99 0.99 0.99 1.26 1.24 1.22

state

BU +5.05 +5.05 +5.05 +0.81 +1.26 +0.60TD -0.02 -0.02 -0.02 -2.60 -2.40 -3.44OLS +0.10 +0.10 +0.10 +4.79 +4.60 +4.19WLSS +0.89 +0.89 +0.89 -3.65 -3.25 -3.81WLSV +2.01 +2.01 +2.01 -2.21 -1.90 -2.26MinTsample >+70T >+80T >+130T >+330T >+90T >+140TMinTshrink +16.35 +15.68 +47.92 +0.30 -0.02 +29.50OLS+ +53.48 +53.48 +53.48 +2.59 +2.70 +2.13WLSS + +73.16 +73.16 +73.16 +1.20 +1.74 +1.06WLSV+ +1,603.79 +1,603.79 +1,603.79 +0.77 +1.32 +0.83MinTshrink+ +39.32 +23.01 +55.05 +0.44 +1.02 +0.44Base RMAE 1.00 1.00 1.00 1.11 1.10 1.11

county

BU +0.00 +0.00 +0.00 +0.00 +0.00 +0.00TD +0.52 +0.52 +0.52 -3.73 -3.65 -3.75OLS -0.15 -0.15 -0.15 +1.05 +1.03 +0.96WLSS -0.82 -0.82 -0.82 -0.64 -0.65 -0.67WLSV -0.53 -0.53 -0.53 -0.41 -0.42 -0.41MinTsample +109T +131T +186T +393 +138T +207TMinTshrink +34.49 +31.99 +72.01 +4.16 +4.30 +79.04OLS+ +20.77 +20.77 +20.77 +0.68 +0.52 +0.58WLSS + +42.60 +42.60 +42.60 +0.49 +0.47 +0.46WLSV+ +455.69 +455.69 +455.69 +0.11 +0.11 +0.13

Continued on next page

88


Level Projection

MinTshrink+ +58.21 +33.25 +69.11 -0.07 -0.04 -0.02Base RMAE 1.04 1.04 1.04 1.08 1.08 1.07

89

References

Ak, C., Ergönül, , Sencan, I., Torunoglu, M. A., and Gönen, M. 2018. “Spatiotemporalprediction of infectious diseases using structured Gaussian processes with applicationto Crimean–Congo hemorrhagic fever,” PLoS Neglected Tropical Diseases (12:8), pp.1–20.

Allard, R. 1998. “Use of time-series analysis in infectious disease surveillance,” Bulletinof the World Health Organization (76:4), pp. 327–333.

Amato-Gauci, A., and Ammon, A. 2008. “The surveillance of communicable diseases inthe European Union - a long-term strategy,” Eurosurveillance (13:4-6), pp. 1–3.

Arık, S. o., Li, C.-L., Yoon, J., Sinha, R., Epshteyn, A., Le, L. T., Menon, V., Singh, S.,Zhang, L., Yoder, N., Nikoltchev, M., Sonthalia, Y., Nakhost, H., Kanal, E., and Pfister,T. 2020. “Interpretable Sequence Learning for COVID-19 Forecasting,” .

Athanasopoulos, G., Gamakumara, P., Panagiotelis, A., Hyndman, R. J., and Affan, M.2020. “Hierarchical Forecasting,” Advanced Studies in Theoretical and Applied Econo-metrics (52:February), pp. 689–719.

Athanasopoulos, G., Hyndman, R. J., Kourentzes, N., and Petropoulos, F. 2017.“Forecasting with Temporal Hierarchies,” European Journal of Operational Research(262:1), pp. 60–74.

Banfield, R. E., Hall, L. O., Bowyer, K. W., and Kegelmeyer, W. P. 2007. “A comparisonof decision tree ensemble creation techniques,” IEEE Transactions on Pattern Analysisand Machine Intelligence (29:1), pp. 173–180.

Bansal, S., Chowell, G., Simonsen, L., Vespignani, A., and Viboud, C. 2016. “Bigdata for infectious disease surveillance and modeling,” Journal of Infectious Diseases(214:Suppl 4), pp. S375–S379.

Barocas, S., Selbst, A. D., and Raghavan, M. 2020. “The Hidden Assumptions BehindCounterfactual Explanations and Principal Reasons,” in Proceedings of the 2020 Con-ference on Fairness, Accountability, and Transparency, New York, NY, USA, pp. 80–89.

Ben Taieb, S., and Koo, B. 2019. “Regularized regression for hierarchical forecastingwithout unbiasedness conditions,” Proceedings of the ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining pp. 1337–1347.

Berrada, M., and Adadi, A. 2018. “Peeking Inside the Black-Box: A Survey on Explain-able Artificial Intelligence (XAI),” IEEE Access (6), pp. 52,138–52,160.

90

Biggerstaff, M., Alper, D., Dredze, M., Fox, S., Fung, I. C.-h., and Hickmann, K. S. 2016.“Results from the centers for disease control and prevention’s predict the 2013 – 2014Influenza Season Challenge,” BMC Infectious Diseases (16:357), pp. 1–10.

Biggerstaff, M., Johansson, M., Alper, D., Brooks, L. C., Chakraborty, P., Farrow, D. C.,Hyun, S., Kandula, S., Mcgowan, C., Ramakrishnan, N., Rosenfeld, R., Tibshirani,R., Tibshirani, R. J., Vespignani, A., Yang, W., Zhang, Q., and Reed, C. 2018. “Resultsfrom the second year of a collaborative effort to forecast influenza seasons in the UnitedStates,” Epidemics (24:February), pp. 26–33.

Biran, O., and Cotton, C. 2017. “Explanation and Justification in Machine Learning: ASurvey,” IJCAI Workshop on Explainable AI (XAI) (2017:August), pp. 8–14.

Bulos, D., and Forsman, S. 2006. “Getting started with ADAPT: OLAP Database Design,”Symmetry Corporation White Paper pp. 1–19.

Carias, C., O’Hagan, J. J., Gambhir, M., Kahn, E. B., Swerdlow, D. L., and Meltzer, M. I.2019. “Forecasting the 2014 West African Ebola Outbreak,” Epidemiologic reviews(41:1), pp. 34–50.

Caruana, R., Kangarloo, H., David, J., Dionisio, N., Sinha, U., and Ms, D. J. 1999.“Case-Based Explanation of Non-Case-Based Learning Methods,” in Proceedings ofthe AMIA Symposium, pp. 212–215.

Centers for Disease Control and Prevention 2006. Principles of Epidemiology in PublicHealth Practice.

Centers for Disease Control and Prevention 2016. “Staying Ahead of the Curve: Model-ing and Public Health Decision Making Modeling to Support Outbreak Preparedness,Surveillance and Response,” in CDC Public Health Grand Rounds, pp. 1–63.

Chae, S., Kwon, S., and Lee, D. 2018. “Predicting infectious disease using deep learn-ing and big data,” International Journal of Environmental Research and Public Health(15:8).

Chen, H., Lundberg, S., and Lee, S.-I. 2019. “Explaining Models by Propagating ShapleyValues of Local Components,” Arxiv Preprint pp. 1–6.

Chretien, J. P., George, D., Shaman, J., Chitale, R. A., and McKenzie, F. E. 2014. “In-fluenza forecasting in human populations: A scoping review,” PLoS ONE (9:4).

Christoph Molnar 2019. Interpretable machine learning.

Claus, H., Kirchner, G., Ullrich, A., and Ghozzi, S. 2017. “DEMIS – Ressort-forschungsantrag – Signale 2.0,” DEMIS Ressortforschungsantrag pp. 1–28.

91

Cooper, H. M. 1988. “Organizing Knowledge Syntheses: A Taxonomy of Literature Re-views,” Knowledge in society (1:1), pp. 104–125.

Corberán-Vallet, A., and Lawson, A. B. 2014. “Prospective analysis of infectious dis-ease surveillance data using syndromic information,” Statistical Methods in MedicalResearch (23:6), pp. 572–590.

Cori, A., Ferguson, N. M., Fraser, C., and Cauchemez, S. 2013. “A New Frameworkand Software to Estimate Time-Varying Reproduction Numbers During Epidemics,”American Journal of Epidemiology (178:9), pp. 1505–1512.

Dembek, Z. F., Chekol, T., and Wu, A. 2018. “Best practice assessment of diseasemodelling for infectious disease outbreaks,” Epidemiology and Infection (146:10), pp.1207–1215.

Desai, A. N., Kraemer, M. U., Bhatia, S., Cori, A., Nouvellet, P., Herringer, M., Cohn,E. L., Carrion, M., Brownstein, J. S., Madoff, L. C., and Lassmann, B. 2019. “Real-time epidemic forecasting: challenges and opportunities,” Health Security (17:4), pp.268–275.

Doms, C., Kramer, S. C., and Shaman, J. 2018. “Assessing the Use of Influenza Forecastsand Epidemiological Modeling in Public Health Decision Making in the United States,”Scientific Reports (8:1), pp. 1–7.

Doshi-Velez, F., and Kim, B. 2017. “Towards A Rigorous Science of Interpretable Ma-chine Learning,” .

Driedger, S. M., Cooper, E. J., and Moghadas, S. M. 2014. “Developing model-basedpublic health policy through knowledge translation: the need for a ‘Communities ofPractice’,” Public Health (2014:128), pp. 561–567.

Dunn, D. M., Williams, W. B. H., and DeChaine, T. L. 1976. “Aggregate Versus Subag-gregate Models in Local Area Forecasting,” Journal ofthe American Statistical Associ-ation (71:353), pp. 68–71.

Ellis, T. J., and Levy, Y. 2010. “A Guide for Novice Researchers: Design and Develop-ment Research Methods,” in Proceedings of Informing Science & IT Education Con-ference, pp. 107–118.

Erion, G., Janizek, J. D., Sturmfels, P., Lundberg, S., and Lee, S.-I. 2019. “LearningExplainable Models Using Attribution Priors,” Arxiv Preprint .

Ertem, Z., Raymond, D., and Meyers, L. A. 2018. “Optimal multi-source forecasting ofseasonal influenza,” PLoS Computational Biology (14:9), pp. 1–16.

92

Faensen, D., Claus, H., Benzler, J., Ammon, A., Pfoch, T., Breuer, T., and Krause, G.2006. “SurvNet@RKI - A Multistate Electronic Reporting System for CommunicableDiseases,” Euro Surveillance (11:4), pp. 100–103.

Flahault, A., Bar-Hen, A., and Paragios, N. 2016. “Public Health and Epidemiology In-formatics,” IMIA Yearbook of Medical Informatics 2016 pp. 240–246.

Fliedner, G. 2015. “Hierarchical forecasting: issues and use guidelines,” Industrial Man-agement & Data Systems (101:1), pp. 5–12.

Friedman, E. J. 2004. “Paths and consistency in additive cost sharing,” International Jour-nal of Game Theory (32:4), pp. 501–518.

Gasthaus, J., Benidis, K., Wang, Y., Rangapuram, S. S., Salinas, D., Flunkert, V., andJanuschowski, T. 2020. “Probabilistic forecasting with spline quantile function RNNs,”in AISTATS 2019 - 22nd International Conference on Artificial Intelligence and Statis-tics, vol. 89, pp. 1–10.

Ghosh, D., and Guha, R. 2011. “Using a neural network for mining interpretable relation-ships of West Nile risk factors,” Social Science and Medicine (72:3), pp. 418–429.

Gibson, G. C., Moran, K. R., Reich, N. G., and Osthus, D. 2019. “Improving ProbabilisticInfectious Disease Forecasting Through Coherence,” Centers for Disease Control pp.1–21.

Gosiewska, A., and Biecek, P. 2019. “iBreakDown: Uncertainty of Model Explanationsfor Non-Additive Predictive Models,” Journal of Open Source Software (4:43), pp.1798–1808.

Gross, C. W., and Sohl, J. E. 1990. “Disaggregation methods to expedite product lineforecasting,” Journal of Forecasting (9:3), pp. 233–254.

Hazelbag, C. M., Dushoff, J., Dominic, E. M., Mthombothi, Z. E., and Delva, W. 2020.“Calibration of individual-based models to epidemiological data: A systematic review,”PLoS Computational Biology (16:5), pp. 1–17.

Hevner, A. R., March, S. T., Park, J., and Ram, S. 2004. “Design Science Research inInformation Systems Research,” MIS Quarterly (28:1), pp. 75–105.

Höhle, M. 2016. “Infectious Disease Modelling,” in Handbook of Spatial Epidemiology,R. H. M. U. AB Lawson, S Banerjee (ed.), Chapman & Hall/CRC, p. 477–500.

Höhle, M., and An Der Heiden, M. 2014. “Bayesian nowcasting during the STEC O104:H4 outbreak in Germany, 2011,” Biometrics (70:4), pp. 993–1002.

Höhle, M., Meyer, S., and Paul, M. 2020. “Package ’surveillance’,” .

93

Hyndman, R. J. 2014. “Errors on percentage errors,” Hyndsight Blog pp. 1–2.

Hyndman, R. J., Ahmed, R. A., Athanasopoulos, G., and Shang, H. L. 2011. “Optimalcombination forecasts for hierarchical time series,” Computational Statistics and DataAnalysis (55:9), pp. 2579–2589.

Hyndman, R. J., and Athanasopoulos, G. 2018. Forecasting: Principles and practice,OTexts, 2nd ed.

Hyndman, R. J., and Koehler, A. B. 2006. “Another look at measures of forecast accu-racy,” International Journal of Forecasting (22:4), p. 679–688.

Hyndman, R. J., and Kostenko, A. V. 2007. “Minimum sample size requirements for sea-sonal forecasting models,” Foresight: the International Journal of Applied Forecasting(6:6), pp. 12–15.

Hyndman, R. J., Lee, A. J., and Wang, E. 2016. “Fast computation of reconciled forecastsfor hierarchical and grouped time series,” Computational Statistics and Data Analysis(97:October), pp. 16–32.

IANPHI 2020. “Members of the International Association of National Public Health In-stitutes,” .

Janzing, D., Minorics, L., and Blöbaum, P. 2019. “Feature relevance quantification inexplainable AI: A causal problem,” Arxi (:2015).

Johnson, L. R., Gramacy, R. B., Cohen, J., Mordecai, E., Murdock, C., Rohr, J., Ryan,S. J., Stewart-Ibarra, A. M., and Weikel, D. 2018. “Phenomenological forecasting ofdisease incidence using heteroskedastic gaussian processes: A dengue case study,” An-nals of Applied Statistics (12:1), pp. 27–66.

Kandula, S., Yang, W., and Shaman, J. 2017. “Type- and Subtype-Specific Influenza Fore-cast,” American Journal of Epidemiology (185:5), pp. 395–402.

Kane, M. J., Price, N., Scotch, M., and Rabinowitz, P. 2014. “Comparison of ARIMA andRandom Forest time series models for prediction of avian influenza H5N1 outbreaks,”BMC Bioinformatics (15:1).

Kim, B., Rudin, C., and Shah, J. 2014. “The Bayesian Case Model: A Generative Ap-proach for Case-Based Reasoning and Prototype Classification,” in Advances in neuralinformation processing systems, pp. 1952–1960.

Krause, G., Altmann, D., Faensen, D., Porten, K., Benzler, J., Pfoch, T., Ammon, A.,Kramer, M. H., and Claus, H. 2007. “SurvNet Electronic Surveillance System forInfectious Disease Outbreaks, Germany,” Emerging Infectious Diseases (13:10), pp.1548–1555.

94

Krening, S., Harrison, B., Feigh, K. M., Isbell, C. L., Riedl, M., and Thomaz, A. 2017.“Learning from explanations using sentiment and advice in RL,” Transactions on Cog-nitive and Developmental Systems (9:1), pp. 44–55.

Kumar, I. E., Venkatasubramanian, S., Scheidegger, C., and Friedler, S. 2020. “Prob-lems with Shapley-value-based explanations as feature importance measures,” ArxivPreprint pp. 1–14.

Lapuschkin, S. 2019. Opening the Machine Learning Black Box with Layer-wise Rele-vance Propagation, Ph.D. thesis.

Lauer, S. A., Brown, A. C., and Reich, N. G. 2020. “Infectious Disease Forecasting forPublic Health,” in Population Biology of Vector-Borne Diseases, J. Drake, M. Strand,and M. Bonsall (eds.), pp. 1–36.

Lawless, J. F. 1994. “Adjustments for Reporting Delays and the Prediction of Occurredbut Not Reported Events,” The Canadian Journal of Statistics (22:1), pp. 15–31.

Lawson, A. B., and Song, H.-r. 2010. “Bayesian hierarchical modeling of the dynamicsof spatio-temporal influenza season outbreaks,” Spatial and Spatio-temporal Epidemi-ology (1:2-3), pp. 187–195.

Levy, Y., and Ellis, T. J. 2006. “A Systems Approach to Conduct an Effective LiteratureReview in Support of Information Systems Research,” Informing Science Journal .

Li, H., Li, H., Lu, Y., and Panagiotelis, A. 2019. “A forecast reconciliation approach tocause-of-death mortality modeling,” Insurance: Mathematics and Economics (86), pp.122–133.

Li, H., and Tang, Q. 2019. “Analyzing mortality bond indexes via hierarchical forecastreconciliation,” ASTIN Bulletin (49:3), pp. 823–846.

Lipton, Z. C. 2018. “The Mythos of Model Interpretability,” Queue (16:3), pp. 31–57.

Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., Katz, R.,Himmelfarb, J., Bansal, N., and Lee, S.-I. 2019. “Explainable AI for Trees: From LocalExplanations to Global Understanding,” Nature Machine Intelligence (2), pp. 56–67.

Lundberg, S. M., and Lee, S. I. 2017. “A unified approach to interpreting model predic-tions,” Advances in Neural Information Processing Systems (2017-Dec:2), pp. 4766–4775.

Lutz, C. S., Huynh, M. P., Schroeder, M., Anyatonwu, S., Dahlgren, F. S., Danyluk, G.,Fernandez, D., Greene, S. K., Kipshidze, N., Liu, L., Mgbere, O., Mchugh, L. A., My-ers, J. F., Siniscalchi, A., Sullivan, A. D., West, N., Johansson, M. A., and Biggerstaff,

95

M. 2019. “Applying infectious disease forecasting to public health: a path forward us-ing influenza forecasting examples,” BMC Public health (19:1659), pp. 1–12.

Makridakis, S., Spiliotis, E., and Assimakopoulos, V. 2018. “The M4 Competition: Re-sults, findings, conclusion and way forward,” International Journal of Forecasting(34:4), pp. 802–808.

Manheim, D., Chamberlin, M., Osoba, O., Vardavas, R., and Moore, M. 2017. ImprovingDecision Support for Infectious Disease Prevention and Control: Aligning Models andOther Tools with Policymakers’ Needs, National Defense Research Institute.

Mcgowan, C. J., Biggerstaff, M., Johansson, M., Apfeldorf, K. M., Ben-nun, M., Brooks,L., Convertin, M., Erraguntla, M., Farrow, D. C., Freeze, J., Ghosh, S., Sangwon, H.,and Kandula, S. 2019. “Collaborative efforts to forecast seasonal influenza in the UnitedStates, 2015-2016,” Tech. Rep. (2019).

Meinshausen, N. 2006. “Quantile Regression Forests,” Journal of Machine Learning Re-search (7), pp. 983–999.

Merrick, L., and Taly, A. 2019. “The Explanation Game: Explaining Machine LearningModels with Cooperative Game Theory,” Arxiv Preprint .

Meyer, S., Held, L., and Höhle, M. 2017. “Spatio-temporal analysis of epidemic phenom-ena using the R package surveillance,” Journal of Statistical Software (77:1).

Miller, T. 2019. “Explanation in Artificial Intelligence: Insights from the Social Sciences,”Artificial Intelligence (267), pp. 1–38.

Mohseni, S., Zarei, N., and Ragan, E. D. 2018. “A Multidisciplinary Survey and Frame-work for Design and Evaluation of Explainable AI Systems,” ACM Trans interact intellsyst (1:1), pp. 1–37.

Naumova, E. N., O’Neil, E., and MacNeill, I. 2005. “INFERNO: a system for early out-break detection and signature forecasting.” MMWR Morbidity and mortality weeklyreport (54 Suppl), pp. 77–83.

Neiting, T. G., and Raftery, A. E. 2007. “Strictly Proper Scoring Rules, Prediction andEstimation,” Journal of the American Statistical Association (102:477), pp. 359–378.

Nsoesie, E. O., Brownstein, J. S., Ramakrishnan, N., and Marathe, V. 2014. “A systematicreview of studies on forecasting the dynamics of influenza outbreaks,” Influenza andother respiratory viruses (8:3), pp. 309–316.

Osthus, D., Hickmann, K. S., Higdon, D., Del, S. Y., and Alamos, L. 2017. “Forecastingseasonal influenza with a state-space SIR model,” Annals of Applied Statistics (11:1),pp. 202–224.

96

Ouyang, W., Zhang, Y., Zhu, M., Zhang, X., Chen, H., Ren, Y., and Fan, W. 2019. “In-terpretable Spatial-Temporal Attention Graph Convolution Network for Service PartHierarchical Demand Forecast,” Lecture Notes in Computer Science (11839).

Panagiotelis, A., Athanasopoulos, G., Gamakumara, P., and Hyndman, R. J. 2020. “Fore-cast reconciliation: A geometric view with new insights on bias correction,” MonashUniversity, Work Paper (23:20), pp. 1–33.

Paul, M., and Meyer, S. 2016. “hhh4: An endemic-epidemic modelling framework forinfectious disease counts,” Journal of Statistical Software (2010:1), pp. 1–17.

Peffers, K., Rothenberger, M., Tuunanen, T., and Vaezi, R. 2012. “Design Science Re-search Evaluation,” in DESRIST12, pp. 398–410.

Peffers, K., Tuunanen, T., Rothenberger, M. A., and Chatterjee, S. 2007. “A design sci-ence research methodology for information systems research,” Journal of ManagementInformation Systems (24:3), pp. 45–77.

Polonsky, J. A., Baidjoe, A., Kamvar, Z. N., Cori, A., Durski, K., John Edmunds, W.,Eggo, R. M., Funk, S., Kaiser, L., Keating, P., Le Polain De Waroux, O., Marks, M.,Moraga, P., Morgan, O., Nouvellet, P., Ratnayake, R., Roberts, C. H., Whitworth, J.,and Jombart, T. 2019. “Outbreak analytics: A developing data science for informingthe response to emerging pathogens,” Philosophical Transactions of the Royal SocietyB: Biological Sciences (374:1776).

Rehman, H. U., Wan, G., Ullah, A., and Shaukat, B. 2019. “Individual and combinationapproaches to forecasting hierarchical time series with correlated data: an empiricalstudy,” Journal of Management Analytics (6:3), pp. 231–249.

Reich, N. G., Lauer, S. A., Sakrejda, K., Iamsirithaworn, S., Hinjoy, S., Suangtho, P.,Suthachana, S., Clapham, H. E., Salje, H., Cummings, D. A., and Lessler, J. 2016a.“Challenges in Real-Time Prediction of Infectious Disease: A Case Study of Denguein Thailand,” PLoS Neglected Tropical Diseases (10:6), pp. 1–17.

Reich, N. G., Lessler, J., Sakrejda, K., Lauer, S. A., Iamsirithaworn, S., and Cummings, D.A. T. 2016b. “Case study in evaluating time series prediction models using the relativemean absolute error,” The American Statistician (70:3), pp. 285–292.

Ribeiro, M. T., Singh, S., and Guestrin, C. 2016a. “Model-Agnostic Interpretability ofMachine Learning,” Arxiv Preprint .

Ribeiro, M. T., Singh, S., and Guestrin, C. 2016b. “"Why Should I Trust You?" - Ex-plaining the Predictions of Any Classifier,” in Proceedings of the 22nd ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining, p. 1135–1144.

97

Ribeiro, M. T., Singh, S., and Guestrin, C. 2018. “Anchors: High-Precision Model-Agnostic Explanations,” in AAAI, pp. 1527–1535.

Rivers, C., Chretien, J.-p., Riley, S., Pavlin, J. A., Woodward, A., Brett-major, D., Berry,I. M., Morton, L., Jarman, R. G., Biggerstaff, M., Johansson, M. A., Reich, N. G.,Meyer, D., Snyder, M. R., and Pollett, S. 2019. “Using “outbreak science” to strengthenthe use of models during epidemics,” Nature Communications (10:3102), pp. 1–3.

Robert Koch-Institut 2012. “Darstellung und Bewertung der epidemiologischen Erken-ntnisse im Ausbruch von Norovirus-Gastroenteritis in Einrichtungen mit Gemein-schaftsverpflegung, Ostdeutschland, September-Oktober 2012.” Tech. rep.

Robert Koch-Institut 2017. “Deutsches Elektronisches Melde- und Informationssystemfür den Infektionsschutz (DEMIS),” Epidemiologisches Bulletin (30), pp. 291–293.

Robert Koch-Institut 2019. Infektionsepidemiologisches Jahrbuch meldepflichtigerKrankheiten für 2018, Berlin.

Robert Koch Institut 2020. “Projekt "Signale",” RKI Homepage .

Roth, A. E. 1988. “Introduction to the Shapley Value,” in The Shapley Value: Essays inHonor of Lloyd S. Shapley, Cambridge University Press, pp. 1–35.

Rousseeuw, P. J. 1987. “Silhouettes: A graphical aid to the interpretation and validationof cluster analysis,” Journal of Computational and Applied Mathematics (20:C), pp.53–65.

Salathé, M., Bengtsson, L., Bodnar, T. J., Brewer, D. D., Brownstein, J. S., Buckee, C.,Campbell, E. M., Cattuto, C., Khandelwal, S., Mabry, P. L., and Vespignani, A. 2012.“Digital epidemiology,” PLoS Computational Biology (8:7), pp. 1–5.

Salmon, M., Schumacher, D., and Höhle, M. 2016. “Monitoring count time series inR: Aberration detection in public health surveillance,” Journal of Statistical Software(70:10), pp. 1–35.

Samek, W., Montavon, G., Vedaldi, A., Hansen, L. K., and Muller, K.-R. 2019. “Ex-plainable AI: Interpreting, Explaining and Visualizing Deep Learning,” Lecture Notesin Computer Science (:11700), pp. 1–435.

Scarpino, S. V., and Petri, G. 2019. “On the predictability of infectious disease outbreaks,”Nature Communications (10:1).

Schielzeth, H. 2010. “Simple means to improve the interpretability of regression coeffi-cients,” Methods in Ecology and Evolution (1), pp. 103–113.

98

Schwarzkopf, A. B., Tersine, R. J., Morris, J. S., Schwarzkopf, A. B., Tersine, R. J., andMorris, J. S. 1988. “Top-down versus bottom-up forecasting strategies,” InternationalJournal of Production Research (26:11), pp. 1833–184.

Scikit-learn 2020. “sklearn.ensemble.RandomForestRegressor,” scikit-learn 0232 docu-mentation pp. 1–7.

Semenza, J. C. 2015. “Prototype early warning systems for vector-borne diseases in Eu-rope,” International Journal of Environmental Research and Public Health (12:6), pp.6333–6351.

Shang, H. L., and Hyndman, R. J. 2017. “Grouped Functional Time Series Forecasting :An Application to Age-Specific Mortality Rates Grouped Functional Time Series Fore-casting : An Application to Age-Specific,” Journal of Computational and GraphicalStatistics (26:2), pp. 330–343.

Shang, H. L., and Smith, P. W. F. 2013. “Grouped time-series forecasting with an applica-tion to regional infant mortality counts,” Centre for Population Change Working Papers(:40).

Shlifer, E., and Wolff, R. 1979. “Aggregation and Proration in Forecasting,” ManagementScience (25:6), pp. 594–603.

Shrikumar, A., Greenside, P., and Kundaje, A. 2017. “Learning important featuresthrough propagating activation differences,” in 34th International Conference on Ma-chine Learning, ICML 2017, pp. 4844–4866.

Smyl, S. 2020. “A hybrid method of exponential smoothing and recurrent neural networksfor time series forecasting,” International Journal of Forecasting (36:1), pp. 75–85.

Spiliotis, E., Petropoulos, F., and Assimakopoulos, V. 2019. “Improving the forecastingperformance of temporal hierarchies,” PLoS ONE (14:10), pp. 1–21.

Stojanovic, O., Leugering, J., Pipa, G., Ghozzi, S., and Ullrich, A. 2019. “A BayesianMonte Carlo approach for predicting the spread of infectious diseases,” PLoS ONE(14:12), pp. 1–20.

Štrumbelj, E., and Kononenko, I. 2014. “Explaining prediction models and individualpredictions with feature contributions,” Knowledge and Information Systems (41:3),pp. 647–665.

Sundararajan, M., and Najmi, A. 2019. “The many Shapley values for model explanation,”Arxiv Preprint pp. 1–20.

Sundararajan, M., Taly, A., and Yan, Q. 2017. “Axiomatic attribution for deep networks,”in 34th International Conference on Machine Learning, ICML, vol. 70, pp. 3319–3328.

99

Taieb, S. B., Taylor, J. W., and Hyndman, R. J. 2017. “Coherent probabilistic forecasts forhierarchical time series,” 34th International Conference on Machine Learning, ICML2017 (7), pp. 5143–5155.

Taieb, S. B., Taylor, J. W., Hyndman, R. J., Ben, S., Taylor, J. W., Hierarchical, R. J. H.,Ben, S., Taylor, J. W., and Hyndman, R. J. 2020. “Hierarchical Probabilistic Forecastingof Electricity Demand With Smart Meter Data,” Journal of the American StatisticalAssociation (0:0), pp. 1–17.

Thorve, S., Wilson, M. L., Lewis, B. L., Swarup, S., Vullikanti, A. K. S., and Marathe,M. V. 2018. “EpiViewer: An epidemiological application for exploring time seriesdata,” BMC Bioinformatics (19:1), pp. 1–10.

Troitzsch, K. G. 2009. “Not all explanations predict satisfactorily, and not all good pre-dictions explain,” Journal of Artificial Societies and Social Simulation (12:1).

Unkel, S., Farrington, C. P., Garthwaite, P. H., and Robertson, C. 2012. “Statistical meth-ods for the prospective detection of infectious disease outbreaks: a review,” Journal ofthe Royal Statistical Society A (175:1), pp. 49–82.

van Erven, T., and Cugliari, J. 2015. “Game-Theoretically Optimal reconciliation of con-temporaneous hierarchical time series forecasts,” Lecture Notes in Statistics (217), pp.297–317.

Viboud, C., Boëlle, P.-y., Carrat, F., Valleron, A.-j., and Flahault, A. 2003. “Prediction ofthe Spread of Influenza Epidemics by the Method of Analogues,” American Journal ofEpidemiology (158:10), pp. 996–1006.

Viboud, C., Sun, K., Ga, R., Ajelli, M., Fumanelli, L., Merler, S., Zhang, Q., Chowell, G.,Simonsen, L., and Vespignani, A. 2018. “The RAPIDD ebola forecasting challenge:Synthesis and lessons learnt,” Epidemics (22:2018), pp. 13–21.

Vom Brocke, J., Simons, A., Niehaves, B., Niehaves, B., Reimer, K., Plattfaut, R., andCleven, A. 2009. “Reconstructing the Giant: On the Importance of Rigour in Docu-menting the Literature Search Process,” ECIS 2009 Proceedings .

Wachter, S., Mittelstadt, B., and Russell, C. 2018. “Counterfactual Explanations withoutOpening the Black Box: Automated Decisions and the GDPR,” Harvard Journal ofLaw and Technology (31:842-861), p. 841.

Wickramasuriya, S. L., Athanasopoulos, G., Hyndman, R. J., Wickramasuriya, S. L.,Athanasopoulos, G., Hyndman, R. J., and Minimization, T. 2019. “Optimal ForecastReconciliation for Hierarchical and Grouped Time Series Through Trace Minimiza-tion,” Journal of the American Statistical Association (114:526), pp. 1–16.

100

Wickramasuriya, S. L., Turlach, B. A., and Hyndman, R. J. 2020. “Optimal non-negativeforecast reconciliation,” Statistics and Computing (2020:February).

World Health Organisation 1968. “Report of the technical discussions at the twenty-firstWorld Health Assembly on "national and global surveillance of communicable dis-eases",” Tech. rep., Geneva, Switzerland.

Yang, W., Olson, D. R., and Shaman, J. 2016. “Forecasting Influenza Outbreaks in Bor-oughs and Neighborhoods of New York City,” PLoS Computational Biology (12:11),pp. 1–19.

Zeger, S. L., and Karim, M. R. 1991. “Generalized linear models with random effects; aGibbs sampling approach,” Journal of the American Statistical Association (86:413),pp. 79–86.

101

Declaration of Authorship

I hereby declare that, to the best of my knowledge and belief, this Master Thesis titled“Interpretable Hierarchical Forecasting of Infectious Diseases” is my own work. I confirmthat each significant contribution to and quotation in this thesis that originates from thework or works of others is indicated by proper use of citation and references.

Münster, 29th September 2020

Adrian Lison

102

Consent Form

Last name: Lison First name: AdrianStudent number: 429175 Course of study: Information SystemsAddress: Feuerbachstr. 31, 14471 PotsdamTitle of the thesis: “Interpretable Hierarchical Forecasting of Infectious Diseases”

What is plagiarism? Plagiarism is defined as submitting someone else’s work or ideas as yourown without a complete indication of the source. It is hereby irrelevant whether the work ofothers is copied word by word without acknowledgment of the source, text structures (e.g. line ofargumentation or outline) are borrowed or texts are translated from a foreign language.

Use of plagiarism detection software The examination office uses plagiarism software to checkeach submitted bachelor and master thesis for plagiarism. For that purpose the thesis is electroni-cally forwarded to a software service provider where the software checks for potential matches be-tween the submitted work and work from other sources. For future comparisons with other theses,your thesis will be permanently stored in a database. Only the School of Business and Economicsof the University of Münster is allowed to access your stored thesis. The student agrees that his orher thesis may be stored and reproduced only for the purpose of plagiarism assessment. The firstexaminer of the thesis will be advised on the outcome of the plagiarism assessment.

Sanctions Each case of plagiarism constitutes an attempt to deceive in terms of the examinationregulations and will lead to the thesis being graded as “failed”. This will be communicated tothe examination office where your case will be documented. In the event of a serious case ofdeception the examinee can be generally excluded from any further examination. This can leadto the exmatriculation of the student. Even after completion of the examination procedure andgraduation from university, plagiarism can result in a withdrawal of the awarded academic degree.

I confirm that I have read and understood the information in this document. I agree to the outlinedprocedure for plagiarism assessment and potential sanctioning.

Münster, 29th September 2020

Adrian Lison

Interpretable Hierarchical Forecasting of Infectious Diseases

Documents

Transcript of Interpretable Hierarchical Forecasting of Infectious Diseases