Data-Driven Methods for Modeling and Predicting Multivariate … · 2020-01-19 · Data-Driven...

Data-Driven Methods for Modeling and Predicting MultivariateTime Series using Surrogates

Prithwish Chakraborty

Dissertation submitted to the Faculty of theVirginia Polytechnic Institute and State University

in partial fulfillment of the requirements for the degree of

Doctor of Philosophyin

Computer Science

Narendran Ramakrishnan, ChairMadhav MaratheChang-Tien LuRavi Tandon

John S. Brownstein

April 28, 2016Arlington, VA

Keywords: Multivariate Time Series, Surrogates, Generalized Linear Models, BayesianSequential Analysis, Computational Epidemiology

Copyright c© 2015, Prithwish Chakraborty

Data-Driven Methods for Modeling and Predicting Multivariate Time Seriesusing Surrogates


(ABSTRACT)

Modeling and predicting multivariate time series data has been of prime interest to re-searchers for many decades. Traditionally, time series prediction models have focused onfinding attributes that have consistent correlations with target variable(s). However, diversesurrogate signals, such as News data and Twitter chatter, are increasingly available whichcan provide real-time information albeit with inconsistent correlations. Intelligent use ofsuch sources can lead to early and real-time warning systems such as Google Flu Trends.Furthermore, the target variables of interest, such as public heath surveillance, can be noisy.Thus models built for such data sources should be flexible as well as adaptable to changingcorrelation patterns.

In this thesis we explore various methods of using surrogates to generate more reliable andtimely forecasts for noisy target signals. We primarily investigate three key components ofthe forecasting problem viz. (i) short-term forecasting where surrogates can be employedin a now-casting framework, (ii) long-term forecasting problem where surrogates acts asforcing parameters to model system dynamics and, (iii) robust drift models that detect andexploit ‘changepoints’ in surrogate-target relationship to produce robust models. We explorevarious ‘physical’ and ‘social’ surrogate sources to study these sub-problems, primarily togenerate real-time forecasts for endemic diseases. On modeling side, we employed matrixfactorization and generalized linear models to detect short-term trends and explored variousBayesian sequential analysis methods to model long-term effects. Our research indicatesthat, in general, a combination of surrogates can lead to more robust models. Interestingly,our findings indicate that under specific scenarios, particular surrogates can decrease overallforecasting accuracy - thus providing an argument towards the use of ‘Good data’ against‘Big data’.

This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) viaDepartment of Interior National Business Center (DoI/NBC) contract number D12PC000337.The US Government is authorized to reproduce and distribute reprints of this work for Gov-ernmental purposes notwithstanding any copyright annotation thereon. Disclaimer: Theviews and conclusions contained herein are those of the authors and should not be inter-preted as necessarily representing the official policies or endorsements, either expressed orimplied, of IARPA, DoI/NBC, or the US Government.

Data-Driven Methods for Modeling and Predicting Multivariate Time Seriesusing Surrogates


(GENERAL AUDIENCE ABSTRACT)

In the context of public health, modeling and early-forecasting of infectious diseases is ofprime importance. Such efforts help agencies to devise interventions and implement effec-tive counter-measures. However, disease surveillance is an involved process where agenciesestimate the intensity of diseases in the public domain using various networks. The processinvolves various levels of data cleaning and aggregation and as such the resultant surveillancedata is inherently noisy (requiring several revisions to stabilize) and delayed. Thus real-timeforecasting about such diseases necessitates stable and robust methods that can provide ac-curate public health information in time-critical manner. This work focuses on data-drivenmodeling and forecasting of time series, especially infectious diseases, for a number regionsof the world including Latin America and the United States of America. With the increasingpopularity of social media, real-time societal information could be extracted from variousmedia such as Twitter and News. This work addresses this critical area where a number ofmodels have been presented to systematically integrate and compare the usefulness of suchreal-time information from both physical- (such as Temperature) and non-physical-indicators(such as Twitter) towards robust disease forecasting. Specifically, this work focuses on threecritical areas: (a) Short-term forecasting of disease case counts to get better estimates ofcurrent on ground scenario, (b) long-term forecasting about disease season characteristicsto get help public health agencies plan and implement interventions and finally (c) Con-cept drift detection and adaptation to consider the ever evolving relationship of the societalsurrogates and the public health surveillance and lend robustness to the disease forecastingmodels. This work shows that such indicators could be useful for reliable estimation of dis-ease characteristics - even when the ground-truth itself is unreliable and provide insights asto how such indicators can be integrated as part of public surveillance. This work has usedprinciples from diverse fields spanning Bayesian Statistics, Machine Learning, InformationTheory, and Public Health to analyze and characterize such diseases.

Acknowledgments

I extend my sincere thanks and gratitude to my advisor Dr. Naren Ramakrishnan for hiscontinued encouragement and guidance throughout my work. His feedback, insights andinputs have contributed immensely to the final form of this work. He has been my mentorand my guide. I have always found in him a patient listener who rendered clarity to mythoughts and I have always come out of our discussion with renewed vigor and focus. It hasbeen my utmost privilege to work with him for all these years.

I would also like to thank my entire committee. I sincerely thank Dr. Madhav Maratheand Dr. John Brownstein for their unique perspectives on public health without which thiswork wouldn’t have been complete. I have especially enjoyed my meetings with Dr. MadhavMarathe and our collaborations that have helped me to gain a broader understanding aboutthe field of computational epidemiology. I cannot thank Dr. Ravi Tandon enough for hisinputs and insights that ultimately materialized in the form of ‘concept drift’ - a crucialcomponent of this work. Finally, Dr. C.T. Lu have always been welcoming and encouraging,and I thank him for his crucial feedback and inputs about this work. I consider myselffortunate to have received the guidance of such an esteemed and kind group of people.

I extend my heartfelt thanks and gratitude to Dr. Bryan Lewis, NDSSL at Virginia Tech forall his encouragement, guidance and countless hours working with me on this work. I havebeen lucky to have him as my mentor.

I would also like to thank Discovery Analytics Center at Virginia Tech which has been my-home-away-from-home for these past few years. I have found mentors like Tozammel Hossainand Patrick Butler who have immensely shaped my early PhD years. All my lab membershave been crucial and I will miss my time with all of them. They have been my friend, mycolleague and more often than not my support group throughout this process. I wish allof you the best for your future. I have also been fortunate to work with a varied group ofcollaborators from NDSSL, HealthMap and YeLab as well as public health agencies such asIARPA and CDC which has made my PhD a great experience that I will cherish forever.

I would also like to express my gratitude to my wonderful friends - Deba Pratim Saha,Gourab Ghosh Roy, Saurav Ghosh, Sathappan Muthiah, Arijit Chattopadhyay, SayantanGuha and Abhishek Mukherjee, to name a few - with whom I have shared unique momentsthroughout this time. Thanks for being around and being there for me whenever I needed

iv

you all.

Thanking my family is perhaps not enough. My mother Mrs. Devyani Chakraborty andmy brother Mr. Prasenjit Chakraborty have been my closest friends and confidants. Thiswork as well as me owes everything to you. My late father Mr. Prasanta Kr. Chakrabortywould have been happy to see me where I am today. My sister-in-law Mrs. Amrita DholeChakraborty and my cousins, I thank you for being the best family I could hope for andbeing there for me always.

v

Contents

1 Background and Motivation 1

1.1 Flu Surveillance Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Motivation towards using surrogates . . . . . . . . . . . . . . . . . . . . . . 6

I Short-term Forecastingusing Surrogates 7

2 Forecasting a Moving Target: Ensemble Models for ILICase Count Predictions 9

2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Ensemble Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.1 Data level fusion: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.2 Model level fusion: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4 Forecasting a Moving Target . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5.1 Reference Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5.2 Evaluation criteria. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5.3 Surrogate data sources. . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

vi

3 Dynamic Poisson Autoregression for Influenza-Like-IllnessCase Count Prediction 27

3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1.1 Model Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1.2 Forecasting Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.1.3 Seasonal Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

II Long-term Forecastingusing Surrogates 36

4 Curve-matching from library of curves 38

5 Data Assimilation methods for long-term forecasting 41

5.1 Data Assimilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2 Data Assimilation Models in disease forecasting . . . . . . . . . . . . . . . . 44

5.3 Data Assimilation Using surrogate Sources . . . . . . . . . . . . . . . . . . . 45

5.4 Experimental Results and Performance Summary . . . . . . . . . . . . . . . 45

5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

III Detecting and Adapting toConcept Drift 52

6 Hierarchical Quickest ChangeDetection via Surrogates 54

6.1 HQCD–Hierarchical Quickest Change Detection . . . . . . . . . . . . . . . . 55

6.1.1 Quickest Change Detection (QCD) . . . . . . . . . . . . . . . . . . . 55

6.1.2 Changepoint detection in Hierarchical Data . . . . . . . . . . . . . . 56

6.2 HQCD for Count Data via Surrogates . . . . . . . . . . . . . . . . . . . . . . 61

6.2.1 Hierarchical Model for Count Data . . . . . . . . . . . . . . . . . . . 62

vii

6.2.2 Changepoint Posterior Estimation . . . . . . . . . . . . . . . . . . . . 64

6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.3.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.3.2 Real life case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

7 Concept Drift Adaptation forGoogle Flu Trends 73

7.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

7.2 Robust Models via Concept Drift Adaptation . . . . . . . . . . . . . . . . . 75

7.2.1 Experimental evaluation and comparing Surrogate Sources . . . . . . 76

7.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

8 Conclusion 83

8.1 Importance of Open Source Indicators for Public Health . . . . . . . . . . . 83

8.2 Guidelines for using surrogates for Health Surveillance . . . . . . . . . . . . 84

8.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

A Data Assimilation: detailed performance 92

B Sequential Bayesian Inference 112

B.1 SMC2 algorithm traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

B.2 SMC2 priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

C HQCD: Additional Experimental Results 115

viii

List of Figures

1.1 Epidemic Pyramid: Depicts the process of how disease exposure in generalpopulation goes through several stages of surveillance and gets reported asconfirmed cases. Adapted and redrawn from “The public health officer -Antimicrobial Resistance Learning Site For Veterinary Students”, http://

amrls.cvm.msu.edu/integrated/principles/meet-the-public-health-officer 2

1.2 Christmas Effect in USA: Number of people seeking care drops during Christ-mas holidays. However, number of ILI related visits don’t vary from non-Christmas times leading to an inflated percent ILI in general population. . . 3

1.3 ILI Surveillance drop towards the end of ILI season in CDC ILINet system.Inflection point can be seen at week 33. Reduced surveillance may renderreports from later parts less accurate. . . . . . . . . . . . . . . . . . . . . . . 4

1.4 ILI surveillance instability: percentage relative error of updates w.r.t. finalvalue as a function of update horizon for PAHO ILI reports for several LatinAmerican countries. Stability varies from one country to other. . . . . . . . . 5

2.1 Our ILI data pipeline, depicting six different data sources used in this chapterto forecast ILI case counts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Average relative error of PAHO count values with respect to stable values.(a) Comparison between Argentina and Colombia (b) Comparison betweendifferent seasons for Argentina. . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3 Average relative error of PAHO count values before and after correction fordifferent countries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4 Accuracy of different methods for each country. . . . . . . . . . . . . . . . . 23

ix

http://amrls.cvm.msu.edu/integrated/principles/meet-the-public-health-officer


3.1 The distance matrix obtained from our learned DPARX model (bottom fig-ure), associated with the ground truth ILI case count series (top figure) onthe AR dataset. We can observe the strong seasonality automatically inferredin the matrix. Each element in the matrix is the Euclidean distance betweena pair of the learned models at two corresponding time points after training.For the top figure, the x axis is the index of the weeks; the y axis is the num-ber of ILI cases. For the bottom figure, both x and y axes are the index ofthe time points. Note that the starting time point (index 0) for the distancematrix is week 15 of the ILI case count series. . . . . . . . . . . . . . . . . . 33

3.2 Model distance matrices for US dataset. The three matrices are derived fromthe fully connected similarity graph, the 3-nearest neighbor similarity graphand the seasonal 3-nearest neighbor similarity graph, from left to right corre-spondingly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3 Comparison of seasonal characteristics for Mexico using different algorithmsfor one-step ahead prediction. Blue vertical dashed lines indicate the actualstart and end of the season. ILI season considered: 2013. . . . . . . . . . . . 34

4.1 Filtering library of curves based on season size and season shape. . . . . . . 38

4.2 Example of seasonal forecasts for ILI using curve-matching methods. . . . . 40

4.3 Performance measures for ILI seasonal characteristics using curve-matching . 40

5.1 Performance summary for (a) ILI and (b) CHIKV seasonal forecasts usingWeather as a surrogate source under data assimilation framework . . . . . . 46

5.2 Comparison of forecasting accuracy for Date metrics using surrogates . . . . 48

5.3 Comparison of forecasting accuracy for Value metrics using surrogates . . . . 48

5.4 Comparison of forecasting accuracy for ‘Start Date’ using different surrogatesources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.5 Comparison of forecasting accuracy for ‘End Date’ using different surrogatesources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.6 Comparison of forecasting accuracy for ‘Peak Date’ using different surrogatesources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.7 Comparison of forecasting accuracy for ‘Peak Value’ using different surrogatesources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.8 Comparison of forecasting accuracy for ‘Season Value’ using different surro-gate sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

x

6.1 Illustration of Quickest Change Detection (QCD): blue colored line representsthe actual changepoint at time Γ = t4. (a) declaring a change at γ1 leads toa false alarm, whereas (b) declaring the change at γ2 leads to detection delay.QCD can strike a tradeoff between false alarm and detection delay. . . . . . 56

6.2 Generative process for HQCD. As an example consider civil unrest protests.In the framework, different protest types (such as Education- and Housing-related protests) form the targets denoted by Si’s. The total number ofprotests will be denoted by the top-most variable E. Finally, the set of sur-rogates, such as counts of Twitter keywords, stock price data, weather data,network usage data etc. are denoted by Kj’s. . . . . . . . . . . . . . . . . . 57

6.3 Histogram fit of (a) surrogate source (Twitter keyword counts) and (b) tar-get source (Number of protests of different categories), for various temporalwindows, under i.i.d. assumptions. These assumptions lead to satisfactory dis-tribution fit, at a batch level, for both sources. The top-most row correspondsto the period before the Brazilian spring (pre 2013-05-25), the second row isfor the period 2013-05-25 to 2013-10-20, and the third is for the period after2013-10-20. The last row shows the fit for the entire period. These temporalfits are indicative of significant changes in distribution along the BrazilianSpring timeline, for both target and surrogates. . . . . . . . . . . . . . . . . 63

6.4 Computation time for one complete run of changepoint detection (in mins) ona 1.6 GHz quad core 8gb intel i5 processor: Gibbs sampling [8] vs HQCD vsHQCD without surrogates. Gibbs sampling computation times are unsuitablefor online detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6.5 Comparison of HQCD against state-of-the-art on simulated target sources. X-axis represents time and Y-axis represents actual value. Solid blue lines referto the true changepoint, solid green refers to the ones detected by HQCD andbrown refers to HQCD without surrogates. Dashed red, magenta, purple andgold lines refer to changepoints detected by RuLSIF, WGLRT, BOCPD andGLRT, respectively. HQCD shows better detection for most targets with lowoverall detection delay and false alarms. . . . . . . . . . . . . . . . . . . . . 68

6.6 False Alarm vs Delay trade-off for different methods. HQCD shows the besttrade-off. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6.7 Comparison of detected changepoints at the sum-of-targets (all Protests).HQCD detections are shown in solid green while those from the state-of-the-art methods i.e. RuLSIF (red), WGLRT (magenta), BOCPD (purple) andGLRT (gold) are shown with dashed lines. HQCD detection is the closest tothe traditional start date of Mass Protests in the three countries studied . . 70

xi

6.8 (Brazilian Spring) Heatmap of changepoint influences of targets on targets (a);and surrogates on targets (b). Darker (lighter) shades indicate higher (lesser)changepoint influence. (a) shows presence of strong off-diagonal elementsindicating strong cross-target changepoint information. (b) shows a mixtureof uninformative and informative surrogates. . . . . . . . . . . . . . . . . . 71

7.1 Evidence of Concept Drift. In Google Flu Trends data for Argentina (left),the corresponding 52-week rolling mean (right) exhibits a saddle point in early2012 - indicates a possible mean shift drift in GFT for Argentina. . . . . . . 74

7.2 Concept Drift Adaption Framework. Framework ingest target sources suchas CDC ILI case count data and surrogate sources such as GFT and detectschangepoints via ‘Concept Drift Detector’ stage. Drift probabilities are nextpassed onto ‘Drift Adaptation’ stage where robust predictions are generatedusing resampling based methods. . . . . . . . . . . . . . . . . . . . . . . . . 75

7.3 Drift Adaptation for Mexico using GFT . . . . . . . . . . . . . . . . . . . . 78

7.4 Drift Adaptation for Mexico using GST . . . . . . . . . . . . . . . . . . . . . 79

7.5 Drift Adaptation for Mexico using HealthMap . . . . . . . . . . . . . . . . . 80

7.6 Drift Adaptation for Mexico using weather sources . . . . . . . . . . . . . . 81

7.7 Drift Adaptation for Mexico using All sources . . . . . . . . . . . . . . . . . 82

8.1 Correlation of surrogate sources with disease incidence. Count of influenza re-lated keywords from (a) HealthMap and (b) GST compared against influenzacase counts for Argentina as available from PAHO. HealthMap keywords cap-ture the start of the season more accurately, while GST keywords exhibit asub-optimal but consistent correlation with PAHO counts. . . . . . . . . . . 85

C.1 Comparison of detected changepoints at the target sources (Protest types)HQCD detections are shown in solid green while those from the state-of-the-art methods i.e. RuLSIF (red), WGLRT (magenta), BOCPD (purple) andGLRT (gold) are shown with dashed lines. . . . . . . . . . . . . . . . . . . . 115

xii

List of Tables

2.1 Comparing forecasting accuracy of models using individual sources. Scores inthis and other tables are normalized to [0,4] so that 4 is the most accurate. . 24

2.2 Comparison of prediction accuracy while combining all data sources and usingMFN regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3 Comparison of prediction accuracy while using model level fusion on MFNregressors and employing PAHO stabilization. . . . . . . . . . . . . . . . . . 24

2.4 Discovering importance of sources in Model level fusion on MFN regressorsby ablating one source at a time. . . . . . . . . . . . . . . . . . . . . . . . . 24

2.5 ILI case count prediction accuracy for Mexico using OpenTable data as asingle source, and by combining it with all other sources using model levelfusion on uncorrected ILI case count data. . . . . . . . . . . . . . . . . . . . 25

3.1 Prediction accuracies for competing algorithms with different forecast stepsover different countries using the GFT input source. GFT data is not availablefor other countries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2 Prediction accuracies for competing algorithms with different forecast stepsover different countries using the weather data source. . . . . . . . . . . . . . 35

3.3 Prediction accuracies for competing algorithms with different forecast stepsover different countries using the GST data source. . . . . . . . . . . . . . . 35

3.4 Prediction accuracies for competing algorithms with different forecast stepsover different countries using the HealthMap data source. . . . . . . . . . . . 35

5.1 Forecasting performance of seasonal characteristics using data assimilationmethods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.1 Comparison of state-of-the-art methods vs Hierarchical Quickest Change De-tection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

xiii

6.2 (Synthetic data) comparing true changepoint (Γ) for targets against detectedchangepoint (γ) by HQCD against state-of-the-art methods for false alarm(FA) and additive detection delay (ADD). Each row represent a target andbest detected changepoint is shown in bold whereas false alarms are shown inred. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7.1 Comparison of surrogate sources pre- and post-drift adaptation. . . . . . . . 76

A.1 Performance of Data assimilation methods using different surrogate sourcesw.r.t. seasonal characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . 92

C.1 (Protest uprisings) Comparison of HQCD vs state-of-the-art with respect todetected changepoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

xiv

Chapter 1

Background and Motivation

The problem of multivariate time series forecasting has been studied extensively for severaldecades and have found use in diverse fields such as Economics and Statistics [9]. Someof the more popular methods that have been used in this sphere are Autoregressive (AR)models, Autoregressive Moving Average models (ARMA) and Vector Autoregressive Models(VAR) for linear problems. For nonlinear problems, some of the more popular methodshave been Kernel Regression and Gaussian Process. However, the traditional approacheshave focused on admitting only coherent time series and/or admitting independent timeseries which exhibits consistent causal relation with the target of interest. In recent years,‘big data’ in the form of diverse real-time sources such as social media and news has beenreadily available. These data sources are in general noisy, and relationships with any targetsources can change over time such as, search patterns of users. However, if used intelligently,such sources can aid in accurately modeling complex target sources such as the number ofinfluenza case counts for a country, in near real-time. This thesis focuses on such noisysurrogates.

We explore the problem of flu forecasting in Section 1.1 to identify the key advantages inusing surrogates and motivate our methods in Section 1.2.

1.1 Flu Surveillance Effects

Accurate and timely influenza (flu) forecasting has gained significant traction in recent times.If done well, such forecasting can aid in deploying effective public health measures. Unlikeother statistical or machine learning problems, however, flu forecasting brings unique chal-lenges and considerations stemming from the nature of the surveillance apparatus and theend-utility of forecasts. However flu surveillance is an inherently complex process and iden-tifying the quirks of this process can lead to a better understanding of the possible problemsfacing a forecasting model.

1

2

Exposures in general population

Person becomes ill

Person seeks care

Specimen obtained

Surveillance estimations

Final reports to Health Agencies

Figure 1.1: Epidemic Pyramid: Depicts the process of how disease exposure in general pop-ulation goes through several stages of surveillance and gets reported as confirmed cases.Adapted and redrawn from “The public health officer - Antimicrobial Resistance Learn-ing Site For Veterinary Students”, http://amrls.cvm.msu.edu/integrated/principles/meet-the-public-health-officer



3

Figure 1.2: Christmas Effect in USA: Number of people seeking care drops during Christmasholidays. However, number of ILI related visits don’t vary from non-Christmas times leadingto an inflated percent ILI in general population.

Influenza-like Illnesses (ILI), tracked by many agencies such as CDC, PAHO, and WHO [10,44, 64], is a category designed to capture severe respiratory disease, like influenza (flu), butalso includes many other less severe respiratory illness due to their similar presentation.Surveillance methods often vary between agencies. Even for a single agency, there may bedifferent networks (such as outpatient based and lab sample based) tracking ILI/Flu. Whileoutpatient reporting networks such as ILINet aim to measure exact case counts for the regionsunder consideration, lab surveillance networks such as WHO NREVSS (used by PAHO) seekto confirm and identify the specific strain. In the absence of a clinic based surveillance system,lab-based systems can provide estimates at per “X” population level; however making anestimate of actual influenza flu cases from these systems is challenging [10]. Furthermore,surveillance reports are often non-representative of actual ILI incidence. Figure 1.1 shows arepresentative ‘epidemic pyramid’ which depicts the surveillance system. The entire processis inherently associated with possible reporting errors starting from patients seeking care tofinal determination of confirmed case through laboratory tests. Surveillance networks are alsoaffected by cultural phenomenon such as holiday periods where behavior of people visitinghospitals changes from other weeks. Figure 1.2 depicts the ‘Christmas effect’ observed duringthe holidays when people seek care from physicians only in emergency situations leading toinflated ILI percentages.

Such effects may render the surveillance reports non-representative of on-ground scenarios. Inaddition to these effects, surveillance systems are also affected by other systematic artifacts.Surveillance reporting has been known to taper off or stop altogether during the post-peakpart of the season. For example, as is evident from Figure 1.3, the number of providers whoreported to US CDC ILINet surveillance tapers off towards the end of the ILI season (forUS, calendar week 40 corresponds to first ILI season week [10]). Specifically, the inflection

4

Figure 1.3: ILI Surveillance drop towards the end of ILI season in CDC ILINet system.Inflection point can be seen at week 33. Reduced surveillance may render reports from laterparts less accurate.

5

Figure 1.4: ILI surveillance instability: percentage relative error of updates w.r.t. final valueas a function of update horizon for PAHO ILI reports for several Latin American countries.Stability varies from one country to other.

point of the average curve occurs at season week 33. Such effects can possibly be attributedto resource re-allocation due to reduced interest in post-peak activities. A combination ofsuch effects ultimately causes surveillance data to be delayed from real-time. Even whenthe reports are published, the reports can be candidates for revision/updating for severalweeks after initial publication. The lag between initial publication and final revision canbe as small as 2 weeks (e.g., for CDC ILINet data) or can wildly fluctuate. For example,PAHO reports for some Latin American countries such as Argentina, Colombia and Mexicocan take more than 10 weeks to settle. On the other hand, PAHO reports stabilize within5 weeks for countries such as Chile, Costa Rica and Peru (see Figure 1.4). The reason forsuch discrepancies has to do with the maturity of the surveillance apparatus and the levelof coordination underlying public health reporting.

6

1.2 Motivation towards using surrogates

The flu surveillance effects described above can be thought of as a representative scenariofor a large class of problems dealing with real-time surveillance where on-ground scenario isdifficult to ascertain. Most work on forecasting do not account for such instability. In essence,these problems requires forecasting a moving target. Real-time surrogates as outlined abovecan be useful in such scenarios to augment the surveillance mechanism with informationfrom general population. Thus motivating the problem of flu forecasting, this thesis outlinesthree key problems as follows:

• Short-term forecasts using surrogates to augment delayed surveillance reports andprovide real-time information of on-ground scenarios.

• Long-term forecasts using surrogates as forcing parameters to determine long-termcharacteristics with increased accuracy.

• Identifying and adapting to Concept Drift to detect changing relationships ofsurrogates and increase robustness of short- and long-term forecasts using such surro-gates.

Part I

Short-term Forecastingusing Surrogates

7

8

The first problem of this thesis is aimed at short-term forecasting of often delayed and unsta-ble target sources such as Influenza-Like-Illness (ILI) case counts as reported by surveillanceagencies such as CDC [10] and PAHO [44]. We compared a range of surrogates encompassingphysical sources such as humidity and temperature, and social sources such as Twitter andNews in [12] under a Matrix Factorization framework for ILI prediction in 15 Latin Americancountries. We found that no single source is best suited to model ILI for all countries. How-ever, physical sources were in general the most informative sources. Furthermore, combiningthe sources led to better forecasting accuracy in general. We present these considerations inChapter 2.

We next focused on increasing the forecasting horizon and used Regularized GeneralizedLinear Models to capture dynamic trends of ILI data in [62]. Our experiments indicate thatwe can reliably forecast up to 4 weeks in advance, for a range of countries including USA andseveral Latin American countries, using our proposed methods. We highlight the importantaspects of our findings from the problem in Chapter 3.

Chapter 2

Forecasting a Moving Target:Ensemble Models for ILICase Count Predictions

Traditionally, epidemiological forecasts of common illnesses, such as the flu, rely heavily onsurveillance reports published by health organizations. However, as discussed in Chapter 1,traditional surveillance reports are often published with a considerable delay and thus recentresearch has focused on mining social signals from search engine query volume [67, 24] andsocial media chatter [27, 34, 39, 15, 56].

One of the pioneering work in this space is the work of Ginsberg et al. [24] where ILI casecounts are predicted from the volume of search engine queries. This work inspired significantfollow-on work, e.g., [67], where Yuan et al. used search query data from Baidu (a popularsearch engine in China) to detect influenza outbreaks. More real-time ILI detection [34]systems have been proposed by modeling Twitter streams.

Apart from such social media sources, there has also been considerable research on exploitingphysical indicators such as climate data. The primary advantage of such data sources is thatthe effects are much more causal and less noisy. Shaman et al. [57, 49, 51] explored this areain detail and found absolute humidity to be a good indicator of influenza outbreaks.

While the aforementioned efforts have made important strides, there are important areas thathave been relatively less studied. First, only few efforts have focused on combining multipledata sources [29, 27] to aid in forecasting. In particular, to the best of our knowledgethere has been no work that investigates the combination of social indicators and physicalindicators to forecast ILI incidence. Second, and more importantly, official estimates asreported by health organizations (e.g., WHO, PAHO) are often lagged by several weeks andeven when reported are typically revised for several weeks before the case counts are finalized.Real-time prediction systems must be designed to handle the forecasting of such a ‘moving

9

10

target’. Finally, most existing work have been retrospective and not set in the context ofa formal data mining validation framework. To overcome these deficiencies, we propose anovel approach to ILI case count forecasting. Our contributions are:

• Our approach integrates both social indicators and physical indicators and thus lever-ages the selective superiorities of both types of feature sets. We systematize suchintegration using a novel matrix factorization-based regression approach using neigh-borhood embedding, thus helping account for non-linear relationships between thesurrogates and the official ILI estimates.

• We investigate the efficacy of combining diverse different sources at two levels: datafusion level, and model level, and discuss the relative (de)merits.

• We propose different ways of handling uncertainties in the official estimates and factorthese uncertainties into our prediction models.

• Finally, we present a detailed and prospective analysis of our proposed methods bycomparing predictions from a near-horizon real time prediction system to official esti-mates of ILI case counts in 15 countries of Latin America.

2.1 Related Work

Related work naturally falls into the categories of social media analytics, physical indicators,and event dynamics modeling. These are next described as follows:

Social media analytics: Most relevant work using social media analytics focuses on Twit-ter, specifically by tracking a dictionary of ILI-related keywords in the data stream. Suchinvestigations have often focused on the importance of diversity in keyword lists, e.g., [39, 15].In [39], Kanhabua and Nejdl used clustering methods to determine important topics in Twit-ter data, constructed time series for matched keywords, and used Jaccards coefficient to char-acterize the temporal diversity of tweets. They noted, that such temporal diversity may becorrelated with real-world ILI outbreaks. In [15] the authors studied the dynamics betweenthe change in circulated tweets and the H1N1 virus. Inspired by these work, we curated acustom ILI related keyword dictionary which is described in details in Section 2.5.3.

Physical indicators for detecting ILI incidence levels: Tamerius et al. [57] investigatedthe existence of seasonal cycles of influenza epidemics in different climate regions. For thesaid work, they considered climatic information from 78 globally distributed sites. Usinglogistic regression they found that, strong correlations exist between influenza epidemicsand weather conditions, especially when conditions are cold-dry or humid-rainy. Similarly,exciting results were reported by Shaman et al. in [49, 51] where they discovered absolutehumidity to be a key indicator of flu. To uncover these relationships they used non-linear

11

Twitter6Data5LLGB6HistoricalPL6GBI6per6week

Data6Enrichment

Weather6DataP6GB6historicalP86MBI6week

Google6Trends6LL6MB6historical86MBI6week

Google6Flu6Trends46MB6historicalPLL6KBI6week

Healthmap6Data7P6MB6HistoricalP:56MBI6per6week

Healthmap6DataP4L6MB6hist:b6MBI6week

Twitter6DataPTB6hist:OL6GBI6week

Filtering6for6Flu6Related6Content

Time6series6SurrogateExtraction

Healthmap6Data66:666POMB6Weather6Data666666:66665L6MBTwitter6Data666666666:666676GB66

Healthmap6Data666:666PL6KBWeather6Data6666666:666P56KBTwitter6Data666666666:666PL6KB

ILI6Prediction

6

OpenTable6Res6DataPP6MB6historicalP766KBI6week

Figure 2.1: Our ILI data pipeline, depicting six different data sources used in this chapter to forecast ILI case counts.

regressors such as Kalman filters, and this was a key inspiration for us in finding a uniformmodel for the varied data sources as explained in Section 2.2.1.

Event dynamics modeling: Denecke et al. [27] proposed an event-based approach forearly prediction of ILI threats [27]. Their method (M-Eco) considers multiple resources suchas Twitter, TV reports, online news articles, and blogs and uses clustering to identify signalsfor event detection. Network dynamic solutions have also been used [3] to study the behaviorof an epidemic in a society.

12

2.2 Problem Formulation

In this section, we formally introduce the problem. Let P = 〈P1, P2, . . . , PT 〉 denote theknown total weekly ILI case count for the country under consideration, where Pt denotesthe case count for time point t and T denotes the time point till which the ILI case countis known. Corresponding to the ILI case count data, let us denote the available surrogateinformation for the same country by X = 〈X1,X2, . . . ,XT1〉, where T1 is the time point tillwhich the surrogate information is available and Xt denotes the surrogate attributes for timepoint t. The problem we desire to solve is to find a predictive model (f) for the case countdata, as presented formally in equation 2.1.

f : Pt = f (P ,X ) (2.1)

In this chapter, in order to better understand the importance of different sources, we assumethat the ILI activities in different countries are independent of each other.

2.2.1 Methods

Focusing on the methods, we employ non-linear temporal regressions over the surrogateattributes to forecast the case count using three models: (a) Matrix Factorization BasedRegression (MF), (b) Nearest Neighbor Based Regression (NN), and (c) Matrix FactorizationRegression using Nearest Neighbor embedding (MFN). For each of the methods, we definetwo parameters: β and α. α is the lookahead window length, denoting distance of the timepoint for prediction from T ; β is the lookback window length denoting the number of timepoints to look back in order to find the regression relation between the case count and thesurrogate data.

We define regression vectors Vt and labels Lt,∀t = 1, . . . , T as below:.

Vt ≡ 〈Pt−β−α,Xt−β−α, Pt+1−β−α,Xt+1−β−α, . . . ,Pt−α,Xt−α〉

Lt ≡ Pt

The regression vector for predicting the case count at time point T ′(T + α > T ′ > T ) isgiven by equation 2.2.

VT ′ ≡ 〈PT ′−β−α,XT ′−β−α, Pt+1−β−α,Xt+1−β−α, . . . ,PT ′−α,XT ′−α〉 (2.2)

Under these definitions we describe the models as follows:

Matrix Factorization Based Regression (MF):

Matrix Factorization is a well accepted technique in the recommender systems literatureto predict user preferences from incomplete user ratings/information. Typically [7] a user-

13

preference matrix is factored into an user-factor and factor-preference matrix. However,such factorizations are incognizant of any temporal continuity. As such to enforce temporalcontinuity, to predict for the time point T ′(T + α > T ′ > T ) we use the regression vectorsand labels as defined earlier, to define a m×n prediction matrixM, as given in equation 2.3:

M =

Vα+β+1 Lα+β+1...

...VT LTVT ′ LT ′

(2.3)

The prediction matrix is factorized into a f ×m factor-feature matrix U and a f ×n factor-prediction matrix as:

Mi,j = bi,j + UTi Fj

Here, bi,j is the baseline estimate given by:

bi,j = M+ bj (2.4)

where M represents the all-element average and bj represents the column wise deviationsfrom the average and is generally a free-parameter, i.e., it is fitted as part of the optimizationproblem. U and F matrix are estimated by minimizing the error function:

b∗, F, U = argmin(m−1∑i=1

(Mi,n − Mi,n

)2

+λ1

n∑j=1

b2j +

m−1∑i=1

||Ui||2 +n∑j=1

||Fj||2))(2.5)

where λ1 is a regularization parameter. An important design criteria in the error function ofequation 2.5 is the fact that we only compute the error between the predicted label values andthe actual label values i.e., the nth column of the prediction matrixM. The rationale behindthis choice is the fact that unlike traditional recommender systems we are only concernedwith the label column and can sacrifice reconstruction accuracies for other columns.

The lookback window β, the factor size f and the regularization parameter λ1 are estimatedusing cross-validation and the final prediction for time point T ′ is given by:

PT ′ = bm,n + UTmFn

Nearest Neighbor Based Regression (NN):

For our second class of models, viz. nearest neighbor models, we define a training set ΓNN ={Vt, Lt}, where Vt represents the regression attributes and Lt denote the corresponding labels.Also, let us define the set N (i) = {k : Vk is one of the top K nearest neighbors of Vi} where

14

K indicates the maximum number of nearest neighbors considered. The predicted count PT ′for the time point T ′ is given as:

PT ′ =

( ∑k∈N (T ′)

θkLk,T−α

)/K∑k=1

θk (2.6)

Here θk indicates the weight assigned to the kth nearest neighbor. Typically the inverseEuclidean distances to VT ′ are chosen as the weights.

Matrix Factorization Based Regression using Nearest Neighbor Embedding(MFN):

It has been shown in [28] that matrix factorization using nearest neighbor constraints canoutperform classical matrix factorization approach as well as traditional nearest neighborapproaches towards recommender systems. Drawing inspirations from the result, we modifythe method to suit the temporal nature of our problem in similar ways as described insection 2.2.1. We again define a similar prediction matrix M (see equation 2.3). Following[28], we define the matrix decomposition rule as

Mi,j = bi,j + UTi Fj

+Fj|N (i)|− 12

∑k∈N(i)(Mi,k − bi,k)xk

(2.7)

The key difference between equation 2.7 and the one proposed in [28] is that we don’t haveany term for implicit feedback and, further, only the top K neighbors as found throughEuclidean distance are used. The model is fitted using equation 2.8 as given below:

b∗, F, U, x∗ = argmin(m−1∑i=1

(Mi,n − Mi,n

)2

+λ2(n∑j=1

b2j +

m−1∑i=1

||Ui||2 +n∑j=1

||Fj||2 +∑k

||xk||2))(2.8)

2.3 Ensemble Approaches

In the last section, we described different strategies to correlate a specific source with theILI case count of a specific country and predict future ILI counts. In practice, we desireto work with a multitude of data sources and there are two broad ways to accomplish thisobjective: (a) data level fusion, where a single regressor is constructed from different datasources to the ILI case count, and (b) model level fusion, where we build one regressor foreach data source and subsequently combine the predictions from the models. In this section,we describe these fusion methods. Experimental results with both methods are presented inSection 2.6.

15

2.3.1 Data level fusion:

Here we express the feature vector X , as a tuple over all the different data sources and thenproceed with any one of the regression methods as outlined in Section 2.2.1. For example,while combining Twitter and weather data sources (see Figure 2.1), the feature vector X isgiven by:

Xt = 〈Tt,Wt〉where Tt and Wt denote attributes derived from Twitter and weather, respectively.

2.3.2 Model level fusion:

In this approach, the models are combined using matrix factorization regression with nearestneighbor embedding by comparing the prediction estimates from each model with the actualestimate (since the ground truth can change as well) and the average ILI case count for themonth for the particular country (to help organize a baseline). Let us denote the averageILI case count for a particular calendar month I for a given country by:

µI =∑t∈I

Pt/|{t ∈ I}|

Considering C different sources and hence C different models, let us denote the predictionfor the tTh time point from the cTh model by cPt.

Using these definitions we can now proceed to describe the fusion model. Essentially, themodel is similar to the one described in Section 2.2.1, where the differences can be found inthe way we construct the feature vectors. Similar to equation 2.3, we construct a predictionm′ × n′ matrix for fusion given byCM where the tTh row is represented by equation 2.9.

CMt =[

1Pt . . . CPt Pt

](2.9)

Then similar to equation 2.7, we factor this matrix into latent factors, CU , CF , Cb∗ as givenby equation 2.10:

CMi,j = µi + Cbj + CUTi CFj

+CFj|CN (i)|− 12

∑k∈CN(i)(CMi,k − µi + Cbk)CZk

(2.10)

so that the final prediction for the TTh data point is given by

PT = CMT , n′.

The fitting function is given by equation 2.11:

Cb∗, CF, CU, Cx∗ = argon(m′−1∑i=1

(CMi,n′ − CMi,n′

)2

+λ3(n′∑j=1

Cb2j +

m′−1∑i=1||CUi||2 +

n′∑j=1||CFj ||2 +

∑k ||Cxk||2))

(2.11)

16

As before the free parameters are estimated through cross-validation.

2.4 Forecasting a Moving Target

One of the key challenges in creating a prospective ILI case count predictor is the fact thatthe official estimates are often delayed and, furthermore, even when published the estimatesare revised over a number of weeks before these become finally stable. For this chapter,we concentrate on 15 Latin American countries as described in Section 2.5 and considerthe official ILI estimates from the Pan American Health Organization (PAHO).Thus we cancategorize PAHO count values downloaded on any week into three different types: (a) theunknown PAHO counts represented by Pt, (b) the known and stable PAHO counts denotedby Pt, and (c) the known and unstable PAHO counts denoted by Pt. While we desire topredict Pt, the uncertainty associated with Pt introduces errors in the predictions. In thissection, we study the effects of such unstable data and propose three different models toadjust these unstable values to more accurate ones.

Figure 2.2a plots the relative error of an unstable PAHO data series w.r.t. its final estimate,as a function of time. It can be seen that different countries have different stability char-acteristics: for some countries, PAHO count values are stabilized very slowly whereas forothers they stabilize faster (esp as the number of updates for a week increases). Stabilitybehavior of PAHO count values were also found to be dependent on the time of the yearas shown in Figure 2.2b. To plot this curve for Argentina, we categorized any week withless than 100 cases to belong to a low season, greater than 300 to be a high season, and theremaining values to be mid season (the thresholds were different for different countries).

At the same time, the PAHO official updates provide an indication of the number of samplesused to generate the case count estimate. Preliminary experiments show that this size iscorrelated with the accuracy of ILI case counts. In other words, in general, larger valuesof statistical population size results in smaller relative errors for ILI case count. Thususing both the number of samples and the lag in uploading the week data, we can usemachine learning techniques to revise the officially published PAHO estimates. Preliminaryresults show that for different seasons and different countries, we encounter different stabilitypatterns. Therefore, any PAHO count adjustment method should be customized for seasonsand countries separately.

Let us assume that P is the set of stable PAHO counts for a specific country. Also, assumethat the sequence of updates for each stable PAHO count value is available. In other words,for Pi we have the following set:

Pi ={P

(1)i , P

(2)i , ..., P

(m)i , ...

}(2.12)

where P(m)i is the value of Pi after m weeks of update.

17

(a)

(b)

Figure 2.2: Average relative error of PAHO count values with respect to stable values. (a)Comparison between Argentina and Colombia (b) Comparison between different seasons forArgentina.

18

After recognizing high, low, and mid-season months for the country, we can categorize eachPi to belong to one of these categories. Then, for category S, an adjustment dataset isconstructed named as PAS which is defined as follows:

PAS ={

(1, P(1)i , Pi, N

(1)i ), ..., (m,P

(m)i , Pi, N

(m)i ), ...

}(2.13)

Each member of PAS is a tuple with four entries: the first entry denotes the time slot thatthe sample belongs to; the second entry is the actual unstable value of Pi; the third entryis the related stable value; and finally, N

(m)i is the size of the statistical population for that

week.

In the next step, a linear regression algorithm is used to adjust unstable PAHO values. Inorder to adjust value of the PAHO values in the mth time slot of season S, we use PAS setto learn a0, a1, a2, and a3 coefficients in the following equation:

ˆP(m)i = a0 + a1m+ a2P

(m)i + a3N

(m)i (2.14)

where ˆP(m)i is the adjusted PAHO count value for the mth time slot.

Experimental results show that this adjustment method results in more accurate knownPAHO values. Average relative errors of the published unstable PAHO values before andafter correction for each country are shown in Figure 2.3. While in a few cases, we do notexperience any improvement, in countries such as Argentina and Paraguay, we experiencesignificant improvements.

Finally, similar to equation 2.14, in addition to P(m)i , one can use only time difference (m)

or size of population (N(m)i ) to correct unstable PAHO values. Effect of these corrections on

overall accuracy of predictions are explored in Section 2.6.

2.5 Experimental Setup

2.5.1 Reference Data.

In this chapter, we focus on 15 Latin American countries viz. Argentina, Bolivia, Costa RCA, Colombia, Chile, Ecuador, El Salvador, Guatemala, French Guiana, Honduras,Mexico, Nicaragua, Paraguay, Panama and Peru. We collected weekly ILI counts from theofficial Pan American Health Organization (PAHO) website(http://ais.paho.org/phip/viz/ed_flu.asp), every day from January 2013 to August 2013. The estimates downloadedevery day for each country contain data from January 2010 to the latest available week onthe day of collection. This dataset is stored in a database we refer to as the Temporal DataRepository (TDR). The TDR is also timestamped so that for any given day, we can readily

http://ais.paho.org/phip/viz/ed_flu.asp


19

Figure 2.3: Average relative error of PAHO count values before and after correction fordifferent countries.

retrieve the ILI case counts that were download on that day. This is important as historicdata may be updated by PAHO even a number of weeks after the first update. For thepurpose of experimental validation we used the data for the period Jan 2010 to December2012 as the static training set. We considered Wednesdays of the weeks as a reference daywithin a week. For each Wednesday from Jan 2013 to July 2013, we used the latest availablePAHO data in TDR for that day and predicted 2 weeks from the last available week forwhich the PAHO data was available. These predictions are next evaluated against the finalILI case count as downloaded on September 1, 2013 and we report the performance of ouralgorithms in Section 2.6.

2.5.2 Evaluation criteria.

We evaluate the prediction accuracy of the different algorithms using a modified version ofpercentage relative error:

A =4

Np

(1−

te∑t=ts

|Pt − Pt|max(Pt, Pt, 10)

)(2.15)

where ts and te indicate the starting and the ending time point for which predictions weregenerated. Np indicates the number of time points over the same time period (i.e. Np =te − ts + 1). Note that the measure is scaled to have values in [0, 4] and the denominator isdesigned to not over-penalize small deviations from the true ILI case count (e.g., when the

20

true case count is 0 and the predicted count is 1). It is to be noted that the accuracy metricso defined is non-convex and is in general multi-modal.

2.5.3 Surrogate data sources.

Before describing our data sources in detail, we describe our overall methodology for organiz-ing a flu-related dictionary (for tracking in multiple media such as news, tweets, and searchqueries).

Dictionary creation.

The keywords relating to ILI were organized from a seed set of words and expanded using acombination of time series correlation analysis and pseudo-query expansion. The seed set ofkeywords (e.g., gripe) was constructed in Spanish, Portuguese, and English using feedbackfrom our in-house subject matter experts.

Pseudo-query expansion. Using the seed set, we crawled the top 20 web sites (accordingto Google Search) associated with each word in this set. We also crawled some expert sitessuch as the official CDC website and equivalent websites of the countries under consideration,detailing the causes, symptoms and treatment for influenza. Additionally we crawled a fewhand-picked websites such as http://www.flufacts.com and http://health.yahoo.net/

channel/flu_treatments. We filtered the words from these sites using standard languageprocessing filtering techniques such as stopword removal and Porter stemming. The filteredset of keywords were then ranked according to the absolute frequency of occurrence. Thetop 500 words for Spanish and English were then selected. For example, words such asenfermedad and pandemia were obtained from this step.

Time series correlation analysis. Next we used Google Correlate (now a part of GoogleTrends) to identify keywords most correlated with the ILI case count time series for eachcountry. Once again these words were found to be a mix of both English and Spanish. As anadded step in this process, we also compared time-shifted ILI counts: left-shifted to capturethe words searched leading up to the actual flu infection and right-shifted to capture thewords commonly searched during the tail of the infection. This entire exercise provided ussome interesting terms like ginger which has been used as a natural herbal remedy in theeastern world. We also found popular flu medications such as Acemuk and Oseltamivir,which are also sold under the trade name of Tamiflu as highly correlated search queries,especially particularly for Argentina.

http://www.flufacts.com

http://health.yahoo.net/channel/flu_treatments

http://health.yahoo.net/channel/flu_treatments

21

Final filtering. The set of terms obtained from query expansion and correlation analysiswere then pruned by hand to obtain a vocabulary of 151 words. We then performed a finalcorrelation check and retained a final set of 114 words.

Google Flu Trends (F):

Google Flu Trends (GFT http://www.google.org/flutrends) is a tool based on [24] andprovided by Google.org which gives weekly and up-to-date ILI case count estimates usingsearch query volumes. Of the countries under consideration, GFT provides weekly estimatesfor only 6 of them viz. Argentina, Bolivia, Chile, Mexico, Peru and Paraguay. Theseestimates are typically at a different scale than the ILI case counts provided by PAHO andtherefore need to be scaled accordingly. We collected this data weekly on Monday from Jan2013 to Aug 2013. (The data downloaded on a particular day contains the entire time seriesfrom 2004 to the corresponding week.)

Google Search Trends (S):

Google Search Trends(http://www.google.com/trends) is another tool provided by Google.Using this tool we can download an estimate of search query volume as a percentage over itsown temporal history, filtered geographically. We download the search query volume timeseries for the 114 keywords described earlier and convert the percentage measures to absolutevalues using a static dataset we downloaded on Oct 2012 when Google Search Trends usedto provide absolute query volumes.

Twitter (T ):

Twitter data was collected from Datasift.com and geotagged using an in-house geocoder.We lemmatized the tweet contents and used language detection and POS tagging to helpdifferentiate relevant from irrelevant uses of our keywords (e.g., the Spanish word gripe,meaning flu, is part of our flu keyword list as opposed to the undesired and unrelated Englishword ‘gripe’). The resulting analysis yields a weekly occurrence count of our dictionary intweets.

HealthMap (H):

Similar to Twitter, we also collect flu-related news stories using HealthMap(http://healthmap.org), an online global disease alert system capturing outbreak data from over 50,000 elec-tronic sources. Using this service we receive flu-related news as a daily feed which is similarlyenriched and filtered to obtain a multivariate time series over lemmatized version of the key-words. While Twitter is more suitable to ascertain general public response, the HealthMap

http://www.google.org/flutrends

http://www.google.com/trends

http://healthmap.org

http://healthmap.org

22

data provides more detailed information but may capture the trends at a slower rate. Thuseach of these sources offers utility in capturing different surrogate signals: Twitter offers lead-ing but noisy indicators whereas HealthMap provides a slightly delayed but more reliableindicator.

OpenTable (O):

We also use data on trends of restaurant table reservations, initially studied in [41] to bea potential early indicator for outbreak surveillance, as another surrogate for ILI detection.This novel data stream is based on the postulate that a higher than average number ofrestaurants with table availability in a region can serve as an indicator of an event of interest,such as increase in flu cases. Table availability was monitored using OpenTable http:

//www.opentable.com, an online restaurant reservation site with 28,000 restaurants at thetime of this writing. Daily searches were performed starting from September 2012 for a tablefor two persons at lunch and dinner; between 12:30-3pm, and between 6-10:30pm. Data wascollected for Mexico by city (Cancun, Mexico City, Puebla, Monterrey, and Guadalajara)and for the entire country. The daily proportion (proportion used due to changes in thenumber of restaurants in the system) of restaurants with available tables was aggregated asa weekly time series.

Weather (W):

All of the previously described data sources can be termed as non-physical indicators whichcan work suitably as indirect indicators about the state of the population with respectto flu by exposing different population characteristics. On the other hand, meteorologicaldata can be considered a more direct and physical driver of influenza transmission [65]. Ithas been shown in [49, 51, 57] that absolute humidity can be directly used to predict theonset of influenza epidemics. Here, we collect several other meteorological indicators such astemperature and rainfall in addition to humidity from the Global Data Assimilation System(GDAS). We accessed this data in GRIB format from http://ladsweb.nascom.nasa.gov/

at a resolution of 1 degrees lat/long interval. However, looking at all the lat/long for acountry can often lead to noisy data. As such we filtered the downloaded data and used theindicators only around the surveillance centers. We also aggregate this data using weeklyaverages and thus obtain a resultant time series for each country. We collected this dataweekly from Jan 2013 to August 2013.

http://www.opentable.com

http://www.opentable.com

http://ladsweb.nascom.nasa.gov/

23

2.6 Results

In this section, we present an exhaustive set of experiments evaluating our algorithms over 6months of predictions from Jan 2013 to August 2013. The final and stable estimates of ILIcase counts are considered to be the estimates downloaded from PAHO on Oct 1, 2013. Allmodels considered here were used to forecast 2 weeks beyond the latest available PAHO ILIestimates. Key findings are presented in Table. 2.1. We analyze some important observationsfrom this table next.

Figure 2.4: Accuracy of different methods for each country.

Can we ‘beat’ Google Flu Trends with our custom dictionary? The key differencebetween Google Flu Trends (which can be considered as a base rate) and Google SearchTrends is that the former uses a closed dictionary whereas we constructed the dictionary touse with GST. As can be seen Table 2.1, for majority of the common countries (countriesfor which data from both GST and GFT is present), regressors running on GST consistentlyoutperform those running on GFT (with Mexico and Peru being the exception). Thus weposit that the GST model devised here is a sufficiently close approximation to GFT, with theadded advantages of having access to raw level data and being available for more countriesthan GFT (among the 15 countries we consider, only 6 of them are present in the GFTdatabase).

Which is the optimal regression model? From Table 2.1, we can also analyze thethree different regressors proposed in Section 2.2.1 with respect to overall accuracy. Withrespect to each individual source, we can see that matrix factorization with nearest neighborembedding (MFN) performs the best in average over the countries. For some countries such

24

Table 2.1: Comparing forecasting accuracy of models using individual sources. Scores inthis and other tables are normalized to [0,4] so that 4 is the most accurate.

Model Sources AR BO CL CR CO EC GF GT HN MX NI PA PY PE SV All

MF

W 2.78 2.46 2.39 2.14 2.70 2.22 2.12 2.63 2.52 2.73 2.31 2.21 2.49 2.77 2.61 2.47H 2.81 2.31 2.22 1.92 2.43 2.04 2.11 2.57 2.33 2.48 2.39 2.15 2.18 2.47 2.33 2.32T 2.37 2.35 2.18 2.03 2.21 2.12 1.83 2.12 2.29 2.03 1.89 2.06 1.96 2.20 2.21 2.12F 2.34 2.11 2.29 N/A N/A N/A N/A N/A N/A 2.71 N/A N/A 2.31 2.24 N/A 2.33S 2.48 2.21 2.33 2.04 2.31 2.21 1.93 2.03 2.15 2.51 2.42 2.52 2.33 1.93 2.30 2.24

NN


MFN


Table 2.2: Comparison of prediction accuracy while combining all data sources and usingMFN regression.

FusionLevel

AR BO CL CR CO EC GF GT HN MX NI PA PY PE SV All

Model 3.12 3.22 3.03 2.88 2.98 3.13 2.87 2.99 2.87 3.00 2.77 2.82 2.81 2.92 2.87 2.95Data 3.01 2.97 3.13 2.87 2.86 3.04 2.91 2.88 2.72 2.89 2.70 2.60 2.88 2.81 2.92 2.88

Table 2.3: Comparison of prediction accuracy while using model level fusion on MFNregressors and employing PAHO stabilization.

CorrectionMethod

AR BO CL CR CO EC GF GT HN MX NI PA PY PE SV All

None 3.12 3.22 3.03 2.88 2.98 3.13 2.87 2.99 2.87 3.00 2.77 2.82 2.81 2.92 2.87 2.95WeeksAhead

3.15 3.24 3.04 2.87 2.97 3.17 2.87 2.99 2.88 3.05 2.77 2.91 3.02 2.91 2.88 2.98

Numb sam-ples

3.20 3.24 3.03 2.88 2.96 3.12 2.87 3.01 2.89 3.12 2.78 2.92 3.04 2.91 2.87 2.99

Combined 3.21 3.24 3.05 2.89 2.96 3.19 2.87 3.00 2.89 3.13 2.77 2.93 3.08 2.92 2.88 3.00

Table 2.4: Discovering importance of sources in Model level fusion on MFN regressors byablating one source at a time.

Sources AR BO CL CR CO EC GF GT HN MX NI PA PY PE SV All

All 3.21 3.24 3.05 2.89 2.96 3.19 2.87 3.00 2.89 3.13 2.77 2.93 3.08 2.92 2.88 3.00w/o W 2.91 2.99 2.77 2.71 2.61 2.59 2.66 2.69 2.49 2.78 2.62 2.87 2.60 2.43 2.67 2.69w/o H 3.04 2.85 2.89 2.56 2.81 2.77 2.61 2.75 2.75 2.82 2.57 2.75 2.51 2.87 2.71 2.75w/o T 2.92 3.14 2.95 2.61 2.72 2.81 2.88 2.79 2.61 2.93 2.74 2.63 2.79 2.74 2.81 2.80w/o S 3.19 3.11 2.92 2.64 2.69 2.70 2.89 2.88 2.78 3.07 2.75 2.91 2.80 2.71 2.86 2.86w/o F 3.20 3.12 2.88 2.89 2.96 3.19 2.87 3.00 2.83 3.02 2.77 2.93 2.98 2.88 2.88 2.96

as Panama, when using only GST, MFN performs poorer than vanilla MF; nevertheless theaverage accuracy over all countries for any given data source is best when using MFN.

25

Table 2.5: ILI case count prediction accuracy for Mexico using OpenTable data as a singlesource, and by combining it with all other sources using model level fusion on uncorrectedILI case count data.

Method Lunch Dinner Lunch & Din-ner

MF 1.92 2.23 2.31NN 1.99 1.83 2.11MFN 2.11 2.31 2.44Model Fusion 2.96 2.87 2.99

Which is the best strategy to combine multiple data sources? As shown in Table 2.2,in overall, model level fusion works better than data level fusion. For 8 of the 15 countries,model level fusion works appreciably better than data level fusion, while the reverse trendis seen for 4 other countries. This showcases the importance of considering both kinds offusion depending on the country of interest.

How effective are we at forecasting a moving PAHO target? As shown in Table 2.3,our corrected estimates using both the number of samples and the weeks ahead from theupload date are generally better. It is instructive to note that our correction strategy is ableto increase the overall accuracy only by a score of approximately 0.05 over all the countries,for some countries such as Mexico and Argentina (for which the data update is typicallynoisy) we obtain a substantial improvement of scores. This suggests that the correctionstrategy may be selectively applied when forecasting for certain countries.

How do physical vs social indicators fare against each other? From Table 2.1, wesee that the data source with the best single accuracy happens to be the physical indicatorsource, i.e., weather data. However, Table 2.4 conveys a mixed story. Here we conduct anablation test, wherein we remove one data source at a time from our model level MFN fusionframework and contrast accuracies. While removing the weather data degrades the accuracyscore the most, removing the social indicators also degrades the score to varying degrees.Thus we posit that it is important to consider both the physical and social indicators to geta refined signal about the prevalent ILI incidence in the population.

How relevant is restaurant reservation data to forecasting ILI? All the results thusfar do not consider the OpenTable reservation data, since this source is available only forMexico (among the countries studied here). We considered table availability for differenttime ranges and compared performance using our MFN model. As Table 2.5 demonstrates,we obtain the best performance when considering both lunch and dinner reservation data.Nevertheless, we have observed that including this source as part of the ensemble decreasesthe overall accuracy by 0.01 over the uncorrected ILI case count data. Thus it is our opinionthat although the reservation data could exhibit some signals about prevalent ILI conditions,it likely is also a surrogate for non-health conditions (e.g., social unrest) which must befactored out to make the data source more useful.

Finally, we present Figure 2.4 where we compare for each country the accuracies of prediction

26

from the best individual source, with those from both data level and model level fusion ofthe different sources and the the model level fusion of MF regressors applied on the correctedPAHO estimates rather than the raw ones. As can be seen, we progressively increase ouraccuracies with the corrected PAHO estimates providing the final increase in predictive powerto our model level fusion framework.

2.7 Discussion

In this chapter, we have aimed to generate short-term ILI forecasts over a range of LatinAmerican countries using a gamut of options pertaining to data sources, fusion possibilities,and corrections to track a moving target. Our results demonstrate that there are signifi-cant opportunities to improve forecasting performance and selective superiority among datasources that can be leveraged. However, the presented method works best for near-horizonforecasts with significant drop in accuracy (see Chapter 3) for longer range forecasts. Thus wewill next explore methods to increase the forecasting horizon while adhering to the principlesof using multiple sources to generate such forecasts.

Chapter 3

Dynamic Poisson Autoregression forInfluenza-Like-IllnessCase Count Prediction

In Chapter 2, we have presented our initial efforts at forecasting influenza-like-illness (ILI)case counts. Seasonal influenza regularly affects the global population and improvementsin forecasting capability can directly translate into tangible measures of public health. Themethods presented in the said chapter successfully incorporated surrogate sources of informa-tion to produce real-time forecasts. However, the ‘reliable’ forecasting horizon was limited,both by model complexity as well as inability to maintain coherence between successive fore-casts. In this chapter, we aim to relax such limitations and increase the forecasting horizonwithout increasing the computational complexity of the model.

Traditionally, epidemiologists aim to predict several characteristics about ILI from surveil-lance reports. Such characteristics of interest can be broadly classified into: (a) seasonalcharacteristics and (b) short-term characteristics. Seasonal characteristics are concernedwith the overall shape of ILI counts for the particular season (See Part II for more details).Such methods are generally trained by assigning greater importance to statistics of the ILIcurve such as peak value and the peak size. Conversely, short-term characteristics are con-cerned with accurately predicting the next few data points in absolute value rather thanaiming for an overall fit for the season. In this chapter we are motivated by the secondproblem, i.e. the short-term forecasting challenge (but we also evaluate our methods w.r.t.seasonal characteristics). As discussed earlier, among the several challenges towards ILI casecount forecasting, one of the most important fact is that the surveillance reports are oftendelayed by a number of weeks and therefore estimating the current on-ground scenario is acrucial problem. The case count estimates for a given week can be delayed anywhere from 1week to 4 weeks, depending on the quality of the surveillance apparatus in a given country.Thus in this chapter we aim to provide reliable short-term forecasts from the last available

27

28

surveillance data such that we can estimate the on-ground case counts and increase ourforecasting horizon to atleast 4 weeks.

In traditional epidemiology, several models such as SEIR and SIRS [3], have been proposedto model the temporal profile of infectious diseases. In modern computational epidemiol-ogy, more advanced methods have been used. One of the currently popular methods is tofit prediction models by matching observational data against a large library of simulatedcurves [5, 40, 58]. The curve simulations are generated by using different epidemiologicalparameters and assumptions. Sometimes network-based models are used to generate thecurves [3]. Partially observed influenza counts for a particular year can then be matchedto a library of curves to produce the best set of predictions [40]. Closely related to suchcurve matching methods are filtering-based methods that dynamically fit epidemic modelsonto observed data by letting the base epidemic parameters vary over time. Yang et al. [66]provide an excellent survey of filtering-based methods used for influenza forecasting and alsopresent comparative analysis of such methods.

Concurrently, there has been a lot of interest in using indicator data sources to predict sea-sonal influenza. In [24], Ginsberg et al. presented a method of estimating weekly influenzacounts based on search query volumes (Google Flu Trend). Following this seminal work,researchers have investigated a wide-variety of data sources such as Wikipedia [25], Twit-ter [12, 34, 46], and online restaurant reservations [41]. Weather has been found to be asignificant indicator of seasonal influenza [49, 50, 51, 57]. In [12], different indicator sourcesare contrasted to understand their relative influence on short-term forecasting quality.

As rich and varied as the above approaches are, most approaches in the literature aim to usethe same model to predict for the entire influenza season. This is not entirely desirable as‘in-season’ ILI characteristics may vary significantly from the ‘out-of-season’ characteristics(see Section 3.1.3). While researchers appreciate the need for dynamic models (e.g., [12]),constraints on temporal consistency are never explicitly imposed in current models. Thusin this chapter we aim to propose a general purpose time series prediction model allowingexternal factors from indicator sources to produce robust short-term forecasts in a consistentmanner.

A popular model for analyzing time series data is the autoregressive exogenous (ARX)model [4, 36]. The ARX model has also been adopted by Paul et al. [46] to predict ILIcase counts by using Twitter and Google Flu Trends (GFT) as the indicator sources. How-ever, the underlying static autoregressive model may not be suitable for flu trend forecasting,as the activity of the disease and the human living environment evolve over time. Ohlsson etal. [42] have designed a more flexible ARX model for time-varying systems based on modelsegmentation. It allows the weight of the autoregressive model to be temporally piecewiseconstant. In this chapter, we further relax this requirement. We build separate models foreach time point, but we constrain the models to share common characteristics. To capturesuch characteristics, we build a graph over models at different time points and embed theprior knowledge on model similarity in terms of the structure of the graph. Then we for-

29

mulate the dynamic ARX model learning problem as a convex optimization problem, whoseobjective balances the autoregressive loss and the model similarity regularization inducedby the graph structure. In this optimization problem, the variables have a natural blockstructure. Thus we apply a block coordinate descent method to solve this problem. Wefurther extend our dynamic ARX modeling to the Poisson regression model for a betterfitting of the count data [4, 14], as is relevant for ILI case counts forecasting. We performextensive experimental studies to evaluate the effectiveness of the proposed model and thecorresponding learning algorithm. We use various real world datasets in the experiments,including different types of indicator data sources from 15 countries around the world. Ourexperimental studies illustrate that the dynamic modeling of the linear Poisson autoregres-sive model captures well the underlying progression of disease counts. Further, our resultsalso show that our proposed method outperforms state-of-the-art ILI case counts forecastmethods.

Our main contributions are summarized as follows:

• We propose a new dynamic ARX model for the task of ILI case count forecasting.This approach incorporates a linear Poisson regression model with non-negativity con-straints into an ARX model, ideal for case counts modeling.

• Prior domain knowledge can be encoded as structural relationships among differenttime points in a graph, which is embedded into the objective as a regularization termwhile still ensuring that the optimization problem is convex.

• We evaluate the proposed method using various real world datasets, including differenttypes of indicator data sources from the USA and 14 Latin American countries.

3.1 Summary

We present a brief summary of our findings here. For a more detailed treatise we ask thereader to refer to [62]. We developed two dynamic generalized linear models viz. DynamicAutoregressive model (DARX) and Dynamic Poisson Autoregressive model (DPARX) andcompared the forecasting performance for various sources against a number of state-of-the-artalgorithms. We highlight some of our interesting findings here.

3.1.1 Model Similarity

First, we conduct experiments to investigate the model similarities posited by our proposedalgorithm. In this experiment, we calculate the distance between all pairs of models learnedby DPARX during a period of time on the AR dataset. We present the distance matrixassociated with the ground truth ILI case count series in Figure 3.1. We see that the

30

distance matrix has a strong seasonal pattern, which is consistent with the pattern of theILI case count series. At the beginning of each flu season, the model is significantly differentfrom the rest of the models at other time points. This result demonstrates that ILI casecounts have a strong periodic pattern and that the dynamic modeling approach successfullycaptures this pattern. It also validates the necessity of conducting this level of modeling forflu forecasting.

In the next experiment, we run our proposed DPARX method on the US dataset under threedifferent model similarity graphs including the fully connected graph, the 3-nearest neighborgraph and the seasonal 3-nearest neighbor graph. We then calculate the three correspondingdistance matrices of the learned models, which are shown in Figure 3.2. The patterns in thethree distance matrices are very similar. However, the distances between the pairs of modelsare smaller for the fully connected similarity graph. Without strong prior knowledge, thefully connected similarity graph is preferred, as during different seasons the target signalmay still be very different. In the following experiments, we will use the fully connectedsimilarity graph for the regularization term.

3.1.2 Forecasting Results

In the ILI cast count forecast experiments, we use the data record from all 15 countries. Allthe case count data are associated with several data sources similar to the ones in Section 2.We start with 50 given time points and test the prediction result on the remaining timepoints. We run all the competing methods in an online manner: the models are re-trainedand updated after the arrival of values at every additional time point. For the DARX andDPARX models, we use the same parameter settings: p = 1, b = 15 for GFT and Weatherdata sources as these data sources have relatively small dimension; p = 1, b = 4 for GSTand HealthMap data sources as these data sources have relatively high dimension. The ARXmodel does not provide numerical stable results for high dimensional data. Thus we presentits results on GFT and Weather data sources with p = 1, b = 15. Likewise, the training ofthe SARX model is very time consuming, especially for high dimensional data. We thus onlypresent its results using the GFT data source with the same setting (p = 1 and b = 15). Theremaining parameter in our model is the regularization parameter that controls the variationof the model. We fix it as η = 1 for the DARX model and η = 5 for the DPARX modelduring all experiments. For MFN algorithm, we follow the same procedure and parametersetting as in [12].

We present the results of short-term ILI case count forecasting for different countries withboth 1-step forecast and multi-step forecasts with step sizes of 2, 3, and 4. The predictionaccuracy on data sources GFT, Weather, GST, and HealthMap are presented in Tables 3.1,3.2, 3.3, and 3.4, correspondingly.

The experiments show that our models yield better prediction accuracy, especially for multi-step forecasting. Multi-step forecast is a much harder task than 1-step forecast. The dynamic

31

modeling of ARX provides more flexibility in handling the uncertainty associated with thetarget signal.

Table 3.1: Prediction accuracies for competing algorithms with different forecast steps overdifferent countries using the GFT input source. GFT data is not available for other countries.

Step Method AR BO CL MX PE PY US

1

ARX 2.85 2.63 3.18 2.61 2.51 2.82 3.71MFN 2.33 2.41 2.34 2.69 2.48 2.54 3.73

SARX 3.02 2.42 3.11 2.90 2.81 2.69 3.67DARX 3.05 2.74 3.12 2.78 2.50 2.65 3.71

DPARX 3.13 2.82 3.18 2.97 2.64 2.81 3.72

2

ARX 2.38 2.22 2.83 1.88 1.90 2.57 3.47MFN 2.12 2.00 2.13 2.33 2.21 2.19 3.63

SARX 2.75 2.03 2.76 2.64 2.43 2.43 3.64DARX 2.94 2.68 3.02 2.58 2.38 2.58 3.60

DPARX 2.86 2.70 2.89 2.64 2.52 2.65 3.61

3

ARX 2.11 1.86 2.61 1.28 1.44 2.31 3.19MFN 1.99 1.87 2.11 2.14 2.10 2.09 3.33

SARX 2.33 1.61 2.46 2.42 2.16 2.23 3.40DARX 2.66 2.36 2.77 2.37 2.26 2.46 3.41

DPARX 2.58 2.53 2.56 2.45 2.37 2.52 3.42

4

ARX 1.84 1.61 2.39 0.88 1.12 2.22 2.92MFN 1.85 1.83 2.00 2.05 2.01 1.94 3.15

SARX 2.12 1.41 2.30 2.22 2.02 2.09 3.30DARX 2.34 2.21 2.52 1.98 2.19 2.22 3.18

DPARX 2.29 2.35 2.32 2.26 2.29 2.40 3.20

3.1.3 Seasonal Analysis

In this chapter, we have not trained the models to predict the seasonal metrics. However,we can construct ILI prediction curves for each ‘step-ahead’, i.e., 1-step ILI prediction curve,2-step ILI prediction curve and so on. From these prediction curves we can then calculate theseason-characteristics and compare them against those calculated from the observed PAHO(or CDC) ILI counts.

We compare the predicted and observed seasonal characteristics, for the last ILI year in ourset for each country. Our experimental results show [62], the proposed algorithms work wellfor a number of countries. In general DPARX performs better in terms of the overall predic-tion characteristics. This is consistent with our results for near-term forecasts. For seasonalcharacteristics, Weather and GFT seem to be the most important sources for prediction. Wealso present the predicted and real curves for Mexico for the ILI season 2013 in Figure 3.3based on 1-step ahead predictions. Excepting GST and HealthMap data for some of thestate-of-the-arts, all the curves match up closely to the observed ILI curve.

32

3.2 Discussion

In this chapter, we presented a practical short-term ILI case count forecasting method usingmultiple digital data sources. One of the main contributions of the proposed model is thatthe underlying autoregressive model is allowed to change over time. In order to controlthe variation of the model, we built a model similarity graph to indicate the relationshipbetween each pair of models at two different time points and embed the prior knowledge as thestructure of the graph. The experiments demonstrate that our proposed algorithm providesconsistently better forecasting results than state-of-the-art time series models used for short-term ILI case count forecasting. We also observed that the dynamic model successfullycaptures the seasonal pattern of flu activity. Finally, while these techniques were applied tothe relatively specialized field of ILI case count forecasting, the methods presented are genericenough such that these may be adapted towards other similar count prediction problems.

33

0 50 100 150 2000

1000

2000

3000

20 40 60 80 100 120 140 160 180 200 220

20

40

60

80

100

120

140

160

180

200

220

0.01

0.02

0.03

0.04

0.05

0.06

Figure 3.1: The distance matrix obtained from our learned DPARX model (bottom figure),associated with the ground truth ILI case count series (top figure) on the AR dataset. Wecan observe the strong seasonality automatically inferred in the matrix. Each element in thematrix is the Euclidean distance between a pair of the learned models at two correspondingtime points after training. For the top figure, the x axis is the index of the weeks; the y axisis the number of ILI cases. For the bottom figure, both x and y axes are the index of thetime points. Note that the starting time point (index 0) for the distance matrix is week 15of the ILI case count series.

34

20 40 60 80 100 120

20

40

60

80

100

120

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

20 40 60 80 100 120

20

40

60

80

100

120

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

20 40 60 80 100 120

20

40

60

80

100

120

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Figure 3.2: Model distance matrices for US dataset. The three matrices are derived from thefully connected similarity graph, the 3-nearest neighbor similarity graph and the seasonal3-nearest neighbor similarity graph, from left to right correspondingly.

1 11 21 31 41 51

Weeks

0

500

1000

1500

2000

2500

3000

3500

ILI C

ount

Seas

on S

tart

Seas

on E

nd

MX for 2013

ActualHM-DARXHM-DPARXGST-DARXGST-DPARXWeather-DARXWeather-DPARXGFT-SARXGFT-DARXGFT-DPARX

Figure 3.3: Comparison of seasonal characteristics for Mexico using different algorithms forone-step ahead prediction. Blue vertical dashed lines indicate the actual start and end ofthe season. ILI season considered: 2013.

35

Table 3.2: Prediction accuracies for competing algorithms with different forecast steps overdifferent countries using the weather data source.

Step Method AR BO CL CO CR EC GT HN MX NI PA PE PY SV US

1

ARX 2.94 2.51 3.10 2.90 2.21 2.81 2.83 2.96 2.25 2.18 2.78 2.51 2.84 2.83 3.51MFN 2.99 3.01 2.88 2.53 2.78 2.81 2.77 2.83 2.61 2.70 2.56 2.82 2.66 2.79 3.81

DARX 3.09 2.84 3.17 2.84 2.57 2.94 2.83 2.89 2.91 2.77 2.72 2.67 2.79 2.72 3.71DPARX 2.98 2.84 3.07 3.01 2.70 2.97 2.87 2.93 2.84 2.86 2.82 2.78 2.86 2.77 3.72

2

ARX 2.56 2.05 2.63 2.71 1.61 2.56 2.63 2.76 1.15 1.36 2.56 2.05 2.62 2.64 3.21MFN 2.86 2.89 2.81 2.49 2.71 2.67 2.72 2.41 2.55 2.31 2.50 2.59 2.71 2.30 3.75

DARX 2.98 2.69 3.00 2.69 2.63 2.79 2.72 2.81 2.66 2.28 2.55 2.49 2.68 2.66 3.60DPARX 2.67 2.73 2.86 2.83 2.66 2.79 2.78 2.78 2.62 2.49 2.71 2.63 2.64 2.68 3.61

3

ARX 2.25 1.65 2.21 2.50 1.06 2.30 2.39 2.59 0.60 0.94 2.42 1.72 2.39 2.46 2.92MFN 2.49 2.38 2.41 2.33 2.45 2.31 2.32 2.10 2.21 2.11 2.19 2.22 2.40 2.08 3.64

DARX 2.68 2.32 2.68 2.57 2.52 2.72 2.50 2.65 2.47 2.00 2.52 2.32 2.54 2.53 3.41DPARX 2.33 2.44 2.63 2.70 2.58 2.66 2.59 2.61 2.36 2.31 2.75 2.44 2.51 2.55 3.42

4

ARX 1.98 1.37 1.73 2.31 0.72 2.07 2.22 2.41 0.39 0.83 2.21 1.46 2.21 2.30 2.56MFN 2.10 2.13 2.15 2.04 2.25 2.11 2.22 1.94 1.99 1.87 2.01 1.86 2.10 1.77 3.54

DARX 2.42 2.12 2.39 2.49 2.34 2.52 2.42 2.51 2.17 1.74 2.38 2.27 2.30 2.42 3.18DPARX 2.10 2.23 2.32 2.64 2.38 2.52 2.55 2.45 2.06 2.15 2.72 2.38 2.27 2.53 3.20

Table 3.3: Prediction accuracies for competing algorithms with different forecast steps overdifferent countries using the GST data source.

Step Dataset AR BO CL CO CR EC GT HN MX NI PA PE PY SV

1MFN 2.61 2.44 2.55 2.22 2.61 2.52 2.31 2.62 2.48 2.61 2.31 2.23 2.53 2.13

DARX 2.99 2.65 3.09 2.74 2.41 2.86 2.72 2.83 2.82 2.84 2.59 2.56 2.75 2.63DPARX 3.07 2.74 3.15 2.85 2.72 2.80 2.51 2.80 2.96 2.77 2.59 2.66 2.82 2.61

2MFN 2.50 2.33 2.31 2.10 2.44 2.29 2.11 2.43 2.37 2.39 2.20 2.01 2.27 2.00

DARX 2.83 2.54 2.94 2.57 2.53 2.69 2.58 2.72 2.59 2.40 2.35 2.40 2.54 2.51DPARX 2.78 2.59 2.86 2.67 2.63 2.67 2.35 2.71 2.60 2.48 2.43 2.53 2.57 2.59

3MFN 2.33 2.10 2.16 1.99 2.21 2.03 1.99 2.14 2.20 2.14 2.02 1.91 2.13 1.92

DARX 2.51 2.07 2.69 2.45 2.36 2.47 2.41 2.54 2.34 2.06 2.48 2.10 2.49 2.44DPARX 2.46 2.41 2.53 2.56 2.48 2.51 2.26 2.58 2.38 2.30 2.41 2.34 2.49 2.51

4MFN 1.99 2.00 2.01 1.82 1.97 1.88 1.92 1.93 1.81 1.77 1.79 1.70 1.82 1.71

DARX 2.16 1.91 2.36 2.24 2.20 2.17 2.28 2.40 1.80 1.86 2.40 2.06 2.23 2.36DPARX 2.17 2.21 2.29 2.46 2.35 2.33 2.14 2.46 2.10 2.13 2.33 2.21 2.30 2.44

Table 3.4: Prediction accuracies for competing algorithms with different forecast steps overdifferent countries using the HealthMap data source.

Step Dataset AR BO CL CO CR EC GT HN MX NI PA PE PY SV US

1MFN 2.81 3.13 2.63 2.58 2.91 2.77 2.63 2.73 2.50 2.61 2.54 2.69 2.51 2.61 3.78

DARX 3.00 2.69 3.11 2.79 2.44 2.89 2.75 2.91 2.85 2.86 2.60 2.65 2.75 2.64 3.71DPARX 3.07 2.74 3.15 2.84 2.69 2.83 2.58 2.82 2.95 2.79 2.59 2.70 2.83 2.62 3.72

2MFN 2.71 2.91 2.30 2.21 2.77 2.49 2.40 2.38 2.44 2.36 2.15 2.33 2.22 2.33 3.64

DARX 2.86 2.60 3.01 2.62 2.54 2.74 2.64 2.77 2.66 2.47 2.37 2.47 2.53 2.58 3.60DPARX 2.78 2.60 2.88 2.67 2.62 2.71 2.44 2.72 2.60 2.50 2.45 2.58 2.58 2.60 3.61

3MFN 2.44 2.30 2.42 2.07 2.31 2.14 2.28 2.01 2.19 2.12 1.99 2.00 1.97 1.95 3.35

DARX 2.58 2.18 2.78 2.49 2.35 2.63 2.51 2.62 2.48 2.15 2.49 2.33 2.48 2.51 3.41DPARX 2.46 2.42 2.55 2.56 2.47 2.58 2.36 2.59 2.38 2.31 2.45 2.37 2.49 2.50 3.42

4MFN 1.93 1.99 2.20 1.88 2.00 1.95 2.15 1.95 1.89 1.85 1.72 1.78 1.91 1.81 3.13

DARX 2.28 2.02 2.46 2.39 2.19 2.37 2.39 2.45 2.22 1.97 2.45 2.26 2.20 2.42 3.18DPARX 2.17 2.21 2.30 2.44 2.34 2.42 2.25 2.47 2.12 2.14 2.37 2.25 2.30 2.47 3.21

Part II

Long-term Forecastingusing Surrogates

36

37

We discussed several facets of short-term forecasting, specially with respect to ILI, in Part I.Concomitant to short-term forecasting, which provides real-time insights about current on-ground scenario, often times long-term characteristics of targets are of prime interest. Con-sidering the example of epidemic diseases, surveillance agencies are interested in identifyingseasonal characteristics such as follows:

1. Start week: Within a particular ILI year (may not be calendar year, e.g., in the USA,the ILI year spans from Epi Week 40 to Epi Week 39 [11]), ‘start week’ is the weekfrom which ILI is said to be in season. We define start week for a ILI year to be thefirst week where the ILI count for 3 consecutive past weeks (including itself) is greaterthan a pre-defined threshold.

2. Peak week: Within a particular ILI year, the peak week is the week for which theILI count is highest for that ILI year.

3. Peak Size: Peak Size is the ILI count observed on the peak week.

4. End week: Within a particular ILI year, the end week is the first week after the peakweek such that ILI counts for 3 consecutive past weeks (including itself) is lower thana pre-defined threshold. End week signifies the end of the ILI season and is thus ofinterest to epidemiologists.

5. Season Size: Season size is used as a proxy for the size of the epidemic. It is calculatedby summing up the total ILI count from the start to the end week.

In traditional epidemiology, several models such as SEIR and SIRS [3], have been proposedto model the temporal profile of infectious diseases. In modern computational epidemiol-ogy, more advanced methods have been used. One of the currently popular methods is tofit prediction models by matching observational data against a large library of simulatedcurves [5, 40, 58]. The curve simulations are generated by using different epidemiologicalparameters and assumptions. Sometimes network-based models are used to generate thecurves [3]. Partially observed influenza counts for a particular year can then be matchedto a library of curves to produce the best set of predictions [40]. Closely related to suchcurve matching methods are filtering-based methods that dynamically fit epidemic modelsonto observed data by letting the base epidemic parameters vary over time. Yang et al. [66]provide an excellent survey of filtering-based methods used for influenza forecasting and alsopresent comparative analysis of such methods. We present our efforts at disease forecastingusing curve matching methods in Chapter 4 and subsequently present our data assimilationbased models towards easier integration of surrogates in Chapter 5.

Chapter 4

Curve-matching from library of curves

One of the simplest and more-intuitive strategies towards long-term forecasting is based oncurve matching from amongst library of curves [3, 5]. Typically, library of curves can begenerated using various parameter choices of compartmental models such as SEIR

Curves can also be generated from agent based models that are informed through a com-bination of diverse sources such as census and road-network. Curves from such library ofcurves can then be matched against specific epidemic surveillance data to predict the sea-sonal curves. Seasonal characteristics can then be identified from these detected curves usingthe definitions as outlined earlier. Some of the considerations in this process can be identifiedas follows:

Figure 4.1: Filtering library of curves based on season size and season shape.

1. Appending Short-term forecasts to surveillance data: Surveillance reports aretypically delayed. As presented in Part I, we can use Surrogates to generate robustpredictions for short-term. In general, these predictions are robust i.e. stable w.r.t. to

38

39

surveillance updates. Short-term forecasts can also provide measures of uncertaintyabout current surveillance. We append these predictions to the last-available surveil-lance data so that the partial time series to match against the curves is longer andhence more accurate. This is especially useful during the initial part of the seasonwhere only a few data points are available from the surveillance reports to matchagainst the library of curves.

2. Filtering Library of Curves: Typically, the library may contain a wide variety ofcurves corresponding to various kinds of diseases identified through various epidemio-logical parameters used to simulate the curves. Many of these curves can be unsuitablefor matching against the disease of interest (such as ILI). Moreover, admitting thesecurves for matching may lead to increased false detections. As such we filter the curvesfrom historical trends of the disease of interest by the following factors:

• Filter curve by average season size of the disease.

• Filter curves by average peak-to-season size ratio. Effectively, this strategy filtersaccording to the shape of the epidemic curve.

Figure 4.1 shows examples of curves that were filtered out from such a library whilematching against ILI data for Latin America

Performance Highlights We used the aforementioned curve-matching strategy to pre-dict ILI seasonal characteristics for 15 Latin American countries. Figure 4.2 gives an exampleof such forecasts and Figure 4.3 outlines the results for several countries against the afore-mentioned metrics as reported by IARPA. As can be seen, the framework works well for afew metrics and for a few countries such as Ecuador for total RSV counts. However, ourperformance was poor for several other metrics. Furthermore, we found curve matchingmodels to be inconsistent with respect to the week of the season when the forecasts weregenerated. Also, this method admits the use of surrogates only for increasing the time serieslength to match against and fails to use it for determining more interesting facets such asdisease transmission rate.

40

Figure 4.2: Example of seasonal forecasts for ILI using curve-matching methods.

Figure 4.3: Performance measures for ILI seasonal characteristics using curve-matching

Chapter 5

Data Assimilation methods forlong-term forecasting

Chapter 4 outlined our current efforts at seasonal forecasting using curve-matching models.As identified in the chapter, such methods involve a sub-optimal use of surrogate informationat determining seasonal characteristics. Motivated by the efforts of Shaman et al. [51], wedeveloped data assimilation models where surrogates are used to force the disease parametersand seasonal characteristics are found by optimizing over the most probable seasonal curves.We present our efforts to some details in the following sections by first describing some of therelevant data assimilation models in Section 5.1 and present our disease forecasting modelsusing data assimilation methods in Sections 5.2 and 5.3.

5.1 Data Assimilation

Originally proposed in the 1960s [26], Kalman filter (KF) has rapidly gained its reputation ina myriad of applications [17] that features estimation and forecasting. Nowadays the originalKF has evolved and given rise to a entire class of dynamic estimation/forecasting algorithmsthat recursively estimate and forecast : it optimizes the estimates by data assimilation usingnoisy measurements (observations), and forecasts using a presumed process model.

We first consider linear process models. Such systems can be expressed as a pair of linearstochastic process and measurement equations as show below:{

xk+1 = Axk +Buk + wk, wk ∼ N (0, Q)zk = Hxk + vk, vk ∼ N (0, R)

(5.1)

where x ∈ Rn is the state vector, z ∈ Rm is the measurement vector, A ∈ Rn×n is calledthe process matrix, B is the matrix that relates optional control input u to the state, andH ∈ Rm×n is the measurement matrix. The process noise wk and measurement noise vk

41

42

are assumed to be mutually independent random variables with zero mean and normallydistributed with noise covariance Q and R respectively.

The classic Kalman Filter was developed to estimate the hidden states as well as forecastthe observed targets of such linear processes and the relevant equations can then be outlinedin the following two groups:

Estimation

Kk = P f

kHT (HP f

kHT +R)−1

xak = xfk +Kk(zk −Hxfk)P ak = (I −KkH)P f

k

(5.2)

Forecast

{xfk+1 = Axak +BukP fk+1 = AP a

kAT +Q

(5.3)

where, xak is the optimal state estimate given measurement vector zk ∈ Rm, P ak is the analysis

state error covariance, R is the measurement noise covariance. Equation 5.2 assimilatesmeasurements into the estimate via Kalman gain matrix K, which weighs the impact frommeasurement versus that from prediction. Larger R or smaller Q increases the weight ofprediction; while smaller R or larger Q increases the weight of measurement. xfk+1 ∈ Rn is

the forecast state, P fk+1 ∈ Rn×n is the forecast state error covariance, Q is the process noise

covariance. Equation 5.3 forecasts x and P for time step k + 1.

In reality, the process and measurement models are often nonlinear. A nonlinear system canbe modeled by nonlinear stochastic equations:{

xk+1 = a(xk, uk) + wk, wk ∼ N (0, Q)zk = h(xk) + vk, vk ∼ N (0, R)

(5.4)

One of the more popular solutions to such non-linear systems is the extended Kalman filter(EKF), which is essentially a Kalman filter modified to linearize the estimation about thecurrent mean and covariance [63]. EKF equations are similar to KF equations, except thatEKF needs to compute Jacobian matrices at each time step

A =∂a(x)

∂x|x, H =

∂h(x)

∂x|x. (5.5)

EKF came to prominence in aerospace and robotics applications where the state space issmall; however in more complex systems with high-dimensional state space such as those inweather and disease prediction, it falls short due to the intractable computational burdenassociated with Jacobian matrices, as well as maintaining and evolving a separate covariancematrix at each time step.

Ensemble Kalman filter (EnKF) [21] is thus developed to alleviate computation complexity.It is related to the Particle Filter (PF) [20] in the sense that each ensemble member can beconsidered to be a particle aimed at estimating the relevant probability distributions usingMonte Carlo procedures. However, contrary to PF, EnKF assumes Gaussian distributed

43

noise characteristics and is thus more computationally more efficient than the PF albeitwith stricter assumptions about the underlying process. The essential steps of EnKF are: 1)maintaining an ensemble of state estimates instead of a single estimate, 2) simply advancingeach member of the ensemble and, 3) calculating the mean and error covariance matrixdirectly from this ensemble.

Assuming that we have an ensemble of q state estimates with random sample errors, EnKFsteps can be expressed via the following equations:

Kk = Efxk

(Efzk

)T (Efzk

(Efzk

)T +R)−1

xaik = xfik + Kk(zk + vik − zfik ), i = 1, 2, ...q

xak = 1q

q∑i=1

xaik

(5.6)

xfik+1 = a(xaik , uk) + wik, i = 1, 2, ...q

zfik+1 = h(xfik+1), i = 1, 2, ...q

Efxk+1

= 1√q−1

[xf1k+1 − xfk+1 . . . xfqk+1 − xfk+1]

Efzk+1

= 1√q−1

[zf1k+1 − zfk+1 . . . zfqk+1 − zfk+1]

(5.7)

where, xfk+1 = 1q

q∑i=1

xfik+1 and zfk+1 = 1q

q∑i=1

zfik+1 are the means of state forecast ensemble

and measurement forecast ensemble. Efxk+1

and Efzk+1

are the corresponding perturbation

matrices ensembles. wik and vik are generated random noise variables that follow the normaldistribution N (0, Q) and N (0, R) respectively. EnKF offers great ease of implementationand handling of non-linearity due to the absence of Jacobian calculations; on the other hand,it is critical to choose an ensemble size that is large enough to be statistically representative.The details of EnKF can be found in [21, 37].

Some of the more popular variations of EnKF, viz. Ensemble Adjustment Kalman filter(EAKF) [2] and Ensemble Transform Kalman Filter (ETKF) [61], do not add Gaussian noiseto form measurement ensembles and instead deterministically adjust each ensemble memberso that the posterior variance is identical to that predicted by Bayesian theorem underGaussian distribution assumptions, while keeping the ensemble mean unchanged. Withrespect to ETKF, EAKF shows better numerical stability but requires extra SVD operationsand is thus computationally more expensive. In EAKF, the estimated state perturbationmatrix can be written in the pre-multiplier form:

Eax = AEf

x (5.8)

Compared to ETKF, in which Eax can only be expressed in post-multiplier form, EAKF does

not suffer from the two issues that may appear in ETKF: 1) Producing analysis ensembleswith inconsistent statistics such as biased mean and/or small standard deviations of thecoordinates; 2) Each assimilation of an observation produces a collapse in the number ofdistinct values of the observed coordinates in the ensemble. More discussion of EAKF andETKF can be found in [2, 61, 37].

44

5.2 Data Assimilation Models in disease forecasting

We described some of the more practical and popular data assimilation models in Section 5.1.In this section, we present our disease specific data assimilation model which we used togenerate seasonal forecasts for ILI and subsequently expanded towards CHIKV forecasts.To build such data assimilation models, we need to specify disease spread processes whoseparameters are learned through the data assimilation algorithm of choice (see [51]). Forour purpose, we chose dynamic data-driven SIRS model. This dynamic model is inspiredfrom the Shaman et al. [51] and aims to use a Bayesian Filter to continuously assimilateobserved data sources into the model characteristics and generate an ensemble of models.A key distinguishing feature of our work is aimed at the diversity of syndromic surveillancesources used. The spread of the ensemble predictions also reveals the underlying probabilitydistribution of various seasonal characteristics such as start week and peak week. The modelused for ILI can be formally described as follows.

Let us denote the observed ILI percentage for the region of interest (including national leveldata) by yt. We choose as a candidate the well defined SIRS model where St and It denotethe number of people in ‘Susceptible’ and ‘Infectious’ compartment, at time t. Let us alsodenote the new infections moving into the I bucket at time t by newIt which can be directlycomputed from It. Let us denote the population size by N , the mean infectious period byD, the mean resistance period by L, and the basic reproductive rate at time t by R0,t. Thenthe basic SIRS equation at time t can be given as

dStdt

= N−St−ItL

− β(t)ItStN− α

dItdt

= β(t)ISN− It

D+ α

(5.9)

where β(t) = R0,t/D.

Let us denote a hidden layer of variables xt that connect the SIRS model with the observedILI percentages. The hidden variable set can be thought of as an n-tuple xt, as

xt = (St, It, R0, D, L, f, r)

The equations governing the Bayesian filter can be given as:

yt = f ∗ newIt +N (0, r)xt = g(xt|xt−1)

(5.10)

where g denotes the dynamic model transition from time t− 1 to t.

g can be a general purpose transition function. For our purpose, we perturb S and Ivia the SIRS equation and the remaining state parameters using a random walk modelwithin specified bounds. We studied a number of data assimilation models as presented in

45

Section 5.1 and selected EnKF filters to allow for greater flexibility in modeling and witha stated goal of comparing different sources towards their relevant importance in diseaseforecasting. We used an EnKF with 10000 ensembles to estimate the disease parameters.The distribution of the ensembles provide the posterior distribution over the SIRS parametersand can be used to directly infer the parameters.

5.3 Data Assimilation Using surrogate Sources

The method so described above can be thought of as a general purpose algorithm where wecan introduce information about different sources by modifying equation 5.10. Earlier re-search [51] has shown that surrogate sources such as absolute humidity can be used to locallymodify disease parameters and generate more robust forecasts. However, such methods havemainly focused on allowing a single surrogate and/or using custom state transition equationswhich are not easily generalizable to other sources. We focused on extending such methodsto more generic sources and study the relative importance of such sources towards long-term forecasting. We have used a number of surrogate sources such as Weather, Google FluTrends, Google Search Trends, HealthMap, and Twitter chatter. For the sake of simplicity,we explain our model using Google Flu Trends (GFT) as the illustrative source. Additionaldata sources can be incorporated following similar equations.

As discussed in Part I, surrogate sources were found to encode disease transmission infor-mation but also exhibiting significant noise. However, from our experiments we found thatalthough absolute surrogate counts are noisy, their rolling covariance can be used to inform asudden increase/decrease of disease incidence in the population. Thus surrogate informationwas used to modify the transition equation for other latent variables such as R0 as:

R0,t = R0,t−1 +N (0, cov(GFTt−1, GFTt)) (5.11)

Following Chakraborty et al. [12] we intend to analyze a myriad of data sources to train amore precise model with lower uncertainty bounds.

5.4 Experimental Results and Performance Summary

We used our data assimilation model to generate forecasts for ILI and CHIKV, for variousregions of the world. While ILI is an human-to-human transmitted infectious diseases,CHIKV is a vector driven disease and hence forecasting models for CHIKV needs to cognizantabout the same. For both diseases, weather attributes such as Temperature and Humiditycould be argued to be an important transmission modulator. We applied data assimilationmethods as outlined in Section 5.3 using weather as a surrogate source. These forecasts

46

(a) ILI

(b) CHIKV

Figure 5.1: Performance summary for (a) ILI and (b) CHIKV seasonal forecasts usingWeather as a surrogate source under data assimilation framework

47

were generated continuously for CHIKV in the Americas and for ILI in the US. As canbe seen, data assimilation methods were able to more accurately forecast several seasonalcharacteristics for ILI compared to CHIKV. CHIKV, being a newly introduced disease inthe Americas were characterized by more noise and our results also indicate the possibleimportance of modeling the vectors (mosquitoes) in addition to surrogate sources which mayimprove our forecasting performance.

Table 5.1: Forecasting performance of seasonal characteristics using data assimilation meth-ods

Metric BO CL MX PE

start date 8.000 3.000 11.000 5.000end date 16.000 14.000 35.000 14.000peak date 17.000 2.000 28.000 4.000peak val 2.005 3.413 0.145 2.908season val 2.889 2.958 0.117 2.255

Similar to our efforts in short-term forecasting we compared the importance of each individ-ual surrogate source towards long-term forecasts. We applied our data assimilation model asoutlined in Section 5.3 to ILI incidence for the season 2014-2015 over four Latin Americancountries viz. Bolivia, Chile, Mexico and Peru. We chose these countries as Google FluTrends was available for these countries as well as these countries exhibits different modesof seasonality in the Latin Americas. We generated seasonal forecasts using data present atweeks 4 → 8 of the flu season for each of these countries. Table 5.1, summarizes the per-formance summary of our forecasts. The complete performance summary for these forecastscould be see in Appendix A. Figure 5.2 and Figure 5.2 plots the distribution of forecasting ac-curacy for dates (deviation in days) and values (quality score), respectively. As can be seen,HealthMap sources performs the best for both categories, indicating that the news mediacaptures long-term signals about the season. The combination of all sources performs withsimilar accuracy as HealthMap, indicating that the competing sources could be potentiallyused, especially to improve accuracies against local variations.

We analyze the forecasting performances furthermore by analyzing the change in fore-casting accuracy over the number of season weeks used to generate the forecasts in Fig-ures 5.4, 5.5, 5.6, 5.7 and 5.8. As can be seen, a combination of all sources shows mostconsistent performance over the season weeks compared to a single source. Furthermore,forecasting accuracy over value metrics (such as peak value and season value) benefits morefrom observation of a number of seasonal weeks compared to dates. Our results indicate thatthe shape of the disease curve can be forecasted with better accuracy compared to the actualsize when only a few data points are observable for the season. Furthermore, the temporalaccuracy plots indicate that surrogates sources such as HealthMap and GST contributesmore heavily in the initial part of the disease season compared to the later part.

48

Weather gft gst hmap merged twittersource

3

4

5

6

7

8

9

10

11

Scor

e

start_date


3

4

5

6

7

8

9

10

11

Scor

e

end_date


3

4

5

6

7

8

9

10

11

Scor

e

peak_date

Figure 5.2: Comparison of forecasting accuracy for Date metrics using surrogates


3

4

5

6

7

8

9

10

11

Scor

e

peak_val


3

4

5

6

7

8

9

10

11

Scor

e

season_val

Figure 5.3: Comparison of forecasting accuracy for Value metrics using surrogates

49

4 5 6 7 8 9curr_week

7.8

8.0

8.2

8.4

8.6

8.8

9.0Country: BO

sourcetwitterhmapgstgftWeathermerged


2.80

2.85

2.90

2.95

3.00

3.05

3.10

3.15Country: CL



10.4

10.6

10.8

11.0

11.2

11.4

11.6Country: MX



4.7

4.8

4.9

5.0

5.1

5.2

5.3Country: PE


Figure 5.4: Comparison of forecasting accuracy for ‘Start Date’ using different surrogatesources


11

12

13

14

15

16Country: BO



12.6

12.8

13.0

13.2

13.4

13.6

13.8

14.0Country: CL



33.8

34.0

34.2

34.4

34.6

34.8

35.0

35.2

35.4

35.6Country: MX



7

8

9

10

11

12

13

14

15

16Country: PE


Figure 5.5: Comparison of forecasting accuracy for ‘End Date’ using different surrogatesources

50


2

4

6

8

10

12

14

16

18Country: BO



0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0Country: CL



26.0

26.5

27.0

27.5

28.0Country: MX



2.8

3.0

3.2

3.4

3.6

3.8

4.0Country: PE


Figure 5.6: Comparison of forecasting accuracy for ‘Peak Date’ using different surrogatesources


1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8Country: BO



3.20

3.25

3.30

3.35

3.40

3.45

3.50

3.55Country: CL



0.135

0.140

0.145

0.150

0.155

0.160

0.165

0.170

0.175Country: MX



1.5

2.0

2.5

3.0

3.5

4.0Country: PE


Figure 5.7: Comparison of forecasting accuracy for ‘Peak Value’ using different surrogatesources

51


2.6

2.8

3.0

3.2

3.4

3.6

3.8Country: BO



2.85

2.90

2.95

3.00

3.05

3.10

3.15Country: CL



0.110

0.115

0.120

0.125

Country: MX



1.6

1.8

2.0

2.2

2.4

2.6

2.8Country: PE


Figure 5.8: Comparison of forecasting accuracy for ‘Season Value’ using different surrogatesources

5.5 Discussion

We have presented our work on long-term forecasts using both data assimilation methodsand curve matching process. Our results indicate that data assimilation methods are ingeneral more flexible and robust towards long-term forecasts. Surrogate sources such asHealthMap are important factors for such forecasts, especially during the initial part of theseason. Our future research will focus on systematically including other infectious diseaseswith the framework and towards sparse selection of surrogates for more robust forecasting.

Part III

Detecting and Adapting toConcept Drift

52

53

Part I and Part II outlined our efforts at short-term and long-term forecasting using surro-gates. However, surrogates are typically noisy and relationships to targets may be dynamicin nature. The changes in surrogate-target relationships can be significant, which if unde-tected may subsequently render any model developed on these surrogates ineffective. Thismotivates the third problem of this thesis where we first try to identify such major changesunder the concept of ‘changepoints’. For this, we developed a hierarchical changepoint de-tection framework which can inform the changepoints in targets using information from thesurrogate layers in Chapter 6. We also propose the use of such changepoints towards adaptivetarget forecasting in Chapter 7.

Chapter 6

Hierarchical Quickest ChangeDetection via Surrogates

With the increasing availability of digital data sources, there is a concomitant interest inusing such sources to understand and detect events of interest, reliably and rapidly.

For instance, protest uprisings in unstable countries can be better analyzed by consideringa variety of sources such as economic indicators (e.g. inflation, food prices) and social mediaindicators (e.g. Twitter and news activity). Concurrently, detecting the onset of such eventswith minimal delay is of critical importance. For instance, detecting a disease outbreak [45] inreal time can help in triggering preventive measures to control the outbreak. Similarly, earlyalerts about possible protest uprisings can help in designing traffic diversions and enhancedsecurity to ensure peaceful protests.

Motivated by similar real-life scenarios where significant events can be argued to be observ-able in social sphere, we propose Hierarchical Quickest Change Detection (HQCD), for onlinechange detection across multiple sources, viz. target and surrogates. Typically, targets aresources of imminent interest (such as disease outbreaks or civil unrest); whereas surrogates(such as counts of the word ‘protesta’ in Twitter) by themselves are not of significant inter-est. Thus, HQCD is aimed towards continuously utilizing both categories, but more focusedon early (or quickest) detection of significant changes across the target sources. Traditionalevent (or change) detection approaches are not suitable for such problems. These are eithera) offline approaches [43, 60, 52, 8] using the entire data retrospectively - thus not applicableto real-time scenarios, or b) online detection approaches [53, 54, 30, 31, 1, 35] with primaryfocus on the target source of interest and do not utilize other correlated sources. Table 6.1shows a comparison of HQCD and several state-of-the-art methods in terms of the desirableattributes.

The main contributions of the work presented in this chapter are:• HQCD formalizes a hierarchical structure which in addition to the observed set of target

54

55

Table 6.1: Comparison of state-of-the-art methods vs Hierarchical Quickest Change Detec-tion

Desirable Sequential Window- Bayesian Relative Hierarchical HQCDProperties GLRT Limited Online Density- Bayesian (This

[53] GLRT CPD ratio Analysis of Paper)[54] [30] [1] Estimation Change

[31] (RuLSIF) Point[35] Problems

[8]

Online X X X X X

Hierarchical X X

Bounded FalseAlarm Rate / X X X XDetection delay

Handles XNon-IID data

sources (i.e., Si’s), incorporates additional surrogates, denoted by Kj’s, and encodes propa-gation of change from surrogate to target sources.• HQCD presents a specialized change detection metric that guarantees a maximum levelof false alarm rate while reducing the detection delay in quickest detection framework. Inaddition, HQCD yields a natural methodology for analyzing the causality of change in aparticular target source through a sequence of change propagation in other sources.• HQCD presents a specialized sequential Monte Carlo based change detection frameworkthat along with specialized change detection metrics enables hierarchical data to be analyzedin online fashion.• We extensively test HQCD on both synthetic and real world data. We compare againststate-of-the-art methods and illustrate the robustness of our methods and the usefulnessof surrogates. Moreover, we analyzed target-surrogate relationships and uncover importantpropagation patterns that led to such uprisings.

6.1 HQCD–Hierarchical Quickest Change Detection

We first provide a brief overview of classical QCD problem and then present the HQCDframework.

6.1.1 Quickest Change Detection (QCD)

Let us consider a data source S changing over time and following different stochastic processesbefore and after an unknown time Γ (changepoint). The task of QCD is to produce anestimate Γ = γ in an online setting (i.e., at time t, only S1, . . . , St is available). Figure 6.1illustrates the two fundamental performance metrics related to this problem. In the figure,

56

Γ = t4 is the actual time-point when the changepoint happened. An early estimate such asγ1 = t1 in the figure leads to a false alarm, where another estimate, such as γ2 = t6 leadsto an ‘additive delay’ of γ2 − Γ = t6 − t4. The goal of QCD is to design an online detectionstrategy which minimizes the expected additive detection delay (EADD) while not exceedinga maximum pre-specified probability of false alarm (PFA). QCD has been studied in variouscontexts. Some of the foremost methods have considered i.i.d. distributions with known (orunknown) parameters before and after unknown changepoints [59]. Some of the more popularmethods have used CUSUM (cumulative sum of likelihood) based tests while more generalapproaches are adapted in GLRT (generalized likelihood ratio test) based methods [19].

t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10

Fal

se A

larm

Tru

e C

han

ge P

oin

t

Γ γ2γ1

Detection Delay = Γγ2 -

Del

ayed

Det

ecti

on

Figure 6.1: Illustration of Quickest Change Detection (QCD): blue colored line representsthe actual changepoint at time Γ = t4. (a) declaring a change at γ1 leads to a false alarm,whereas (b) declaring the change at γ2 leads to detection delay. QCD can strike a tradeoffbetween false alarm and detection delay.

6.1.2 Changepoint detection in Hierarchical Data

We next present our approach to generalize QCD to a hierarchical setting. We first describea generic hierarchical model and then propose the QCD statistics for such models in Sec-tion 6.1.2. For computational feasibility, we present a bounded approximate of the same andour multilevel changepoint algorithm in Section 6.1.2.

Generic Hierarchical Model

Let us consider S(T ), a set of I correlated temporal sequences {S(T )1 , S

(T )2 , . . . S

(T )I } where,

S(T )i represents the ith target data sequence S

(T )i = [si(1), si(2), . . . , si(T )] for i = 1, . . . , I,

57

E

S1 S2 S3

�

. . .

K1 K2 K3 . . . KJ

SI

�K1�K2 �K3

�KJ

�SI�S3�S2�S1

�E

(Target sources)

Sum of Target sources

(Surrogate sources)

Figure 6.2: Generative process for HQCD. As an example consider civil unrest protests. Inthe framework, different protest types (such as Education- and Housing-related protests)form the targets denoted by Si’s. The total number of protests will be denoted by the top-most variable E. Finally, the set of surrogates, such as counts of Twitter keywords, stockprice data, weather data, network usage data etc. are denoted by Kj’s.

collected up and until some time T . The cumulative sum of the target sources Si’s at time t isgiven by E(t), i.e., E(t) =

∑Ii=1 si(t). Concurrent to target sources, we also observe a set of

J surrogate sources, K(T ) = {K(T )1 , K

(T )2 , . . . , K

(T )J }, where K

(T )j = [kj(1), kj(2), . . . , kj(T )],

for j = 1, . . . , J , which may either have a causal or effectual relationship with the targetsource set S(T ) (see Figure 6.2). We assume that targets and surrogates follow a stochasticMarkov process as follows:

P (S(T ), K(T )) =P (S(T )1 , . . . , S

(T )I , K

(T )1 , . . . , K

(T )J )

=T∏t=1

{J∏j=1

PφKjt (Kj(t))×

I∏i=1

PφSit

(Si(t)|S(t−1), K(t−1)

)}.

The binary variables φKj , φSi ∈ {0, 1} capture the notion of significant changes in events

through changes in distribution of the generative process as follows: if the surrogate sourceKj undergoes a change in distribution at some time t, then, φKj changes from 0 to 1. In otherwords, P 0

t (Kj) (respectively P 1t (Kj)) denotes the pre-change (post-change) distribution of the

jth surrogate source. Similarly, if the target source Si undergoes a change in distribution at

58

some time t, then φSi changes from 0 to 1. In other words, P 0t (Si|·) (respectively P 1

t (Si|·)) de-notes the pre-change(post-change) conditional distribution of the jth target data source. We denote ΓKj (re-spectively ΓSi) as the random variable denoting the time at which φKj (respectively, φSi )changes from 0 to 1. Finally, we write ΓK = (ΓK1 , . . . ,ΓKJ ), and ΓS = (ΓS1 , . . . ,ΓSI ) asthe collective sets of changepoints in the surrogate and target sources, respectively. Finally,denote ΓE as the changepoint random variable for the top layer, E, which represents thesum total of all target sources.

From QCD to HQCD

We extend the concepts of QCD presented in Section 6.1.1 to multilevel setting by formalizingthe problem as the earliest detection of the set of all (J + I + 1) changepoints, i.e., Γ ={ΓK, ΓS,ΓE} having observed the target and surrogate sources i.e.

(S(T ), K(T )

). Let γ =

{γK, γS, γE} be the (J + I + 1) vector of decision variables for the changepoints. To measuredetection performance, we define the following two novel performance criteria:

Multi-Level Probability-of-False-Alarm (ML-PFA):

ML-PFA(γ) = P(γ � Γ

), (6.1)

where for any two N length vectors a � b, the notation implies ai ≤ bi, for i = 1, . . . , N . Forinstance, consider the example of I = 1 target, and J = 1 surrogate. Then Γ = (ΓK1 ,ΓS1)and γ = (γK1 , γS1), and the probability of multi-level false alarm is given by ML-PFA(γ) =P(γK1 ≤ ΓK1 , γS1 ≤ ΓS1). This definition of ML-PFA declares a false alarm only if all the(J + I + 1) change decision variables are smaller than the true changepoints.

Expected Additive Detection Delay (EADD):

EADD(γ) = E(|γ − Γ|1

)=

J∑j=1

E(|γKj − ΓKj |)︸︷︷︸Surrogate layer delay

+I∑i=1

E(|γSi − ΓSi |)︸︷︷︸Target layer delay

+ E|γE − ΓE |︸︷︷︸Top layer delay

(6.2)

Given the observations, i.e., all target and surrogate sources (S(T ), K(T )) till time T governedby unknown changepoints Γ, we aim to make an optimal decision γ about these changepointsunder the following criterion

γ∗(α) = arg minγ

EADD(γ) s.t. ML-PFA(γ) ≤ α. (6.3)

In other words, γ∗(α) is the optimal change decision vector which minimizes the EADDwhile guaranteeing that the ML-PFA is no more than a tolerable threshold α. We notethat the above optimal test is challenging to implement for real-world data sets due tofollowing issues: a) it requires the knowledge of pre- and post- change distributions (for all

59

sources) and the distribution of the changepoint random vector Γ, b) unlike single sourceQCD, finding the optimal γ∗(α) requires a multi-dimensional search over multiple sources,making it computationally expensive, and c) it does not discriminate between false alarmsacross different sources. For instance, declaring false alarm at a target source (such aspremature declaration of the onset of protests or disease outbreaks) must be penalized morein comparison to declaring false alarm at a surrogate source (such as incorrectly declaringrise in Twitter activity).

Bounded approximation of HQCD

We can circumvent the problem (b) of the original definition of ML-PFA as given in equa-tion 6.1 by upper bounding it in Theorem 6.1.

Theorem 6.1 (Modified-PFA). Let γ = {γS, γK, γE} be the a set of estimates about truechangepoint for targets, surrogates and sum-of-targets, respectively. Then under the conditionof greater importance to accurate target layer detections, ML-PFA (see 6.1) is upper-boundedby Modified-PFA, where:

Modified-PFA(γ) , I ×maxi

P(γSi ≤ ΓSi) + minj

P(γKj ≤ ΓKj) + P(γE ≤ ΓE) (6.4)

Proof. We can prove the upper bound of ML-PFA with the following reductions:

ML-PFA(γ) = P(γ � Γ)= P(γS � ΓS, γK � ΓK, γE ≤ ΓE)(a)

≤ P(γS � ΓS) + P(γK � ΓK) + P(γE ≤ ΓE)(b)

≤ ∑Ii=1 P(γSi ≤ ΓSi) + P(γK � ΓK) + P(γE ≤ ΓE)

≤ I ×maxi

P(γSi ≤ ΓSi) + P(γK � ΓK) + P(γE ≤ ΓE)

(c)

≤ I ×maxi


P(γKj ≤ ΓKj) + P(γE ≤ ΓE),

(6.5)

where (a) and (b) follows from the union bound on probability and (c) follows from the factthat the joint probability of a set of events is less than the probability of any one event, i.e.,P(γK � ΓK) ≤ P(γKj ≤ ΓKj), for any j = 1, . . . , J , and then taking the minimum over all j.The resulting upper bound in (6.5) leads to the basis of the modification of the multi-levelPFA:

Modified-PFA(γ) , I ×maxi


P(γKj ≤ ΓKj) + P(γE ≤ ΓE)

� �

Modified-PFA expression leads to intuitive interpretations as follows: (i) as false alarms attargets can have a higher impact, it is desirable to keep the worst case PFA across these to

60

be the smallest, or equivalently, maxi P(γSi ≤ ΓSi) should be minimized. (ii) false alarms atsurrogates are not as important and we can declare a false alarm if all of the surrogate leveldetection(s) are unreliable, or equivalently, minj P(γKj ≤ ΓKj) needs to be minimized. (iii)notably, the above modification leads to a low-complexity change detection approach acrossmultiple sources by locally optimal detection strategies avoiding a multi-dimensional search.

Based on Modified-PFA, we next present a compact test suite to declare changes at pre-specified levels of maximum PFA as given in Theorem 6.2 and incorporate specificity issuespointed out in problem (c) of the original formulation of PFA.

Theorem 6.2 (Multi-level Change Detection). Let ΓSi be the true change point randomvariable for the ith target source, Si. Let ΓKj and ΓE represent the same for the jth surrogate

and the sum-of-targets, respectively. Let the data observed till time T be D(T ) ,(S(T ), K(T )

)and P (Γ|D(T )) denote the estimate of the conditional distribution (see Section 6.2.2). Then,if αi, βj, λ represent the PFA thresholds for the Si, Kj, E, the changepoint tests can be givenas:

γSi(αi) = inf

{n : TSSi(D

(T )) ≥ αi1 + αi

}, i = 1, . . . , I (6.6a)

γKj(βj) = inf

{n : TSKj(D

(T )) ≥ βj1 + βj

}, j = 1, . . . , J (6.6b)

γE(λ) = inf

{n : TSE(D(T )) ≥ λ

1 + λ

}, (6.6c)

where TSX(D(T )) = P(ΓX ≤ n|D(T )) is the test statistic (TS) for a source X.

Proof. In quickest change detection, our goal at time T is to decide if a change should bedeclared for some n ≤ T for a particular data source. To this end, we can use the followingchange detection test

γSi(αi) = inf

n : log

P(ΓSi ≤ n|D(T )

)P(

ΓSi > n|D(T )) ≥ log(αi)

,

which is equivalent to the following test:

γSi(αi) = inf

{n : P

(ΓSi ≤ n|D(T )

)≥ αi

1 + αi

}. (6.7)

Intuitively, the above test declares the change for the ith target source Si at the smallesttime n for which the test statistic (i.e., posterior probability of the change point randomvariable being less than n) exceeds a threshold. The probability of false alarm for the above

61

test can be bounded in terms of the threshold αi as:

P(γSi ≤ ΓSi) =∑

D(T )

∑n P(D(T ), γSi = n)P(ΓSi > n|D(T ), γSi = n)

(d)

≤∑D(T )

∑n

P(D(T ), γSi = n)︸︷︷︸=1

(1

1+αi

)

= 11+αi

,

(6.8)

where (d) follows from the fact that given the observed data and the event, γSi = n, i.e., thechange is declared at n, then it follows from equation 6.7 that

P(ΓSi > n|D(T ), γSi = n) ≤ 1/(1 + αi)

Let us denote the test statistic (TS) for a data source X as:

TSX(D(T )) = P(ΓX ≤ n|D(T ))

Then, then the multi-level change detection test is:

γSi(αi) = inf{n : TSSi(D(T )) ≥ αi

1 + αi}, i = 1, . . . , I

γKj(βj) = inf{n : TSKj(D(T )) ≥ βj

1 + βj}, j = 1, . . . , J

γE(λ) = inf{n : TSE(D(T )) ≥ λ

1 + λ}

� �

From Theorem 6.2, we can infer the following boundedness property of Modified-PFA asexpressed in the following Lemma.

Lemma 6.3. If we define α∆= mini(αi) and β

∆= maxj(βj), then Modified-PFA in equa-

tion 6.4 can be bounded as:

Modified-PFA(γ) ≤ I × 1

1 + α+

1

1 + β+

1

1 + λ(6.10)

6.2 HQCD for Count Data via Surrogates

In this section we discuss the HQCD framework for count data sources which may be observedin real life. For example, we can analyze the number of protests towards early detection ofprotest uprisings via surrogate sources. Protests can happen in civil society for various rea-sons such as protests against fare hike or protests demanding more job opportunities. Such

62

Algorithm 1: HQCD Multi-level Change Point Detection Algorithm

Input : At time T , Target and Surrogate Sources D(T ) =(S(T ),K(T )

)Parameters: PFA threshold for targets (α), surrogates (β), and sum of targets (λ)Output : Changepoint Decisions γS, γK, γE at each timepoint T

1 for each T do2 Update joint posteriorP (ΓK ,ΓS ,ΓE |D(T ))

// target change detection

3 for i← 1 to I do4 Compute target marginal P (ΓSi |D(T ))5 Find γSi

(α) using 6.6a

6 γS ← {γS1(α), . . . , γSI

(α)}// surrogate change detection

7 for j ← 1 to J do8 Compute surrogate marginal P (ΓKj

|D(T ))9 Find γKj

(β) using 6.6a

10 γK ← {γK1(β), . . . , γKJ(β)}

// sum-of-targets change detection

11 Compute sum-of-targets marginal P (ΓE |D(T ))12 Find γE(λ) using 6.6c13 Return Decision γS, γK, γE(λ) at T

protests, especially major changes in protest base levels, are potentially interlinked. How-ever explaining such interactions is a non-trivial process. [48] found several social sources,especially Twitter chatter, to capture protest related information. We apply HQCD to findsignificant changes in protests concurrent to changes in Twitter chatter, such that detectingchanges accurately are of primary importance in contrast to the chatters which can be influ-enced by a range of factors, including protests. In general, HQCD can be applied in similarevents, such as disease outbreaks, to find significant changes in targets using informationfrom noisy surrogates.

6.2.1 Hierarchical Model for Count Data

In general, HQCD can be applied to any count data sources. However, the exact specificationmay depend on the application. For example, considering protest uprisings, we first notethat surrogate sources such as Twitter are in general noisy and involve a complex interplay ofseveral factors - one of which could be protest uprisings. Furthermore, for protest uprisings,we are more concerned in using the surrogates (Twitter chatter) to help declare changes attarget level (protest counts) than accurately identifying the changes in surrogates. Thus,without loss of generality, we model the surrogates as i.i.d. distributed variables. Figure 6.3)evaluates the i.i.d. assumptions, for both protest counts and Twitter chatter. Our resultsindicate that Log-normal is a reasonable fit for Twitter chatter.

Surrogate Sources: Formally, we assume that the jth surrogate source Kj is generated i.i.d.

63

200000 0 200000 400000 600000 800000 1000000 1200000 14000000.0000000

0.0000005

0.0000010

0.0000015

0.0000020

0.0000025

0.0000030Pre 2013-04-01

LogNorm Fit

200000 0 200000 400000 600000 800000 1000000 1200000 14000000.0000000

0.0000005

0.0000010

0.0000015

0.0000020

0.00000252013-04-01 - 2013-09-01

LogNorm Fit

0 200000 400000 600000 800000 1000000 1200000 14000000.0000000

0.0000005

0.0000010

0.0000015

0.0000020

0.0000025Post 2013-09-01

Norm Fit

200000 0 200000 400000 600000 800000 1000000 1200000 14000000.0000000

0.0000005

0.0000010

0.0000015

0.0000020

M = 409126.93s = 1.77

M = 574497.65s = 1.48

µ = 730181.74σ= 178042.92

M = 565888.93s = 1.60

Full DataLogNorm Fit

(a)

0 10 20 30 40 50 60 700.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08Pre 2013-05-25

LogNorm Fit

0 50 100 150 200 250 300 350 400 4500.000

0.002

0.004

0.006

0.008

0.010

0.012

0.014

0.016

2013-05-25 - 2013-10-20LogNorm Fit

50 0 50 100 150 2000.000

0.005

0.010

0.015

0.020

0.025Post 2013-10-20

LogNorm Fit

100 0 100 200 300 400 5000.000

0.002

0.004

0.006

0.008

0.010

0.012

0.014

(b)

Figure 6.3: Histogram fit of (a) surrogate source (Twitter keyword counts) and (b) targetsource (Number of protests of different categories), for various temporal windows, underi.i.d. assumptions. These assumptions lead to satisfactory distribution fit, at a batch level,for both sources. The top-most row corresponds to the period before the Brazilian spring(pre 2013-05-25), the second row is for the period 2013-05-25 to 2013-10-20, and the thirdis for the period after 2013-10-20. The last row shows the fit for the entire period. Thesetemporal fits are indicative of significant changes in distribution along the Brazilian Springtimeline, for both target and surrogates.

from a distribution fK w.r.t to the associated changepoint ΓKj as:

kj(t)i.i.d∼{fK(φ

Kj0 ) t ≤ ΓKj

fK(φKj1 ) t > ΓKj

(6.11)

where, φKj0 and φ

Kj1 are the pre- and post-change parameters. Following our earlier discus-

sion, we select fK as Log-normal (with location and scale parameters φKj = {cKj , dKj}) forTwitter counts.

Target Sources: Target sources can in general be dependent on both the past values oftargets as well as the surrogates. Here, we restrict the target source process to be a firstorder Markov process. Under this assumption, we formalize the ith target source Si to followa Markov process fSt w.r.t to its changepoint ΓSi as:

si(t) ∼{fSt (φSi0 (t)) t ≤ ΓSifSt (φSi1 (t)) t > ΓSi

(6.12)

64

where, φSi0 and φSi1 are the pre- and post-change parameters of the process. Poisson processwith dynamic rate parameters has been shown [8] to be effective in specifying hierarchicalcount data w.r.t changepoints. Here, we model the rate parameters as a nested autoregressiveprocess [22, 8] given as:

φSi0/1(t) = φSi0/1(t− 1) +Ai

0/1(t)

|Ai0/1

(t)|

(S(t− 1)

K(t− 1)

)+N (0, σS)

Ai0/1(t) = Ai0/1(t− 1) +N (0,ΣAi)(6.13)

Here, φS0/1(t) captures the latent rate and σS denotes the error variance. Ai0/1(t) capturesthe variation due to the observed values of target and surrogates sources.

Changepoint Priors: Following our prior discussion, surrogate changepoints can be assumedto have an uninformative prior and we model ΓKj via a memoryless arrival distribution(static probability of observing change given it hasn’t occurred earlier) as:

ΓKj ∼ Geom(ρKj)⇒ P (Kj = t|Kj ≥ t) = ρKj (6.14)

Conversely, target changepoints can be influenced by surrogate changepoints as their genera-tive process is dependent on the surrogates. Specifically, whenever we observe a changepointin the surrogates, we assume that the base rate of changepoint for a target to increase for acertain period of time. Formally, target changepoint priors are assumed to follow a dynamicprocess as:

ΓSi ∼ Geom(ρSi(t)) (6.15)

ρSi(t) = ρSi +∑j

I(ΓKj < t)µ1je−µ2j (t−ΓKj )

where, I is the indicator function. ρSi represents the nominal base rate for the changepoint.It can be seen, a change in the jth surrogate source is modeled as an exponentially decaying‘impulse’ of amplitude µ1

j . The summation of targets, E(t) is known deterministically givenSi(t). Moreover, given Si(t − 1), E(t) can be considered to be summation of independentPoisson processes following similar dynamics as equation 6.13 which is omitted due to limitedspace. Similarly, relationships for dependence of ΓE can be modeled to be dependent on Ksimilar to equation 6.15.

6.2.2 Changepoint Posterior Estimation

Algorithm 1 involves posterior estimation of the changepoints given the data at a particulartime point. Earlier work has focused mainly on offline methods such as Gibbs Sampling [8].Online posterior estimation for such problems have been studied extensively in the context ofSequential Bayesian Inference [9] such as Kalman filters [26, 55, 2] (Gaussian transitions) andParticle Filters [18, 47, 20]. Recently, Chopin et al. [16] proposed a robust Particle Filter,SMC2 which is ideally suited for fitting the parameters of the non-linear hierarchical modeldescribed in Section 6.2.1. In this section we formulate a Sequential Bayesian Algorithmthat makes the HQCD tractable under real world constraints (see Figure 6.4).

65

Simulated Brazil Venezuela UruguayDataset

0

20

40

60

80

100

120

Tim

e (in

min

)

Gibbs SamplingHQCDHQCD without surrogates

Figure 6.4: Computation time for one complete run of changepoint detection (in mins) ona 1.6 GHz quad core 8gb intel i5 processor: Gibbs sampling [8] vs HQCD vs HQCD withoutsurrogates. Gibbs sampling computation times are unsuitable for online detection.

To find the posterior P(ΓS, ΓK ,ΓE|D(T )

)at any time T using SMC2 we first cast the model

parameters and variables into the following three categories:

Observations (yT ): In the context of SMC2 these are the parameters that correspond toobserved variables at each time point T . For HQCD we can model yT as:

yT∆= {S(T ), K(T )} (6.16)

Hidden States (xT ): SMC2 estimates the observations based on interaction with hiddenstates which are dynamic, unobserved and is sufficient to describe yT at T . For HQCD,we can express xT as follows:

xT∆= {ΓS, ΓK ,ΓE, φ

S0/1(T − 1), φK0/1, (6.17)

ρK(T ), ρS(T ), A0/1, S(T − 1), K(T − 1)}

Static Parameters (θ): Finally, SMC2 also accommodates the concept of static parameterswhich do not change over time such as the base probabilities of changepoint ρS and the noisematrix ΣA in HQCD. We can express θ as:

θ∆= {σS,ΣA, ρS, µ1, µ2} (6.18)

For a given set of such parameters, SMC2 works by first generating Nθ samples of θ usingthe prior distribution P (θ). For each of these samples of θ, SMC2 samples NX samples of x0

from its prior P (x0|θ). Following standard practices, we use conjugate distributions [9] forthe priors.

66

Algorithm 2: HQCD Changepoint Posterior estimation via SMC2

Input : At time T , yT as give in equation 6.16Parameters: Prior distributions P (θ) and P (x0|θ)

Hyperparameters for P (θ) and P (x0|θ)Output : joint posterior P (ΓK ,ΓS ,ΓE |D(T ))

1 Define xT as give in equation 6.172 Define θ as give in equation 6.18

// Initialization

3 Sample Nθ number of θq using P (θ)4 Sample Nx number of x0q,r using P (x0|θq)5 Update weights w(0) // See Appendix

// Online Learning

6 for each T do// State Updates

7 for each q ∈ Nθ do8 for each r ∈ Nx do9 Update States: xTq,r

from xT−1q,r

10 Compute Importance weights wq,r(T )11 Compute observation probability P (yT |yT−1, θq)

// Incorporate observation at time T12 Update Importance weight wq,r(T )← wq,r(T )P (yT |yT−1, θq)

// test premature convergence

13 Test degeneracy conditions using effective sample size14 if degeneracy then

// markov kernel jumps

15 Update xTq,r by multiplying a markov Kernel KT// recomputing weights

16 exchange xTq,rand set wqr ∝ 1

// Find joints

17 Return Update P (ΓS , ΓK ,ΓE |D(T )) using equation 6.19

67

Table 6.2: (Synthetic data) comparing true changepoint (Γ) for targets against detectedchangepoint (γ) by HQCD against state-of-the-art methods for false alarm (FA) and additivedetection delay (ADD). Each row represent a target and best detected changepoint is shownin bold whereas false alarms are shown in red.

True GLRT WGLRT BOCPD RuLSIF HQCD HQCD w/o surr.

Γ γ ADD γ ADD γ ADD γ ADD γ ADD γ ADD

S1 29 7 – 10 – 13 36 7 33 4 32 3S2 6 11 5 14 8 16 10 28 22 8 2 9 3S3 24 7 – 16 – 15 29 5 22 - 26 2S4 26 5 – 11 – 11 38 12 27 1 31 5S5 47 40 – 15 – 8 26 - 50 3 55 8

At each time point T , the samples are perturbed using the model equations given in Sec-tion 6.2.1 and associated with weights w to estimate the joint posteriors as:

P (θ, xT |yT ) =Nθ∑q=1

Nx∑r=1

wq,rδ(θ, xT )

P(ΓS, ΓK ,ΓE |D(T )

)∝

Nθ∑q=1

Nx∑r=1

wq,rδ(ΓS, ΓK ,ΓE)

(6.19)

where, δ is the Kronecker-delta function. Algorithm 2 outlines the steps involved in thisprocess. For more details on SMC2 see Appendix.

6.3 Experiments

We present experimental results for both synthetic and real-world datasets, and compareHQCD against several state-of-the-art online change detection methods (see Table 6.1),specifically, GLRT [54], W-GLRT [31], BOCPD [1] and RuLSIF [35]. To further analyze theeffects of surrogates in detecting changepoints, we compare against HQCD without surro-gates, where K(t−1) is dropped from equation 6.13 and ρSi(t) is made static (i.e. independentof changepoints from surrogates) in equation 6.15.

6.3.1 Synthetic Data

In this section, we validate against synthetic datasets with known changepoint parameters.For this, we pick 5 targets (I = 5) and 10 surrogates (J = 10). The surrogates weregenerated from i.i.d. Log-normal distributions (see equation 6.11) while the targets weregenerated using Poisson process (see equation 6.12). The changepoints for surrogates were

68

0 10 20 30 40 50340

360

380

400

420

440

460

480

500

520 Target-1

0 10 20 30 40 50100

200

300

400

500

600

700

800

900 Target-2

0 10 20 30 40 50100

200

300

400

500

600

700

800

900

1000 Target-3

0 10 20 30 40 50320

340

360

380

400

420

440

460

480 Target-4

0 10 20 30 40 50600

650

700

750

800

850

900 Target-5

Figure 6.5: Comparison of HQCD against state-of-the-art on simulated target sources.X-axis represents time and Y-axis represents actual value. Solid blue lines refer to thetrue changepoint, solid green refers to the ones detected by HQCD and brown refers toHQCD without surrogates. Dashed red, magenta, purple and gold lines refer to changepointsdetected by RuLSIF, WGLRT, BOCPD and GLRT, respectively. HQCD shows better detectionfor most targets with low overall detection delay and false alarms.

sampled from a fixed Gamma distribution (see 6.14) while the associated changepoints fortarget sources were simulated via equation 6.15.

Comparisons with state-of-the-art

As true changepoints are known for the synthetic dataset, we can compare HQCD againstthe state-of-the-art methods for the detected changepoint as shown in Figure 6.5. Table 6.2presents the results in terms of the false alarm (FA) and additive detection delay (ADD).From the table, we can see that HQCD is able to detect the changepoints with fewer falsealarms. Also HQCD has the lowest delay across all methods for all targets except Target-1for which HQCD without surrogates achieved better delay indicating the surrogates are notinformative for this target source.

Usefulness of Surrogates

Our comparisons with the state-of-the-art shows significant improvements that were achievedby HQCD, both in terms of FA and ADD and showcase the importance of systematicallyadmitting surrogate information to attain a quicker change detection with low false alarm.We compare HQCD with surrogates against HQCD without surrogates (Table 6.2) and findthat admitting surrogates significantly improves average delay (2.5 compared to 4.2). Wealso plot the average false alarm rate against the detection delay in Figure 6.6 and find thatHQCD results are in general the ones with the best trade-off between FA and ADD.

69

40 30 20 10 0 10 20 30

Detection DelayFalse Alarm

HQCD Without Surrogates

HQCD

BOCPD

RuLSIF

GLRT

W-GLRT

Figure 6.6: False Alarm vs Delay trade-off for different methods. HQCD shows the besttrade-off.

6.3.2 Real life case study

In real-life scenarios, the true changepoint is typically unknown. One representative examplecould be seen w.r.t. the onset of major civil unrest related protests and uprisings. Wepresent an analysis of three major uprisings: (i) in Brazil around mid 2013 (often termedas the Brazilian Spring), (ii) in Venezuela around early 2014 and, (iii) in Uruguay aroundlate 2013. We first describe the data collection procedure and followup with a comparativeanalysis of detected changepoints.

Weekly counts of civil unrest events from Nov. 2012 to Dec. 2014 were obtained as partof a database of discrete unrest events (Gold Standard Report - GSR) prepared by humananalysts by parsing news articles for civil unrest content. Among other annotations, the GSRalso classifies each event to one of 6 possible event types based on the reason (‘why’) behindthe protest. Each of these event types such as a) Employment and Wages, b) Housing, c)Energy and Resources, d) Other government, e) Other economic and f) Other, bears certainsocietal importance. We treat the weekly counts of each of these event-types as target sources(S) and the sum total of all protests for a week as the sum-of-targets (E). We also collectedgeo-fenced tweets for each country over the same time-period. We used a human-annotateddictionary of 962 such keywords/phrases that contains several identifiers of protest in thelanguages spoken in the countries of interest (similar to Ramakrishnan et.al. [48]). As mostof these keywords could have similar trends, we cluster them using k-means into 30 clusters(i.e., we have J = 30 surrogates). To account for scaling effects while preserving temporalcoherence, each keyword time series was normalized to zero-mean and unit variance.

70

Jan2013

Feb Mar Apr May Jun Jul Aug Sep0

20

40

60

80

100

120

140

160

180

Eve

nt C

ount

s

(a) Brazil Total Protests

Jan2014

Feb Mar

06 13 20 27 03 10 17 240

5

10

15

20

25

30

35

40

Eve

nt C

ount

s

(b) Venezuela Total Protests

Dec Jan2014

11 18 25 02 09 16 23 30 06 130

5

10

15

20

25

30

Eve

nt C

ount

s

(c) Uruguay Total Protests

Figure 6.7: Comparison of detected changepoints at the sum-of-targets (all Protests).HQCD detections are shown in solid green while those from the state-of-the-art methodsi.e. RuLSIF (red), WGLRT (magenta), BOCPD (purple) and GLRT (gold) are shown withdashed lines. HQCD detection is the closest to the traditional start date of Mass Protests inthe three countries studied .

Changepoint Across layers

We show the changepoints detected by HQCD (bold green) and the state-of-the-art methods(dashed lines) for the sum-of-all protests in Figure 6.7 (see Figure C.1 in Appendix C forindividual protest types). We can observe that HQCD, which uses the surrogate informationsources and exploits the hierarchical structure, finds indicators of changes which are visuallybetter as well as more aligned to the dates of major events (See demo at https://prithwi.github.io/hqcd_supplementary). In contrast, the state-of-the-art methods can be arguedto show significantly high false alarm rate. For such real world data sources, the notion ofa true changepoint is difficult to ascertain, we can instead consider for example the onsetof Brazilian spring protests (2013-06-01) as an underlying changepoint to compare at thesum-of-targets and interpret notions of false alarm. Table C.1 tabulates these inferences forthe targets as well as the sum-of-targets. Although, a true changepoint is unknown, we notethat for HQCD, the expected additive detection delay (EADD) can be estimated accordingto equation 6.2 (from P (Γ|D(T )) in Algorithm 2).

Changepoint influence analysis

The experiments presented in the previous section can be further analyzed to ascertain thenature of progression of significant events that lead to a protest. Here we present our analysisfor Brazilian Spring. We found that detected changepoints (see Table C.1 in Appendix C) forBrazil reveal an interesting progression - significant changes in Energy related unrest (06/02)propagated to Housing/Other Govt. Unrest (06/16) and culminated in mass Employmentrelated unrest (08/18). Interestingly, we can analyze the fitted parameters of the weightvector Ai0/1 of the rate updates (see 6.13) to quantize the changepoint influence of a source

(target/surrogate) at time T − 1 to time T . For each target Si, we can compute the average

https://prithwi.github.io/hqcd_supplementary


71

0

4

8

12

16

20

24

28

32

36

40

Oth

er

Gov

ernm

ent

Energy and Resources

Employment

Hou

sing

Oth

er

Other Economic

Housing

Other

Oth

er

Eco

nom

ic

Other Government

Ene

rgy

and

R

esou

rces

Em

ploy

men

t

(a) Influence of lagged targets on current targets

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

Employment

Energy and Resources

Housing

Other

Other Economic

Other Government

0481216202428323640

(b) Influence of lagged surrogates on current targets

Figure 6.8: (Brazilian Spring) Heatmap of changepoint influences of targets on targets (a);and surrogates on targets (b). Darker (lighter) shades indicate higher (lesser) changepointinfluence. (a) shows presence of strong off-diagonal elements indicating strong cross-targetchangepoint information. (b) shows a mixture of uninformative and informative surrogates.

72

value of the weight vector component of each target/surrogate separately. Let h0 and h1

denote these averages for one such source. Effectively, h0 then measures the effect of thesource at time t − 1 on Si at t before change while h1 captures the same post change.Their percentage relative change can then be used as a measure of the changepoint influenceof a particular target/surrogate source on Si. We plot a heatmap of these percentages inFigure 6.8 for both targets and surrogates, separately. From Figure 6.8a, we can see that‘Other Economic’ and ’Employment’ related protests had strong influences from ‘Housing’related protests. Furthermore, from Figure 6.8b we can see ‘Housing’ and ‘Employment’related protests were influenced by similar Twitter chatter clusters (cluster-01 and cluster-26) - indicating that the interaction between these protest subtypes can be inferred fromsocial domain. Conversely, ‘Housing’ and ‘Other Economic’ related protests are only weaklycorrelated through Twitter chatters - thus exhibiting the robustness of HQCD which canstill detect interactions between targets when surrogates fail to explain the same. In general,for a particular target we can see linked pre-cursors in other targets (strong off-diagonalelements in Figure 6.8a) and highly specific informative surrogates (few strong cells for arow in Figure 6.8b).

6.4 Discussion

We have shown HQCD to be an effective framework towards detecting changepoints in anonline manner while accommodating multiple sources in a hierarchical framework. HQCDhas been validated against both synthetic sources and real-life scenarios. In the next chapter,we will next present our efforts at utilizing these changepoints towards robust forecastingmodels.

Supporting Information A demo of HQCD and the datasets used in this chapter can befound in https://prithwi.github.io/hqcd_supplementary. Attached appendix providesadditional details on SMC2.


Chapter 7

Concept Drift Adaptation forGoogle Flu Trends

Early detection of disease outbreaks can lead to prompt response strategies and effective im-plementation of counter-measures. Syndromic surveillance mechanisms hold great promise inimproving lead-time to detection. Google Flu Trends (GFT) was one of the most celebratedexample of syndromic surveillance and has emerged as one of the most popular mechanismsinvolving non-clinical data. Recent work, including at Google, has shown that systems likeGFT, just like other surveillance and forecasting strategies, require periodic re-training andadaptation every year. In particular, GFT estimates tend to be locally spiky in nature,which often lead to difficulties in regression w.r.t CDC ILI surveillance data. In addition tolocal variations, we posit that the fundamental cause of major seasonal performance varia-tions of GFT is due to dynamic patterns in user search behavior. Such a phenomenon canbe analyzed under the framework of concept drift. Our proposed approach is to explicitlymodel concept drift to make such ILI estimates from surrogates sources such as GFT morerobust and in an online manner.

7.1 Background

Google Flu Trends first came into limelight with Ginsberg et al.’s seminal work [24] onmining indicators for disease surveillance from social media activity. This work has spurreda flurry of research in this domain such as [38]. GFT ILI estimates were available forseveral countries and for several regions, which can be used epidemiologists to gain quickinsight into the prevalent influenza state. However, as noted in some recent studies suchas [6, 33, 32], GFT is under-performing against official surveillance data. In spite of updatesto the GFT system, which attempt to rescale search query terms in response to suddenspikes in the search data, the drifting performance issue hasn’t been completely resolved [32].

73

74

GFT-Argentina Rolling mean

Figure 7.1: Evidence of Concept Drift. In Google Flu Trends data for Argentina (left), thecorresponding 52-week rolling mean (right) exhibits a saddle point in early 2012 - indicatesa possible mean shift drift in GFT for Argentina.

Part I and II have outlined our efforts at short-term and long-term forecasting of infectiousdiseases using surrogate sources. Such forecasts were also generated for the IARPA OSInationwide challenge our winning team developed Early Model Based Event Recognitionusing Surrogates (EMBERS) [48] - an automated continuous surveillance and predictivesystem that monitors among other things, epidemic and rare disease outbreaks. During thiseffort we came to better understand the inherent drift in surrogate-target relationships fordiseases and we had to continuously monitor and adapt our models focusing equal attentionto robustness and efficacy. During this experience, we learned that the effective usage ofopen source data, in presence of ever-changing data patterns, necessitates incorporationof adaptivity in models. Specifically, for ILI we have been monitoring a set of keywords inseveral media such as Google search data, news and Twitter, and found evidences of evolvingcorrelations of such keyword counts to surveillance data [12].

We have also been closely collaborating with CDC for the past three years and providingforecasts about US national and region level ILINet percentages as well as seasonal indicatorssuch as peaks. Such efforts led us to run a market for flu predictions under Scicast (https://scicast.org/flu). These experiences corroborate with our EMBERS observations andwe have made similar observations about ILI disease surveillance in general.

Focusing on GFT, we conducted experiments for six Latin American countries, namely Ar-gentina, Bolivia, Chile, Mexico, Peru and Paraguay. Figure 7.1 shows the GFT data forArgentina (from 2010-2014) and the corresponding rolling mean (over a 52 week window).As can be seen the rolling mean indicates that the average activity of flu trends showed amajor shift around 2012. Apart from the major change, similar other local changes in meanshift can also be observed. Rolling statistics over standard deviations and Kurtosis providessimilar insights. In general a combination of these measures indicate that the GFT datadistribution is non-stationary. From a machine learning perspective, such non-stationarity

https://scicast.org/flu

https://scicast.org/flu

75

Robust ForecastsDrift Adaptive ResamplingModel Retargetting

Drift Adaptation

Target/Source Mismatch

Concept Drift Detector

GST Data GFT Data Weather HealthMap

Surveillance Data

Figure 7.2: Concept Drift Adaption Framework. Framework ingest target sources such asCDC ILI case count data and surrogate sources such as GFT and detects changepoints via‘Concept Drift Detector’ stage. Drift probabilities are next passed onto ‘Drift Adaptation’stage where robust predictions are generated using resampling based methods.

in the independent variables leads to varying statistical correlation with the target variable(here official surveillance data) also referred to as concept drift. Concept drift is known tocause predictions to be less accurate over time and identification/handling of such drifts canshow significant improvement in models. We observed similar trends for concept drift inGFT data for the other five Latin American countries.

7.2 Robust Models via Concept Drift Adaptation

Concept drift is an actively studied problem and researchers have proposed many differentmethods to handle concept drifts [23]. Some of the more popular methods focuses on ensem-ble models where ensembles can either be created at model level or via random resamplingof data points to constitute a drift adapted dataset which can next be passed on to machinelearning algorithms. We focused on the random resampling approaches with an aim to-wards computationally inexpensive and generic approach and propose a two-step formalismto handle concept drift towards a Robust GFT estimate.

First, we detect concept-drifts in the surrogate-target data relationships using an onlinenonparametric changepoint detection test (see Chapter 6). We used windowed GLRT ap-proaches using Poisson Regression model from surrogate data sources to ILI surveillancedata and analyze the regression errors (slack) for changes in distribution. Following classicalCUSUM test and our experiences (see Chapter 6), we propose a rolling window over the se-ries of slacks and identify change points based on log-likelihood ratios. These log-likelihoodratios can then be used as probabilities of concept drift for each time-point and we can use

76

weighted resampling of past data where the weights for sampling the time-point t can begiven as:

wt =1− Ldrift(t)∑t(1− Ldrift(t))

(7.1)

where Ldrift(t) quantifies the drift at time t in terms of likelihood of a change at the saidtime point.

The second component involves fitting a Poisson Regression once more, but this timeon the resampled dataset to find updated model parameters and generate the adapted GFTestimates. We use random resampling without replacement using drift probabilities fromequation 7.1 and fit our Poisson regression model on the same.

The framework can be roughly shown as given in Figure 2. We can also employ a feedbackmechanism where past accuracies of adapted GFT to ILI surveillance data is used to updatethe computed Log-likelihood for drift.

7.2.1 Experimental evaluation and comparing Surrogate Sources

The proposed method, outlined in the previous section, can capture drifts using aggregatedsurrogate activity. Similar to Part I and Part II, we intend to compare different surrogatesources for drift correction ability. Our results from Part II has indicated that long-termforecasts for Mexico were especially noisy. As such, we focus on Mexico for the followingassay. We applied our drift adaptation framework for the season 2014-2015 and Table 7.1presents our findings. As can be seen, incorporation of surrogate sources via drift adapterssignificantly improves forecasting accuracy. GST contributes most significantly towards thedrift adaptation while a combination of all sources produces the best overall forecastingaccuracy. Significant drift adaptation could also be seen for HealthMap, however the abso-lute value of forecasting accuracy renders HealthMap source insignificant for the country ofinterest.

Table 7.1: Comparison of surrogate sources pre- and post-drift adaptation.

Pre Drift Correction Post Drift Correction Percentage correctionsource

GST 2.801 3.125 10.372GFT 2.533 2.741 7.561HealthMap 1.364 1.815 24.874Weather 2.242 2.499 10.271All 3.082 3.496 11.836

77

We also plot the quality score and deviance plots for pre drift-corrected and post drift-corrected forecasts for GFT, GST, HealthMap, Weather and all sources in Figures 7.3, 7.4, 7.5,7.6, 7.7, respectively. As can be seen, the quality score distribution of forecasts shows amarked improvement, both in terms of higher absolute value and tighter bounds, for post-drift corrected models. The figures also show the distribution of residual deviance. In termsof concept drift, a narrow distribution indicates a well fitted problem and hence betterdrift correction, whereas a more spread out deviance distribution indicates a sub-optimalcorrection. The deviance plots also exhibits the efficacy of our methods - especially GSTand combined sources shows marked improvement indicating these are the best methods ofcorrecting for drift.

7.3 Discussion

We have proposed a computationally inexpensive method of drift adaptation for diseasesources for Mexico. Our results indicate that significant improvement in forecasting, as wellmodeling, accuracy could be achieved by including surrogates via the proposed framework.Furthermore, a combination of all sources performs best in terms of drift adaptation, thusexhibiting the importance of considering diverse sources. In future, we would extend thisanalysis to more regions and ascertain relative importance of such sources wrtto the regions.

78

corrected uncorrectedDrift Adaptation

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

QS

(a) Quality Score distribution of forecasts before and after drift correction

40 30 20 10 0 10 20 30 40Deviance

0

20

40

60

80

100

120

140

160

Freq

uenc

y

Drift uncorrected Residual Deviance: 0.829

40 30 20 10 0 10 20 30 40Deviance

0

20

40

60

80

100

120

140

Freq

uenc

y

Drift corrected Residual Deviance: 0.469

(b) Residual Deviance Distribution

Figure 7.3: Drift Adaptation for Mexico using GFT

79

10 5 0 5 10 15Deviance

0

10

20

30

40

50

60

70

Freq

uenc

y

Drift uncorrected Residual Deviance: -0.398

100 80 60 40 20 0 20Deviance

0

50

100

150

200

250

Freq

uenc

y

Drift corrected Residual Deviance: -0.250


10 5 0 5 10 15Deviance

0

10

20

30

40

50

60

70

Freq

uenc

y


100 80 60 40 20 0 20Deviance

0

50

100

150

200

250

Freq

uenc

y



Figure 7.4: Drift Adaptation for Mexico using GST

80


0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

QS


100 80 60 40 20 0 20 40 60Deviance

0

20

40

60

80

100

120

140

Freq

uenc

y

Drift uncorrected Residual Deviance: 6.201

100 80 60 40 20 0 20 40 60Deviance

0

20

40

60

80

100

120

140

160

180

Freq

uenc

y

Drift corrected Residual Deviance: 3.650


Figure 7.5: Drift Adaptation for Mexico using HealthMap

81


0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

QS


40 30 20 10 0 10 20 30 40Deviance

0

20

40

60

80

100

120

Freq

uenc

y


30 20 10 0 10 20 30Deviance

0

20

40

60

80

100

120

140

Freq

uenc

y



Figure 7.6: Drift Adaptation for Mexico using weather sources

82


0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

QS


100 80 60 40 20 0 20Deviance

0

50

100

150

200

250

Freq

uenc

y


100 80 60 40 20 0 20Deviance

0

50

100

150

200

250

Freq

uenc

y



Figure 7.7: Drift Adaptation for Mexico using All sources

Chapter 8

Conclusion

We have presented the problem of time series prediction using surrogates and motivatedour efforts by examining the particular case of influenza forecasting. We identified threemajor thrusts for this problem viz. (i) short-term forecasting, (ii) long-term forecastingand (iii) concept drift. We presented our approaches for each of these thrusts in this thesisand communicated our findings in [12, 62, 13]. Our results showcase the efficacy of usingsurrogates to forecast about disease characteristics. In the following section, we discuss theimportance of surrogate information available from open source indicators for public healthsurveillance and conclude with some key insights on how such surrogates can be used towardsan integrated surveillance mechanism.

8.1 Importance of Open Source Indicators for Public

Health

Our results indicate that open source indicators (OSI) are extremely useful for forecastingvarious facets of disease characteristics such as peak intensity and case counts in the shortterm. One of the key advantages of using surrogates could be attributed to the real-timenature of such sources as well as their ready availability. However, such surrogates arein general noisy and may exhibit changing relationships with the disease characteristics ofinterest. For example, the volume of search queries for the term ‘flu’ may have been moreindicative of ILI case counts in the population for the years preceding 2011 than post-2012,for the United States. Thus, this work motivates the use of algorithms that in principleare aware of the possibility of such changing patterns and more importantly, are adaptableto such circumstances. In general, surrogate sources, especially the non-physical ones, canbe considered to be ‘sensors’ of disease spread in population rather than actual indicatorsof disease characteristics. Surrogates from a particular source (see Part I) may containinformation about a certain stage of the disease spread than other. For example, Figure 8.1

83

84

indicates that disease keywords from HealthMap news corpus are more indicative duringthe start of the season whereas search query volumes as accessed by Google Search Trendsexhibit a sub-optimal but stable correlation throughout the season. Thus a single OSI sourcemay not be suitable towards robust disease forecasting. However, as seen from Part I andPart II, combining multiple surrogates can lead to a more robust and stable forecastingframework. It can be argued that multiple surrogates may provide better coverage over thedifferent stages of the season. Also, noises such as spikes in search query activity may bebetter compensated by using a variety of OSI sources and a consensus of increased/decreasedactivity may better inform a forecasting framework.

Another crucial aspect of public health surveillance is the fact that ‘ground truth’ informationavailable at a particular point of time is subject to noise. Consequently, models for diseaseforecasting should be aware of such noises, which can often be systematic. Such flexibility inmodeling is more important while using OSI sources as such sources itself may be subject tonoise. In this work, we have shown that forecasting with an ability to model the surveillanceuncertainty increases the final forecasting accuracy manifold.

This work has focused more on influenza as a primer for endemic disease forecasting. Oneof the key advantages of using influenza as an application is the fact that its one of themost common infectious disease worldwide exhibiting evolving patterns over regions andtime and, more important has significant public health impact. In this work, we have foundphysical sources such as Temperature and Humidity to be more useful in forecasting in-fluenza conditions in the population. Non-physical sources were found to contribute to theoverall forecasting accuracy with varied degrees wrtcountries and disease characteristics. Forshort-term forecasts, Twitter chatter and disease related news were found to be significantlyuseful for a number of countries of interest. For long-term forecasts, surrogates were foundto be more useful at the initial stages of the disease season where disease information fromtraditional surveillance is sparse and more noisy. For the later part of the season, simula-tion based models worked better than data assimilation models and inclusion of surrogatesimproved the overall forecasting accuracy to lesser degree.

8.2 Guidelines for using surrogates for Health Surveil-

lance

In this section, we combine the insights presented in the previous section with our experienceon forecasting infectious diseases into a list of guidelines that may be followed while usingsurrogates for disease surveillance, as follows:

• Surrogates are more useful for forecasting diseases in regions where historical data forsuch sources as well as surveillance data for the said diseases are available for atleast afew disease seasons. For emerging diseases, such surrogates may still be useful but may

85

(a) HealthMap (b) GST

Figure 8.1: Correlation of surrogate sources with disease incidence. Count of influenzarelated keywords from (a) HealthMap and (b) GST compared against influenza case countsfor Argentina as available from PAHO. HealthMap keywords capture the start of the seasonmore accurately, while GST keywords exhibit a sub-optimal but consistent correlation withPAHO counts.

86

require different stochastic models than regression/assimilation based models presentedin this work.

• As discussed in Section 8.3, multiple surrogate sources are more useful for disease fore-casting than a single source. In general, a variety of heterogeneous sources such asTwitter chatter and news, may encode different information about the disease spreadshould be used than a number of homogeneous sources, such as disease news informa-tion from multiple sources.

• Use of OSI must also be justified by being aware of spurious correlation and the us-ability of such sources towards a real-time system must be backed by both quantitative(such as forecasting accuracy) and well as qualitative (such as expert insights) mea-sures. For example use of search queries as surrogates for influenza forecasting canbe justified by an improved forecasting performance as well as by recognizing the factthat people may search about flu symptoms and remedies as possible measures ofself-diagnosis.

• Finally, surrogates that are available at a regular and steady interval with wide cov-erage are preferable to other sources that may be available only at sporadic intervals.For example, twitter chatter albeit being noisy are preferable from a public healthsurveillance standpoint than other sources such as telephone surveys.

8.3 Future Work

The research presented in this thesis has been mainly aimed at disease forecasting usingdata from a single region of interest. In future, we aim to generate spatially aware modelssuch that the increase/decrease of disease incidences in neighboring regions can be used tomodulate forecasts for the region of interest. Furthermore, we will expand the frameworkspresented in this thesis to simultaneously study multiple diseases with underlying similarities- either via common transmission methods (such as Dengue and Chikungunya) or via similarexposed population. Such combined studies can lead to more robust forecasts, especially fordiseases with sparse data (such as Chikungunya), and provide a deeper understanding aboutthe spread of such diseases.

Bibliography

[1] R. P. Adams and D. J. MacKay. Bayesian Online Changepoint Detection. arXiv preprintarXiv:0710.3742, 2007.

[2] J. L. Anderson. An Ensemble Adjustment Kalman Filter for Data Assimilation. Monthlyweather review, 129(12):2884–2903, 2001.

[3] A. Apolloni, V. A. Kumar, M. V. Marathe, and S. Swarup. Computational Epidemiologyin a Connected World. Computer, 42(12):0083–86, 2009.

[4] M. T. Bahadori, Y. Liu, and E. P. Xing. Fast Structure Learning in Generalized Stochas-tic Processes with Latent Factors. In Proceedings of KDD ’13, 2013.

[5] K. R. Bisset, J. Chen, X. Feng, V. Kumar, and M. V. Marathe. EpiFast: A Fast Algo-rithm for Large Scale Realistic Epidemic Simulations on Distributed Memory Systems.In Proceedings of the ICS ’09, 2009.

[6] D. Butler. When Google got Flu Wrong. Nature, 494(7436):155, 2013.

[7] J. Canny. Collaborative Filtering with Privacy via Factor Analysis. In Proceedings ofSIGIR ’02, pages 238–245, 2002.

[8] B. P. Carlin, A. E. Gelfand, and A. F. Smith. Hierarchical Bayesian analysis of Change-point Problems. Applied statistics, pages 389–405, 1992.

[9] G. Casella and R. L. Berger. Statistical Inference, volume 2. Duxbury Pacific Grove,CA, 2002.

[10] CDC. Influenza (flu). www.cdc.gov/flu/index.htm. Accessed: 2015-09-17.

[11] P. Chakraborty. U.S. Flu Forecasting 2014 - SciCast. https://scicast.org/flu. LastAccessed: 2015-02-20.

[12] P. Chakraborty, P. Khadivi, B. Lewis, A. Mahendiran, J. Chen, P. Butler, E. O. Nsoesie,S. R. Mekaru, J. S. Brownstein, M. V. Marathe, and N. Ramakrishnan. Forecastinga Moving Target: Ensemble Models forILI Case Count Predictions. In Proceedings ofthe 2014SIAM International Conference on Data Mining, Philadelphia, Pennsylvania,USA, April 24-26, 2014, pages 262–270, 2014.

87

www.cdc.gov/flu/index.htm

88

[13] P. Chakraborty, S. Muthiah, R. Tandon, and N. Ramakrishnan. Hierarchical QuickestChange Detection via Surrogates. arXiv preprint arXiv:1603.09739, 2016.

[14] Y. Chen, D. Pavlov, and J. F. Canny. Large-Scale Behavioral Targeting. In Proceedingsof KDD ’09, 2009.

[15] C. Chew and G. Eysenbach. Pandemics in the age of twitter: Content analysis of tweetsduring the 2009 h1n1 outbreak. PLOS One, 5(11):e14118, 2013.

[16] N. Chopin, P. E. Jacob, and O. Papaspiliopoulos. SMC2: An Efficient Algorithm forSequential Analysis of State Space Models. Journal of the Royal Statistical Society:Series B (Statistical Methodology), 75(3):397–426, 2013.

[17] J. L. Crassidis and J. L. Junkins. Optimal estimation of dynamic systems. CRC press,2011.

[18] P. Del Moral. Non-linear Filtering: Interacting Particle Resolution. Markov processesand related fields, 2(4):555–581, 1996.

[19] A. Dessein and A. Cont. Online Change Detection in Exponential Families with Un-known Parameters. In F. Nielsen and F. Barbaresco, editors, Geometric Science of In-formation, volume 8085 of Lecture Notes in Computer Science, pages 633–640. SpringerBerlin Heidelberg, 2013.

[20] A. Doucet and A. M. Johansen. A Tutorial on Particle Filtering and Smoothing: FifteenYears Later. Handbook of Nonlinear Filtering, 12:656–704, 2009.

[21] G. Evensen. The ensemble kalman filter: Theoretical formulation and practical imple-mentation. Ocean dynamics, 53(4):343–367, 2003.

[22] K. Fokianos, A. Rahbek, and D. Tjøstheim. Poisson Autoregression. Journal of theAmerican Statistical Association, 104(488):1430–1439, 2009.

[23] J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, and A. Bouchachia. A Survey onConcept Drift Adaptation. ACM Computing Surveys (CSUR), 46(4):44, 2014.

[24] J. Ginsberg, M. H. Mohebbi, R. S. Patel, L. Brammer, M. S. Smolinski, and L. Bril-liant. Detecting Influenza Epidemics using Search Engine Query Data. Nature,457(7232):1012–1014, 2008.

[25] K. S. Hickmann, G. Fairchild, R. Priedhorsky, N. Generous, J. M. Hyman, A. Desh-pande, and S. Y. Del Valle. Forecasting the 2013–2014 Influenza Season using Wikipedia.arXiv preprint arXiv:1410.7716, 2014.

[26] R. E. Kalman. A New approach to Linear Filtering and Prediction Problems. Journalof Fluids Engineering, 82(1):35–45, 1960.

89

[27] K.Denecke, P.Dolog, and P.Smrz. Making use of social media data in public health. InProceedings of WWW ’12, pages 243–246, 2012.

[28] Y. Koren. Factorization Meets the Neighborhood: A Multifaceted Collaborative Filter-ing Model. In Proceedings of KDD ’08, pages 426–434, 2008.

[29] P. Kostkova. A Roadmap to Integrated Digital Public Health Surveillance: The Visionand The Challenges. In Proceedings of WWW ’13, pages 687–694, 2013.

[30] T. L. Lai. Sequential Changepoint Detection in Quality Control and Dynamical Systems.Journal of the Royal Statistical Society. Series B (Methodological), pages 613–658, 1995.

[31] T. L. Lai and H. Xing. Sequential Change-point Detection when the pre-and post-changeParameters are Unknown. Sequential Analysis, 29(2):162–175, 2010.

[32] D. Lazer, R. Kennedy, G. King, and A. Vespignani. Google Flu Trends Still AppearsSick: An Evaluation of the 2013-2014 Flu Season. Available at SSRN 2408560, 2014.

[33] D. Lazer, R. Kennedy, G. King, and A. Vespignani. The Parable of Google Flu: Trapsin Big Data Analysis. Science, 343(6176):1203–1205, 2014.

[34] K. Lee, A.Agrawal, and A.Choudhary. Real-time disease surveillance using twitter data:demonstration on flu and cancer. In Proceedings of the KDD ’13, pages 1474–1477, 2013.

[35] S. Liu, M. Yamada, N. Collier, and M. Sugiyama. Change-point Detection in Time-seriesData by Relative Density-Ratio Estimation. Neural Networks, 43(0):72 – 83, 2013.

[36] Y. Liu, M. T. Bahadori, and H. Li. Sparse-GEV: Sparse latent space model for multi-variate extreme value time series modelling. In Proceedings of ICML ’12, 2012.

[37] D. Livings. Aspects of the ensemble kalman filter. Reading University Masters Thesis,2005.

[38] D. J. McIver and J. S. Brownstein. Wikipedia Usage Estimates Prevalence of Influenza-Like Illness in the United States in Near Real-Time. PLOS Computational Biology,10(4):e1003581, 04 2014.

[39] N.Kanhabua and W.Nejdl. Understanding the diversity of tweets in the time of out-breaks. In Proceedings of WWW ’13, pages 1335–1342, 2013.

[40] E. Nsoesie, M. Mararthe, and J. Brownstein. Forecasting Peaks of Seasonal InfluenzaEpidemics. PLOS Currents, 5, 2013.

[41] E. O. Nsoesie, D. L. Buckeridge, and J. S. Brownstein. Who’s Not Coming to Dinner?Evaluating Trends in Online Restaurant Reservations for Outbreak Surveillance. OnlineJournal of Public Health Informatics, 5(1), 2013.

90

[42] H. Ohlsson, L. Ljung, and S. Boyd. Segmentation of ARX-models using Sum-Of-NormsRegularization. Automatica, 46:1107–1111, 2010.

[43] E. Page. Continuous Inspection Schemes. Biometrika, pages 100–115, 1954.

[44] PAHO. Influenza and other Respiratory Viruses. http://ais.paho.org/phip/viz/

ed_flu.asp. Accessed: 2015-09-01.

[45] I. Painter, J. Eaton, and B. Lober. Using Change Point Detection for Monitoring theQuality of Aggregate Data. Online journal of public health informatics, 5(1), 2013.

[46] M. J. Paul, M. Dredze, and D. Broniatowski. Twitter improves influenza forecasting.PLOS Currents, 6, 2014.

[47] M. K. Pitt and N. Shephard. Filtering via simulation: Auxiliary particle filters. Journalof the American statistical association, 94(446):590–599, 1999.

[48] N. Ramakrishnan, P. Butler, S. Muthiah, et al. ‘Beating the News’ with EMBERS:Forecasting Civil Unrest Using Open Source Indicators. In Proceedings of the 20thACM SIGKDD, KDD, pages 1799–1808, New York, NY, USA, 2014. ACM.

[49] J. Shaman, E. Goldstein, and M. Lipsitch. Absolute Humidity and Pandemic VersusEpidemic Influenza. American journal of epidemiology, 173(2):127–135, 2010.

[50] J. Shaman and A. Karspeck. Forecasting Seasonal Outbreaks of Influenza. Proceedingsof the National Academy of Sciences, 109(50):20425–20430, 2012.

[51] J. Shaman, V. E. Pitzer, C. Viboud, B. T. Grenfell, and M. Lipsitch. Absolute humidityand the seasonal onset of influenza in the continental United States. PLOS Biology,8(2):e1000316, 2010.

[52] W. A. Shewhart. The Application of Statistics as an Aid in Maintaining Quality of aManufactured Product. Journal of the American Statistical Association, 20(152):546–548, 1925.

[53] A. N. Shiryaev. On Optimum Methods in Quickest Detection Problems. Theory ofProbability & Its Applications, 8(1):22–46, 1963.

[54] D. Siegmund and E. Venkatraman. Using the Generalized Likelihood Ratio Statistic forSequential Detection of a Change-point. The Annals of Statistics, pages 255–271, 1995.

[55] D. Simon. Kalman Filtering with State Constraints: A Survey of Linear and NonlinearAlgorithms. IET Control Theory & Applications, 4:1303–1318(15), August 2010.

[56] R. Sugumaran and J.Voss. Real-time spatio-temporal analysis of west nile virus usingtwitter data. In Proceedings of COM.Geo ’12, pages 1335–1342, 2012.



91

[57] J. D. Tamerius, J. Shaman, W. J. Alonso, K. Bloom-Feshbach, C. K. Uejio, A. Com-rie, and C. Viboud. Environmental Predictors of Seasonal Influenza Epidemics acrossTemperate and Tropical Climates. PLOS Pathog., 9(3):68–72, 2013.

[58] M. Tizzoni, P. Bajardi, C. Poletto, J. J. Ramasco, D. Balcan, B. Goncalves, N. Perra,V. Colizza, and A. Vespignani. Real-Time Numerical Forecast of Global EpidemicSpreading: Case Study of 2009 A/H1N1pdm. BMC medicine, 10(1):165, 2012.

[59] V. V. Veeravalli and T. Banerjee. Quickest Change Detection. Academic Press Libraryin Signal Processing: Array and Statistical Signal Processing, 3:209–256, 2013.

[60] A. Wald. Sequential tests of Statistical Hypotheses. The Annals of Mathematical Statis-tics, 16(2):117–186, 1945.

[61] X. Wang and C. H. Bishop. A comparison of breeding and ensemble transform kalmanfilter ensemble forecast schemes. Journal of the atmospheric sciences, 60(9):1140–1158,2003.

[62] Z. Wang, P. Chakraborty, S. R. Mekaru, J. S. Brownstein, J. Ye, and N. Ramakrishnan.Dynamic Poisson Autoregression for Influenza-Like-Illness Case Count Prediction. InProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discov-ery and Data Mining, KDD ’15, pages 1285–1294, New York, NY, USA, 2015. ACM.

[63] G. Welch and G. Bishop. An introduction to the kalman filter. department of computerscience, university of north carolina, 2006.

[64] WHO. Surveillance and Monitoring. http://www.who.int/influenza/surveillance_monitoring/en/. Accessed: 2015-09-17.

[65] W. Yang, S. Elankumaran, and L. C. Marr. Relationship between humidity and in-fluenza A viability in droplets and implications for influenza’s seasonality. PlOS One,7(10):e46789, 2012.

[66] W. Yang, A. Karspeck, and J. Shaman. Comparison of filtering methods for the model-ing and retrospective forecasting of influenza epidemics. PLOS Computational Biology,10(4):e1003583, 2014.

[67] Q. Yuan, E. O. Nsoesie, B. Lv, G. Peng, R. Chunara, and J. S. Brownstein. MonitoringInfluenza Epidemics in China with Search Query from Baidu. PLOS One, 8(5):e64323,2013.

http://www.who.int/influenza/surveillance_monitoring/en/

http://www.who.int/influenza/surveillance_monitoring/en/

Appendix A

Data Assimilation: detailedperformance

In this appendix, we present the detailed performance of data assimilation model presentedin Chapter 5, w.r.t. several seasonal characteristics using different sources individually aswell as in a combined manner. The metrics are presented in Table A.1. Quality score is usedto evaluate the value metrics (peak value and season value) while number of days offset hasbeen used to evaluate the date metrics (start date, peak date, and end date). Combinedsources shows the best performance overall.

Table A.1: Performance of Data assimilation methods using different surrogate sources w.r.t.seasonal characteristics

Actual Predicted ScoreMetric Current week Country Source

end date 4 BO Weather 36.0 52.000 16.000GFT 36.0 52.000 16.000GST 36.0 52.000 16.000HealthMap 36.0 47.455 11.455Twitter 36.0 52.000 16.000Merged 36.0 52.000 16.000

CL Weather 28.0 42.000 14.000GFT 28.0 42.000 14.000GST 28.0 42.000 14.000HealthMap 28.0 42.000 14.000Twitter 28.0 42.000 14.000Merged 28.0 42.000 14.000

Continued on next page

92

93


MX Weather 11.0 46.000 35.000GFT 11.0 46.000 35.000GST 11.0 46.000 35.000HealthMap 11.0 46.000 35.000Twitter 11.0 46.000 35.000Merged 11.0 46.000 35.000

PE Weather 28.0 42.000 14.000GFT 28.0 42.000 14.000GST 28.0 42.000 14.000HealthMap 28.0 35.000 7.000Twitter 28.0 42.000 14.000Merged 28.0 42.273 14.273

5 BO Weather 36.0 52.000 16.000GFT 36.0 52.000 16.000GST 36.0 52.000 16.000HealthMap 36.0 47.000 11.000Twitter 36.0 52.000 16.000Merged 36.0 52.000 16.000




6 BO Weather 36.0 52.000 16.000GFT 36.0 52.000 16.000


94


GST 36.0 52.000 16.000HealthMap 36.0 47.000 11.000Twitter 36.0 52.000 16.000Merged 36.0 52.000 16.000






MX Weather 11.0 46.000 35.000GFT 11.0 46.000 35.000GST 11.0 46.000 35.000HealthMap 11.0 46.545 35.545


95


Twitter 11.0 46.000 35.000Merged 11.0 46.000 35.000








96





peak date 4 BO Weather 22.0 39.000 17.000GFT 22.0 39.000 17.000GST 22.0 39.000 17.000HealthMap 22.0 24.000 2.000Twitter 22.0 39.000 17.000Merged 22.0 39.000 17.000



PE Weather 25.0 21.000 4.000GFT 25.0 21.000 4.000


97


GST 25.0 21.000 4.000HealthMap 25.0 21.000 4.000Twitter 25.0 21.000 4.000Merged 25.0 21.000 4.000






CL Weather 19.0 17.000 2.000GFT 19.0 15.000 4.000GST 19.0 19.000 0.000HealthMap 19.0 16.000 3.000


98


Twitter 19.0 17.000 2.000Merged 19.0 19.000 0.000






PE Weather 25.0 21.000 4.000GFT 25.0 21.000 4.000GST 25.0 21.000 4.000HealthMap 25.0 21.000 4.000Twitter 25.0 21.000 4.000


99


Merged 25.0 21.000 4.0008 BO Weather 22.0 39.000 17.000

GFT 22.0 39.000 17.000GST 22.0 39.000 17.000HealthMap 22.0 24.000 2.000Twitter 22.0 39.000 17.000Merged 22.0 39.000 17.000






MX Weather 4.0 32.000 28.000


100




peak val 4 BO Weather 74.0 215.983 1.370GFT 74.0 141.080 2.098GST 74.0 154.167 1.920HealthMap 74.0 48.998 2.649Twitter 74.0 178.928 1.654Merged 74.0 145.784 2.030




5 BO Weather 74.0 183.187 1.616GFT 74.0 135.428 2.186GST 74.0 149.081 1.985


101


HealthMap 74.0 46.160 2.495Twitter 74.0 182.980 1.618Merged 74.0 144.726 2.045






MX Weather 25.0 697.397 0.143GFT 25.0 717.214 0.139GST 25.0 710.721 0.141HealthMap 25.0 574.125 0.174Twitter 25.0 669.851 0.149


102


Merged 25.0 687.034 0.146PE Weather 83.0 112.842 2.942







CL Weather 1004.0 833.944 3.322


103








PE Weather 83.0 115.505 2.874GFT 83.0 93.706 3.543GST 83.0 117.460 2.826


104



season val 4 BO Weather 674.0 1002.532 2.689GFT 674.0 907.496 2.971GST 674.0 942.767 2.860HealthMap 674.0 712.464 3.784Twitter 674.0 964.909 2.794Merged 674.0 930.938 2.896





CL Weather 8752.0 11716.956 2.988GFT 8752.0 11474.547 3.051GST 8752.0 11931.396 2.934HealthMap 8752.0 11246.845 3.113Twitter 8752.0 11834.536 2.958


105


Merged 8752.0 12086.014 2.897MX Weather 158.0 5533.380 0.114







7 BO Weather 674.0 949.164 2.840


106








MX Weather 158.0 5445.851 0.116GFT 158.0 5340.681 0.118GST 158.0 5695.009 0.111


107








start date 4 BO Weather 17.0 9.000 8.000GFT 17.0 9.000 8.000GST 17.0 9.000 8.000HealthMap 17.0 8.000 9.000Twitter 17.0 9.000 8.000


108


Merged 17.0 9.000 8.000CL Weather 10.0 7.000 3.000







PE Weather 6.0 1.000 5.000


109








CL Weather 10.0 7.000 3.000GFT 10.0 7.000 3.000GST 10.0 7.000 3.000


110








PE Weather 6.0 1.000 5.000GFT 6.0 1.000 5.000GST 6.0 1.000 5.000HealthMap 6.0 1.000 5.000Twitter 6.0 1.000 5.000


111


Merged 6.0 1.000 5.0009 BO Weather 17.0 9.000 8.000





Appendix B

Sequential Bayesian Inference

In this appendix, we briefly describe the methodology of Sequential Bayesian Inference andpresent some of the details as relevant to the methods presented in Chapter6.

Consider a stochastic process where an observed temporal data sequence y = {y1, y2, . . . , yt}depends on unobserved latent states x = {x1, x2, . . . , xt} such that the following formulationholds:

P (yt|y1:t−1, x1:t, θ) = fθ(yt|xt)P (xt|x1:t−1, θ) = gθ(xt|xt−1)

P (x1|θ) = µθ(x1)Π0(θ) = P (θ)

(B.1)

i.e. yt depends only on the current estimate of the state xt. On the other hand, xt dependsonly on xt−1, thus exhibiting a first-order Markov property. θ denotes the set of parameterfor the described process which are constant over time. For some θ, fθ, gθ describe theobservation probability and the state transition probability, respectively. P (θ) is the priordistribution for the static parameter θ while µθ is the same for x given a particular θ.Typically, at any time point t − 1 the observation values are known but the latent statesand the parameter θ are unknown. The problem of interest is then to estimate the posteriorprobability

Pθ({x1, x2, . . . , xt−1}|{y1, y2, . . . , yt−1})

This problem has been studied extensively in the context of Sequential Bayesian Inference [9].Kalman filters [26], a class of such algorithms, are very popular when fθ and gθ describelinear Gaussian transitions. There have been efforts [55, 2] at relaxing these restrictions usingmethods such as Taylor series expansion and ensemble averages. However, for arbitrary formsof fθ and gθ, Sequential Monte Carlo and more specifically Particle Filters are more popular.Particle Filters [18] estimate the posteriors using a large number of Monte Carlo samplesfrom the observation and state transition models. At any time t, these algorithms only needto draw new samples for time t using data from only t− 1. Thus these methods are ideally

112

113

suited for online learning. Standard Particle Filters are known to suffer from prematureconvergence (particle degeneracy) [20] or unsuitable for unknown static variables [47, 20]Recently, Chopin et al. [16] proposed a hybrid Particle filter which interleaves Iteratedbatch resampling with particle filter updates to handle both static and state parameters.Given an observed sequence y1:t, SMC2 can be used to find the best posterior fit of the staticand state parameters as given below:

P(φ, {x1:t}φ | y1:t

)

B.1 SMC2 algorithm traces

We present the traces of the SMC2 algorithm below. For a more detailed treatment of thesame (including theoretical proofs of convergence) we ask the readers to refer to [16].

SMC2 typically starts with two parameters: (a) Nθ - the number of static parameters sampledfrom the prior of θ and (b) Nx - the number of particles of initialized for each θ.

Then the Algorithm can be given as follows:

1. Sample Nθ number of θm ∼ P (θ)

2. ∀θm run the following particle filter

(a) Initialization: t = 1

i. x1:Nx,m1 ∼ µθm

ii. w1,θ(xn1 ,m) =

µ1,θm (xn,m1 )gθ(y1|xn,m1

q1,θ(xn,m)1

iii. W n,m1,θ =

w1,θ(x1n,m)∑i w1,θ(x1i,m)

iv. P (y|θm) = 1Nx

Nx∑n=1

w1,θ(xn,m1 )

(b) t ≥ 1

i. Auxiliary variable:

an,mt−1 ∼Multinomial(W 1:Nx,mt−1,θ

)ii. State Proposal:

xtn,m ∼ qt,θ

(.|xa

n,mt−1

t−1

)iii. Weight Update:

Wt,θ

(xan,mt−1

t−1

)∼

wt,θ

(xan,mt−1t−1 xn,mt

)∑xan,mt−1t−1 xn,mt

114

iv. Observation probability:

P (yt|y1:t−1, θm) =

Nx∑n=1

wt,θ

(xan,mt−1t−1 xn,mt

)Nx

3. Update Importance weights:∀θm wm ← wmP (yt|y1:t−1,θm

4. Under degeneracy criterion:Move particles using Kernel(

θm, ˜x1:Nx,m1:t , ˜a1:Nx,m

1:t−1

)i.i.d∼∑m

w Kt(θm,x1:Nx,m1:t ,a1:Nx,m1:t−1 )∑m wm

5. Weight Exchange:(θm, x1:Nx,m

1:t , a1:Nx,m1:t−1

)←(θm, ˜x1:Nx,m

1:t , ˜a1:Nx,m1:t−1

)Here, K is a Markov kernel Targeting the posterior distribution. It can be shown that suchMarkov moves don’t change the target distribution and can alleviate the problem of particledegeneracy.

B.2 SMC2 priors

We used conjugate distributions to model the priors. For, P (θ) we used a mixture of Latinhypercube sampling (LHS) and conjugate priors as follows:

σS, ρS, µ1, µ2 ∼ LHSΣA ∼ InverseWishart,

(B.2)

Similar to P (θ), we model the initial distribution P (x0|θ) via LHS sampling for the basevalues and by using the model equations as presented in Section 3.1. as follows:

cK ∼ Normalφk, ρs, ∼ Gamma

(B.3)

The parameters of the distributions of P (θ) and P (x0|θ) are called hyperparameters in thegeneral domain of Bayesian Inference and following standard practices are found via cross-validation.

Appendix C

HQCD: Additional ExperimentalResults

In this appendix, we present some additional experimental results that complements thesummary results presented in Section 6.3.

Jan2013


5

10

15

20

25

30

35

40

Eve

nt C

ount

s

Employment

Jan2013


5

10

15

20

25

Eve

nt C

ount

s

Other Government

Jan2013


10

20

30

40

50

60

70

80

90

Eve

nt C

ount

s

Energy & Resources

Jan2013


2

4

6

8

10

12

14

16

18

Eve

nt C

ount

s

Other Economic

Jan2013


1

2

3

4

5

Eve

nt C

ount

s

Housing

Jan2013


5

10

15

20

25

30

35

40

Eve

nt C

ount

s

Other

(a) Brazil Subtypes

Jan2014

Feb Mar

06 13 20 27 03 10 17 240.0

0.5

1.0

1.5

2.0

2.5

3.0

Eve

nt C

ount

s

Employment

Jan2014

Feb Mar

06 13 20 27 03 10 17 240

5

10

15

20

25

30

35

40

Eve

nt C

ount

s

Other Government

Jan2014

Feb Mar

06 13 20 27 03 10 17 240.0

0.5

1.0

1.5

2.0

2.5

3.0

Eve

nt C

ount

s

Energy & Resources

Jan2014

Feb Mar

06 13 20 27 03 10 17 240

1

2

3

4

5

6

7

Eve

nt C

ount

s

Other Economic

Jan2014

Feb Mar

06 13 20 27 03 10 17 24?0.06

?0.04

?0.02

0.00

0.02

0.04

0.06

Eve

nt C

ount

s

Housing

Jan2014

Feb Mar

06 13 20 27 03 10 17 240.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Eve

nt C

ount

s

Other

(b) Venezuela Subtypes

Dec Jan2014

11 18 25 02 09 16 23 30 06 130.0

0.5

1.0

1.5

2.0

Eve

nt C

ount

sEmployment

Dec Jan2014

11 18 25 02 09 16 23 30 06 130

5

10

15

20

25

30

Eve

nt C

ount

s

Other Government

Dec Jan2014

11 18 25 02 09 16 23 30 06 130.0

0.5

1.0

1.5

2.0

2.5

3.0

Eve

nt C

ount

s

Energy & Resources

Dec Jan2014

11 18 25 02 09 16 23 30 06 130.0

0.2

0.4

0.6

0.8

1.0

Eve

nt C

ount

s

Other Economic

Dec Jan2014

11 18 25 02 09 16 23 30 06 130.0

0.2

0.4

0.6

0.8

1.0

Eve

nt C

ount

s

Housing

Dec Jan2014

11 18 25 02 09 16 23 30 06 130.0

0.5

1.0

1.5

2.0

Eve

nt C

ount

s

Other

(c) Uruguay Subtypes

Figure C.1: Comparison of detected changepoints at the target sources (Protest types)HQCD detections are shown in solid green while those from the state-of-the-art methods i.e.RuLSIF (red), WGLRT (magenta), BOCPD (purple) and GLRT (gold) are shown with dashedlines.

115

116

Table C.1: (Protest uprisings) Comparison of HQCD vs state-of-the-art with respect todetected changepoints

Event-Type GLRT WGLRT BOCPD RuLSIF HQCD

γ γ γ γ γ EADD

Brazil Employment & Wages 02/10 03/17 06/16 05/26 08/18 4Energy & Resources 02/10 03/17 06/09 05/19 06/02 6Housing 03/24 03/31 07/28 05/19 06/16 8Other Economic 03/24 03/24 06/23 05/19 06/30 5Other Government 02/17 06/23 04/07 05/19 06/16 4Other 03/03 03/17 06/30 05/19 06/23 6All 02/17 04/28 05/19 06/16 06/16 8

Venezuela Employment & Wages 01/14 01/13 01/28 01/25 01/27 3Energy & Resources 01/20 01/11 02/28 01/20 02/24 7Housing - - - - - -Other Economic 01/31 01/31 01/28 - 01/27 9Other Government 01/22 01/11 02/03 01/20 02/10 4Other 01/14 01/12 01/25 01/30 01/24 5All 01/26 01/11 01/30 01/20 02/12 3

Uruguay Employment & Wages 12/06 12/08 12/13 12/03 12/10 3Energy & Resources 12/04 12/05 12/10 - 12/09 4Housing 12/21 12/06 11/30 - 11/28 2Other Economic 12/20 12/06 - - 11/26 2Other Government 11/25 12/05 12/16 11/29 12/15 3Other 12/05 12/09 12/03 - 01/14 10All 12/05 12/09 12/03 11/29 12/10 3

Data-Driven Methods for Modeling and Predicting Multivariate … · 2020-01-19 · Data-Driven...

Documents

Transcript of Data-Driven Methods for Modeling and Predicting Multivariate … · 2020-01-19 · Data-Driven...