Assessing the impact of a health intervention via user-generated Internet content

ECML PKDD 2015, Porto, Portugal

Assessing the impact of a health intervention via user-‐generated Internet data

Data Mining and Knowledge Discovery 29(5), pp. 1434–1457, 2015

Vasileios Lampos, Elad Yom-‐Tov, Richard Pebody and Ingemar J. Cox

STATUTORY NOTIFICATIONS OF INFECTIOUS DISEASES

WEEK 2015/33 week ending 16/08/2015

in ENGLAND and WALES

Table 1 Statutory notifications of infectious diseases in the past 6 weeks with totals for the current year compared with corresponding periods of the two preceding years

CONTENTS

Table 2 Statutory notifications of infectious diseases for diseases for WEEK 2015/33 by PHE Region, county, local and unitary authority including additional diseases notifiable from 6th April 2010

Registered Medical Practioner in England and Wales have a statutory duty to notify a Proper Officer of the local authority, often the CCDC (Consultant in Communicable Disease Control), of suspected cases of certain infectious diseases:

Cholera Meningococcal septicaemia Viral haemorrhagic feverBrucellosis * Measles Typhus

Diphtheria Mumps Whooping cough

Food poisoning Rabies

Enteric fever (typhoid or paratyphoid) Plague Yellow fever

Botulism * Malaria Tuberculosis

Acute infectious hepatitis Infectious bloody diarrhoea * SARS *Acute encephalitis Haemolytic uraemic syndrome * Rubella

Acute meningitis Invasive group A Streptococcal disease

Scarlet fever

Anthrax Leprosy TetanusAcute poliomyelitis Legionnaires disease * Smallpox

* Notifiable from 6th April 2010

Notifications of infectious diseases, some of which are later microbiologically confirmed, prompt local investigation and action to control the diseases. Proper officers are required every week to inform the PHE (formerly the Registrar General) anonymised details of each case of each disease that has been notified. PHE has responsibilty of collating the weekly returns from proper officers and publishing analyses of local and national trends.

All weekly data are Provisional© Public Health England - Information Management Department

NOIDs WEEKLY REPORTTurning ideas into reality at

Microsoft Research Cambridge

๏ Background and motivation

๏ Nowcasting disease rates from online text

๏ Estimating the impact of a health intervention

๏ Case study: influenza vaccination impact

๏ Conclusions & future work

1%

Assessing the impact of a health intervention via online content

Online, user-‐generated data

+ Social media, blogs, search engine query logs

+ Proxy of real-‐world (online+offline) behaviour

+ Complementary information sensors to more ‘traditional’ crowdsourcing efforts

+ Can answer questions difficult to resolve otherwise

+ Strong predictive power

Online, user-‐generated data — Applications

+ Politics • voting intention • result of an election

+ Finance • financial indices • tourism patterns

+ User profiling • age • gender • occupation (Preotiuc-‐Pietro, Lampos & Aletras, 2015)

(Burger et al., 2011)

(Rao et al., 2010)

(Bollen, Mao & Zeng, 2011)

(Choi & Varian, 2012)

(Lampos, Preotiuc-‐Pietro & Cohn, 2013)

(Tumasjan et al., 2010)

Online, user-‐generated data for health

Traditional disease surveillance - does not cover the entire population - not present everywhere (cities / countries) - not always timely

Digital disease surveillance + different or better population coverage + better geographical granularity + useful in underdeveloped parts of the world + almost instant - noisy, unstructured information

e.g. (Lampos & Cristianini, 2010 & 2012), (Lamb, Paul & Dredze, 2013), (Lampos et al., 2015)

What this work is all about

Health intervention

disease rates

(Pebody & Cox, 2015

impact ?

What this work is all about

Health intervention

disease rates

(Lampos, Yom-‐Tov, Pebody & Cox, 2015)

impact ?

✓ Background and motivation

๏ Estimating disease rates from online text





15%

Estimating disease rates from online text

Ridge regression

argmin

w,�

0

@NX

i=1

(xiw + � � yi)2+

MX

j=1

w2j

1

A

Elastic net

argmin

w,�

0

@NX

i=1

(xiw + � � yi)2+ �1

MX

j=1

|wj |+ �2

MX

j=1

w2j

1

A

Variables

NMX 2 RN⇥M

y 2 RN

1

time intervalsn-‐grams

frequency of n-‐grams during the time intervals

disease rates during the time intervals

Ridge regression

argmin

w,�

0

@NX

i=1

(xiw + � � yi)2+

MX

j=1

w2j

1

A

Elastic net

argmin

w,�

0

@NX

i=1

(xiw + � � yi)2+ �1

MX

j=1

|wj |+ �2

MX

j=1

w2j

1

A

Variables

NMX 2 RN⇥M

y 2 RN

1

(Hoerl & Kennard, 1970)

Ridge regressionRidge regression

argmin

w,�

0

@NX

i=1

(xiw + � � yi)2+

MX

j=1

w2j

1

A

Elastic net

argmin

w,�

0

@NX

i=1

(xiw + � � yi)2+ �1

MX

j=1

|wj |+ �2

MX

j=1

w2j

1

A

Variables

NMX 2 RN⇥M

y 2 RN

1

(Zou & Hastie, 2005)

Elastic net


Ridge regression

argmin

w,�

0

@NX

i=1

(xiw + � � yi)2+

MX

j=1

w2j

1

A

Elastic net

argmin

w,�

0

@NX

i=1

(xiw + � � yi)2+ �1

MX

j=1

|wj |+ �2

MX

j=1

w2j

1

A

Variables

NMX 2 RN⇥M

y 2 RN

GPs can be defined as sets of random variables, any finite number of

which have a multivariate Gaussian distribution.

In GP regression, for the inputs x, x

0 2 RM(both expressing rows of

the observation matrix X) we want to learn a function f : RM ! R that is

drawn from a GP prior

f(x) ⇠ GP �µ(x) = 0, k(x,x0

)

�

kSE(x,x0) = �2

exp

✓�kx� x

0k222`2

◆

where �2describes the overall level of variance and ` is referred to as the

characteristic length-scale parameter.

An infinite sum of SE kernels with di↵erent length-scales results to an-

other well studied covariance function, the Rational Quadratic (RQ) kernel.

kRQ(x,x0) = �2

✓1 +

kx� x

0k222↵`2

◆�↵

↵ is a parameter that determines the relative weighting between small

and large-scale variations of input pairs. The RQ kernel can be used to model

functions that are expected to vary smoothly across many length-scales.

1

Gaussian Process

Ridge regression

argmin

w,�

0

@NX

i=1

(xiw + � � yi)2+

MX

j=1

w2j

1

A

Elastic net

argmin

w,�

0

@NX

i=1

(xiw + � � yi)2+ �1

MX

j=1

|wj |+ �2

MX

j=1

w2j

1

A

Variables

NMX 2 RN⇥M

y 2 RN







f(x) ⇠ GP �µ(x) = 0, k(x,x0

)

�

kSE(x,x0) = �2

exp

✓�kx� x

0k222`2

◆





kRQ(x,x0) = �2

✓1 +

kx� x

0k222↵`2

◆�↵




1

Rational Quadratic covariance function (kernel)

infinite sum of squared exponential (RBF) kernels

k(x,x0) =

CX

n=1

kRQ(gn,g0n)

!+ kN(x,x

0)

One kernel per n-‐gram category varied usage patterns, increasing semantic value

(Rasmussen & Williams, 2006)

see also (


Ridge regression

argmin

w,�

0

@NX

i=1

(xiw + � � yi)2+

MX

j=1

w2j

1

A

Elastic net

argmin

w,�

0

@NX

i=1

(xiw + � � yi)2+ �1

MX

j=1

|wj |+ �2

MX

j=1

w2j

1

A

Variables

NMX 2 RN⇥M

y 2 RN







f(x) ⇠ GP �µ(x) = 0, k(x,x0

)

�

kSE(x,x0) = �2

exp

✓�kx� x

0k222`2

◆





kRQ(x,x0) = �2

✓1 +

kx� x

0k222↵`2

◆�↵




1

Gaussian Process

Ridge regression

argmin

w,�

0

@NX

i=1

(xiw + � � yi)2+

MX

j=1

w2j

1

A

Elastic net

argmin

w,�

0

@NX

i=1

(xiw + � � yi)2+ �1

MX

j=1

|wj |+ �2

MX

j=1

w2j

1

A

Variables

NMX 2 RN⇥M

y 2 RN







f(x) ⇠ GP �µ(x) = 0, k(x,x0

)

�

kSE(x,x0) = �2

exp

✓�kx� x

0k222`2

◆





kRQ(x,x0) = �2

✓1 +

kx� x

0k222↵`2

◆�↵




1

Rational Quadratic covariance function (kernel)

infinite sum of squared exponential (RBF) kernels

k(x,x0) =

CX

n=1

kRQ(gn,g0n)

!+ kN(x,x

0)

where gn is used to express the features of each n-gram category, i.e.,

x = {g1, g2, g3, g4}, C is equal to the number of n-gram categories (in our

experiments, C = 4) and kN(x,x0) = �2

N ⇥ �(x,x0) models noise (� being a

Kronecker delta function).

Denoting the disease rate time series as y = (y1, . . . , yN ), the GP re-

gression objective is defined by the minimization of the following negative

log-marginal likelihood function

argmin

�1,...,�C ,`1,...,`C ,↵1,...,↵C ,�N

�(y � µ)|K�1

(y � µ) + log |K|�

where K holds the covariance function evaluations for all pairs of inputs,

i.e., (K)i,j = k(xi,xj), and µ = (µ(x1), . . . , µ(xN )).

Based on a new observation x⇤, a prediction is conducted by computing

the mean value of the posterior predictive distribution, E[y⇤|y,O,x⇤].

2

One kernel per n-‐gram category varied usage patterns, increasing semantic value

(Rasmussen & Williams, 2006)

see also (Lampos et al., 2015)

Estimating influenza-‐like illness (ILI) rates — Data

2012 2013 20140

0.01

0.02

0.03

0.04

ILI r

ate

per 1

00 p

eopl

e

ILI rates (PHE)

Bing

User-‐generated data, geolocated in England • Twitter: May 2011 to April 2014 (308 million tweets) • Bing: end of December 2012 to April 2014

ILI rates from Public Health England (PHE)

Estimating ILI rates — Feature extraction

• Start with a manually crafted list of 36 textual markers, e.g. flu, headache, doctor, cough

• Extract frequent co-‐occurring n-‐grams from a corpus of 30 million UK tweets (February & March, 2014) after removing stop-‐words

• Set of markers expanded to 205 n-‐grams (n ≤ 4) e.g. #flu, #cough, annoying cough, worst sore throat

• Relatively small set of features motivated by previous work (Culotta, 2013)

Estimating ILI rates — Experimental setup

Two time intervals based on the different temporal coverage of Twitter and Bing data • Dt1: 154 weeks (May 2011 to April 2014) • Dt2: 67 weeks (December 2012 to April 2014)

Stratified 10-‐fold cross validation

Error metrics • Pearson correlation (r) • Mean Absolute Error (MAE)

Pearso

n co

rrelation (r)

0.5

0.6

0.7

0.8

0.9

1

User-‐generated data source

Twitter (Dt1) Twitter (Dt2) Bing (Dt2)

0.9520.924

0.8450.867

0.7440.718

0.814

0.698

0.64

Ridge Regression Elastic Net Gaussian Process

Estimating ILI rates — Performance

MAE

1

1.64

2.28

2.92

3.56

4.2

User-‐generated data source

Twitter (Dt1) Twitter (Dt2) Bing (Dt2)

1.598

1.9992.196

2.564

3.198

2.828 2.963

4.084

3.074

Ridge Regression Elastic Net Gaussian Process

Estimating ILI rates — Performancex 10

3


✓ Estimating disease rates from online text





41%

Estimating the impact of a health intervention

1. Disease intervention launched (to a set of areas)

2. Define a distinct set of control areas

3. Estimate disease rates in all areas

4. Identify pairs of areas with strong historical correlation in their disease rates

5. Use this relationship during and slightly after the intervention to infer diseases rates in the affected areas had the intervention not taken place


k(x,x0) =

CX

n=1

kRQ(gn,g0n)

!+ kN(x,x

0)









argmin

�1,...,�C ,`1,...,`C ,↵1,...,↵C ,�N

�(y � µ)|K�1

(y � µ) + log |K|�




the mean value of the posterior predictive distribution, E[y⇤|y,O,x⇤].—

⌧ = {t1, . . . , tN}vc

r(q⌧v ,q

⌧c )

f(w,�) : R ! R

argmin

w,�

NX

i=1

�qtic w + � � qtiv

�2

q

⇤v = q

⇤cw + b

qv

�v = qv � q

⇤v

✓v =

qv � q

⇤v

q

⇤v

.

2

time interval(s) before the interventionlocation(s) where the intervention took placecontrol location(s)

k(x,x0) =

CX

n=1

kRQ(gn,g0n)

!+ kN(x,x

0)









argmin

�1,...,�C ,`1,...,`C ,↵1,...,↵C ,�N

�(y � µ)|K�1

(y � µ) + log |K|�





⌧ = {t1, . . . , tN}vc

r(q⌧v ,q

⌧c )

f(w,�) : R ! R

argmin

w,�

NX

i=1


�2

q

⇤v = q

⇤cw + b

qv

�v = qv � q

⇤v

✓v =

qv � q

⇤v

q

⇤v

.

2

k(x,x0) =

CX

n=1

kRQ(gn,g0n)

!+ kN(x,x

0)









argmin

�1,...,�C ,`1,...,`C ,↵1,...,↵C ,�N

�(y � µ)|K�1

(y � µ) + log |K|�





⌧ = {t1, . . . , tN}vc

r(q⌧v ,q

⌧c )

f(w,�) : R ! R

argmin

w,�

NX

i=1


�2

q

⇤v = q

⇤cw + b

qv

�v = qv � q

⇤v

✓v =

qv � q

⇤v

q

⇤v

.

2

such that

k(x,x0) =

CX

n=1

kRQ(gn,g0n)

!+ kN(x,x

0)









argmin

�1,...,�C ,`1,...,`C ,↵1,...,↵C ,�N

�(y � µ)|K�1

(y � µ) + log |K|�





⌧ = {t1, . . . , tN}vc

r(q⌧v ,q

⌧c )

f(w,�) : R ! R

argmin

w,�

NX

i=1


�2

q

⇤v = q

⇤cw + b

qv

�v = qv � q

⇤v

✓v =

qv � q

⇤v

q

⇤v

.

2

disease rate(s) in affected location

before intervention

disease rate(s) in control location

before intervention

high


k(x,x0) =

CX

n=1

kRQ(gn,g0n)

!+ kN(x,x

0)









argmin

�1,...,�C ,`1,...,`C ,↵1,...,↵C ,�N

�(y � µ)|K�1

(y � µ) + log |K|�





⌧ = {t1, . . . , tN}vc

r(q⌧v ,q

⌧c )

f(w,�) : R ! R

argmin

w,�

NX

i=1


�2

q

⇤v = q

⇤cw + b

qv

�v = qv � q

⇤v

✓v =

qv � q

⇤v

q

⇤v

.

2

k(x,x0) =

CX

n=1

kRQ(gn,g0n)

!+ kN(x,x

0)









argmin

�1,...,�C ,`1,...,`C ,↵1,...,↵C ,�N

�(y � µ)|K�1

(y � µ) + log |K|�





⌧ = {t1, . . . , tN}vc

r(q⌧v ,q

⌧c )

f(w,�) : R ! R

argmin

w,�

NX

i=1


�2

q

⇤v = q

⇤cw + b

qv

�v = qv � q

⇤v

✓v =

qv � q

⇤v

q

⇤v

.

2

such that

qvdisease rate(s) in affected location during/after intervention

�v = qv � q

⇤v

absolute difference

✓v =

qv � q

⇤v

q

⇤v

relative difference (impact)

(Lambert & Pregibon, 2008

estimate projected rate(s) in affected location during/after intervention

k(x,x0) =

CX

n=1

kRQ(gn,g0n)

!+ kN(x,x

0)









argmin

�1,...,�C ,`1,...,`C ,↵1,...,↵C ,�N

�(y � µ)|K�1

(y � µ) + log |K|�





⌧ = {t1, . . . , tN}vc

r(q⌧v ,q

⌧c )

f(w,�) : R ! R

argmin

w,�

NX

i=1


�2

q

⇤v = q

⇤cw + b

q

⇤v = qcw + b

qv

�v = qv � q

⇤v

2


k(x,x0) =

CX

n=1

kRQ(gn,g0n)

!+ kN(x,x

0)









argmin

�1,...,�C ,`1,...,`C ,↵1,...,↵C ,�N

�(y � µ)|K�1

(y � µ) + log |K|�





⌧ = {t1, . . . , tN}vc

r(q⌧v ,q

⌧c )

f(w,�) : R ! R

argmin

w,�

NX

i=1


�2

q

⇤v = q

⇤cw + b

qv

�v = qv � q

⇤v

✓v =

qv � q

⇤v

q

⇤v

.

2

k(x,x0) =

CX

n=1

kRQ(gn,g0n)

!+ kN(x,x

0)









argmin

�1,...,�C ,`1,...,`C ,↵1,...,↵C ,�N

�(y � µ)|K�1

(y � µ) + log |K|�





⌧ = {t1, . . . , tN}vc

r(q⌧v ,q

⌧c )

f(w,�) : R ! R

argmin

w,�

NX

i=1


�2

q

⇤v = q

⇤cw + b

qv

�v = qv � q

⇤v

✓v =

qv � q

⇤v

q

⇤v

.

2

such that

k(x,x0) =

CX

n=1

kRQ(gn,g0n)

!+ kN(x,x

0)









argmin

�1,...,�C ,`1,...,`C ,↵1,...,↵C ,�N

�(y � µ)|K�1

(y � µ) + log |K|�





⌧ = {t1, . . . , tN}vc

r(q⌧v ,q

⌧c )

f(w,�) : R ! R

argmin

w,�

NX

i=1


�2

q

⇤v = q

⇤cw + b

qv

�v = qv � q

⇤v

✓v =

qv � q

⇤v

q

⇤v

.

2

disease rate(s) in affected location during/after intervention

k(x,x0) =

CX

n=1

kRQ(gn,g0n)

!+ kN(x,x

0)









argmin

�1,...,�C ,`1,...,`C ,↵1,...,↵C ,�N

�(y � µ)|K�1

(y � µ) + log |K|�





⌧ = {t1, . . . , tN}vc

r(q⌧v ,q

⌧c )

f(w,�) : R ! R

argmin

w,�

NX

i=1


�2

q

⇤v = q

⇤cw + b

qv

�v = qv � q

⇤v

✓v =

qv � q

⇤v

q

⇤v

2

absolute difference

k(x,x0) =

CX

n=1

kRQ(gn,g0n)

!+ kN(x,x

0)









argmin

�1,...,�C ,`1,...,`C ,↵1,...,↵C ,�N

�(y � µ)|K�1

(y � µ) + log |K|�





⌧ = {t1, . . . , tN}vc

r(q⌧v ,q

⌧c )

f(w,�) : R ! R

argmin

w,�

NX

i=1


�2

q

⇤v = q

⇤cw + b

qv

�v = qv � q

⇤v

✓v =

qv � q

⇤v

q

⇤v

2

relative difference (impact)

(Lambert & Pregibon, 2008)

estimate projected rate(s) in affected location during/after intervention

k(x,x0) =

CX

n=1

kRQ(gn,g0n)

!+ kN(x,x

0)









argmin

�1,...,�C ,`1,...,`C ,↵1,...,↵C ,�N

�(y � µ)|K�1

(y � µ) + log |K|�





⌧ = {t1, . . . , tN}vc

r(q⌧v ,q

⌧c )

f(w,�) : R ! R

argmin

w,�

NX

i=1


�2

q

⇤v = q

⇤cw + b

q

⇤v = qcw + b

qv

�v = qv � q

⇤v

2



✓ Estimating the impact of a health intervention




52%

Live Attenuated Influenza Vaccine (LAIV) campaign

2012 2013 20140

0.01

0.02

0.03

ILI r

ate

per 1

00 p

eopl

e

PHE/RCGP LAIV Post LAIV

∆tv

• LAIV programme for children (4 to 11 years) in pilot areas of England during the 2013/14 flu season

• Vaccination period (blue): Sept. 2013 to Jan. 2014 • Post-‐vaccination period (green): Feb. to April 2014

Target (vaccinated) & control areas

Vaccination impact paper

Vaccinated Locations

Bury

Cumbria

Gateshead

Leicester

East Leicestershire

Rutland

Havering

Newham

South-East Essex

Control Locations

Brighton

Bristol

Cambridge

Exeter

Leeds

Liverpool

Norwich

Nottingham

Plymouth

Sheffield

Southampton

York

Brighton • Bristol • Cambridge Exeter • Leeds • Liverpool Norwich • Nottingham • Plymouth Sheffield • Southampton • York

Control areas

Bury • Cumbria • Gateshead Leicester • East Leicestershire Rutland • South-‐East Essex Havering (London) Newham (London)

Vaccinated areas

Applying the impact estimation framework

Target vs. control areas • Use previous flu season only to establish relationships • Find the best correlated areas or supersets of them

Confidence intervals • Bootstrap sampling of the regression residuals

(mapping function of control to vaccinated areas) • Bootstrap sampling of data prior to the application of

the bootstrapped regressor • 105 bootstraps; use the .025 and .975 quantiles

Statistical significance assessment • Impact estimate (abs.) > 2σ of the bootstrap estimates

Relationship between vaccinated & control areas

Twitter — All areas

Bing — All areas

0 0.25 0.5 0.75 1

0

0.25

0.5

0.75

1

ILI r

ates

in v

acci

nate

d ar

eas

ILI rates in control areas

pre−vaccination periodduring/after LAIV

0 0.25 0.5 0.75 10

0.25

0.5

0.75

1

ILI r

ates

in v

acci

nate

d ar

eas



axes normalised from 0 to 1

r = .86

r = .87

Relationship between vaccinated & control areas

Twitter — London areas

Bing — London areas

0 0.25 0.5 0.75 10

0.25

0.5

0.75

1

ILI r

ates

in v

acci

nate

d ar

eas



0 0.25 0.5 0.75 10

0.25

0.5

0.75

1IL

I rat

es in

vac

cina

ted

area

s



axes normalised from 0 to 1

r = .74

r = .85

Impact estimation results (strongly correlated controls)

Source Target r δ x 103 θ (%)

Twitter All areas .861 -‐2.5 (-‐4.1, -‐1.0) -‐32.8 (-‐47.4, -‐15.6)

Bing All areas .866 -‐1.9 (-‐3.2, -‐0.7) -‐21.7 (-‐32.1, -‐9.10)

TwitterLondon areas .738 -‐1.7 (-‐2.5, -‐0.9) -‐30.5 (-‐41.8, -‐17.5)

BingLondon areas .848 -‐2.8 (-‐4.1, -‐1.6) -‐28.4 (-‐36.7, -‐17.9)

Source Target r δ x 103 θ (%)

Twitter All areas .861 -‐2.5 (-‐4.1, -‐1.0) -‐32.8 (-‐47.4, -‐15.6)

Bing All areas .866 -‐1.9 (-‐3.2, -‐0.7) -‐21.7 (-‐32.1, -‐9.10)

TwitterLondon areas .738 -‐1.7 (-‐2.5, -‐0.9) -‐30.5 (-‐41.8, -‐17.5)

BingLondon areas .848 -‐2.8 (-‐4.1, -‐1.6) -‐28.4 (-‐36.7, -‐17.9)

Impact estimation results (strongly correlated controls)

Impact estimation results (stat. sig.)-‐θ (%

)

0

7

14

21

28

35

All areas London areas Newham Cumbria Gateshead

30.228.7

21.7 21.1

30.430.532.8

Twitter Bing

Projected vs. inferred ILI rates in vaccinated locations

Twitter — All areas

Bing — All areas

Oct Nov Dec Jan Feb Mar Apr0

0.005

0.01

0.015

0.02

ILI r

ates

per

100

peo

ple

weeks during and after the vaccination programme

inferred ILI ratesprojected ILI rates


0.005

0.01

0.015

0.02

ILI r

ates

per

100

peo

ple



Projected vs. inferred ILI rates in vaccinated locations

Twitter — London areas

Bing — London areas


0.005

0.01

ILI r

ates

per

100

peo

ple




0.005

0.01

0.015

ILI r

ates

per

100

peo

ple



Sensitivity of impact estimates to variable controls

• Repeat the impact estimation for the N controls (up to a 100) with r ≥ 95% of the best r —> μ(δ) and μ(θ) (%)

• Measure % of difference, Δ(θ), between θ and μ(θ)

Source Target N μ(r) μ(δ) x 103 μ(θ) (%) Δθ (%)

Twitter All areas 100 0.84 -‐2.5 (0.2) -‐32.7 (2.1) 0.10Bing All areas 46 0.85 -‐1.4 (0.4) -‐16.4 (3.6) 24.4

Twitter London areas 79 0.70 -‐1.5 (0.1) -‐27.9 (2.0) 8.32

Bing London areas 100 0.84 -‐1.4 (0.2) -‐16.9 (1.8) 40.4





Twitter All areas 100 0.84 -‐2.5 (0.2) -‐32.7 (2.1) 0.10

Bing All areas 46 0.85 -‐1.4 (0.4) -‐16.4 (3.6) 24.4







Twitter All areas 100 0.84 -‐2.5 (0.2) -‐32.7 (2.1) 0.10Bing All areas 46 0.85 -‐1.4 (0.4) -‐16.4 (3.6) 24.4





✓ Estimating the impact of a health intervention

✓ Case study: influenza vaccination impact



89%

Conclusions & points for discussion

• Framework for estimating the impact of a health intervention based on online content

• Access to different & larger parts of the population

Evaluation is hard, however: • PHE’s impact estimates: -‐66% based on sentinel

surveillance, -‐24% laboratory confirmed • Correlation between actual vaccination uptake and our

study’s estimated impacts

Why are Bing and Twitter estimations different? • Different user demographics (?) — this can be useful • Different temporal resolution

(Pebody et al., 2014)

Potential future work directions

• Improve supervised learning models - better natural language processing / machine

learning modelling - combination of different data sources

• Work on unsupervised techniques - inferring / understanding the demographics of the

online medium will be essential

• More rigorous evaluation

Collaborators, acknowledgements & material

Elad Yom-‐Tov, Microsoft Research Richard Pebody, Public Health England Ingemar J. Cox, UCL & University of Copenhagen

Jens Geyti, UCL (Software Engineer) Simon de Lusignan, University of Surrey & RCGP

Slides: ow.ly/RN7MZPaper: ow.ly/RN9J2

i-‐sense.org.uk

Bollen, Mao & Zeng. Twitter mood predicts the stock market. J Comp Science, 2011. Burger, Henderson, Kim & Zarrella. Discriminating Gender on Twitter. EMNLP, 2011. Choi & Varian. Predicting the Present with Google Trends. Economic Record, 2012. Culotta. Lightweight methods to estimate influenza rates and alcohol sales volume from Twitter messages. Lang Resour Eval, 2013. Hoerl & Kennard. Ridge regression: biased estimation for nonorthogonal problems. Technometrics, 1970. Lamb, Paul & Dredze. Separating Fact from Fear: Tracking Flu Infections on Twitter. NAACL, 2013. Lambert & Pregibon. Online effects of offline ads. Data Mining & Audience Intelligence for Advertising, 2008. Lampos & Cristianini. Tracking the flu pandemic by monitoring the Social Web. CIP, 2010. Lampos & Cristianini. Nowcasting Events from the Social Web with Statistical Learning. ACM TIST, 2012. Lampos, Miller, Crossan & Stefansen. Advances in nowcasting influenza-‐like illness rates using search query logs. Sci Rep, 2015. Lampos, Yom-‐Tov, Pebody & Cox. Assessing the impact of a health intervention via user-‐generated Internet content. DMKD, 2015. Pebody et al. Uptake and impact of a new live attenuated influenza vaccine programme in England: early results of a pilot in primary school-‐age children, 2013/14 influenza season. Eurosurveillance, 2014. Preotiuc-‐Pietro, Lampos & Aletras. An analysis of the user occupational class through Twitter content. ACL, 2015. Rao, Yarowsky, Shreevats & Gupta. Classifying Latent User Attributes in Twitter. SMUC, 2010. Rasmussen & Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. Tumasjan, Sprenger, Sandner & Welpe. Predicting Elections with Twitter: What 140 characters Reveal about Political Sentiment. ICWSM, 2010. Zou & Hastie. Regularization and variable selection via the elastic net. J R Stat Soc Series B Stat Methodol, 2005.

References

Assessing the impact of a health intervention via user-generated Internet content

Science

Transcript of Assessing the impact of a health intervention via user-generated Internet content