WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

42
WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS WHAT KIND OF POLITICAL CAMPAIGNS/USES HAVE THERE BEEN? #thisundocumentedlife Undocumented teenagers living in the US shared aspects of their life struggle with an instagram campaign #blacklivesmatter Used it to instantly document violence towards black people as part of the larger campaign Stefano M. Iacus University of Milan VOICES from the Blogs R Foundation for Statistical Computing To which extent Social Media can help migration monitoring? (on-going work with L.Curini, R. Impicciatore, Y. Teocharis) Measuring Migration: NTTS 2017 satellite event, Brussels, 13 March 2017

Transcript of WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

Page 1: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS

WHAT KIND OF POLITICAL CAMPAIGNS/USES HAVE THERE BEEN?

▸ #thisundocumentedlife

▸ Undocumented teenagers living in the US shared aspects of their life struggle with an instagram campaign

▸ #blacklivesmatter

▸ Used it to instantly document violence towards black people as part of the larger campaign

Stefano M. Iacus

University of Milan VOICES from the Blogs

R Foundation for Statistical Computing

To which extent Social Media can help migration monitoring?

(on-going work with L.Curini, R. Impicciatore, Y. Teocharis)

Measuring Migration: NTTS 2017 satellite event, Brussels, 13 March 2017

Page 2: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

Let us start from the very beginning…

Page 3: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

exabyteortrillionor1000^6:abillionofbillionsofbytes

2003dawnofcivilization

“Therewas5exabytesofinformationcreatedbetweenthedawnofcivilizationthrough2003…butthatmuchinformationisnowcreatedevery2days”(EricSchmidt,Google,2010)

How big are Big data?

Page 4: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

2013:2,7BillionsofInternetusers,37,9%oftheWorldpopulation

2014:2,9Billionsofusers,40,4%oftheWorldpop.

2015:3,2Billions,+11.6%increment

2016:46.1%oftheWorldpop.(asofJuly2016)

Source:

Internet growth rate

Page 5: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

2013:2,7BillionsofInternetusers,37,9%oftheWorldpopulation

2014:2,9Billionsofusers,40,4%oftheWorldpop.

2015:3,2Billions,+11.6%increment

2016:46.1%oftheWorldpop.(asofJuly2016)

Source:

Internet growth rate

Page 6: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

2013:2,7BillionsofInternetusers,37,9%oftheWorldpopulation

2014:2,9Billionsofusers,40,4%oftheWorldpop.

2015:3,18Billions,+11.6%increment

2016:46.1%oftheWorldpop.(asofJuly2016)

Source:

Internet growth rate

unfortunately crashed on Sep 1st 2016

Page 7: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

Why Social Media?

Page 8: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

Take advantage of the specificity of these data:

• Social media are Big Data: we have many in time and space

• They exhibit nowcasting properties which can be exploited

• limitless way to look at this data

• usually less expensive and faster to collect than survey data

Page 9: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

Limits of Social Media data

• The real profiles behind social media accounts are not known in most cases;

• The population on Social Media is a biased sample from the demographics population;

• The population of Social Media under observation, changes according to the topic.

• Social media are not the same everywhere (no FB but VK in RUSSIA, no Twitter but Sina Weibo in China, etc)

(possible solutions to some of these issues a the end of the talk)

Page 10: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

Twitter numbers

Number of (monthly) active users

Page 11: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

Instagram numbers

WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS

YANNIS THEOCHARIS BI-ANNUAL SMAPP GLOBAL CONFERENCE, NEW YORK UNIVERSITY FLORENCE, MAY 23-24, 2016

Number of (monthly) active users

Instagram users have shared over 30 billion photos to date, and now share an average of 70 million photos per day

70 percent of Instagram users c o m e f r o m outside of the U.S.

Page 12: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

Observing the unobservable

Page 13: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

In November 2014 VOICES published the article “Support for Isis stronger in Arabic social media in Europe than in Syria“ for The Guardian. The analysis of 2 million online posts found those originating in Europe were more favurable to Isis than those from frontline of conflict. Total ISIS mentions and sentiment on social media from July to October 2014

In December 2015 VOICES published the article “Here’s a paradox: Shutting down the Islamic State on Twitter might help it recruit” for the Washington Post.

”[…] limiting debate in a digital forum could further radicalize and isolate possible Islamic State sympathizers. The resulting “loneliness

effect” can be dangerous”

“We examined nearly 13 million tweets in Arabic from 53 countries published between July 2014 and

January 2015. We examined the ratio of positive to negative tweets about the Islamic State, by country.”

Observing the unobservable

Page 14: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

As of Nov 2014

Belgium Isis attack: 22 March 2016

Page 15: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

The “loneliness effect” and its risks (…) limiting debate in a digital forum could further radicalize and isolate possible Islamic State sympathizers.

Page 16: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

WORKING WITH INSTAGRAM DATA: THEMES AND IDEASWhy Social Media and Refugee?

Page 17: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS

Page 18: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

• Refugees* are coming from highly wired countries, with youths that are experts in ICT use and high levels of social media use (Howard, 2011; Howard & Hussain, 2014)

• Smartphones are the most important items in most refugees’ luggage. About the only things available to them to keep in touch with people at home

• The best possible tool in their hands to document the conditions they live in and their struggle for physical survival

• Help people create and share their own narratives through systematic visual and textual documentation

• Allow them to present themselves as human beings rather than as “others” or “hostiles” or whatever else

* At least those from the Syrian crisis

Why Social Media and Refugee?

Page 19: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

Part of Twitter & Instagram have geo-reference meta data

For Twitter, this proportion is around 1% to 3% of the total accouts

{ "geo": { "type": "Point", "coordinates": [40.0160921, -105.2812196] }, "coordinates": { "type": "Point", "coordinates": [-105.2812196, 40.0160921] } }

Do we have enough data then?

Page 20: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

Part of Twitter & Instagram have geo-reference meta data

For Instagram, this proportion grows to 30%{ "data": [{ "id": "788029", "latitude": 48.858844300000001, "longitude": 2.2943506, "name": "Eiffel Tower, Paris" }]}

WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS

YANNIS THEOCHARIS BI-ANNUAL SMAPP GLOBAL CONFERENCE, NEW YORK UNIVERSITY FLORENCE, MAY 23-24, 2016

Page 21: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

Movements in a day in

Rome

41.8

41.9

42.0

12.3 12.4 12.5 12.6 12.7lon

lat

Application from tourism study

Page 22: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

Density along the year in Milan

50

100

150

200Tweet

25/04/2014

200

400

600Tweet

08/04/2014

Salone Mobile April 25th

200400600800

Tweet

21/02/2014

Fier

a M

ilano

Rho

Duomo Square

Forum Assago

Concert

Page 23: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

Travels of italian Twitter accounts through 2012-2016

Page 24: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

Travel in time

Time t0Past

Backward lookingMonitoring activitycrisis

event date

Page 25: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

Can we exploit also information coming from networks? Which are the hubs of information?

Would tracking them help in forecasting the flows?

Page 26: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

The ongoing Refugee projectWORKING WITH INSTAGRAM DATA: THEMES AND IDEAS

COLLECTING DATA USING GEOLOCATION FROM IDOMENI, GREECE

About 5000 Instagram posts limited to the Idomeni camp area in Greece

Page 27: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

The ongoing Syrian Refugee project

About 5000 Instagram posts limited to the Idomeni camp area in Greece

Period 11-21 Feb 2016

Plan: track the accounts generating these posts a year later

Issue: Instagram has severely restricted the API usage in late April 2016 and this makes this repository very valuable.

Page 28: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

The ongoing Syrian Refugee project

Page 29: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS

COLLECTING DATA USING #REFUGEES (POSTS LIKED)Activity around the 5000 posts

Page 30: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

The ongoing Syrian Refugee project

1. Extract the Instagram accounts present in the data collection from Idomeni camp (Greece)

2. Follow them in future (i.e. present): are they still in Greece? Or they moved around Europe?

3. Check if before the crisis the accounts were twitting from, e.g., Syria, to test for “real” (suspect) refugee

4. confirm the analysis by manually looking at the profiles

Page 31: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

?

Page 32: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

ì

Three possible opportunities/challenges: • small area estimation approach (Rao, 2003): use social media data as IV • anchoring social media data to official statistics (Bayesian/Multilevel approach) • build composite statistics (survey + social media) to nowcast migration phenomena

The new magic word in Social Data Science: “data mashup” = mix Big Data with official statistics

Page 33: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

Small area estimation idea: HCR (Risk-at-poverty-index) vs Mobile phone data (Md). Marchetti et al. (2015), JOS, 31(2), 263-281.

a measure of entropy where (l1,l2) represents a pair of locations, pv(l1,l2) is the

probability of observing a movement of vehicle v between the locations l1 and l2, and L

is the total number of locations. The probability pv(l1,l2) is given by the ratio between

the number of trips of v between l1 and l2 and the total number of trips of v. When l1 is

equal to l2, pv(l1,l2) is set to 0. Then, we define the mobility of an area d as:

Md ¼1

Vd n[d

XMn; ð3Þ

where Vd is the number of vehicles resident in area d. A vehicle is considered resident

in the area where it most frequently stops during the night. The mobility value tends to

zero when the vehicle v visits few distinct locations, showing low mobility diversity.

On the other hand, when the mobility measure (2) increases, it means that the vehicle v

makes journeys with several locations as destinations. We calculate the standard

deviation of the mobility Md for each area. For a given area d we measure the standard

deviation of the mobility by:

sMd¼ n[d

XMn 2 Mdð Þ2

Vd 2 1

8>><

>>:

9>>=

>>;

1=2

; ð4Þ

where Mn and Md are defined by (2) and (3).

Figure 1 shows the scatterplot of the HCR values plotted against the sMdvalues

computed for the ten provinces of the Tuscany region. Their linear correlation

coefficient, used as a mere descriptive index, is equal to 20.74. This result suggests that

higher levels of heterogeneity of mobility (Md), expressed by the standard deviation sMd,

are in the provinces where there are lower levels of poverty. In other words, the

diversification of mobility within an area with respect to its mean value can be a proxy

100 105 110 115 120 125 130

0.00

0.05

0.10

0.15

0.20

0.25

0.30

SMd

HC

R

Fig. 1. Scatterplot of the standard deviation of the mobility vs. estimates of the HCR at province level

in Tuscany.

Journal of Official Statistics270

UnauthenticatedDownload Date | 12/6/15 10:09 AM

Model HCR as a function of Md for the data at hand, then use Md to estimate the unobserved HCR in the region d.

Use the estimate of HCR in further analysis.

Extension: here d is space (province) but we can extend the model to time as well using time series approach

Goal: extend/estimate official statistic data

Page 34: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

On a continuous-time & space model for small area estimation (C-SAE)

(on going project called: SWBI)The continuous-time and space model is stated as follows:

Let Y dt be a direct measure of a proxy of some sort of well-being (e.g. “hospital admissions”) for region

d = 1, . . . , D, at time t

Let µdt be the target latent measure of well-being (e.g. “health-related quality of life”) for region d = 1, . . . , D

Assume there are other covariates, e.g., o�cial statistics, say Xdj,t, for j = 1, . . . ,K, region d at time t, that

summarize the “standard of living” so that we can write

dY dt = µd

t dt+

KX

j=1

�jXdj,tdt+ dBd

t

OBS = UNOBS + COV ARIATES +NOISE

The assume that there are other m1 variables Zdt who are expression of µ. We assumed that these covariates

are also geographically dependent. Finally Sdt are m2 come from Social Media (e.g. SWBI components). We

assume that Zdt and Sd

t contribute to µdt in this way

dµdt =

✓d0 +

m1X

i=1

✓di Zdi,t +

m2X

i=1

diS

di,t

!dt+

DX

l=1

wlddLdt

where w = (wd,l) is a proximity matrix detected among the di↵erent areas which is used to control for spatial

correlation.

dXdi,t = ⇣di (⌫

di �Xd

i,t)dt+ d⇧di,t, i = 1, . . . ,K

dZdi,t = ↵d

i (�di � Zd

i,t)dt+ dW di,t, i = 1, . . . ,m1

dSd`,t = �d

` (⌘d` � Sd

`,t)dt+ d�d`,t, ` = 1, . . . ,m2

where Zdi,t represent the quality of life variables and w = (wd,l) is a proximity matrix detected among the

di↵erent areas, Bdt and Ld

t , Wdt and �

dt are vectors of independent the Brownian motions

As quality of life variables we consider the SWBI index and its components, some weather and pollution

indicators, the consumer price index for blue and white-collar worker households, and a proxy of job market.

1

provinces of Lombardy

d=1,…, 11

Page 35: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

On a continuous-time & space model for small area estimation (C-SAE)

Dic2013

Ago2014

Apr2015

Dic2015

546000

MilanoRicoveri

Dic2013

Ago2014

Apr2015

Dic2015

875

900

MilanoConsumi

Dic2013

Ago2014

Apr2015

Dic2015

198

210

MilanoCasa

Dic2013

Ago2014

Apr2015

Dic2015

200

MilanoPensioni

Dic2013

Ago2014

Apr2015

Dic2015

0600

MilanoValoreAggiunto

Dic2013

Ago2014

Apr2015

Dic2015

740

790

MilanoTenoreVita

Dic2013

Ago2014

Apr2015

Dic2015

660

MilanoServiziAmbiente

Dic2013

Ago2014

Apr2015

Dic2015

560

640

MilanoAffariLavoro

Dic2013

Ago2014

Apr2015

Dic2015

240

MilanoOrdinePubblico

Dic2013

Ago2014

Apr2015

Dic2015

600

640

MilanoPopolazione

Dic2013

Ago2014

Apr2015

Dic2015

490

530

MilanoTempoLibero

Dic2013

Ago2014

Apr2015

Dic2015

114000

Milanodisoccupati

Dic2013

Ago2014

Apr2015

Dic2015

106.4

MilanoPrezzi

Dic2013

Ago2014

Apr2015

Dic2015

3170000

Milanopop_media

Dic2013

Ago2014

Apr2015

Dic2015

800

1400

Milanonum_avviamenti

Dic2013

Ago2014

Apr2015

Dic2015

2070

Milanopm10

Dic2013

Ago2014

Apr2015

Dic2015

1050

Milanopm2.5

Dic2013

Ago2014

Apr2015

Dic2015

520

Milanotmedia

Dic2013

Ago2014

Apr2015

Dic2015

6090

Milanoumidita

Dic2013

Ago2014

Apr2015

Dic2015

3060

Milanoemo

Dic2013

Ago2014

Apr2015

Dic2015

3560

Milanofun

Dic2013

Ago2014

Apr2015

Dic2015

3555

Milanorel

Dic2013

Ago2014

Apr2015

Dic2015

4065

Milanores

Dic2013

Ago2014

Apr2015

Dic2015

2050

Milanosat

Dic2013

Ago2014

Apr2015

Dic2015

3070

Milanotru

Dic2013

Ago2014

Apr2015

Dic2015

4560

Milanovit

Dic2013

Ago2014

Apr2015

Dic2015

1040

Milanowor

Dic2013

Ago2014

Apr2015

Dic2015

3846

Milanoswbi

Social Media estimates

(on going project called: SWBI)

Page 36: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

On a continuous-time & space model for small area estimation (C-SAE)

provinces of Lombardy

d=1,…, 11

The continuous-time and space model is stated as follows:

Let Y dt be a direct measure of a proxy of some sort of well-being (e.g. “hospital admissions”) for region

d = 1, . . . , D, at time t

Let µdt be the target latent measure of well-being (e.g. “health-related quality of life”) for region d = 1, . . . , D

Assume there are other covariates, e.g., o�cial statistics, say Xdj,t, for j = 1, . . . ,K, region d at time t, that

summarize the “standard of living” so that we can write

dY dt = µd

t dt+

KX

j=1

�jXdj,tdt+ dBd

t

OBS = UNOBS + COV ARIATES +NOISE

Then, assume that there are otherm1 variables Zdt who are expression of µ. We assumed that these covariates

are also geographically dependent. Finally Sdt are m2 come from Social Media (e.g. SWBI components). We

assume that Zdt and Sd

t contribute to µdt in this way

dµdt =

✓d0 +

m1X

i=1

✓di Zdi,t +

m2X

i=1

diS

di,t

!dt+

DX

l=1

wlddLdt

where w = (wd,l) is a proximity matrix detected among the di↵erent areas which is used to control for spatial

correlation.

dXdi,t = ⇣di (⌫

di �Xd

i,t)dt+ d⇧di,t, i = 1, . . . ,K

dZdi,t = ↵d

i (�di � Zd

i,t)dt+ dW di,t, i = 1, . . . ,m1

dSd`,t = �d

` (⌘d` � Sd

`,t)dt+ d�d`,t, ` = 1, . . . ,m2

where Zdi,t represent the quality of life variables and w = (wd,l) is a proximity matrix detected among the

di↵erent areas, Bdt and Ld

t , Wdt and �

dt are vectors of independent the Brownian motions

As quality of life variables we consider the SWBI index and its components, some weather and pollution

indicators, the consumer price index for blue and white-collar worker households, and a proxy of job market.

1

The continuous-time and space model is stated as follows:

Let Y dt be a direct measure of a proxy of some sort of well-being (e.g. “hospital admissions”) for region

d = 1, . . . , D, at time t

Let µdt be the target latent measure of well-being (e.g. “health-related quality of life”) for region d = 1, . . . , D

Assume there are other covariates, e.g., o�cial statistics, say Xdj,t, for j = 1, . . . ,K, region d at time t, that

summarize the “standard of living” so that we can write

dY dt = µd

t dt+

KX

j=1

�jXdj,tdt+ dBd

t

OBS = UNOBS + COV ARIATES +NOISE

Then, assume that there are otherm1 variables Zdt who are expression of µ. We assumed that these covariates

are also geographically dependent. Finally Sdt are m2 come from Social Media (e.g. SWBI components). We

assume that Zdt and Sd

t contribute to µdt in this way

dµdt =

✓d0 +

m1X

i=1

✓di Zdi,t +

m2X

i=1

diS

di,t

!dt+

DX

l=1

wlddLdt

where w = (wd,l) is a proximity matrix detected among the di↵erent areas which is used to control for spatial

correlation. Putting all together

dY dt =

✓d0 +

m1X

i=1

✓di Zdi,t +

m2X

i=1

diS

di,t

!dt+

DX

l=1

wlddLdt +

KX

j=1

�jXdj,tdt+ dBd

t

dXdi,t = ⇣di (⌫

di �Xd

i,t)dt+ d⇧di,t, i = 1, . . . ,K

dZdi,t = ↵d

i (�di � Zd

i,t)dt+ dW di,t, i = 1, . . . ,m1

dSd`,t = �d

` (⌘d` � Sd

`,t)dt+ d�d`,t, ` = 1, . . . ,m2

where Zdi,t represent the quality of life variables and w = (wd,l) is a proximity matrix detected among the

di↵erent areas, Bdt and Ld

t , Wdt and �

dt are vectors of independent the Brownian motions

As quality of life variables we consider the SWBI index and its components, some weather and pollution

indicators, the consumer price index for blue and white-collar worker households, and a proxy of job market.

1

The continuous-time and space model is stated as follows:

Let Y dt be a direct measure of a proxy of some sort of well-being (e.g. “hospital admissions”) for region

d = 1, . . . , D, at time t

Let µdt be the target latent measure of well-being (e.g. “health-related quality of life”) for region d = 1, . . . , D

Assume there are other covariates, e.g., o�cial statistics, say Xdj,t, for j = 1, . . . ,K, region d at time t, that

summarize the “standard of living” so that we can write

dY dt = µd

t dt+

KX

j=1

�jXdj,tdt+ dBd

t

OBS = UNOBS + COV ARIATES +NOISE

Then, assume that there are otherm1 variables Zdt who are expression of µ. We assumed that these covariates

are also geographically dependent. Finally Sdt are m2 come from Social Media (e.g. SWBI components). We

assume that Zdt and Sd

t contribute to µdt in this way

dµdt =

✓d0 +

m1X

i=1

✓di Zdi,t +

m2X

i=1

diS

di,t

!dt+

DX

l=1

wlddLdt

where w = (wd,l) is a proximity matrix detected among the di↵erent areas which is used to control for spatial

correlation. Putting all together

dY dt =

✓d0 +

m1X

i=1

✓di Zdi,t +

m2X

i=1

diS

di,t

!dt+

DX

l=1

wlddLdt +

KX

j=1

�jXdj,tdt+ dBd

t

dXdi,t = ⇣di (⌫

di �Xd

i,t)dt+ d⇧di,t, i = 1, . . . ,K

dZdi,t = ↵d

i (�di � Zd

i,t)dt+ dW di,t, i = 1, . . . ,m1

dSd`,t = �d

` (⌘d` � Sd

`,t)dt+ d�d`,t, ` = 1, . . . ,m2

where Zdi,t represent the quality of life variables and w = (wd,l) is a proximity matrix detected among the

di↵erent areas, Bdt and Ld

t , Wdt and �

dt are vectors of independent the Brownian motions

As quality of life variables we consider the SWBI index and its components, some weather and pollution

indicators, the consumer price index for blue and white-collar worker households, and a proxy of job market.

1

The continuous-time and space model is stated as follows:

Let Y dt be a direct measure of a proxy of some sort of well-being (e.g. “hospital admissions”) for region

d = 1, . . . , D, at time t

Let µdt be the target latent measure of well-being (e.g. “health-related quality of life”) for region d = 1, . . . , D

Assume there are other covariates, e.g., o�cial statistics, say Xdj,t, for j = 1, . . . ,K, region d at time t, that

summarize the “standard of living” so that we can write

dY dt = µd

t dt+

KX

j=1

�jXdj,tdt+ dBd

t

OBS = UNOBS + COV ARIATES +NOISE

Then, assume that there are otherm1 variables Zdt who are expression of µ. We assumed that these covariates

are also geographically dependent. Finally Sdt are m2 come from Social Media (e.g. SWBI components). We

assume that Zdt and Sd

t contribute to µdt in this way

dµdt =

✓d0 +

m1X

i=1

✓di Zdi,t +

m2X

i=1

diS

di,t

!dt+

DX

l=1

wlddLdt

where w = (wd,l) is a proximity matrix detected among the di↵erent areas which is used to control for spatial

correlation. Putting all together

dY dt =

✓d0 +

m1X

i=1

✓di Zdi,t +

m2X

i=1

diS

di,t

!dt+

DX

l=1

wlddLdt +

KX

j=1

�jXdj,tdt+ dBd

t

dXdi,t = ⇣di (⌫

di �Xd

i,t)dt+ d⇧di,t, i = 1, . . . ,K

dZdi,t = ↵d

i (�di � Zd

i,t)dt+ dW di,t, i = 1, . . . ,m1

dSd`,t = �d

` (⌘d` � Sd

`,t)dt+ d�d`,t, ` = 1, . . . ,m2

where Zdi,t represent the quality of life variables and w = (wd,l) is a proximity matrix detected among the

di↵erent areas, Bdt and Ld

t , Wdt and �

dt are vectors of independent the Brownian motions

As quality of life variables we consider the SWBI index and its components, some weather and pollution

indicators, the consumer price index for blue and white-collar worker households, and a proxy of job market.

1

Page 37: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

On a continuous-time & space model for small area estimation (C-SAE)

d=1,…, 11

The continuous-time and space model is stated as follows:Let Y d

t be a direct measure of a proxy of some sort of well-being (e.g. “hospital admissions”) for regiond = 1, . . . , D, at time t

Let µdt be the target latent measure of well-being (e.g. “health-related quality of life”) for region d = 1, . . . , D

Assume there are other covariates, e.g., o�cial statistics, say Xdj,t, for j = 1, . . . ,K, region d at time t, that

summarize the “standard of living” so that we can write

dY dt = µd

t dt+

KX

j=1

�jXdj,tdt+ dBd

t

OBS = UNOBS + COV ARIATES +NOISE

Then, assume that there are otherm1 variables Zdt who are expression of µ. We assumed that these covariates

are also geographically dependent. Finally Sdt are m2 come from Social Media (e.g. SWBI components). We

assume that Zdt and Sd

t contribute to µdt in this way

dµdt =

✓d0 +

m1X

i=1

✓di Zdi,t +

m2X

i=1

diS

di,t

!dt+

DX

l=1

wlddLdt

where w = (wd,l) is a proximity matrix detected among the di↵erent areas which is used to control for spatialcorrelation. Putting all together

dY dt =

✓d0 +

m1X

i=1

✓di Zdi,t +

m2X

i=1

diS

di,t

!dt+

DX

l=1

wlddLdt +

KX

j=1

�jXdj,tdt+ dBd

t

dXdi,t = ⇣di (⌫

di �Xd

i,t)dt+ d⇧di,t, i = 1, . . . ,K

dZdi,t = ↵d

i (�di � Zd

i,t)dt+ dW di,t, i = 1, . . . ,m1

dSd`,t = �d

` (⌘d` � Sd

`,t)dt+ d�d`,t, ` = 1, . . . ,m2

where Zdi,t represent the quality of life variables and w = (wd,l) is a proximity matrix detected among the

di↵erent areas, Bdt and Ld

t , Wdt and �d

t are vectors of independent the Brownian motions

Once the parameters are estimated we can predict the underlying measure as follows

dµdt =

✓d0 +

m1X

i=1

✓di Zdi,t +

m2X

i=1

diS

di,t

!dt+

DX

l=1

wlddLdt

As quality of life variables we consider the SWBI index and its components, some weather and pollutionindicators, the consumer price index for blue and white-collar worker households, and a proxy of job market.

1

in our case we have D = 11, m1 = 4, m2 = 8, K=2

which means: 165 (D*(1+K+m1+m2) equations and 473 parameters!

Page 38: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

Advantages of the continuous-time & space model for small area estimation (C-SAE)

Using official statistics of Y, X and Z means that we have at most monthly data, while SM data Z are intraday o daily data

Once the model is estimated at “low frequency” (e.g., monthly), we can simulate it at any frequency/time t.

Page 39: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

Anchoring Big Data to Official Statistics/Survey data

Anchoring social media data to official statistics: on going.

Xst ⇠ L(✓st) social media statistics, t = time, s = space

✓st ⇠ Q(�st ), Q is the pior calibrated on official statistics

Outcome: posterior distribution of Xst , P(Xs

t ; ✓st |�s

t ).

Goal1: adjust social media statistics (or indirectly estimate bias?) to obtain high-frequency andspace-distributed social info in real time much before official statistics or survey day are collectedon the next wave.

Goal2: make official-statistics subjects (institutions, politicians, academics, ecc) happier withsocial media data.

4

Goal: adjust social media statistics (or indirectly estimate bias?) to obtain high-frequency and space-distributed social info in real time much before official statistics or survey data are collected on the next wave.

Page 40: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

Goal: Nowcasting overall expectations and act accordingly. Example: looking at switching of WNI’s trend, speculate on the market.

Combining official statistics, survey and social media: Wired Next Index (WNI)

WNI (“measures” expectation about economic wealth of a country, well... Italy) combines different times series:

• Off. Stat: GDP, Import/Export, Unemployment rate (low frequency, backward looking)

• Survey Data: consumer expectations, entrepreneurs expectations (low freq., forward looking)

• Social Media Data: sentiment data on economy, politics, personal wealth (high frequency, geo-referenced, nowcasting)

Page 41: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

What about Replicability?

Replicability of this particular experiment: given the structure of the query used to download the data and the “post ID”, everyone can replicate the analysis.

Replicability of a similar idea: as the model is very easy to understand (although computationally intensive), the same idea can be applied to other situations. We are indeed working with clandestine migrants in Italy and soon Spain using mainly Twitter data.

Page 42: WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

Sedeoperativa:ViaGaspareBugatti7/A,Milano

Tel.+393661652058/61/64Fax+390269000855

voices-int.com

[email protected]

@blogsvoices

Thanks!

forfurtherinformation:[email protected]