Big Data Visualization

28
Edwin de Jonge, December 3, 2013 Big Data Visualization “Turning Statistics into Knowledge”, Aguascalientes With thanks to Piet Daas, Martijn Tennekes and Alex Priem

description

Presentation given at Innovative Approaches to Turn Statistics into Knowledge, 2 - 4 December 2013, Aguascalientes, Mexico

Transcript of Big Data Visualization

Page 1: Big Data Visualization

Edwin de Jonge, December 3, 2013

Big Data Visualization

“Turning Statistics into Knowledge”, Aguascalientes

With thanks to Piet Daas, Martijn Tennekes and Alex Priem

Page 2: Big Data Visualization

Overview

2

• Big Data • Research ‘theme’ at Stat. Netherlands • Data driven approach

• Visualization as a tool •Why? •Examples in our office

•Census •Social Security •Social Media •Not shown: Traffic loops, Mobile phone data

Page 3: Big Data Visualization

Why Visualization?

October 1st 2013, Statistics Netherlands

Page 4: Big Data Visualization

Effective Display!

(see Tor Norretranders, “Band width of our senses”)

Page 5: Big Data Visualization

Anscombes quartet…

5

DS1 x

y DS2 x y

DS3 x y DS4 x y

10 8.04 10 9.14 10 7.46 8 6.58

8 6.95 8 8.14 8 6.77 8 5.76

13 7.58 13 8.74 13 12.74 8 7.71

9 8.81 9 8.77 9 7.11 8 8.84

11 8.33 11 9.26 11 7.81 8 8.47

14 9.96 14 8.1 14 8.84 8 7.04

6 7.24 6 6.13 6 6.08 8 5.25

4 4.26 4 3.1 4 5.39 19 12.5

12 10.84 12 9.13 12 8.15 8 5.56

7 4.82 7 7.26 7 6.42 8 7.91

5 5.68 5 4.74 5 5.73 8 6.89

Page 6: Big Data Visualization

Anscombe’s quartet

Property Value

Mean of x1, x2, x3, x4 All equal: 9

Variance of x1, x2, x3, x4 All equal: 11

Mean of y1, y2, y3, y4 All equal: 7.50

Variance of y1, y2, y3, y4 All equal: 4.1

Correlation for ds1, ds2, ds3, ds4 All equal 0.816

Linear regression for ds1, ds2, ds3, ds4

All equal: y = 3.00 + 0.500x

Looks the same, right?

Page 7: Big Data Visualization

Lets plot!

Page 8: Big Data Visualization

Visualization

For Big Data:

Use appropriate:

- Summarization

- Granularity

- Noise filtering

Research: What works for big data?

Page 9: Big Data Visualization

9

Scatter plot with 100 data points

Page 10: Big Data Visualization

10

Scatter plot with 100 000 data points

Page 11: Big Data Visualization

11

Example 1: Census

Page 12: Big Data Visualization

Example Virtual Census

‐ Every 10 years a Census needs to be conducted

‐ No longer with surveys in the Netherlands • Last traditional census was in 1971

‐ Now by (re-)using existing information • Linking administrative sources and available sample

survey data at a large scale

• Check result

• How?

• With a visualisation method: the Tableplot

11

Page 13: Big Data Visualization

Making the Tableplot

1. Load file 17 million records 2. Sort record according to 17 million records

key variable • Age in this example

3. Combine records 100 groups (170,000 records each)

• Numeric variables • Calculate average (avg. age)

• Categorical variables • Ratio between categories present (male vs. female)

4. Plot figure of select number of variables • Colours used are important up to 12

12

Page 14: Big Data Visualization
Page 15: Big Data Visualization

October 1st 2013, Statistics Netherlands tableplot of the census test file

Page 16: Big Data Visualization

Tableplot: Monitor data quality

16

– All data in Office passes stages:

‐ Raw data (collected)

‐ Preproccesed (technically correct)

‐ Edited (completed data)

‐ Final (removal of outliers etc.)

Page 17: Big Data Visualization

Processing of data Raw (unedited) data

Edited data

Final data

Page 18: Big Data Visualization

Example 2 : Social Security Register

15

Page 19: Big Data Visualization

Social Security Register

– Contains all financial data on jobs, benefits and

pensions in the Netherlands

‐ Collected by the Dutch Tax office

‐ A total of 20 million records each month

‐ How to obtain insight into so much data? • With a visualisation method: a heat map

19

Page 20: Big Data Visualization

October 1st 2013, Statistics Netherlands

Heat map: Age vs. ‘Income’

16

Age

Inco

me

(eu

ro)

Page 21: Big Data Visualization

17

amount

amount

Page 22: Big Data Visualization

22

Example 3: Social media

Page 23: Big Data Visualization

Daily Sentiment in Dutch Social Media

Social media: daily sentiment in Dutch messages

23

Page 24: Big Data Visualization

Granilarity: From day to week

Social media, daily sentiment in Dutch messages Social media: daily & weekly sentiment in Dutch messages

24

Page 25: Big Data Visualization

Granularity: From day to month

Social media, daily sentiment in Dutch messages Social media: daily, weekly & monthly sentiment in Dutch messages

25

Page 26: Big Data Visualization

Enter: Consumer confidence!

Social media, daily sentiment in Dutch messages Social media: monthly sentiment in Dutch messages & Consumer confidence

26 Corr: 0.88

Page 27: Big Data Visualization

Conclusions

Big data is a very interesting data source for

official statistics

Visualisation is a great way of

getting/creating insight

Not only for data exploration, but also for

finding errors

27

Page 28: Big Data Visualization

The future of statistics?