Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3....

29
Visual Profiling of Large Statistical Datasets Martijn Tennekes Edwin de Jonge, Piet Daas February 10, 2011

Transcript of Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3....

Page 1: Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3. Introduction Large statistical dataset Administrative sources Survey data Quality

Visual Profiling of LargeStatistical Datasets

Martijn TennekesEdwin de Jonge, Piet Daas

February 10, 2011

Page 2: Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3. Introduction Large statistical dataset Administrative sources Survey data Quality

Outline

Introduction

Tableplot description

Applications

Implementation in R

Visual Profiling of Large Statistical Datasets 2

Page 3: Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3. Introduction Large statistical dataset Administrative sources Survey data Quality

Introduction

Large statistical dataset

Administrative sources

Survey data

Quality assessment at a technical level

Step 1: Technical checks (e.g. readability and convertability)

Step 2: Data profiling

Representation and distribution of valuesStrange data patternsOccurrence of missing values

Visual Profiling of Large Statistical Datasets 3

Page 4: Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3. Introduction Large statistical dataset Administrative sources Survey data Quality

Introduction

Large statistical dataset

Administrative sources

Survey data

Quality assessment at a technical level

Step 1: Technical checks (e.g. readability and convertability)

Step 2: Data profiling

Representation and distribution of valuesStrange data patternsOccurrence of missing values

Visual Profiling of Large Statistical Datasets 3

Page 5: Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3. Introduction Large statistical dataset Administrative sources Survey data Quality

Introduction

Large statistical dataset

Administrative sources

Survey data

Quality assessment at a technical level

Step 1: Technical checks (e.g. readability and convertability)

Step 2: Data profiling

Representation and distribution of valuesStrange data patternsOccurrence of missing values

Visual Profiling of Large Statistical Datasets 3

Page 6: Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3. Introduction Large statistical dataset Administrative sources Survey data Quality

Introduction

Large statistical dataset

Administrative sources

Survey data

Quality assessment at a technical level

Step 1: Technical checks (e.g. readability and convertability)

Step 2: Data profiling

Representation and distribution of valuesStrange data patternsOccurrence of missing values

Visual Profiling of Large Statistical Datasets 3

Page 7: Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3. Introduction Large statistical dataset Administrative sources Survey data Quality

IntroductionTraditional approach:

Visual Profiling of Large Statistical Datasets 4

Page 8: Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3. Introduction Large statistical dataset Administrative sources Survey data Quality

IntroductionNew approach:

Visual Profiling of Large Statistical Datasets 5

Page 9: Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3. Introduction Large statistical dataset Administrative sources Survey data Quality

Tableplot description

DATA

v1 v2 .... v12

100,000 records

12 variables

Visual Profiling of Large Statistical Datasets 6

Page 10: Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3. Introduction Large statistical dataset Administrative sources Survey data Quality

Tableplot description

DATA

v1 v2 .... v12

100,000 records

12 variables

Visual Profiling of Large Statistical Datasets 7

Page 11: Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3. Introduction Large statistical dataset Administrative sources Survey data Quality

Tableplot description

DATA

v1 v2 .... v12

100,000 records

12 variables

Visual Profiling of Large Statistical Datasets 8

Page 12: Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3. Introduction Large statistical dataset Administrative sources Survey data Quality

Tableplot description

DATA

v1 v2 .... v12

100,000 records

12 variables123

100

...

...

100 row bins

Visual Profiling of Large Statistical Datasets 9

Page 13: Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3. Introduction Large statistical dataset Administrative sources Survey data Quality

Tableplot description

Visual Profiling of Large Statistical Datasets 10

Page 14: Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3. Introduction Large statistical dataset Administrative sources Survey data Quality

Tableplot description

DATA

v1 v2 .... v12

100,000 records

12 variables123

100

...

...

100 row bins

Visual Profiling of Large Statistical Datasets 11

Page 15: Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3. Introduction Large statistical dataset Administrative sources Survey data Quality

Tableplot description

DATA

v1 v2 .... v12

100,000 records

12 variables

100...

123

100 row bins...

Visual Profiling of Large Statistical Datasets 12

Page 16: Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3. Introduction Large statistical dataset Administrative sources Survey data Quality

Tableplot description

Visual Profiling of Large Statistical Datasets 13

Page 17: Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3. Introduction Large statistical dataset Administrative sources Survey data Quality

Tableplot description

Visual Profiling of Large Statistical Datasets 14

Page 18: Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3. Introduction Large statistical dataset Administrative sources Survey data Quality

Tableplot description

Quality measures:

1 Smoothness of a data distribution

2 Selectivity of missing values

3 Distribution of correlated variables

Visual Profiling of Large Statistical Datasets 15

Page 19: Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3. Introduction Large statistical dataset Administrative sources Survey data Quality

Applications

Structural Business Statistics (SBS)

Large business survey

Circa 50,000 respondents

Data editing and analysis process:1 Unprocessed data2 Edited data3 Data prepared for analysis

Visual Profiling of Large Statistical Datasets 16

Page 20: Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3. Introduction Large statistical dataset Administrative sources Survey data Quality

Unprocessed data

Visual Profiling of Large Statistical Datasets 17

Page 21: Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3. Introduction Large statistical dataset Administrative sources Survey data Quality

Edited data

Visual Profiling of Large Statistical Datasets 18

Page 22: Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3. Introduction Large statistical dataset Administrative sources Survey data Quality

Data prepared for analysis

Visual Profiling of Large Statistical Datasets 19

Page 23: Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3. Introduction Large statistical dataset Administrative sources Survey data Quality

Comparison with other sources

Comparison:

SBS turover

VAT turnover

STS turnover

Visual Profiling of Large Statistical Datasets 20

Page 24: Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3. Introduction Large statistical dataset Administrative sources Survey data Quality

Comparison with other sources

Visual Profiling of Large Statistical Datasets 21

Page 25: Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3. Introduction Large statistical dataset Administrative sources Survey data Quality

Implementation in R

Package tabplot

Available on CRAN

Functions

tableplot tableplot(myDataFrame,colNames = myColumnNames,nBins = 100)

num2fac num2fac(myNumericVector,method=“pretty”,n=5)

tabGUI tableGUI()

Supports very large datasets (up to 2 · 109 records)

Visual Profiling of Large Statistical Datasets 22

Page 26: Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3. Introduction Large statistical dataset Administrative sources Survey data Quality

Conclusion

Quality assessment

Existing data sourcesNew data sources

Effective method to support top-down data analysis

Apply tableplot to other sources

Further improve tableplots

Visual Profiling of Large Statistical Datasets 23

Page 27: Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3. Introduction Large statistical dataset Administrative sources Survey data Quality

Conclusion

Quality assessment

Existing data sourcesNew data sources

Effective method to support top-down data analysis

Apply tableplot to other sources

Further improve tableplots

Visual Profiling of Large Statistical Datasets 23

Page 28: Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3. Introduction Large statistical dataset Administrative sources Survey data Quality

Conclusion

Quality assessment

Existing data sourcesNew data sources

Effective method to support top-down data analysis

Apply tableplot to other sources

Further improve tableplots

Visual Profiling of Large Statistical Datasets 23

Page 29: Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3. Introduction Large statistical dataset Administrative sources Survey data Quality

Conclusion

Quality assessment

Existing data sourcesNew data sources

Effective method to support top-down data analysis

Apply tableplot to other sources

Further improve tableplots

Visual Profiling of Large Statistical Datasets 23