Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3....

Post on 22-May-2020

4 views 0 download

Transcript of Visual Profiling of Large Statistical Datasets · Visual Pro ling of Large Statistical Datasets 3....

Visual Profiling of LargeStatistical Datasets

Martijn TennekesEdwin de Jonge, Piet Daas

February 10, 2011

Outline

Introduction

Tableplot description

Applications

Implementation in R

Visual Profiling of Large Statistical Datasets 2

Introduction

Large statistical dataset

Administrative sources

Survey data

Quality assessment at a technical level

Step 1: Technical checks (e.g. readability and convertability)

Step 2: Data profiling

Representation and distribution of valuesStrange data patternsOccurrence of missing values

Visual Profiling of Large Statistical Datasets 3

Introduction

Large statistical dataset

Administrative sources

Survey data

Quality assessment at a technical level

Step 1: Technical checks (e.g. readability and convertability)

Step 2: Data profiling

Representation and distribution of valuesStrange data patternsOccurrence of missing values

Visual Profiling of Large Statistical Datasets 3

Introduction

Large statistical dataset

Administrative sources

Survey data

Quality assessment at a technical level

Step 1: Technical checks (e.g. readability and convertability)

Step 2: Data profiling

Representation and distribution of valuesStrange data patternsOccurrence of missing values

Visual Profiling of Large Statistical Datasets 3

Introduction

Large statistical dataset

Administrative sources

Survey data

Quality assessment at a technical level

Step 1: Technical checks (e.g. readability and convertability)

Step 2: Data profiling

Representation and distribution of valuesStrange data patternsOccurrence of missing values

Visual Profiling of Large Statistical Datasets 3

IntroductionTraditional approach:

Visual Profiling of Large Statistical Datasets 4

IntroductionNew approach:

Visual Profiling of Large Statistical Datasets 5

Tableplot description

DATA

v1 v2 .... v12

100,000 records

12 variables

Visual Profiling of Large Statistical Datasets 6

Tableplot description

DATA

v1 v2 .... v12

100,000 records

12 variables

Visual Profiling of Large Statistical Datasets 7

Tableplot description

DATA

v1 v2 .... v12

100,000 records

12 variables

Visual Profiling of Large Statistical Datasets 8

Tableplot description

DATA

v1 v2 .... v12

100,000 records

12 variables123

100

...

...

100 row bins

Visual Profiling of Large Statistical Datasets 9

Tableplot description

Visual Profiling of Large Statistical Datasets 10

Tableplot description

DATA

v1 v2 .... v12

100,000 records

12 variables123

100

...

...

100 row bins

Visual Profiling of Large Statistical Datasets 11

Tableplot description

DATA

v1 v2 .... v12

100,000 records

12 variables

100...

123

100 row bins...

Visual Profiling of Large Statistical Datasets 12

Tableplot description

Visual Profiling of Large Statistical Datasets 13

Tableplot description

Visual Profiling of Large Statistical Datasets 14

Tableplot description

Quality measures:

1 Smoothness of a data distribution

2 Selectivity of missing values

3 Distribution of correlated variables

Visual Profiling of Large Statistical Datasets 15

Applications

Structural Business Statistics (SBS)

Large business survey

Circa 50,000 respondents

Data editing and analysis process:1 Unprocessed data2 Edited data3 Data prepared for analysis

Visual Profiling of Large Statistical Datasets 16

Unprocessed data

Visual Profiling of Large Statistical Datasets 17

Edited data

Visual Profiling of Large Statistical Datasets 18

Data prepared for analysis

Visual Profiling of Large Statistical Datasets 19

Comparison with other sources

Comparison:

SBS turover

VAT turnover

STS turnover

Visual Profiling of Large Statistical Datasets 20

Comparison with other sources

Visual Profiling of Large Statistical Datasets 21

Implementation in R

Package tabplot

Available on CRAN

Functions

tableplot tableplot(myDataFrame,colNames = myColumnNames,nBins = 100)

num2fac num2fac(myNumericVector,method=“pretty”,n=5)

tabGUI tableGUI()

Supports very large datasets (up to 2 · 109 records)

Visual Profiling of Large Statistical Datasets 22

Conclusion

Quality assessment

Existing data sourcesNew data sources

Effective method to support top-down data analysis

Apply tableplot to other sources

Further improve tableplots

Visual Profiling of Large Statistical Datasets 23

Conclusion

Quality assessment

Existing data sourcesNew data sources

Effective method to support top-down data analysis

Apply tableplot to other sources

Further improve tableplots

Visual Profiling of Large Statistical Datasets 23

Conclusion

Quality assessment

Existing data sourcesNew data sources

Effective method to support top-down data analysis

Apply tableplot to other sources

Further improve tableplots

Visual Profiling of Large Statistical Datasets 23

Conclusion

Quality assessment

Existing data sourcesNew data sources

Effective method to support top-down data analysis

Apply tableplot to other sources

Further improve tableplots

Visual Profiling of Large Statistical Datasets 23