WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs,...

31
WP-K Methodology and Quality Magdalena Six, Alexander Kowarik, Sonia Quaresma, Piet Daas Pilots Track KickOff Meeting, Vienna, 5-6th of December, 2018

Transcript of WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs,...

Page 1: WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs, how should this be organised? 3. How close should Quality and Methodology work together?

WP-K Methodology and Quality

Magdalena Six, Alexander Kowarik, Sonia Quaresma, Piet Daas

Pilots Track KickOff Meeting,

Vienna, 5-6th of December, 2018

Page 2: WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs, how should this be organised? 3. How close should Quality and Methodology work together?

Partners / Input

– Austria (Magdalena will talk about Quality)

– Netherlands (Piet will talk about Methodology)

– Portugal (Sonia will talk about the Typification)

– Poland

– Spain

– Italy

– Literature

– Experiences obtained in the pilots of WPB to WPE and WPG to WPJ and WPL are used as input to this WP.

Page 3: WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs, how should this be organised? 3. How close should Quality and Methodology work together?

Main outputs

• M9: First draft of the quality guidelines

• M13: Updated literature review, Revised version of quality guidelines Quality report template draft

• M17: First draft of methodological report Revised quality report template Updated and extended literature Review

• M18: Typification Matrix for big data projects

• M24: Evolution roadmap between the areas of the typification matrix

• M25:Report describing quality aspects of the different pilots Revised literature overview Report describing the meth.steps of using big data in official statistics with a sectionon the most important questions for the future including guidelines

Page 5: WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs, how should this be organised? 3. How close should Quality and Methodology work together?

WP-K Methodology and Quality Overview of Quality

Magdalena Six, Sonia Quaresma, Piet Daas , Alexander Kowarik

Pilots Track KickOff Meeting,

Vienna, 5-6th of December, 2018

Page 6: WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs, how should this be organised? 3. How close should Quality and Methodology work together?

• A quick recap of ESSnet Big Data I

• Quality Framework vs Quality Aspects

• The 7 Quality Aspects

• Structure of the Report

• Examples

• Quality Measures Wanted!

• Summary of ESSnet BD I

• ESSnet BD II: Expected Outcome of WPK

• ESSnet BD II: Time line and Deliverables of WPK

• Open Questions w.r.t Cooperation with other WPs

Overview of the Presentation

Page 7: WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs, how should this be organised? 3. How close should Quality and Methodology work together?

ESSnet Big Data I (01/2016 – 05/2018)

• 7 WPs related to data sources

• WP 8 on cross-cutting subjects: Methodology, Quality and IT, WP-lead: Netherlands

• Deliverable 8.1 of WP8: Literature Overview

• Deliverable 8.2 of WP 8: Report describing the quality aspects of Big Data for Official Statistics

• Deliverable 8.3 of WP 8: Report describing the IT aspects of Big Data for Official Statistics

• Deliverable 8.2 of WP 8: Report describing the methodological aspects of Big Data for Official Statistics

ESSnet Big Data I

Page 8: WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs, how should this be organised? 3. How close should Quality and Methodology work together?

• Workshop in April 2017 in the Netherlands:

• 18 experts identified the most important topics in the areas of Quality (as well as Methodology and IT) using Big Data in Official Statistics in the context of the WPs 1-7

• No claims that these aspects are generally the most important ones

• Based on experiences gathered in pilots, not in production of Official Statistics

• Data driven approach, focus on exploration and use of new data sources

• Quality (of output side) not (from the beginning) the main focus

• Deliberate choice of wording: Quality Aspect vs Quality Dimension

Why This Form?

Page 9: WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs, how should this be organised? 3. How close should Quality and Methodology work together?

In relation to cause(s) of errors:

• Coverage, Accuracy and Selectivity

• Processing errors

• Linkability

• Measurement errors

• Model errors and precision

In relation to changes in the composition of the source

• Comparability over time

• Process chain control

7 Quality Aspects of Big Data

Page 10: WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs, how should this be organised? 3. How close should Quality and Methodology work together?

7 Quality Aspects in the Context of UNECE’s Quality Framework for BD

• 3 Phases of the business process: Input, Throughput, Output

• 3 Hyperdimensions: Source, Data, Metadata

Page 11: WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs, how should this be organised? 3. How close should Quality and Methodology work together?

Structure of the Report on Quality in the ESSnet Big Data

7 Chapters according to the 7 identified quality aspects Same structure for each chapter: 1. Introduction: meaning of the respective quality aspect in the

context of Big Data 2. Examples and Methods: Role of the respective quality aspect

in the WP1-WP7 3. Discussion: Challenges for the quality aspect, cross connections

to other Chapters in the Quality Report, but also to IT and Methodology Report

Page 12: WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs, how should this be organised? 3. How close should Quality and Methodology work together?

Quality Aspects for Big Data: “Same same but different”?

Example Measurement Error In survey-based statistics: ME is the degree to which a value collected differs from its “true” value due to imperfections in the data collection, reluctance and/or inability of respondents to provide honest and unbiased response, influence of interviewers on response In Big Data based statistics: Errors are of more technical nature, measurement errors in Big Data often happen while machines collect data from other machines • WP4 AIS Data/Ship positions:

Error Source: Scrambling of the Automated Identification Signal (AIS) of ships -> measurement error or coverage error?

• WP2 Job Vacancies Scraping of a deceptive Job vacancy ad -> measurement or coverage error?

Page 13: WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs, how should this be organised? 3. How close should Quality and Methodology work together?

Examples for new (?), Big Data specific (?) Error Sources

Other examples:

• Non-stable access to the BD source

• Change in technological process generating the BD, change in use of BD-generating devices -> comparability over time

• Multiple layers of (new) processing steps required (advanced techniques for editing, imputation, linking techniques, text mining algorithms…) including new error sources

• Deduction of information about target variable from other variables via modelling

• Models based on small-sample statistical inference don’t work

Page 14: WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs, how should this be organised? 3. How close should Quality and Methodology work together?

Quality Measures: Challenges from the Past and Challenges Ahead

Feedback from the Reviewers: Quality measures would have been very useful

BUT:

• Still in the experimenting phase

• Often no routine, no regular access to Big Data source

• Focus in WPs more on potential sources and potential access to sources than on a standardized reporting of quality measures

• Experimental phase shows: Big Data sources, as well as processes needed to work with these sources, are so diverse that the development of standardized quality measures / a quality framework will be challenging

Page 15: WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs, how should this be organised? 3. How close should Quality and Methodology work together?

Summary of Quality Report from ESSnet BD 1

What can the report provide?

• Overview of the most important quality aspects with respect to the ESSnet Big Data I

• Overview which and how quality aspects have been considered in WP1-WP7

• Good discussion of the meaning of quality aspects in the context of Big Data in the context of actual examples

What can the report not provide?

• No quality framework, the report does not cover the measurement of quality in a systematic and exhaustive way

• No quality guidelines

• No quality reporting

Page 16: WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs, how should this be organised? 3. How close should Quality and Methodology work together?

ESSnet Big Data II

What is our „comparative advantage“ ?

We are close to the actual big data projects!

Quality within ESSnet Big Data

- Quality Aspects extracted from the other WPs

- Quality Reporting tested on the other WPs

- Quality Indicators can actually be tested in other WPs

Quality Framework for Big Data (e.g UNECE) - Generic - Exhaustive - Systematic

Page 17: WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs, how should this be organised? 3. How close should Quality and Methodology work together?

ESSnet Big Data II WPK: Methodology and Quality

Expected Outcome with respect to Quality:

• Updated and extended literature Review Focus on quality indicators described in the literature

• Update of Report on Quality from ESSnet Big Data I with input from current WPs of ESSnet Big Data II

• Quality Guidelines for the usage of Big Data in Official Statistics, based on know-how from within ESSnet as well as from outside sources (eg UNECE quality framework)

• Template for quality report when using big data in the production of official statistics Potentially including suggestions for quantitative quality indicators

Page 18: WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs, how should this be organised? 3. How close should Quality and Methodology work together?

Timeline & Deliverables w.r.t. Quality

• M9: First draft of the quality guidelines

• M13: Updated literature review, Revised version of quality guidelines Quality report template draft

• M17: First draft of methodological report Revised quality report template Updated and extended literature Review

• M18: Typification Matrix for big data projects

• M24: Evolution roadmap between the areas of the typification matrix

• M25: Report describing quality aspects of the different pilots Revised literature overview Report describing the meth.steps of using big data in official statistics with a sectionon the most important questions for the future including guidelines

Page 19: WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs, how should this be organised? 3. How close should Quality and Methodology work together?

Communication & Cooperation of WPK with other WPs

WP K

WP G

WP H

WP I

WP J

WP L

WP B

WP C

WP D

WP E

Pilot Track

Smart Statistics

Implementation Track

Quality Guidelines Quality Reporting (Suggestions), Template

Extract Quality Aspects, Test Template, Indicator

Page 20: WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs, how should this be organised? 3. How close should Quality and Methodology work together?

Open Questions

1. Experiences from all WPs (not only from the Pilot Track) are input for WPK. How should the communication and the exchange of information be organised?

2. The quality report template should be tested by the other WPs, how should this be organised?

3. How close should Quality and Methodology work together? Separate reports -> Yes Separate Guidelines?

4. Time Plan? Quality report template draft : M13 Revised quality report template M18 Test by other WPs until when so that feedback can be incoporated?

Page 21: WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs, how should this be organised? 3. How close should Quality and Methodology work together?

WP-K Methodology and Quality Big Data Typificiation

Magdalena Six, Sonia Quaresma, Piet Daas , Alexander Kowarik

Pilots Track KickOff Meeting,

Vienna, 5-6th of December, 2018

Page 22: WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs, how should this be organised? 3. How close should Quality and Methodology work together?

ESSnet Big Data II WPK: Methodology and Quality

Big Data...

Page 23: WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs, how should this be organised? 3. How close should Quality and Methodology work together?

ESSnet Big Data II WPK: Methodology and Quality

Why typify?

• Big diversity of data sources

• Different kinds of problems acessing the sources – Technical

– Legal and ethical

– Understanding the data – extracting it’s Metadata

• Different kinds of problems posed by the type of data – Text data requires NLP, feature extraction, entity recognition, text analysis, and so

on...

– Images requires maybe GIS, perhaps munging, visualization...

– Sensor data surely requires ETL but maybe also Transformation and Enrichment, data integration, data fusion...

• Can a list of treatments be “prescribed” based on the type of source and kind of big data being addressed?

Page 24: WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs, how should this be organised? 3. How close should Quality and Methodology work together?

ESSnet Big Data II WPK: Methodology and Quality

Once we know:

• The challenges put by the source – Technical

– Legal and ethical

– Understanding the data – extracting it’s Metadata

• The tools/treatments required by the data – NLP, feature extraction, entity recognition, text analysis...

– GIS, munging, visualization...

– ETL , Transformation and Enrichment, data integration, data fusion...

• We can estimate the investment/resources needed for the source exploration

Page 25: WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs, how should this be organised? 3. How close should Quality and Methodology work together?

ESSnet Big Data II WPK: Methodology and Quality

Let’s try to typify our big data projects in a Matrix!

Page 26: WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs, how should this be organised? 3. How close should Quality and Methodology work together?

ESSnet Big Data II WPK: Methodology and Quality

From a certain point on it’s possible/desirable that the same methods/processes may be appliable to all/most/some/the big data sets...

How do we get there?

Page 27: WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs, how should this be organised? 3. How close should Quality and Methodology work together?

ESSnet Big Data II WPK: Methodology and Quality

Is it possible to know the way from A1, A2.. An towards B?

• Ax - Given a new big data project can we: – Characterize it

– Describe it in our typification matrix

– Know which tools will probably be required/suitable to make the data explorable

– Anticipate the problems this kind of source/data will pose to us?

• B – Desired state: – In terms of data

– In terms of quality requirements being met

– In terms of methods appliable

• Is there a way to go from Ax to B?

Page 28: WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs, how should this be organised? 3. How close should Quality and Methodology work together?

ESSnet Big Data II WPK: Methodology and Quality

Moreover if A1, A2... to An express different levels of maturity in adressing the same data source can we evolve from Ax to Ax+1?

Can we prescribe a way to do it? Can we estimate the costs of doing it?

Can we anticipate the difficulties of the journey?

Page 29: WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs, how should this be organised? 3. How close should Quality and Methodology work together?

ESSnet Big Data II WPK: Methodology and Quality

Can we, in a systematic way, present a road map or some alternative paths to assist the traveller?

Page 30: WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs, how should this be organised? 3. How close should Quality and Methodology work together?

Timeline & Deliverables w.r.t. Typification

• M9: First draft of the quality guidelines

• M13: Updated literature review, Revised version of quality guidelines Quality report template draft

• M17: First draft of methodological report Revised quality report template Updated and extended literature Review

• M18: Typification Matrix for big data projects

• M24: Evolution roadmap between the areas of the typification matrix

• M25: Report describing quality aspects of the different pilots Revised literature overview Report describing the meth.steps of using big data in official statistics with a sectionon the most important questions for the future including guidelines

Page 31: WP-K Methodology and Quality · 2. The quality report template should be tested by the other WPs, how should this be organised? 3. How close should Quality and Methodology work together?

Thank your for your attention!

Any questions, comments?! Contact: Magdalena Six Statistics Austria Email: [email protected]

creative commons