ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*,...

287
1 ESSNET USE OF ADMINISTRATIVE AND ACCOUNTS DATA IN BUSINESS STATISTICS WP6 Quality Indicators when using Administrative Data in Statistical Outputs Deliverable 6.5 / 2011: Final list of quality indicators and associated guidance July, 2013

Transcript of ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*,...

Page 1: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

1

ESSNET

USE OF ADMINISTRATIVE AND ACCOUNTS DATA

IN BUSINESS STATISTICS

WP6 Quality Indicators when using Administrative Data

in Statistical Outputs

Deliverable 6.5 / 2011:

Final list of quality indicators and associated guidance

July, 2013

Page 2: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

2

Final list of quality indicators and associated guidance

John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger†, Humberto Pereira°, Sofia Rodrigues°, Jennifer Davies*, Salvatore Cusimano~, Alessandra Fiori~, Mirella Morrone~, Arnout van Delden†, Piet Daas†, Ana Chumbau°, Jorge Mendes°,

*ONS UK; †CBS Netherlands; °INE Portugal; ~ ISTAT Italy

Executive Summary:

With the increasing use of administrative data in the production of business statistics comes the challenge for statistical producers of how to assess quality. The ESSnet Admin Data has set out to aid members of the ESS meet this challenge by developing quality indicators for business statistics involving administrative data. The team, from across a number of National Statistical Institutes has worked on three interlinked areas, developing: a list of basic quality indicators including quantitative indicators and complementary qualitative indicators; a set of composite indicators which draws together the basic quality indicators into ‘themes’ in line with the ESS dimensions of output quality, to provide a more holistic view of the quality of a statistical output; and guidance on the accuracy of mixed source (survey and administrative data) statistics. This document pulls together the three main outputs of the quality indicators work package (WP6) of the ESSnet Admin Data. We believe that it provides a valuable resource to producers of statistics, which can be implemented as part of a quality management system and used to inform users of the quality of the statistics produced.

Page 3: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

3

Contents Page

Chapter 1: Introduction to WP6........................................................................................ 4

Chapter 2: Basic Quality Indicators

2.1 A quick guide to the quality indicators......................................................... 7

2.2 Quality indicators when using administrative data in statistical outputs

2.2.1 Quantitative quality indicators.............................................. 8

2.2.2 Qualitative quality indicators................................................ 10

2.2.3 Using the list of quality indicators........................................ 10

2.3 List of basic quality indicators

2.3.1 Background information (indicators).................................... 13

2.3.2 Quality indicators................................................................. 31

2.3.3 List of qualitative indicators by quality theme...................... 65

2.4 References.................................................................................................. 72 2.5 Appendix A: Notation for quality indicators.................................................. 73

Chapter 3: Composite quality indicators

3.1 Introduction................................................................................................. 75

3.2 Grouping indicators into quality themes..................................................... 76

3.3 Methods for calculating composite indicators............................................. 76

3.4 Development of composite quality indicators............................................. 77

3.5 Developing a composite indicator for Accuracy......................................... 83

3.6 Developing a composite indicator for Timeliness and Punctuality............. 88

3.7 Developing a composite indicator for Comparability.................................. 89

3.8 Developing a composite indicator for Coherence....................................... 90

3.9 Conclusion.................................................................................................. 94

3.10 References................................................................................................. 95

3.11 Appendix A: Grouping of basic quality indicators into quality themes....... 96

3.12 Appendix B: Literature review on methods for developing composite

indicators.................................................................................................... 98

Chapter 4: Quality guidance relating to mixed source outputs

4.1 Introduction................................................................................................ 103

4.2 Uses of admin data.................................................................................... 103

4.3 Measuring error......................................................................................... 107

4.4 Case studies

4.4.1 Cut-off sampling (ONS case study).................................... 107

4.4.2 Two-strata mix (CBS case study)....................................... 122

4.5 Conclusions and recommendations................................................................. 130

4.6 References....................................................................................................... 131

4.7 Appendix A....................................................................................................... 133

Chapter 5: Conclusions................................................................................................... 134

Annex 1: Tailored lists of indicators (SBS and STS)...................................................

Annex 2: Glossary...........................................................................................................

Page 4: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

4

Chapter 1: Introduction to Work Package 6

Administrative data are increasingly being used by National Statistical Institutes (NSIs) in the production of their statistics; for example, in the Nordic countries where admin data have been the main data source for the production of official statistics for a number of years (UNECE, 2007). Historically, the fields of social and population statistics have been more advanced in using admin data but their use is becoming more prevalent within business statistics (Orjala, 2008, Statistics New Zealand, 2009). This increase is mainly due to the pressure to reduce costs and minimise burden on respondents (Eurostat, 2003, Daas & Fonville, 2007). However, despite this increasing use of admin data, there is little in the way of formal best practice and recommendations which differ from those for the production of statistics based on survey data. For example, although the European Statistical System (ESS) dimensions of quality apply to all statistics, not all elements of these dimensions are appropriate for statistics that are fully or partly based on admin data. It was to address this lack of best practice that the European Statistical System network project on the Use of Administrative and Accounts Data in Business Statistics (ESSnet Admin Data) was established. One of the work packages (WP6) within the ESSnet Admin Data was designed to address the lack of quality indicators within this field, with a particular focus on developing quantitative quality indicators specifically relating to admin data and complementary qualitative indicators which provide additional, often descriptive and explanatory, information. The purpose of this project was not to reinvent the wheel but to build on work already in place. Consequently, some investigative and research work was initially conducted. Although there was some work already done in the area of quality of business statistics involving admin data, much of this work either referred to qualitative indicators or was based more on a descriptive analysis of admin data (Eurostat, 2003). The quantitative indicators that have been produced have been concerned with the quality of the admin sources (Daas, Ossen & Tennekes, 2010) or designed as part of a quality framework for the evaluation of admin data (Ossen, Daas & Tennekes, 2011). However, these do not address the quality of the production of the statistic directly, although quality of the admin data is obviously a crucial factor (which is addressed in other work of the ESSnet Admin Data1). In reality, almost no work had been done on quantitative indicators for business statistics involving admin data, which was the main focus of this project. At this stage, it is important to note that the work being carried out under this project was not independent of other work already in place. These indicators are for the benefit of the members of the ESS, and producers of statistics more widely. Thus, the end result of the ESSnet Admin Data’s work in this area should be integrated with the work already in place, for example, on the production of quality reports. It is also important to note that quality indicators that can be applied in the same way when using admin data or survey data have been excluded. Many of these latter indicators are those specifically related to the statistical output or the publication. For example, indicators in relation to accessibility of the statistics were out-of-scope because accessibility of the output is not normally influenced by whether survey or admin data are used in its production. Similarly, although it may be more challenging to explain clearly to users the methods adopted when there is a complex combination of admin and survey data, these clarity indicators are not quantifiable and are more related to the quality of the presentation than the use of the admin data.

1 Work Package 2 of the ESSnet Admin Data: http://essnet.admindata.eu/WorkPackage?objectId=4253

Page 5: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

5

In contrast, this project focused on the statistical output, taking the quality of the input and process into account. This is because the input and process are critical to the work of NSIs and it is the input and process in particular that differ when using admin data. This is, to an extent, a similar approach to that taken with survey-based statistics; indicators such as non-response and level of imputation (input and process indicators) are accepted as indicators of the quality of the statistical output. In line with this, the indicators produced must be considered with a specific output in mind; for example, the extent of undercoverage of the admin data will vary depending on the population required for the statistical output – eg.exclusion of part of the services sector of the economy is not a quality concern if the output is in relation to the manufacturing sector. Therefore, all indicators should be calculated with specific statistical outputs in mind. In developing the indicators, the first step for WP6 was to establish relevant, useful and simple indicators that took account of the use of admin data, and then to look at how some of these could be combined or grouped together to provide a more holistic view of quality. However, with the increasing use of admin data comes increasingly complex ways of using the data. In these contexts, the simple input and process indicators are less informative. WP6 therefore also developed guidance on assessing the accuracy of mixed source statistical outputs. The remaining chapters of this document include the main outcomes of the WP6 part of the ESSnet Admin Data project. Much of the work is based on initial research to establish what was already being done in this area. The basic list of quantitative quality indicators was then developed with complementary qualitative indicators incorporated as well as a framework for examples and individual examples for each quantitative indicator. The inclusion of examples was found to aid statistical producers in understanding and implementing the indicators. Consequently, tailored lists were developed with specific examples relating to either Structural Business Statistics (SBS) or Short Term (business) Statistics (STS) - again to aid understanding and implementation across the ESS. The main list of basic quality indicators is included in Chapter 2 with the tailored lists (SBS and STS) available in Annex 1. Chapter 3 includes guidance on calculating composite quality indicators for relevant dimensions of the ESS quality framework. The aim of these composite indicators is to do more than simply provide the ESS with a list of 23 separate quantitative indicators. Instead WP6 has provided guidance for producers of statistics to collate some of these indicators into ‘themes’ (based on the dimensions of the ESS quality framework) to provide a more general indication of the quality of the statistical output. Chapter 4 provides guidance to producers on the accuracy of mixed source outputs. As mentioned above, despite the increasing use of admin data in the production of business statistics, the admin data are largely still combined with survey data. Since sampling theory typically does not apply to admin data, the assessment of the accuracy of these mixed source statistics is a challenge. This chapter discusses a number of possibilities for using admin data when survey data are also available. These include: comparing the mean square error between a survey estimate and an admin data estimate, combining them into a weighted sum, or developing composite estimates after data integration. Four situations where the sources are integrated are discussed, two of which are developed as case studies. The bulk of the chapter outlines these two main case studies, establishing means of estimating variance and bias. Finally, Chapter 5 aims to pull together the three main work areas of WP6; highlighting the developments made during the project and drawing conclusions. The outcome of the testing of the WP6 deliverables is summarised and an overview of the feedback from ESS members is provided. Potential areas for future consideration are also identified.

Page 6: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

6

At the outset of this document, it should be noted that the work undertaken by WP6 has been to aid Member States. The basic and composite quality indicators and the more complex guidance on establishing the accuracy of mixed source outputs are for NSIs to assess the quality of their statistics involving admin data. We hope that our work will help producers of statistics to assess and improve the quality of their outputs over time. This work was not intended to establish cross-European quality indicators. In fact, one of the main lessons learned through the work of the ESSnet Admin Data is the diverse way that admin data are used across and even within NSIs. The indicators and guidance produced by WP6 are to aid producers of statistics within the context of how they use admin data. Considerable further work would be required to overcome these national and domain specific differences and develop indicators of the quality of statistical outputs at the European level. We feel that the case studies and guidance outlined in Chapter 4 of this document may provide the first step in this direction, but further work in this area was out of the scope and remit of the ESSnet Admin Data. Nevertheless, we hope that our work will be a building block to improve and strengthen ESS statistics, through the efforts of Member States to assess and improve the quality of their statistics involving admin data.

References

Daas, P.J.H. & Fonville, T.C. (2007). Quality control of Dutch administrative registers : An inventory of quality aspects. Paper presented at the Seminar on Registers in Statistics – methodology and quality. Helsinki, Finland. Daas, P.J.H., Ossen, S.J.L. & Tennekes, M. (2010). Determination of administrative data quality: recent results and new developments. Paper and presentation for the European Conference on Quality in Official Statistics 2010. Helsinki, Finland. Eurostat, (2003). Item 6: Quality assessment of administrative data for statistical purposes. Luxembourg, Working group on assessment of quality in statistics, Eurostat. Orjala, H. (2008). Potential of administrative data in business statistics – a special focus in improvements in short term statistics. Paper presented at the IAOS Conference on reshaping Official Statistics. Shanghai, China. Ossen, S.J.L., Daas, P.J.H. & Tennekes, M. (2011). Overall Assessment of the Quality of Administrative Data Sources. Paper accompanying the poster at the 58th Session of the International Statistical Institute. Dublin, Ireland. Statistics New Zealand (2009). Managing the quality of administrative data in the production of economic statistics. Auckland, New Zealand. UNECE (2007). Register-based statistics in Nordic countries – review of best practices with focus on population and social statistics. Geneva: United Nations Publication.

Page 7: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

7

Chapter 2: Basic Quality Indicators

2.1 A Quick Guide to the Quality Indicators

What are the quality indicators? The European Statistical System network project on admin data (ESSnet Admin Data) has developed a list of quality indicators, for use with business statistics involving admin data. The indicators provide measures of the quality of the statistical output, taking input and process into account. They are based on the ESS dimensions of statistical output quality and other characteristics considered within the ESS Handbook for Quality Reports2. Who are they for? The list of quality indicators has been developed primarily for producers of statistics, within the ESS and more widely. The indicators can also be used for quality reporting, thus benefiting users of the statistical outputs. They provide the user with an indication of the quality of the output, and an awareness of how the admin data have been used in the production of the output. When can they be used? The list of quality indicators is particularly useful for two broad situations:

1. When planning to start using admin data as a replacement for, or to supplement, survey data. In this scenario, the indicators can be used to assess the feasibility of increasing the use of admin data, and the impact on output quality.

2. When admin data are already being used to produce statistical outputs. In this scenario, the indicators can be used to gauge and report on the quality of the output, and to monitor it over time. Certain indicators will be suitable to report to users, whilst others will be most useful for the producers of the statistics only.

How should they be used? There are 23 basic quantitative quality indicators and 46 qualitative quality indicators in total, but not all indicators will be relevant to all situations. Therefore, a statistical producer should select the indicators relevant to its output. The table in Section 2.2.3 shows which of the quantitative indicators relate to which dimension or ‘theme’ of quality, which may be useful in identifying which indicators to use. Indicators 1 to 8 are background indicators, which provide general information on the use of admin data in the statistical output in question but do not, directly, relate to the quality of the statistical output. Indicators 9 to 23 provide information directly addressing the quality of the statistical output.

2

More information on the ESS Handbook for quality reports can be found here:

http://epp.eurostat.ec.europa.eu/portal/page/portal/product_details/publication?p_product_code=KS-RA-08-016

Page 8: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

8

2.2 Quality Indicators when using Administrative Data in Statistical Outputs

2.2.1 Quantitative quality indicators

One of the aims of the ESSnet Admin Data is the development of quality indicators for business statistics involving admin data, with a particular focus on developing quantitative quality indicators and qualitative indicators to complement them.

Some work has already been done in the area of quality of business statistics involving admin data and some indicators have been produced. However, the work conducted thus far refers to qualitative indicators or is based more on a descriptive analysis of admin data (see Eurostat, 2003). The quantitative indicators that have been produced have been more to do with the quality of the admin sources (Daas, Ossen & Tennekes, 2010) or have been to develop a quality framework for the evaluation of admin data (Ossen, Daas & Tennekes, 2011) 3. These do not address the quality of the production of the statistical output however. In fact, almost no work has been done on quantitative indicators of business statistics involving admin data, which is the main focus of this project (for further discussion on this topic see Frost, Green, Pereira, Rodrigues, Chumbau & Mendes, 2010).

The ESSnet aims to develop quality indicators of statistical outputs that involve admin data. These indicators are for the use of members of the European Statistical System; producers of statistics. Therefore, the list contains indicators on input and process because these are critical to the work of the National Statistical Institutes and it is the input and process in particular that are different when using admin data. Moreover, the list of indicators developed is specifically in relation to business statistics involving admin data. Indicators (e.g. on accessibility) that do not differ for admin vs. survey based statistics are not included in this work because they fall outside the remit of the ESSnet Admin Data project.

To address some issues of terminology, a few definitions are provided below to clarify how these terms are used in this document and throughout the ESSnet Admin Data.

What is administrative data? Administrative data are the data derived from an administrative source, before any processing or validation by the NSIs.

What is an administrative source? An administrative source is a data holding containing information collected and maintained for the purpose of implementing one or more administrative regulations. In a wider sense, any data source containing information that is not primarily collected for statistical purposes.

Further information on terminology and useful links to other, related work is available on the ESSnet Admin Data Information Centre4.

A list of quantitative quality indicators has been developed on the basis of research which took stock of work being conducted in this field across Europe5. This list was then user tested within five European NSIs, before testing across Member States6. Feedback from this

3

More information on the BLUE-ETS project and the associated deliverables can be found here:

http://www.blue-ets.istat.it/index.php?id=7 4 ESSnet Admin Data Glossary: http://essnet.admindata.eu/Glossary/List

ESSnet Admin Data Reference Library: http://essnet.admindata.eu/ReferenceLibrary/List 5 A summary of the main findings of this stock take research (Deliverable 2010/6.1) is available here:

http://essnet.admindata.eu/WikiEntity?objectId=4696 6 The outcome of this testing is reported on the ESSnet Information Centre (included within the SGA 2010 final

report) and is available here: http://essnet.admindata.eu/WikiEntity?objectId=4751

A summary of the 2011 user testing is reported in the Phase 2 User Testing Report, available here:

http://essnet.admindata.eu/WikiEntity?objectId=4696

Page 9: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

9

testing was used to improve the list of quality indicators during its development (2010/11). Subsequent testing of all the outputs of this workpackage of the ESSnet has also been conducted (during 2013) and is discussed in Chapter 5 of this document. The entry for each quantitative indicator is self-contained in the attached list (see Section 2.3), including a description, information on how the indicator can be calculated and one or two examples. Qualitative (or descriptive) indicators have also been developed to complement the quantitative indicators and are also included in Section 2.3. Further information on the qualitative indicators is included in Section 2.2.2. The quantitative indicators have been developed so that a low indicator score denotes high quality, and a high indicator score denotes low quality. This is consistent with the concept of error, where high errors signify low quality. In essence, the indicators measure quality risks – for example, the higher the level of non-response, the higher the risk to the quality of the output. The exceptions to this rule are the background indicators (1 to 8), where the score provides information rather than a quality ‘rating’; and indicators 20 and 23, where a high indicator score denotes high quality, and a low indicator score denotes low quality. Examples are also given for weighted indicators, for example weighting the indicators by turnover or number of employees. Caution needs to be taken when considering these weighted indicators in case of bias caused by the weighting.

A framework for the basic quantitative quality indicator examples

The calculation of an indicator needs some preliminary steps. Some or all of the following steps will be used for each example of the indicators to ensure consistency of the examples, and to aid understanding of the indicators themselves. A simple framework to aid calculating the quantitative indicators is included here:

A. Define the statistical output B. Define the relevant units C. Define the relevant variables D. Adopt a schema for calculation E. Declare the tolerance for quantitative and qualitative variables

The list is not created so that producers calculate all indicators but rather as a means of enabling producers to select those that a most relevant to their output. Not all indicators will apply in all situations and it is not recommended that they are all calculated on an ongoing, regular basis. Whilst, some may be useful for exactly this purpose, others may only be used when considering options for increasing the use of admin data or when undergoing or evaluating changes in the process of producing the statistical output.

Links between this and other work on Quality

The work being carried out under this project should not be seen as independent of other work already in place. When analysing the list of indicators, one can conclude that some other information is useful in regard to the quality of the output. However, some of that very useful information is not specific to the use of admin data and thus is out of scope for the work of this ESSnet.

This work is for the benefit of the members of the European Statistical System (ESS); the producers of statistics. Consequently, the end result of the ESSnet Admin Data work in this area should be integrated with the work already in place in NSIs and Eurostat.

Page 10: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

10

2.2.2 Qualitative quality indicators

While much of the focus of the ESSnet Admin Data work on quality has been around the development of quantitative quality indicators, the project also required the development of qualitative quality indicators to complement the quantitative indicators. Quantitative and qualitative indicators can be thought of as numerical and descriptive quality indicators respectively: the quantitative indicators provide a numerical measure around the quality of the output, whereas the qualitative indicators provide further descriptive information that cannot be obtained from observing a numerical value. Many of the qualitative indicators have been taken from a UK document entitled ‘Guidelines for Measuring Statistical Output Quality’, which serves as a comprehensive list of quality measures and indicators for reporting on the quality of a statistical output. Others have been developed as part of the work of the ESSnet Admin Data. Beneath each quantitative indicator in Section 2.3 is a table which displays any potentially relevant qualitative indicators, a description of each indicator and the quality theme with which they are associated. Some of the qualitative indicators are repeated in Section 2.3 as they are related to more than one quantitative indicator. Section 2.4 contains a complete list of all qualitative indicators, grouped by theme, and also references the quantitative indicators to which they have been linked in Section 2.3.

2.2.3 Using the list of quality indicators

The list of indicators has been grouped into two main areas:

1. Background Information – these are ‘indicators’ in the loosest sense. They provide general information on the use of admin data in the statistical output in question but do not, directly, relate to the quality of the statistical output. This information is often crucial in understanding better those indicators that measure quality more directly.

2. Quality Indicators – these provide information directly addressing the quality of the statistical output.

The background information indicators and the quality indicators are further grouped by quality ‘theme’. These quality themes are based on the ESS dimensions of output quality, with some additional themes which relate specifically to admin data and are consistent with quality considerations as outlined in the ESS Handbook on Quality Reports. These themes also appear in the composite quality indicators that are included in Section 3 of this report. The quality themes are:

Page 11: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

11

Quality theme Description

Relevance

Relevance is the degree to which statistical outputs meet current and potential user needs. Note: only a subset of potential relevance quality indicators are considered within this document given the scope of the ESSnet project (eg. differences between statistical and admin data definitions). All relevance indicators are qualitative.

Accuracy The closeness between an estimated result and the unknown true value.

Timeliness and punctuality The lapse of time between publication and the period to which the data refer, and the time lag between actual and planned publication dates.

Comparability The degree to which data can be compared over time and domain.

Coherence The degree to which data that are derived from different sources or methods, but which refer to the same phenomenon, are similar.

Other relevant considerations

Cost and efficiency The cost of incorporating admin data into statistical systems, and the efficiency savings possible when using admin data in place of survey data.

Use of administrative data Background information relating to admin data inputs.

Page 12: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

12

The following table shows which quantitative indicators are relevant to each of the quality

themes.

Quality theme Quantitative indicators relevant to that theme

Accuracy 9, 10, 11, 12, 13, 14, 15, 16, 17.

Timeliness and punctuality

4, 18.

Comparability 19.

Coherence 5, 6, 20, 21.

Cost and efficiency 7, 8, 22, 23.

Use of administrative

data 1, 2, 3.

Reminder:

The quantitative indicators have been developed so that a low indicator score denotes high quality, and a high indicator score denotes low quality. Thus, the indicators measure quality risks – for example, the higher the level of non-response, the higher the risk to the quality of the output. The exceptions to this rule are the background indicators (1 to 8), where the score provides information rather than a quality ‘rating’; and indicators 20 and 23, where a high indicator score denotes high quality, and a low indicator score denotes low quality.

Each individual indicator will not apply in all situations. The list is not created so that producers calculate all indicators but rather as a means of enabling producers to select those that a most relevant to their output.

Page 13: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

13

2.3 List of Basic Quality Indicators

2.3.1 Background Information (indicators)

Use of administrative data:

1 Number of admin sources used

Description This indicator provides information on the number of administrative sources used in each statistical output. The number of sources should include all those used in the statistical output whether the admin data are used as raw data, in imputation or to produce estimates. In general, administrative data sources used for updating base registers (where available) should not be included in this indicator.

How to calculate Note. Where relevant, a list of the admin sources may also be helpful for users, along with a list of the variables included in each source. Alternatively, the number of admin sources used can be specified by variable.

Note: all examples use the relevant parts of the examples framework set out in Section 2.2.1.

Example 1

A. Statistical output: The Business Register (BR) Enterprise units updating/identification

B. Relevant units: 10+ employees enterprises (relevant for a specific survey or as base for the HG firms)

D. Steps for calculation: Identify the relevant admin sources Let S1 be the Fiscal Register source. Let S2 be the Chamber of Commerce source. Let S3 be the Social Security source. Let S4 be the Yellow Pages source.

I(1) = 4 sources.

Example 2

A. Statistical output: The BR Local units updating/identification

B. Relevant units: The local units of enterprises with more than one local unit

D. Steps for calculation: Identify the relevant admin sources. Let S1 be the Chamber of Commerce source. Let S2 be the Social Security source.

I’(1) = 2 sources.

For further clarification on

terminology and definitions of

terms used, please refer to the

Glossary included in Annex 2.

Page 14: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

14

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

A - Name each admin source used as an input into the statistical product and describe the primary purpose of the data collection for each source

Relevance Name all administrative sources and their providers. This information will assist users in assessing whether the statistical product is relevant for their intended use. Providing information on the primary purpose of data collection also enables users to assess whether the data are relevant to their needs.

B - Describe the main uses of each of the admin sources and, where possible, how the data relate to the needs of users

Relevance Include all the main statistical processes and/or outputs known to require data from the administrative source This indicator should also capture how well the data support users’ needs. This information can be gathered from user satisfaction surveys and feedback.

Page 15: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

15

2 % of items obtained exclusively from admin data

Description This indicator provides information on the proportion of items only obtained from admin data, whether directly or indirectly, and where survey data are not collected. This includes where admin data are used as raw data, as proxy data, in calculations, etc. This indicator should be calculated on the basis of the statistical output – the number of items obtained exclusively from admin data (not by survey) should be considered.

How to calculate

%100 items of no. Total

dataadmin fromy exclusivel obtained items of No.

This indicator could also be weighted in terms of whether or not the variables are key to the statistical output.

Example

A. Statistical output: Monthly manufacturing

B. Relevant units: Units in the statistical population

C. Relevant variables Number of employees

D. Steps for calculation:

D1. For the relevant variable, calculate the number of items for which the variable is obtained exclusively from admin data (items with non missing variable);

D2. Divide the sum of numbers of items for which the variables are obtained exclusively from admin data by the sum of numbers of items for which the variable is not missing

D3. Calculate the indicator as follows:

Let S1 be the Social Security source.

Page 16: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

16

I(2)=itemsofnumberTotal

dataminadfromyexclusivelobtaineditemsofNo.*100= %7.35100*

12

5

Weighted by turnover:

I(2)W=

100*277,923,1

518,27860,27000,58632,79652,12816.7%

Units

Number of

employees-S1

Number of

employees

(survey)

Items obtained

exclusively from admin

data (1/0)=(Y/N) Items non missing Turnover

X1 27 1 1 128,652

X2 512 518 0 1 759,830

X3 missing 2 0 0 14,000

X4 28 27 0 1 253,000

X5 11 1 1 79,632

X6 3 2 0 1 22,536

X7 118 120 0 1 123,412

X8 123 123 0 1 237,523

X9 1 1 1 58,000

X10 missing 1 0 0 39,800

X11 1 1 1 27,860

X12 3 3 0 1 79,845

X13 28 30 0 1 125,469

X14 2 1 1 27,518

Sum 5 12 1,977,077

Page 17: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

17

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

A - Name each admin source used as an input into the statistical product and describe the primary purpose of the data collection for each source

Relevance Name all administrative sources and their providers. This information will assist users in assessing whether the statistical product is relevant for their intended use. Providing information on the primary purpose of data collection also enables users to assess whether the data are relevant to their needs.

B - Describe the main uses of each of the admin sources and, where possible, how the data relate to the needs of users

Relevance Include all the main statistical processes and/or outputs known to require data from the administrative source This indicator should also capture how well the data support users’ needs. This information can be gathered from user satisfaction surveys and feedback.

AQ – Describe reasons for significant overlap in admin data and survey data collection for some items.

Cost and efficiency

Where items are not obtained exclusively from admin data, reasons for the overlap between admin data and survey data should be described.

Page 18: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

18

3 % of required variables which are derived using admin data as a proxy

Description This indicator provides information on the extent that admin data are used in the statistical output as a proxy or are used in calculations rather than as raw data. A proxy variable can be defined as a variable that is related to the required variable and is used as a substitute when the required variable is not available. This indicator should be calculated on the basis of the statistical output – the number of required variables derived indirectly from admin data (because not available directly from admin or survey data) should be considered.

How to calculate

%100variables required of No.

proxy a as data admin using derived are whichvariables required of No.

Note. If a combination of survey and admin data is used, this indicator would need to be weighted (by number of units). If double collection is necessary (e.g. to check quality of admin data), some explanation should be provided. This indicator could also be weighted in terms of whether or not the variables are key to the statistical output.

Example

A. Statistical output: Annual structural data on performance in trade sector

B. Relevant units: Units in the statistical population

C. Relevant variables: Turnover; Personnel costs; Wages and salaries; Number of persons employed; Number of employees; Number of employees in full time equivalent; Production value.

D. Steps for calculation:

D1. Number of required variables (Denominator) D2. Number of required variables used as a proxy (Numerator) Calculate the indicator as follows:

%100)3( variables required of No.

proxy a as data admin using derived are whichvariables required of No.I

Let S1 be the Balance Sheet source. From this source we can obtain directly the variables Personnel costs; Wages and salaries; and Production value. Let S2 be the VAT Turnover source. From this source we can obtain indirectly Turnover, using VAT turnover as proxy. Let S3 be the Social Security source: from this source we can obtain directly the Number of employees and indirectly the Number of employees in full time equivalent. Let S4 be the Shareholders and Associates Data Bank: from this source we can obtain indirectly the variable Number of Self Employed which is a component of the variable Number of persons employed.

Page 19: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

19

So I(3)=(3/7)*100=42.9% Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

C – Describe the extent to which the data from the administrative source meet statistical requirements

Relevance Statistical requirements of the output should be outlined and the extent to which the administrative source meets these requirements stated. Gaps between the administrative data and statistical requirements can have an effect on the relevance to the user. Any gaps and reasons for the lack of completeness should be described, for example, if certain areas of the target population are missed or if certain variables that would be useful are not collected. Any methods used to fill the gaps should be stated.

D – Describe constraints on the availability of administrative data at the required level of detail

Relevance Some administrative microdata have restricted availability or may only be available at aggregate level. Describe any restrictions on the level of data available and their effects on the statistical product.

E – Describe reasons for use of admin data as a proxy.

Relevance Where admin data are used in the statistical output as a proxy or are used in calculations rather than as raw data, information should be provided in terms of why the admin data have been used as a proxy for the required variables.

Page 20: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

20

Timeliness and punctuality: 4 Periodicity (frequency of arrival of the admin data)

Description This indicator provides information about how often the admin data are received by the NSI. This indicator should be provided for each admin source.

How to calculate Note. If data are provided via continuous feed from the admin source, this should be stated in answer to this indicator. Only data you receive for statistical purposes should be considered.

Example 1

A. Statistical output: The Business Register (BR) Enterprise units

D: Steps for calculation: Record periodicity for each source Let S1 be the Fiscal Register source. Let S2 be the Chamber of Commerce source. Let S3 be the Social Security source. Let S4 be the Yellow Pages source.

IS1(4) = 1; IS2(4) = 2; IS3(4) = 2; IS4(4) = 1

Example 2

A. Statistical output: OROS Survey (Employment, earnings and social security contributions) based on the Social Security administrative data.

B. Relevant units: small enterprises with employees

D: Steps for calculation: Record periodicity for each source

Let S1 be the VAT Turnover source.

Let S2 be the Social Security source.

Yellow Pages data 1

Chamber of Commerce data 2

Social Security data 2

Fiscal Register data 1

Frequency of arrival of the admin data (respect to BR

reference year) - Per yearType of admin data

Page 21: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

21

I’S1(4) = 4; I’S2(4) = 4

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

G – Describe the timescale since the last update of data from the administrative source

Timeliness An indication of the timescale since the last update from administrative sources will provide the user with an indication of whether the statistical product is timely enough to meet their needs.

H – Describe the extent to which the administrative data are timely

Timeliness Provide information on how soon after their collection the statistical institution receives the administrative data. The effects of any lack of timeliness on the statistical product should be described.

I – Describe any lack of punctuality in the delivery of the administrative data source

Timeliness Give details of the time lag between the scheduled and actual delivery dates of the data. Any reasons for the delay should be documented along with their effects on the statistical product.

Fiscal Register data 4

Social Security data 4

Type of admin data

Frequency of arrival of the admin data -Per

year

Page 22: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

22

Coherence: 5 % of common units across two or more admin sources

Description

This indicator relates to the combination of two or more admin sources. This indicator provides information on the proportion of common units across two or more admin sources. Only units relevant to the statistical output should be considered. This indicator should be calculated pairwise for each pair of admin sources and then averaged. If only one admin source is available, this indicator is not relevant.

How to calculate

%100 units uniquerelevant of No.

sourcesadmin in the unitscommon relevant of No.

Note. The “unique units” in the denominator means that units should only be counted once, even if they appear in multiple sources. This indicator should be calculated separately for each variable. If the sources are designed to cover different populations and are combined to provide an overall picture, this should be explained. This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution of these units to the statistical output.

Example

A. Statistical output: The Business Register Enterprise units updating/identification

B. Relevant units: The NACE sector = Construction

D. Steps for calculation:

D1. Identify the statistical unit (enterprise) for each source (i.e. group the administrative records in one source at id code level)

D2. Match all sources with each other by id code D3. Attribute a Presence(1)/Absence(0) indicator to the unit with regard to the specific

source D4. Calculate the number of possible pairings between sources (i.e. when there are n

sources, it is the combination of n sources taken 2 at a time) Cn,2= n!/(n-2)!* 2! Cn= n/2*(n-1) Let’s suppose 4 sources, the possible combinations will be: C4 =2*3=6 D5. Multiply the Presence(1)/Absence(0) indicator to obtain the

Presence(1)/Absence(0) indicator for each pair D6. Sum up the Presence(1)/Absence(0) indicators at pair level and divide by Cn*no. of relevant units (m)

Let A be the Social Security source.

Let B be the Chamber of Commerce source.

Let C be the Yellow Pages source.

Page 23: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

23

Let D be the VAT Turnover source.

Ind(XiA)=1 if Xi is present in the source A; Ind(XiA)=0 if Xi is absent in the source A

Numerator = ∑ijind(Xij) = 34

Denominator = m*Cn,2 =10*6 = 60

I(5) = (Numerator/Denominator)*100 = (34/60)*100 = 57%

The following picture illustrates the meaning of the result:

Weighted by turnover:

UNIT A B C D AB AC AD BC BD CD Sum

X1 0 0 1 1 0 0 0 0 0 1 1

X2 0 1 0 1 0 0 0 0 1 0 1

X3 0 1 1 1 0 0 0 1 1 1 3

X4 1 1 1 1 1 1 1 1 1 1 6

X5 0 1 0 1 0 0 0 0 1 0 1

X6 1 1 1 1 1 1 1 1 1 1 6

X7 1 1 1 1 1 1 1 1 1 1 6

X8 0 1 0 1 0 0 0 0 1 0 1

X9 1 1 0 1 1 0 1 0 1 0 3

X10 1 1 1 1 1 1 1 1 1 1 6

Sum 5 4 5 5 9 6 34

Page 24: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

24

Numerator = ∑ijind(Xij)*wi = 43,722,364

Denominator = Cn,k*∑wi = 6*7,514,804 = 45,088,824

I(5) = (Numerator/Denominator)*100 = 97%

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

O – Describe the common identifiers of population units in administrative data

Coherence Different administrative sources often have different population unit identifiers. The user can utilise this information to match records from two or more sources. Where there is a common identifier, matching is generally more successful.

U – Describe the record matching methods and processes used on the administrative data sources

Accuracy Record matching is when different administrative records for the same unit are matched using a common, unique identifier or key variables common to both datasets. There are many different techniques for carrying out this process. A description of the technique (e.g. automatic or clerical matching) should be provided along with a description (qualitative or quantitative) of its effectiveness.

UNIT

(2)

Turnover AB*(2) AC*(2) AD*(2) BC*(2) BD*(2) CD*(2) Sum

X1 15,020 0 0 0 0 0 15,020 15,020

X2 28,340 0 0 0 0 28,340 0 28,340

X3 57,812 0 0 0 57,812 57,812 57,812 173,436

X4 1,167,584 1,167,584 1,167,584 1,167,584 1,167,584 1,167,584 1,167,584 7,005,504

X5 21,333 0 0 0 0 21,333 0 21,333

X6 5,767,853 5,767,853 5,767,853 5,767,853 5,767,853 5,767,853 5,767,853 34,607,118

X7 153,000 153,000 153,000 153,000 153,000 153,000 153,000 918,000

X8 63,021 0 0 0 0 63,021 0 63,021

X9 184,818 184,818 0 184,818 0 184,818 0 554,454

X10 56,023 56,023 56,023 56,023 56,023 56,023 56,023 336,138

Sum 7,514,804 7,329,278 7,144,460 7,329,278 7,202,272 7,499,784 7,217,292 43,722,364

Page 25: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

25

6 % of common units when combining admin and survey data

Description This indicator relates to the combination of admin and survey data. This indicator provides information on the proportion of common units across admin and survey data. Linking errors should be detected and resolved before this indicator is calculated. This indicator should be calculated for each admin source and then aggregated based on the number of common units (weighted by turnover) in each source. (For information on the progressive nature of administrative data, see Appendix A.)

How to calculate

%100survey in units of No.

datasurvey andadmin in unitscommon of No.

Note. If there are few common units due to the design of the statistical output (e.g. a combination of survey and admin data), this should be explained. If the sources are designed to cover different populations and are combined to provide an overall picture, this should also be explained. This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution of these units to the statistical output.

Example A. Statistical output: Annual data on structure and performance of credit institutions

B. Relevant units: Units in the survey

D. Steps for calculation:

D1. Match each source with survey(s) by the common identification code (if available) or by other methods

D2. Attribute a Presence(1)/Absence(0) indicator to the unit if it belongs at least to one survey (sum up to obtain denominator)

D3. Attribute a Presence(1)/Absence(0) indicator to the unit if it belongs both to the survey and to each source (sum up by source to obtain numerator)

D4. Calculate the indicator as follows:

%100)6( survey in units of No.

data survey and admin in units common of No.I

Let A be the Yellow Pages source. Let B the National Bank source.

Page 26: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

26

I(6)=[(1+5-1)/5]*100=100%

Weighted by number of employees:

I(6)W=[(32,584+39,272-32,584)/39,272]*100=100%

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

O - Describe the common identifiers of population units in administrative data

Coherence

Different administrative sources often have different population unit identifiers. The user can utilise this information to match records from two or more sources. Where there is a common identifier, matching is generally more successful.

U – Describe the record matching methods and processes used on the administrative data sources

Accuracy Record matching is when different administrative records for the same unit are matched using a common, unique identifier or key variables common to both datasets. There are many different techniques for carrying out this process. A description of the technique (e.g. automatic or clerical matching) should be provided along with a description (qualitative or quantitative) of its effectiveness.

Units (1) (2) (3) (4)=(1)*(2) (5)=(1)*(3) (6)=(1)*(2)*(3) (7) (8)=(4)*(7) (9)=(5)*(7) (10)=(6)*(7) (11)=(1)*(7)

X1 1 0 1 0 1 0 1,899 0 1,899 0 1,899

X2 0 1 1 0 0 0 249 0 0 0 0

X3 0 1 1 0 0 0 186 0 0 0 0

X4 0 0 1 0 0 0 48 0 0 0 0

X5 0 0 1 0 0 0 225 0 0 0 0

X6 1 0 1 0 1 0 1,536 0 1,536 0 1,536

X7 1 0 1 0 1 0 2,986 0 2,986 0 2,986

X8 1 1 1 1 1 1 32,584 32,584 32,584 32,584 32,584

X9 0 0 1 0 0 0 69 0 0 0 0

X10 1 0 1 0 1 0 267 0 267 0 267

X11 0 0 1 0 0 0 25 0 0 0 0

X12 0 1 1 0 0 0 46 0 0 0 0

Sum 5 4 12 1 5 1 40,120 32,584 39,272 32,584 39,272

Presence in

Source B ∩

Survey

Number of

employees

Presence in

Source A ∩ B

∩ Survey

Presence

in survey

Presence

in Source

A

Presence

in Source

B

Presence

in Source A

∩ Survey

Page 27: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

27

Cost and efficiency: 7 % of items obtained from admin source and also collected by survey

Description This indicator relates to the combination of admin and survey data. This indicator provides information on the double collection of data, both admin source and surveys. Thus, it provides an idea of redundancy as the same data items are being obtained more than once. This indicator should be calculated for each admin source and then aggregated. Note. Double collection is sometimes conducted for specific reasons, e.g. to measure quality or because admin data is not sufficiently timely for the requirements of the statistical output. If this is the case, this should be explained.

How to calculate

%100survey in itemsrelevant of No.

datasurvey andadmin by obtained itemscommon relevant of No.

Only admin data which meet the definitions and timeliness requirements of the output should be included.

Example

A. Statistical output: Monthly data on the industrial sector

B. Relevant units: Units in the survey

C. Relevant variable: Number of employees

D. Steps for calculation:

D1. Match each source with survey(s) by the common id code (if available) or by other methods

D2. Attribute a Presence(1)/Absence(0) indicator to relevant units in the survey (sum up for obtaining denominator)

D3. Attribute a Presence(1)/Absence(0) indicator for common items in the survey and in the source (sum up for obtaining numerator)

D4. Calculate the indicator as follows: Let EMP be the Social Security source. Let STS1 be the survey.

100*)(.

.)7(

ssurveyinitemsrelevantofofNo

datasurveyandminadbyobtaineditemscommonrelevantofNoI

Page 28: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

28

I(7)=(8/9)*100=89% Weighted by turnover: I(7)W=(422,426,156/422,444,698)*100=100% Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

AR – Comment on the types of items that are being obtained by the admin source as well as the survey

Cost and Efficiency

If the same data items are collected from both the admin source and the survey, this can lead to duplication when combining the sources. This indicator should highlight to users the variables that are being collected across both sources.

AS – If items are purposely collected by both the admin source and the survey, describe the reason for this duplication (eg validity checks)

Cost and Efficiency

In some instances it may be beneficial to collect the same variables from both the admin source and the survey, such as to validate the micro level data. The reason for the double collection should be described to users.

Units Turnover

Number of employees (EMP)

Number of employees (STS1)

Number of items in STS1

Presence/Absence (1/0) index of items in STS1 and EMP

(A) (B) (C) (D) (E)=(B) ∩

(C) (A)*(D) (A)*(E)

X1 2,157,322 15 18 1 1 2,157,322 2,157,322 X2 14,000 0 0 0 0 X3 3,458,610 27 25 1 1 3,458,610 3,458,610 X4 358,987,462 587 600 1 1 358,987,462 358,987,462 X5 22,125 2 0 0 0 0 X6 5,027,321 34 34 1 1 5,027,321 5,027,321 X7 32,154 1 0 0 0 0 X8 18,542 1 1 0 18,542 0 X9 27,854 5 5 1 1 27,854 27,854 X10 52,369,584 965 962 1 1 52,369,584 52,369,584 X11 20,154 1 0 0 0 0 X12 153,000 2 2 1 1 153,000 153,000 X13 87,965 7 0 0 0 0 X14 245,003 17 18 1 1 245,003 245,003

Sum 9 8 422,444,698 422,426,156

Page 29: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

29

8 % reduction of survey sample size when moving from survey to admin data

Description This indicator relates to the combination of admin and survey data. This indicator provides information on the reduction in survey sample size because of an increased use of admin data. Only changes to the sample size due to using admin data should be included in this calculation. The indicator should be calculated for each survey and then aggregated (if applicable). This can be viewed as a one-off indicator when moving from a survey based output to an output involving admin data.

How to calculate

%100dataadmin of usein increase before size Sample

after size sample - dataadmin of usein increase before size Sample

Note. This indicator is likely to be calculated once, when making the change from survey to admin data.

Example 1 A. Statistical output: Annual data on structure and competitiveness of companies of

one NACE Division of industry sector

B. Relevant units: Units in the statistical population

C. Relevant variables: Turnover from industrial activities; turnover from service activities; turnover from trading activities of purchase and resale and from intermediary activities

D. Steps for calculation:

D1. Calculate sample size before increase in use of administrative data D2. Calculate sample size after increase in use of administrative data D3. Calculate the indicator as follows: In order to obtain the desired precision and reliability, we calculate three different sample sizes depending on the examined variable; so it would be suitable using at least the largest sample size dimension (amongst the three), which assures good results for all the three variables. Let the admin source be the Balance Sheet source. Let A be turnover from industrial activities. Let U be turnover from service activities. Let E be turnover from trading activities of purchase and resale and from intermediary activities. Survey sample size before increase in use of admin data: Average A (Ā): 10,263,650 Average U (Ū): 1,023,652 Average E (Ē): 3,976,678

Page 30: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

30

Correct sample variance of A (S2A): 34,350,734,902,500 = 5,860,9502

Correct sample variance of U (S2U): 40,000,000,000 = 200,0002

Correct sample variance of E (S2E): 62,500,000,000 = 250,0002

Variable Requested precision (ε) Requested reliability (1- α) Survey sample size (n)

A U E

205,273 10,000 11,000

95% 95% 95%

1,598 784

1,012

Sample size (A) = nA = (z2

α*S2A)/ε

2A = (1.962*34,350,734,902,500)/205,2732 = 3,134

Sample size (U) = nU = (z2α*S

2U)/ε

2U = (1.962*40,000,000,000)/10,0002 = 1,537

Sample size (E) = nE = (z2α*S

2E)/ε

2E = (1.962*62,500,000,000)/11,0002 = 1,984

Survey sample size after increase in use of admin data: Average A (Ā’): 9,852,347 Average U (Ū’): 1,000,365 Average E (Ē’): 3,782,658 Correct sample variance of A (S’2A): 28,718,452,281,600 = 5,358,9602 Correct sample variance of U (S’2U): 32,400,000,000 = 180,0002

Correct sample variance of E (S’2E): 52,900,000,000 = 230,0002

Variable Requested precision (ε) Requested reliability (1- α) Survey sample size (n)

A U E

205,273 10,000 11,000

95% 95% 95%

1,336 635 857

Sample size (A) = nA = (z2

α*S’2A)/ε2A = (1.962*28,718,452,281,600)/205,2732 = 2,618

Sample size (U) = nU = (z2α*S’2U)/ε

2U = (1.962*32,400,000,000)/10,0002 = 1,245

Sample size (E) = nE = (z2α*S’2E)/ε

2E = (1.962*52,900,000,000)/11,0002 = 1,680

%100)8(data admin of use in increase before sizeSample

after size sample- data admin of use in increase before size SampleI

%46.16100*3134

26183134

Thus, due to an increase in the use of admin data, the survey sample size has decreased by 16.46%.

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

L – Describe the impact of moving from a survey based output to an admin-data based output

Comparability Provide information on how the data are affected when moving from a survey based output to an admin-data based output. Comment on any reasons behind differences in the output and describe how any inconsistencies are dealt with.

Page 31: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

31

2.3.2 Quality Indicators

Accuracy: 9 Item non-response (% of units with missing values for key variables)

Description Although there are technically no ‘responses’ when using admin data, non-response (missing values at item or unit level) is an issue in the same way as with survey data. This indicator provides information on the extent of missing values for the key variables. The higher the level of missing values, the poorer the quality of the data (and potentially the statistical output). However, other indicators should also be considered, eg. the level of imputation and also the means of imputation used to address this missingness. This indicator should be calculated for each of the key variables and for each admin source and then aggregated based on the contributions of the variables to the overall output.

How to calculate

%100 variableXfor relevant units of No.

variableXfor valuemissing with dataadmin in the unitsrelevant of No.

This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution to the output

Example

A. Statistical output: Quarterly construction

B. Relevant units: Units of the statistical population

C. Relevant variables: Number of employees

D. Steps for calculation:

D1. Match source A with units in the statistical population and take the common units D2. Calculate number of units in D1 with missing value for source A D4. Calculate the indicator as follows:

Let A be the Social Security source.

Page 32: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

32

I(9) = 100*.

.

iablevartheforrelevantunitsofNo

iablevartheforvaluesgsinmiswithdataminadtheinunitsrelevantofNo

I(9) = %10100*10

1

Weighted by turnover:

I(9)W= %59.0100*072,191,4

880,24

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

V – Describe the data processing known to be required on the administrative data source to deal with non-response

Accuracy Data processing is often required to deal with non-response. The user should be made aware of how and why particular data processing methods are used.

W – Describe differences between responders and non-responders

Accuracy This indicates to users how significant the non-response bias is likely to be. Where response is high, non-response bias is likely to be less of a problem than when there are high rates of non-response. NB: There may be instances where non-response bias is high even with very high response rates, if there are large differences between responders and non-responders.

X – Assess the likely impact of non-response/imputation on final estimates

Accuracy Non-response error may reduce the representativeness of the data. An assessment of the likely impact of non-response/imputation on final estimates allows users to gauge how reliable the key estimates are as estimators of population values. This assessment may draw upon the indicator: ‘% of imputed values (items) in the admin data’.

Units

Number of

employees (Source

A)

Number of units in

source A with missing values for employees

(1/0)=(Y/N) Number of relevant

units for the variable Turnover

X1 15 0 1 158,325

X2 1 1 24,880

X3 25 0 1 233,541

X4 178 0 1 780,251

X5 52 0 1 200,320

X6 1 0 1 18,000

X7 1 0 1 15,358

X8 37 0 1 785,423

X9 19 0 1 185,320

X10 612 0 1 1,789,654

Sum 1 10 4,191,072

Page 33: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

33

10 Misclassification rate

Description This indicator provides information on the proportion of units in the admin data which are incorrectly coded. For simplicity and clarity, activity coding as recorded on the Business Register (BR) can be considered to be correct – the example in this report makes this assumption (the validity of this assumption will depend on the systems used within different countries; other sources may be used if there is evidence they are more accurate than the BR). The level of coding used should be at a level consistent with the level used in the statistical output (e.g. if the statistical output is produced at the 3-digit level, then the accuracy of the coding should be measured at this level). This indicator should be calculated for each admin source and then aggregated based on the number of relevant units (weighted by turnover) in each source.

How to calculate

%100 dataadmin in unitsrelevant of No.

BR tocode NACEdifferent with dataadmin in unitsrelevant of No.

Note. If the activity code from the admin data is not used by the NSI (e.g. if coding from BR is used), details of the misclassification rate for the BR should be provided instead. If a survey is conducted to check the rate of misclassification, the rate from this survey should be provided and a note added to the indicator. This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution to the output.

Example 1

A. Statistical output: Business Register (BR) unit

B. Relevant units: Units in Construction sector

C. Relevant variables: Economic Activity code (NACE, 4 digits or 3 digits)

D. Steps for calculation:

D1. Match each source (VAT (file of “Value – Added Taxes” model) and/or CCIAA (file of declaration to ‘Chambers of Commerce’) with BR by the common id code (if available) or by other methods

D2. Attribute a Presence(1)/Absence(0) indicator to items of variable in each admin data (sum up for obtaining denominator)

D3. Attribute a value=1(0) for Inconsistency(Consistency) item between BR and source (sum up for obtaining numerator)

D4. Calculate the indicator (simple or aggregated, weighted or not weighted) as follows:

sourceCCIAA or VATin items #

)( digit) 3or 4 (NACE variableof nciesinconsiste #)10(

VATorCCIAASourceBRI

E. Tolerance: Consistency means equal NACE at 4 digits

Page 34: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

34

I(10)VAT=(2/3)*100=67%

Weighted by employment:

I(10)VATw=(2+5.42)/(2+16+5.42)*100=32%

I(10)CCIAA=(3/5)*100=60%

Weighted by employment:

I(10)CCIAA w=(2+1+5.42)/(2+2.25+1+16+5.42)*100=32%

I(10)aggregated VAT – CCIAA =(67*3+60*5)/(3+5)=63%

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

R - Describe differences in concepts, definitions and classifications between the administrative source and the statistical output

Coherence

There may be differences in concepts, definitions and classifications between the administrative source and statistical product. Concepts include the population, units, domains, variables and time reference and the definitions of these concepts may vary between the administrative data and the statistical product. Time reference problems occur when the statistical institution requires data from a certain time period but can only obtain them for another. Any effects on the statistical product need to be made clear along with any techniques used to remedy the problem.

Z – Describe how the misclassification rate is determined

Accuracy It is often difficult to calculate the misclassification rate.Therefore, where this is possible, a description of how the rate has been calculated should also be provided.

AA – Describe any issues with classification and how these issues are dealt with

Accuracy Whereas a statistical institution can decide upon and adjust the classifications used in its own surveys to meet user needs, the institution usually has little or no influence over those used by administrative sources. Issues with classification and how these issues are dealt with should be described so that the user can decide whether the source meets their needs.

unit NACE -BR

Nace-

Source

VAT

#items in VAT

Source

Inconsistency

VAT-BR (4

digits)

Nace-

Source

CCIAA

#items in

CCIAA

Source

Inconsistency

CCIAA-BR (4

digits)

Persons

employed

X1 43910 41200 1 1 412 1 1 2

X2 41200 0 0 1

X3 41200 0 412 1 0 2.25

X4 41200 0 0 2

X5 43390 0 41 1 1 1

X6 43290 43290 1 0 43290 1 0 16

X7 43220 0 0 1

X8 43120 41200 1 1 412 1 1 5.42

X9 432 0 0 1

X10 43290 0 0 1

Sum 3 2 5 3 32.67

Page 35: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

35

11 Undercoverage

Description This indicator provides information on the undercoverage of the admin data. That is, units in the reference population that should be included in the admin data but are not (for whatever reason). This indicator should be calculated for each admin source and then aggregated based on the number of relevant units (weighted by turnover) in each source. (For information on the progressive nature of administrative data, see Appendix A at the end of this chapter.)

How to calculate

%100 population referencein unitsrelevant of No.

dataadmin in NOTbut population referencein unitsrelevant of No.

Note. This could be calculated for each relevant publication of the statistical output, e.g. first and final publication. This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution to the output.

Example

A. Statistical output: Annual data on structure and competitiveness of craft business of the industrial sector

B. Relevant units: Units in the statistical population

C. Relevant variables: Craft nature (Y/N=1/0) of the enterprise.

D. Steps for calculation:

D1. Identify units in reference population, i.e. population of craft enterprises of industry sector (e.g. using Business Register)

D2. Match source A with the units in D1 by the common identification code and take the units which are in D1 but not in A (relevant units in reference population but not in source A)

D4. Calculate the indicator as follows.

%100)11( population reference in units relevant of No.

data admin in NOT but population reference in units relevant of No.I

Let A be the Register of craft enterprises of Chamber of Commerce.

Page 36: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

36

Presence of unit in reference population taken from BR (i.e. craft business of industry sector): (Y/N)=(1/0)

Presence of unit in source A (Y/N)=(1/0)

Units not in Source A but in reference population Turnover

Units (1) (2) (3) (4) (5)=(3)*(4)

X1 1 1 0 1,532,620 0

X2 1 1 0 758,900 0

X3 1 1 0 256,300 0

X4 1 1 0 1,025,890 0

X5 1 0 1 650,000 650,000

X6 1 1 0 475,620 0

X7 1 1 0 965,002 0

X8 1 1 0 1,487,500 0

X9 1 1 0 325,640 0

X10 1 0 1 265,400 265,400

X11 1 1 0 654,250 0

X12 1 1 0 1,596,300 0

Sum 12 10 2 9,993,422 915,400

%67.16100*12

2)11( I

Weighted by turnover:

%16.9100*422,993,9

400,915)11( WI

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

AB – Describe the extent of coverage of the administrative data and any known coverage problems

Accuracy This information is useful for assessing whether the coverage is sufficient. The population that the administrative data covers should be included along with all known coverage problems. There could be overcoverage (where duplicate records are included) or undercoverage (where certain records are missed).

AC – Describe methods used to deal with coverage issues

Accuracy Updating procedures and cleaning procedures should be described. Also, information should be provided on edit checks and /or imputations that are carried out because of coverage error. This information indicates to users the resultant robustness of the administrative data, after these procedures have been carried out to improve coverage.

AD – Assess the likely impact of coverage error on key estimates

Accuracy Coverage issues could relate to undercoverage and overcoverage (or both). Special studies can sometimes be carried out to assess the impact of undercoverage and overcoverage. An assessment of the likely impact of coverage error indicates to users how reliable the key estimates are as estimators of population values.

Page 37: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

37

12 Overcoverage

Description This indicator provides information on the overcoverage of the admin data. That is, units that are included in the admin data but should not be (e.g. are out-of-scope, outside the reference population). Note that when overcoverage is identified, quite often it can be addressed by removing these units when calculating the statistical output. However, in cases where overcoverage is identified but cannot be addressed, it is this estimate of ‘uncorrected’ overcoverage that should be provided for this indicator. This indicator should be calculated for each admin source and then aggregated based on the number of relevant units (weighted by turnover) in each source. (For information on the progressive nature of administrative data, see Appendix A at the end of this chapter.)

How to calculate

%100 population referencein units of No.

population referencein NOTbut dataadmin in units of No.

This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution to the output.

Example

A. Statistical output: Annual data on structure and competitiveness of craft business of industrial sector

B. Relevant units: Units in the statistical population

C. Relevant variables: Craft nature (Y/N=1/0) of the enterprise.

D. Steps for calculation:

D1. Identify units in reference population i.e. craft enterprises of industrial sector (e.g. using Business Register)

D2. Match source A with units in D1 by the common identification code and take the units which are in A but not in D1 (units in source A but not in reference population)

D3. Calculate the indicator as follows:

%100)12( population reference in units of No.

population reference in NOT but data admin in units of No.I

Let A be the Register of craft enterprises of the Chamber of Commerce.

Note: sometimes some enterprises are struck from the Register of craft business because they don’t have the legal requisites for admission anymore (e.g. they change legal status or the number of their employees increases too much etc.) but in the available version of data this fact is still not recorded because of a delay in the registration of these data in the Register.

Page 38: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

38

Presence of unit in reference population taken from BR (i.e. craft business of industry sector): (Y/N)=(1/0)

Presence of unit in source A (Y/N)=(1/0)

Units in Source A but not in reference population Turnover

Turnover in reference population

Units (1) (2) (3) (4) (5)=(1)*(4) (5)=(3)*(4)

X1 1 1 0 1,532,620 1,532,620 0

X2 0 1 1 758,900 0 758,900

X3 1 0 0 256,300 256,300 0

X4 1 1 0 1,025,890 1,025,890 0

X5 1 1 0 650,000 650,000 0

X6 1 1 0 475,620 475,620 0

X7 1 1 0 965,002 965,002 0

X8 1 1 0 1,487,500 1,487,500 0

X9 0 1 1 325,640 0 325,640

X10 1 0 0 265,400 265,400 0

X11 1 1 0 654,250 654,250 0

X12 1 1 0 1,596,300 1,596,300 0

Sum 10 10 2 9,993,422 9,908,882 1,084,540

%00.20100*10

2)12( I

Weighted by turnover:

%17.12100*882,908,8

540,084,1)12( WI

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

AB – Describe the extent of coverage of the administrative data and any known coverage problems

Accuracy This information is useful for assessing whether the coverage is sufficient. The population that the administrative data covers should be included along with all known coverage problems. There could be overcoverage (where duplicate records are included) or undercoverage (where certain records are missed).

AC – Describe methods used to deal with coverage issues

Accuracy Updating procedures and cleaning procedures should be described. Also, information should be provided on edit checks and /or imputations that are carried out because of coverage error. This information indicates to users the resultant robustness of the administrative data, after these procedures have been carried out to improve coverage.

AD – Assess the likely impact of coverage error on key estimates

Accuracy Coverage issues could relate to undercoverage and overcoverage (or both). Special studies can sometimes be carried out to assess the impact of undercoverage and overcoverage. An assessment of the likely impact of coverage error indicates to users how reliable the key estimates are as estimators of population values.

Page 39: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

39

13 % of units in the admin source for which reference period differs from the required reference period

Description This indicator provides information on the proportion of units that provide data for a different reporting period than the required period for the statistical output. If the periods are not those required, then some imputation is necessary, which may impact quality. This indicator should be calculated for each admin source and then aggregated based on the number of relevant units (weighted by turnover) in each source.

How to calculate

%100 dataAdmin in unitsrelevant of No.

period required from period reporting

different with dataAdmin in unitsrelevant of No.

This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution to the output.

Note: In some cases, 'calenderization' adjustments must be made to get the administrative data to the correct periodicity - for example, converting quarterly data to monthly data. If this is required, it may be helpful to calculate an additional indicator covering the proportion of units for which calenderization adjustments have taken place.

Example

A. Statistical output: The Business Register (BR) Enterprise units

B. Relevant units: Corporations in Admin data with different reporting period from required period

D: Steps for calculation:

D1. From Balance Sheet source take all corporations with different reporting period

with respect to the required period. For example, required period is 01.01.2009-

31.12.2009 while a different reporting period could be 30.06.2008-30.06.2009

D2. Match all BR corporations with Balance Sheet by the common id code (if available)

or by other methods

D3. Calculate the indicator as follows:

Let A be the Balance Sheet source.

I(13)=(Numerator/Denominator)*100=(11.341/607.899)*100=1.87%

Numerator: No of relevant corporations with different required period;

Denominator: No of relevant corporations in BR.

Units Employees Turnover

607,899 7,980,361 2,088,997,442,877

11,341 410,777 140,954,727,363No of BR corporations present in Balance Sheet for which

reference period is different from Br required reference period (2)

No of BR corporations present in Balace Sheet source (1)

Page 40: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

40

Weighted by employees:

I(13)w(E)=[Employees(2)/Employees(1)]*100=(410,777/7,980,361)*100=5.15%

Weighted by turnover:

I(13) w (T)

=[Turnover(2)/Turnover(1)]*100=(140,954,727,363/2,088,997,442,877)*100=6.75%

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

AE – Describe the data processing known to be required on the administrative data source to address instances where the reference period differs from the required reference period.

Accuracy Data processing may sometimes be required to check or improve the quality of the administrative data or create new variables to be used for statistical purposes. The user should be made aware of how and why data processing is used.

Page 41: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

41

14 Size of revisions from the different versions of the admin data RAR – Relative Absolute Revisions

Description This indicator assesses the size of revisions from different versions of the admin data, providing information on the reliability of the data received. With this indicator it is possible to understand the impact of the different versions of admin data on the results for a certain reference period. When data is revised based on other information (e.g. survey data) this should not be included in this indicator. The indicator should be calculated for each admin source and then aggregated.

How to calculate

%100

1

1

T

t Pt

T

t PtLt

X

XX

= Latest data for X variable

= First data for X variable If only one version of the admin data is received, this indicator is not relevant. Note. This indicator should only be calculated for estimates based on the same units (not including any additional units added in a later delivery of the data). This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution to the output.

Example A. Statistical output: Monthly manufacturing

B. Relevant units: Units in the statistical population

C. Relevant variables: Turnover

D. Steps for calculation:

D1. Identify the statistical unit (enterprise) in the first and in the second version of data coming from the same source

D2. Match the source with the units in the statistical population by the common identification code (if available) or by other methods, and take the units in common

D3. Take the non missing values (XPt) from the first data version D4. Take the non missing values (XLt) from the second data version for the same units

received in the first data version D5. Calculate the difference (absolute value) between the latest data and the first data

version for each unit D6. Sum up the differences and divide it by the sum of the absolute values of the first

data D7. Calculate the indicator as follows:

LtX

PtX

Page 42: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

42

Let A be the VAT Turnover source.

Units

Turnover 1st data version

Turnover 2nd

data version Employment

(A) (B) (P) (D)=|(B)-(A)| (E)=(D)*(P) (F)=(A)*(P)

X1 15,860 18,362 0 2,502 0 0

X2 596,321 597,523 25 1,202 30,050 14,908,025

X3 1,500,693 1,500,693 63 0 0 94,543,659

X4 276,365 276,527 12 162 1,944 3,316,380

X5 56,321 56,321 2 0 0 112,642

X6 159,632 160,523 6 891 5,346 957,792

X7 1,895,471 1,925,632 132 30,161 3,981,252 250,202,172

X8 15,630 15,630 0 0 0 0

X9 28,963 30,213 0 1,250 0 0

X10 58,741 58,967 1 226 226 58,741

X11 41,205 41,205 1 0 0 41,205

Sum 4,645,202 4,681,596 242 36,394 4,018,818 364,140,616

I(14)= (36,394/4,645,202)*100 = 0.78%

Weighted by employment:

I(14)w=(4,018,818/364,140,616)*100 =1.10%

100 *

| |

| |

I(14)

1

1

T

t

T

t

XPt

XPt XLt

Page 43: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

43

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

AF - Comment on the impact of the different versions of admin data on the results

Accuracy When commenting on the size of the revisions of the different versions of the admin data, information on the impact of the revisions on the statistical product for the relevant reference period should also be explained to users.

AG – Flag any published data that are subject to revision and data that have already been revised

Accuracy This indicator alerts users to published data that may be, or have already been, revised. This will enable users to assess whether provisional data will be fit for their purposes.

AH – For ad hoc revisions, detail revisions made and provide reasons

Accuracy Where revisions occur on an ad hoc basis to published data, this may be because earlier estimates have been found to be inaccurate. Users should be clearly informed of the revisions made and why they occurred. Clarifying the reasons for revisions guards against any misinterpretation of why revisions have occurred, while at the same time making processes (including any errors that may have occurred) more transparent to users.

AP – Reference/link to detailed revisions analyses

Accessibility Users should be directed to where detailed revisions analyses are available.

Page 44: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

44

15 % of units in admin data which fail checks

Description This indicator provides information on the extent to which data fail some elements of the checks (automatic or manual) and are flagged by the NSI as suspect. This does not mean that the data are necessarily adjusted (see Indicator 16), simply that they fail one or more check(s). This checking can either be based on a model, checking against other data sources (admin or survey), internet research or through direct contact with the businesses. This indicator should be calculated for each of the key variables and aggregated based on the number of relevant units (weighted by turnover) in each source.

How to calculate

%100 checked unitsrelevant of no. Total

failed and checked dataadmin in unitsrelevant of No.

Note. If the validation is done automatically and the system does not flag or record this in some way, this should be noted. Users should state the number of checks done, and the proportion of data covered by these checks. This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution to the output.

A. Statistical output: Quarterly data on Transportation and storage

B. Relevant units: Units in the statistical population

C. Relevant variables: Turnover, NACE code.

D. Steps for calculation:

D1: Identify for each key variable the number of units checked in admin data. D2: Identify for each key variable the number of units in admin data that fail checks. D3: Average the proportions of units that fail checks by weighting by the numbers of

units.

I(15) = %100 checked units relevant of no. Total

failed and checked data admin in units relevant of No.

Let the source of Turnover be the VAT Turnover source. Let the source of NACE code be the Chamber of Commerce source.

Page 45: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

45

I(15) = [(3+1)/(12+10)]*100 = 18.18%

Weighted by employment: I(15)w = [(15+32)/(265+262)]*100 = 8.92%

Units Units checked Units checked

Units failing check (Y/N)=(1/0)

Units failing check (Y/N)=(1/0)

var=Turnover var=NACE code Var=Turnover

Var=NACE code Employees

(A) (B) (C) (D) (E) (A)*(E) (B)*(E) (C)*(E) (D)*(E) X1 1 1 1 0 15 15 15 15 0 X2 1 1 0 1 0 0 0 0 0 X3 1 0 0 0 3 3 0 0 0 X4 1 1 0 1 0 0 0 0 0 X5 1 1 0 0 1 1 1 0 0 X6 1 1 0 0 5 5 5 0 0 X7 1 1 0 0 14 14 14 0 0 X8 1 0 0 0 0 0 0 0 0 X9 1 1 0 0 150 150 150 0 0 X10 1 1 0 0 27 27 27 0 0 X11 1 1 0 0 18 18 18 0 0 X12 1 1 0 1 32 32 32 0 32 Sum 12 10 1 3 265 265 262 15 32

Page 46: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

46

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

AI – Describe the known sources of error in administrative data

Accuracy Metadata provided by the administrative source and/or information from other reliable sources can be used to assess data errors. The magnitude of any errors (where known) that have a significant impact on the administrative data should be made available to users. This will help the user to understand how accurate the administrative data are.

AJ – Describe the data processing known to be required on the administrative data source in terms of the types of checks carried out

Accuracy Data processing may sometimes be required to check or improve the quality of the administrative data or create new variables to be used for statistical purposes. The user should be made aware of how and why data processing is used.

AK – Describe processing systems and quality control

Accuracy This informs users of the mechanisms in place to minimise processing error by ensuring accurate data transfer and processing.

AL – Describe the main sources of measurement error

Accuracy Measurement error is the error that occurs from failing to collect the true data values from respondents. Measurement error cannot usually be calculated. However, outputs should contain a definition of measurement error and a description of the main sources of the error when the admin data holder gathers the data.

AM – Describe processes employed by the admin data holder to reduce measurement error

Accuracy Describing processes to reduce measurement error indicates to users the accuracy and reliability of the measures.

AN – Describe the main sources of processing error

Accuracy Processing error is the error that occurs when processing data. It includes errors in data capture, coding, editing and tabulation of the data. It is not usually possible to calculate processing error exactly. However, outputs should be accompanied by a definition of processing error and a description of the main sources of the error.

Page 47: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

47

16 % of units for which data have been adjusted

Description This indicator provides information about the proportion of units for which the data have been adjusted (a subset of the units included in Indicator 15). These are units that are considered to be erroneous and are therefore adjusted in some way (missing data should not be included in this indicator – see Indicator 9). Any changes to the admin data before arrival with the NSI should not be considered in this indicator. This indicator should be calculated for each of the key variables and aggregated based on the number of relevant units (weighted by turnover) in each source.

How to calculate

%100 DataAdmin in unitsrelevant of No.

data adjusted with dataAdmin in the unitsrelevant of No.

This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution to the output.

Example

A. Statistical output: Quarterly data on Transportation and storage

B. Relevant units: Units in the statistical population

C. Relevant variables: Turnover, NACE code.

D. Steps for calculation:

D1: Identify for each key variable the number of units in admin data D2: Identify for each key variable the number of units in admin data that have been

adjusted D3: Average the proportions of units that have been adjusted by weighting by the

numbers of units

I(16) = %100 Data Admin in units relevant of No.

data adjusted withdata Admin the in units relevant of No.

Let the source of Turnover be the VAT Turnover source. Let the source of NACE code be the Chamber of Commerce source.

Page 48: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

48

I(16) = [(1+1)/(12+10)]*100=9.09%

Weighted by employment: I(16)w = [(15+0)/(265+262)]*100=2.85%

Units Units in admin data

Units in admin data

Units adjusted (Y/N)=(1/0)

Units adjusted (Y/N)=(1/0)

var=Turnover var=NACE code Var=Turnover Var=NACE code Employees

(A) (B) (D) (E) (F) (A)*(F) (B)*(F) (D)*(F) (E)*(F) X1 1 1 1 0 15 15 15 15 0 X2 1 1 0 1 0 0 0 0 0 X3 1 0 0 0 3 3 0 0 0 X4 1 1 0 0 0 0 0 0 0 X5 1 1 0 0 1 1 1 0 0 X6 1 1 0 0 5 5 5 0 0 X7 1 1 0 0 14 14 14 0 0 X8 1 0 0 0 0 0 0 0 0 X9 1 1 0 0 150 150 150 0 0 X10 1 1 0 0 27 27 27 0 0 X11 1 1 0 0 18 18 18 0 0 X12 1 1 0 0 32 32 32 0 0 Sum 12 10 1 1 265 265 262 15 0

Page 49: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

49

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

S - Describe any adjustments made for differences in concepts and definitions between the administrative source and the statistical output

Coherence Adjustments may be required as a result of differences in concepts and definitions between the administrative data and the requirements of the statistical product. A description of why the adjustment needed to be made and how the adjustment was made should be provided to the users so they can assess the coherence of the statistical product with other sources.

AK – Describe processing systems and quality control

Accuracy This informs users of the mechanisms in place to minimise processing error by ensuring accurate data transfer and processing.

AL – Describe the main sources of measurement error

Accuracy Measurement error is the error that occurs from failing to collect the true data values from respondents. Measurement error cannot usually be calculated. However, outputs should contain a definition of measurement error and a description of the main sources of the error when the admin data holder gathers the data.

AM – Describe processes employed by the admin data holder to reduce measurement error

Accuracy Describing processes to reduce measurement error indicates to users the accuracy and reliability of the measures.

AN – Describe the main sources of processing error

Accuracy Processing error is the error that occurs when processing data. It includes errors in data capture, coding, editing and tabulation of the data. It is not usually possible to calculate processing error exactly. However, outputs should be accompanied by a definition of processing error and a description of the main sources of the error.

AO – Describe the data processing known to be required on the administrative data source in terms of the types of edits carried out

Accuracy Data processing may sometimes be required to check or improve the quality of the administrative data or create new variables to be used for statistical purposes. The user should be made aware of how and why data processing is used.

Page 50: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

50

17 % of imputed values (items) in the admin data

Description This indicator provides information on the impact of the values imputed by the NSI. These values are imputed because data are missing (see Indicator 9) or data items are unreliable (see Indicator 16). This indicator should be calculated by variable for each admin source and then aggregated based on the contributions of the variables to the overall output.

How to calculate

%100 dataadmin in itemsrelevant of No.

dataadmin relevant in the items imputed of No.

This indicator should be weighted (e.g. by turnover or number of employees) in terms of the % contribution of the imputed values to the statistical output.

Example

A. Statistical output: The results of a sectoral survey

B. Relevant units: All the units in a specific NACE activity code

C. Relevant variables: The variables NACE activity Code; number of employees; turnover

D. Steps for calculation:

D1: For each source identify the variables which are used for the statistical output

D2. For each variable in the source calculate the number of items in admin data

D3. For each variable in the source identify all the units with items present in admin

data which are afterwards imputed

D4: For each variable in the source calculate the non-missing items in the statistical

output

D5. For each variable calculate the proportion of D3 on D2

D6. Calculate the indicator for each source weighting the proportions with the items of

D4

D7. Calculate the general indicator weighting the indicators of D6 for the data

X1 16231 2 Absent in the source Absent in the source

X2 16232 0 16232 16231 80,305

X3 missing 16231 3 16231 127,118

X4 16231 0 Absent in the source Absent in the source

X5 17110 16231 10 16231 335,550

X6 16231 15 missing 25,332

X7 16231 1 47112 16231 118,125

X8 missing 16231 0 5 16231 63,212

X9 16231 0 16231 missing 7550

X10 16291 0 16291 18,123

10 3 10 1 8 2 8 1

Nace

activity

code

source A

Nace activity

code source

A afterwards

imputed

Number

of units

Turnover

Source B

afterwards

imputed

Units Number of

employees source A

afterwards imputed

Nace activity code

source B

Nace activity

code source B

afterwards

imputed

Turnover Source BNumbers

of

employees

Source A

Page 51: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

51

Percentage of units in source A with Nace activity codes imputed: (3/10)*100=30%

Percentage of units in Source A with number of employees imputed: (1/10)*100=10%

Percentage of units in Source B with Nace activity codes imputed: (2/8)*100=25%

Percentage of units in Source B with Turnover imputed:(1/8)*100=12.5%

I(17)Source A=[(3+1)/(10+10)]*100=20%

I(17)Source B=[(2+1)/(8+8)]*100=18.75%

I(17)Sources A and B=(20*10+18.75*8)/(10+8)=19.4% Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

X – Assess the likely impact of non-response/imputation on final estimates

Accuracy Non-response error may reduce the representativeness of the data. An assessment of the likely impact of non-response/imputation on final estimates allows users to gauge how reliable the key estimates are as estimators of population values. This assessment may draw upon the indicator: ‘% of imputed values (items) in the admin data’.

Y – Comment on the imputation method(s) in place within the statistical process

Accuracy The imputation method used can determine how accurate the imputed value is. Information should be provided on why the particular method(s) was chosen and when it was last reviewed.

Page 52: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

52

Timeliness and punctuality: 18 Delay to accessing / receiving data from Admin Source

Description This indicator provides information on the proportion of the time from the end of the reference period to the publication date that is taken up waiting to receive the admin data. This is calculated as a proportion of the overall time between reference period and publication date to provide comparability across statistical outputs. This indicator should be calculated for each admin source and then aggregated.

How to calculate

%100daten publicatio toperiod reference of end thefrom Time

dataAdmin receiving toperiod reference of end thefrom Time

Note. Include only the final dataset used for the statistical output. If a continuous feed of data is received, the ‘last’ dataset used to calculate the statistical output should be used in this indicator. If more than one source is used, an average should be calculated, weighted by the sources’ contributions to the final estimate. If the admin data are received before the end of the reference period, this indicator would be 0. This indicator applies to the first publication only, not to revisions.

Example A. Statistical output: Annual data on structure and competitiveness of

enterprises with employees

B. Relevant units: Units in the statistical population

C. Relevant variables: Number of employees; Turnover; Production value.

D. Steps for calculation:

D1. Match each source with the units in the statistical population by the common identification code (if available) or by other methods, obtaining the number of common units

D2. Calculate for each source the number of days from the end of reference period to the arrival of admin data

D3. Calculate the number of days from the end of reference period to dissemination date

D4. Calculate the indicator as follows:

%100)18( datenpublicatiotoperiodreferenceofend the from Time

dataAdmin receivingtoperiod referenceofendthe fromTimeI

Let A be the Social Security source (from which we take the number of employees). Let B be the Fiscal register (from which we take the VAT proxy of turnover). Let C be Balance Sheet source (from which we take the Production Value).

Page 53: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

53

Number of units in source and statistical population

Number of days from the end of the reference period to receiving Admin data

Number of days from the end of reference period to publication date

I(18) for each source

Weighting for contributions

Source (1) (2) (3) (4)=(2)/(3) (5)=(1)*(4)

A 2,598 186 291 63.92% 1660.64

B 1,962 123 291 42.27% 829.34

C 2,241 235 291 80.76% 1809.83

Sum 6,801 4299.81

%92.63100*291

186)18( EmployeesI

%27.42100*291

123)18( TurnoverI

%76.80100*291

235)18( Pr valueoductionI

Weighted by the contributions of the source to the statistical output:

%22.63100*801,6

81.4299)18( AggregateI

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

G – Describe the timescale since the last update of data from the administrative source

Timeliness An indication of the timescale since the last update from administrative sources will provide the user with an indication of whether the statistical product is timely enough to meet their needs.

H – Describe the extent to which the administrative data are timely

Timeliness Provide information on how soon after their collection the statistical institution receives the administrative data. The effects of any lack of timeliness on the statistical product should be described.

I – Describe any lack of punctuality in the delivery of the administrative data source

Timeliness Give details of the time lag between the scheduled and actual delivery dates of the data. Any reasons for the delay should be documented along with their effects on the statistical product.

J – Frequency of production

Timeliness This indicates how timely the outputs are, as the frequency of publication indicates whether the outputs are up to date with respect to users’ needs.

K – Describe key user needs for timeliness of data and how these needs have been addressed

Timeliness This indicates how timely the data are for specified needs, and how timeliness has been secured, eg by reducing the time lag to a number of days rather than months for monthly releases.

Page 54: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

54

Comparability:

19 Discontinuity in estimate when moving from a survey-based output to an

output involving admin data

Description This indicator measures the impact on the level of the estimate when changing from a survey-based output to an output involving admin data (either entirely admin based or partly). This indicator should be calculated separately for each key estimate included in the output. This can be viewed as a one-off indicator when moving from a survey based output to an output involving admin data.

How to calculate

%100 survey fromEstimate

surveyfrom Estimate - data Admininvolving Estimate

Note. This indicator should be calculated using survey and admin data which refer to the same period.

Example

A. Statistical output: Monthly manufacturing

B. Relevant units: Units in the statistical population

C. Relevant variables: Turnover; number of employees

D. Steps for calculations: D1. Compute the estimate of the variable(s) for the survey based output D2. Compute the estimate of the variable(s) for the admin-data based output D3. Calculate the indicator as follows:

%100 survey fromEstimate

surveyfrom Estimate - data Admininvolving EstimateI(19)

Let A be the VAT Turnover source. Let B be the Social Security source. Estimator: sample mean.

Page 55: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

55

Units (A) (B) (D) (E)

X1 56,321 75,210 1 0

X2 118,948 120,321 4 4

X3 658,362 658,362 20 22

X4 29,632 31,550 0 0

X5 85,690 102,362 3 3

X6 522,360 522,360 30 30

X7 14,520,369 14,554,320 153 155

X8 99,652 101,520 0 0

X9 369,584 369,584 8 8

X10 887,456 890,630 22 22

X11 58,630 61,230 0 0

X12 741,252 741,550 6 6

Sum 18,148,256 18,228,999 247 250

Turnover

Source A

Turnover

Survey

Employees

Source B

Employees

Survey

Estimate Turnover (Source A) = 18,148,256/12 = 1,512,355

Estimate Turnover (Survey) = 18,228,999/12 = 1,519,083

I(19) Turnover = [(1,512,355-1,519,083)/1,519,083]*100 = -0.44%

Estimate Employees (Source B) = 247/12 = 20.6 Estimate Employees (Survey) = 250/12 = 20.8 I(19)Employees = [(20.6-20.8)/20.8]*100 = -0.96%

Page 56: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

56

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

L – Describe the impact of moving from a survey based output to an admin-data based output

Comparability Provide information on how the data are affected when moving from a survey based output to an admin-data based output. Comment on any reasons behind differences in the output and describe how any inconsistencies are dealt with.

M – Describe any method(s) used to deal with discontinuity issues

Comparability Where the level of the estimate is impacted when moving from a survey based output to an admin-data based output, it may be possible to address this discontinuity. In this case a description of the method(s) used to deal with the discontinuity should be provided.

N - Describe the reasons behind discontinuities when moving from survey based estimates to admin-data based estimates

Comparability Where it is not possible to address a discontinuity in estimates when moving from a survey based output to an admin-data based output, the reasons behind the discontinuity should be provided along with commentary around the impact on the level of the estimate. Any changes in terms of the advantages and limitations of the differences should also be highlighted.

Page 57: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

57

Coherence: 20 % of consistent items for common variables in more than one source7

Description This indicator provides information on consistent items for any common variables across sources (either admin or survey). Only variables directly required for the statistical output should be considered – basic information (e.g. business name and address) should be excluded. Values within a tolerance should be considered consistent – the width of this tolerance (1%, 5%, 10%, etc.) would depend on the variables and methods used in calculating the statistical output. This indicator should be calculated for each of the key variables and aggregated based on the contributions of the variables to the overall output.

How to calculate

%100 variable X for required items of no. Total

variable X for tolerance) (within items consistent of No.

Note. If only one source is available or there are no common variables, this indicator is not relevant. Please state the tolerance used. This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution to the output.

Example A. Statistical output: Annual data on structure and competitiveness of enterprises

of trade sector

B. Relevant units: Units in the survey

C. Relevant variables: Number of employees

D. Steps for calculation:

D1. Match each source with the survey by the common identification code (if available) or by other methods D2. Attribute a Presence(1)/Absence(0) indicator to items of the variables in the survey (sum up to obtain denominator) D3. Attribute a value 1(0) for consistent (not consistent) items in the survey and in the source (it is considered “consistent” if the percentage difference is less than 3%) D4. Calculate the indicator as follows:

%100)20( variable Xfor required items of no.Total

variableX for tolerance) (within items consistentofNo.I

E. Tolerance method: Max Difference = 3%

Let A be the Social Security source. Let B be the Nielsen data bank data.

7Indicators 20 and 23 are the only indicators in Section 2.3.2 for which a high indicator score denotes high

quality and a low indicator score denotes low quality.

Page 58: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

58

%29.64100*14

9)20( I

Weighted by turnover:

%54.37100*541,619,281

830,724,105)20( WI

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

O – Describe the common identifiers of population units in administrative data

Coherence Different administrative sources often have different population unit identifiers. The user can utilise this information to match records from two or more sources. Where there is a common identifier, matching is generally more successful.

Q - Describe the width of the tolerance and the reasons for this

Coherence Where values within a particular tolerance are considered consistent for common variables across more than one source, the width of the tolerance should be stated, along with a brief explanation as to why this particular tolerance width was chosen.

U – Describe the record matching methods and processes used on the administrative data sources

Accuracy Record matching is when different administrative records for the same unit are matched using a common, unique identifier or key variables common to both datasets. There are many different techniques for carrying out this process. A description of the technique (e.g. automatic or clerical matching) should be provided along with a description (qualitative or quantitative) of its effectiveness.

Units (1) (2) (3)={[(2)-(1)]/(1)}*100 (4) (5) (6)=(4)*(5)

X1 152 154 1.32 1 38,985,610 38,985,610

X2 335 352 5.07 0 58,945,620 0

X3 15 15 0.00 1 7,540,210 7,540,210

X4 29 40 37.93 0 48,540,210 0

X5 2 2 0.00 1 298,540 298,540

X6 0 0 0.00 1 680,000 680,000

X7 11 12 9.09 0 1,548,760 0

X8 18 18 0.00 1 1,800,000 1,800,000

X9 60 61 1.67 1 9,856,410 9,856,410

X10 71 70 1.41 1 17,564,280 17,564,280

X11 29 27 6.90 0 6,985,471 0

X12 569 600 5.45 0 59,874,650 0

X13 235 240 2.13 1 26,541,780 26,541,780

X14 11 11 0.00 1 2,458,000 2,458,000

Sum 1,537 1,602 9 281,619,541 105,724,830

Turnover

Number of

employees source A

Number of employees

source B

Percentage

difference <3%

(Y/N)=(1/0) A and B

Percentage

difference between

Page 59: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

59

oSS

S

U+U

U

21 % of relevant units in admin data which have to be adjusted to create statistical units

Description This indicator provides information on the proportion of units that have to be adjusted in order to create statistical units. For example, the proportion of data at enterprise group level which therefore need to be split to provide reporting unit data.

How to calculate

Relevant units in the reference population that are adjusted to the statistical concepts by the use of statistical methods

Relevant units in the reference population that correspond to the statistical concepts This indicator should be weighted (e.g. by turnover or number of employees) in terms of the % contribution of these units to the statistical output. Note: Frequently, administrative units must be aggregated into 'composite units' before being disaggregated into statistical units. If this is required, it may be helpful to calculate an additional indicator covering the proportion of administrative units which can be successfully matched, or 'aligned', with composite units.

Example

A. Statistical output: Annual data on structure and competitiveness in industry sector

B. Relevant units: Enterprises in the statistical population (but statistical units are enterprises groups)

D. Steps for calculation:

D1. Identify the units in admin data which need to be adjusted in order to obtain the relevant statistical units

D2. Identify the relevant units in admin data that correspond to the statistical concepts. D3. Divide D1 by (D1+D2) to calculate the indicator as follows:

100*)21(

UosUs

UsI

Where:

Us= Relevant units in the reference population that are adjusted to the statistical concepts by the use of statistical methods.

Uos= Relevant units in the reference population that correspond to the statistical concepts.

SS UU

oSU

Page 60: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

60

%25.81100*313

13)21(

I

Weighted by number of employees;

%45.96100*116156,3

156,3)21(

WI

Relevant Units (1) (2) (3) (4) (5)=(2)*(4) (5)=(3)*(4)

X1 A1 0 1 26 0 26

X2 A1 0 1 369 0 369

X3 A1 0 1 856 0 856

X4 A2 1 0 96 96 0

X5 A3 0 1 15 0 15

X6 A3 0 1 27 0 27

X7 A3 0 1 100 0 100

X8 A3 0 1 25 0 25

X9 A3 0 1 38 0 38

X10 A4 1 0 2 2 0

X11 A5 0 1 2 0 2

X12 A5 0 1 0 0 0

X13 A6 0 1 15 0 15

X14 A6 0 1 985 0 985

X15 A6 0 1 698 0 698

X16 A7 1 0 18 18 0

Sum 3 13 3,272 116 3,156

Enterprises

groups code

Relevant unit

corresponds to

statistical unit

(Y/N)=(1/0)

Number of

employees

Relevant unit

corresponds to

statistical unit

(N/Y)=(1/0)

Page 61: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

61

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

D – Describe constraints on the availability of administrative data at the required level of detail

Relevance Some administrative microdata have restricted availability or may only be available at aggregate level. Describe any restrictions on the level of data available and their effects on the statistical product.

R - Describe differences in concepts, definitions and classifications between the administrative source and the statistical output

Coherence

There may be differences in concepts, definitions and classifications between the administrative source and statistical product. Concepts include the population, units, domains, variables and time reference and the definitions of these concepts may vary between the administrative data and the statistical product. Time reference problems occur when the statistical institution requires data from a certain time period but can only obtain them for another. Any effects on the statistical product need to be made clear along with any techniques used to remedy the problem.

S - Describe any adjustments made for differences in concepts and definitions between the administrative source and the statistical output

Coherence

Adjustments may be required as a result of differences in concepts and definitions between the administrative data and the requirements of the statistical product. A description of why the adjustment needed to be made and how the adjustment was made should be provided to the users so they can assess the coherence of the statistical product with other sources.

Page 62: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

62

Cost and efficiency:

22 Cost of converting admin data to statistical data

Description This indicator provides information on the estimated cost (in person hours) of converting admin data to statistical data. It can be considered in two ways: either as a one-off indicator to identify the set-up costs of moving from survey data to administrative data (as such it should include set-up costs, monitoring of data sources, negotiating with data providers, etc.), or as a regular indicator to identify the ongoing running costs of the system that converts the administrative data to statistical data (which should include costs of technical processing, monitoring of the data, ongoing liaison with data providers, etc.). The indicator should be calculated for each admin source and then aggregated based on the contribution of the admin source to the statistical output.

How to calculate

(Estimated) Cost of conversion in person hours

Note. This should only be calculated for parts of the admin data relevant to the statistical output.

Example

A. Statistical output: A sectoral output

B. Relevant units: Enterprises with commercial area greater than 400 m2

D. Steps for calculation:

D1.Identify the time in person hours necessary to convert the admin data in order to obtain statistical data as a function of admin source size and complexity in the treatment of admin data.

Let c1=number of records in admin data.

Let c2=number of records processed per hour=complexity coefficient.

I(22)=Cost of conversion in person hours=f(no. of record in admin data, no. of records

processed per hour)

= 2

1

c

cNumber of person hours= H36

83

000,3

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

AT - Describe the processes required for converting admin data to statistical data and comment on any potential issues

Cost and efficiency

Data processing may be required to convert the admin data to statistical data. This indicator should provide the user with information on the types of processes and techniques used, along with a description of any issues that are dealt with, or perhaps still remain.

Page 63: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

63

23 Efficiency gain in using admin data8

Description This indicator provides information on the efficiency gain in using admin data rather than simply using survey data. For example, collecting admin data is usually cheaper than collecting data through a survey but this benefit might be offset by higher processing costs. This indicator should consider the total estimated costs of producing the output when using survey data (potentially a few years ago if the move was gradual) and then compare this to the total estimated costs of producing the output when using admin data or a combination of both. Production cost should include all costs the NSI is able to attribute to the production of the statistical output. (For example, this may include the cost of the use of computers and electrical equipment, staff costs, cost of data processing, cost of results dissemination, etc.) This can be viewed as a one-off indicator when moving from a survey based output to an output involving admin data.

How to calculate

%100statistic basedsurvey ofcost Production

statistic basedadmin ofcost production - statistic basedsurvey ofcost Production

Note. Estimated costs are acceptable.

This indicator is likely to be calculated once, when making the change from survey to admin data.

Example

A. Statistical output: Quarterly data on the manufacture of machinery and equipment

B. Relevant units: Units in the statistical population

D. Steps for calculation:

D1. Quantify costs of survey based statistic (total cost of the survey including questionnaires, mailing, recalling, staff etc.)

D2. Quantify cost of statistic when based on admin data (cost of admin source acquisition, processing costs, staff etc.)

%100 statisticbased survey of cost Production

statisticbased admin of cost production - statisticbased survey of cost ProductionI(23)

8 Indicators 20 and 23 are the only indicators in Section 3.2 for which a high indicator score denotes high quality

and a low indicator score denotes low quality.

Page 64: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

64

I(23) = [(89,086-46,000)/89,086]*100 = 48.36% Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

AT - Describe the processes required for converting admin data to statistical data and comment on any potential issues

Cost and efficiency

Data processing may be required to convert the admin data to statistical data. This indicator should provide the user with information on the types of processes and techniques used, along with a description of any issues that are dealt with, or perhaps still remain.

Cost of survey based statistics = f(c1,c2,c3,c4,c5,c6,c7,c8,n)

where:

c1=cost of survey planning € 18,000

c2=cost of use of computers and electrical equipment € 786

c3=cost of each questionnaire € 3

c4=cost of mailing for each questionnaire € 3

c5=cost of staff employed in survey € 30,000

c6=cost of telephone calls for survey requirements € 250

c7=cost of possible website ad hoc for the survey € 2,300

c8=cost of results dissemination € 6,000

c9=cost of data processing € 750

c10=other costs € 10,000

n=number of questionnaires 3,500

Cost of survey based statistic = c1+c2+c5+c6+c7+c8+c9+c10+n*(c3+c4)

= 18,000+786+30,000+250+2,300+6,000+750+10,000+3,500*(3+3) = € 89,086

Cost of admin based statistics = f(c1,c2,c3,c4,c5,c6,c7)

c1=cost of planning € 5,000

c2=cost of admin source € 10,000

c3=cost of use of computers and electrical equipment € 2,000

c4=cost of staff employed € 7,000

c5=cost of data processing € 15,000 c6=cost of results dissemination € 6,000 c7=other costs € 1,000

Cost of admin based statistics = C1+c2+c3+c4+c5+c6+c7

€ 46,000 = 5,000+10,000+2,000+7,000+15,000+6,000+1,000 =

Page 65: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

65

2.3.3 List of Qualitative Indicators by Theme

Relevance

Qualitative indicator Description Related

quantitative indicator(s)

A - Name each admin source used as an input into the statistical product and describe the primary purpose of the data collection for each source

Name all administrative sources and their providers. This information will assist users in assessing whether the statistical product is relevant for their intended use. Providing information on the primary purpose of data collection also enables users to assess whether the data are relevant to their needs.

1, 2

B – Describe the main uses of each of the admin sources and, where possible, how the data relate to the needs of users

Include all the main statistical processes and/or outputs known to require data from the administrative source This indicator should also capture how well the data support users’ needs. This information can be gathered from user satisfaction surveys and feedback.

1, 2

C – Describe the extent to which the data from the administrative source meet statistical requirements

Statistical requirements of the output should be outlined and the extent to which the administrative source meets these requirements stated. Gaps between the administrative data and statistical requirements can have an effect on the relevance to the user. Any gaps and reasons for the lack of completeness should be described, for example if certain areas of the target population are missed or if certain variables that would be useful are not collected. Any methods used to fill the gaps should be stated.

3

D – Describe constraints on the availability of administrative data at the required level of detail

Some administrative microdata have restricted availability or may only be available at aggregate level. Describe any restrictions on the level of data available and their effects on the statistical product.

3, 21

E – Describe reasons for use of admin data as a proxy

Where admin data are used in the statistical output as a proxy or are used in calculations rather than as raw data, information should be provided in terms of why the admin data have been used as a proxy for the required variables.

3

F - Identify known gaps between key user needs, in terms of coverage and detail, and current data

Data are complete when they meet user needs in terms of coverage and detail. This indicator allows users to assess, when there are gaps, how relevant the outputs are to their needs.

N/A

Page 66: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

66

Timeliness and punctuality

Qualitative indicator Description Related quantitative indicator(s)

G – Describe the timescale since the last update of data from the administrative source

An indication of the timescale since the last update from administrative sources will provide the user with an indication of whether the statistical product is timely enough to meet their needs.

4, 18

H – Describe the extent to which the administrative data are timely

Provide information on how soon after their collection the statistical institution receives the administrative data. The effects of any lack of timeliness on the statistical product should be described.

4, 18

I – Describe any lack of punctuality in the delivery of the administrative data source

Give details of the time lag between the scheduled and actual delivery dates of the data. Any reasons for the delay should be documented along with their effects on the statistical product.

4, 18

J – Frequency of production

This indicates how timely the outputs are, as the frequency of publication indicates whether the outputs are up to date with respect to users’ needs.

18

K – Describe key user needs for timeliness of data and how these needs have been addressed

This indicates how timely the data are for specified needs, and how timeliness has been secured, eg by reducing the time lag to a number of days rather than months for monthly releases.

18

Comparability

Qualitative indicator Description Related quantitative indicator(s)

L – Describe the impact of moving from a survey based output to an admin-data based output

Provide information on how the data are affected when moving from a survey based output to an admin-data based output. Comment on any reasons behind differences in the output and describe how any inconsistencies are dealt with.

8, 19

M – Describe any method(s) used to deal with discontinuity issues

Where the level of the estimate is impacted when moving from a survey based output to an admin-data based output, it may be possible to address this discontinuity. In this case a description of the method(s) used to deal with the discontinuity should be provided.

19

N - Describe the reasons behind discontinuities when moving from survey based estimates to admin-data based estimates

Where it is not possible to address a discontinuity in estimates when moving from a survey based output to an admin-data based output, the reasons behind the discontinuity should be provided along with commentary around the impact on the level of the estimate. Any changes in terms of the advantages and limitations of the differences should also be highlighted.

19

Page 67: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

67

Coherence

Qualitative indicator Description Related quantitative indicator(s)

O – Describe the common identifiers of population units in administrative data

Different administrative sources often have different population unit identifiers. The user can utilise this information to match records from two or more sources. Where there is a common identifier, matching is generally more successful.

5, 6, 20

P – Provide a statement of the nationally/internationally agreed definitions, classifications and standards used

This is an indicator of clarity, in that users are informed of concepts and classifications used in compiling the output. It also indicates geographical comparability where the agreed definitions and standards are used.

N/A

Q - Describe the width of the tolerance and the reasons for this

Where values within a particular tolerance are considered consistent for common variables across more than one source, the width of the tolerance should be stated, along with a brief explanation as to why this particular tolerance width was chosen.

20

R - Describe differences in concepts, definitions and classifications between the administrative source and the statistical output

There may be differences in concepts, definitions and classifications between the administrative source and statistical product. Concepts include the population, units, domains, variables and time reference and the definitions of these concepts may vary between the administrative data and the statistical product. Time reference problems occur when the statistical institution requires data from a certain time period but can only obtain them for another. Any effects on the statistical product need to be made clear along with any techniques used to remedy the problem.

10, 21

S - Describe any adjustments made for differences in concepts and definitions between the administrative source and the statistical output

Adjustments may be required as a result of differences in concepts and definitions between the administrative data and the requirements of the statistical product. A description of why the adjustment needed to be made and how the adjustment was made should be provided to the users so they can assess the coherence of the statistical product with other sources.

16, 21

T - Compare estimates with other estimates on the same theme

This statement advises users whether estimates from other sources on the same theme are coherent (ie they ‘tell the same story’), even where they are produced in different ways. Any known reasons for lack of coherence should be given.

N/A

Page 68: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

68

Accuracy

Qualitative indicator Description Related quantitative indicator(s)

U – Describe the record matching methods and processes used on the administrative data sources

Record matching is when different administrative records for the same unit are matched using a common, unique identifier or key variables common to both datasets. There are many different techniques for carrying out this process. A description of the technique (e.g. automatic or clerical matching) should be provided along with a description (qualitative or quantitative) of its effectiveness.

5, 6, 20

V – Describe the data processing known to be required on the administrative data source to deal with non-response

Data processing is often required to deal with non-response. The user should be made aware of how and why particular data processing methods are used.

9

W – Describe differences between responders and non-responders

This indicates to users how significant the non-response bias is likely to be. Where response is high, non-response bias is likely to be less of a problem than when there are high rates of non-response. NB: There may be instances where non-response bias is high even with very high response rates, if there are large differences between responders and non-responders.

9

X – Assess the likely impact of non-response/imputation on final estimates

Non-response error may reduce the representativeness of the data. An assessment of the likely impact of non-response/imputation on final estimates allows users to gauge how reliable the key estimates are as estimators of population values. This assessment may draw upon the indicator: ‘% of imputed values (items) in the admin data’.

9, 17

Y – Comment on the imputation method(s) in place within the statistical process

The imputation method used can determine how accurate the imputed value is. Information should be provided on why the particular method(s) was chosen and when it was last reviewed.

17

Z – Describe how the misclassification rate is determined

It is often difficult to calculate the misclassification rate.Therefore, where this is possible, a description of how the rate has been calculated should also be provided.

10

AA – Describe any issues with classification and how these issues are dealt with

Whereas a statistical institution can decide upon and adjust the classifications used in its own surveys to meet user needs, the institution usually has little or no influence over those used by administrative sources. Issues with classification and how these issues are dealt with should be described so that the user can decide whether the source meets their needs.

10

AB – Describe the extent of coverage of the administrative data and any known coverage problems

This information is useful for assessing whether the coverage is sufficient. The population that the administrative data covers should be included along with all known coverage problems. There could be overcoverage (where duplicate records are included) or undercoverage (where certain records are missed).

11, 12

Page 69: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

69

Qualitative indicator Description Related quantitative indicator(s)

AC – Describe methods used to deal with coverage issues

Updating procedures and cleaning procedures should be described. Also, information should be provided on edit checks and /or imputations that are carried out because of coverage error. This information indicates to users the resultant robustness of the administrative data, after these procedures have been carried out to improve coverage.

11, 12

AD – Assess the likely impact of coverage error on key estimates

Coverage issues could relate to undercoverage and overcoverage (or both). Special studies can sometimes be carried out to assess the impact of undercoverage and overcoverage. An assessment of the likely impact of coverage error indicates to users how reliable the key estimates are as estimators of population values.

11, 12

AE – Describe the data processing known to be required on the administrative data source to address instances where the reference period differs from the required reference period

Data processing may sometimes be required to check or improve the quality of the administrative data or create new variables to be used for statistical purposes. The user should be made aware of how and why data processing is used.

13

AF - Comment on the impact of the different versions of admin data on the results

When commenting on the size of the revisions of the different versions of the admin data, information on the impact of the revisions on the statistical product for the relevant reference period should also be explained to users.

14

AG – Flag any published data that are subject to revision and data that have already been revised

This indicator alerts users to published data that may be, or have already been, revised. This will enable users to assess whether provisional data will be fit for their purposes.

14

AH – For ad hoc revisions, detail revisions made and provide reasons

Where revisions occur on an ad hoc basis to published data, this may be because earlier estimates have been found to be inaccurate. Users should be clearly informed of the revisions made and why they occurred. Clarifying the reasons for revisions guards against any misinterpretation of why revisions have occurred, while at the same time making processes (including any errors that may have occurred) more transparent to users.

14

AI – Describe the known sources of error in administrative data

Metadata provided by the administrative source and/or information from other reliable sources can be used to assess data errors. The magnitude of any errors (where known) that have a significant impact on the administrative data should be made available to users. This will help the user to understand how accurate the administrative data are.

15

AJ – Describe the data processing known to be required on the administrative data source in terms of the types of checks carried out

Data processing may sometimes be required to check or improve the quality of the administrative data or create new variables to be used for statistical purposes. The user should be made aware of how and why data processing is used.

15

Page 70: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

70

Qualitative indicator Description Related quantitative indicator(s)

AK – Describe processing systems and quality control

This informs users of the mechanisms in place to minimise processing error by ensuring accurate data transfer and processing.

15, 16

AL – Describe the main sources of measurement error

Measurement error is the error that occurs from failing to collect the true data values from respondents. Measurement error cannot usually be calculated. However, outputs should contain a definition of measurement error and a description of the main sources of the error when the admin data holder gathers the data.

15, 16

AM – Describe processes employed by the admin data holder to reduce measurement error

Describing processes to reduce measurement error indicates to users the accuracy and reliability of the measures.

15, 16

AN – Describe the main sources of processing error

Processing error is the error that occurs when processing data. It includes errors in data capture, coding, editing and tabulation of the data. It is not usually possible to calculate processing error exactly. However, outputs should be accompanied by a definition of processing error and a description of the main sources of the error.

15, 16

AO – Describe the data processing known to be required on the administrative data source in terms of the types of edits carried out

Data processing may sometimes be required to check or improve the quality of the administrative data or create new variables to be used for statistical purposes. The user should be made aware of how and why data processing is used.

16

Accessibility and clarity

Qualitative indicator Description Related quantitative indicator(s)

AP – Reference/link to detailed revisions analyses

Where published data have been revised, users should be directed to where detailed revisions analyses are available.

14

Page 71: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

71

Cost and efficiency

Qualitative indicator Description Related quantitative indicator(s)

AQ – Describe reasons for significant overlap in admin data and survey data collection for some items

Where items are not obtained exclusively from admin data, reasons for the overlap between admin data and survey data should be described.

2

AR – Comment on the types of items that are being obtained by the admin source as well as the survey

If the same data items are collected from both the admin source and the survey, this can lead to duplication when combining the sources. This indicator should highlight to users the variables that are being collected across both sources.

7

AS – If items are purposely collected by both the admin source and the survey, describe the reason for this duplication (eg validity checks)

In some instances it may be beneficial to collect the same variables from both the admin source and the survey, such as to validate the micro level data. The reason for the double collection should be described to users.

7

AT - Describe the processes required for converting admin data to statistical data and comment on any potential issues

Data processing may be required to convert the admin data to statistical data. This indicator should provide the user with information on the types of processes and techniques used, along with a description of any issues that are dealt with, or perhaps still remain.

22, 23

Page 72: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

72

2.4 References Daas, P.J.H., Ossen, S.J.L. & Tennekes, M. (2010). Determination of administrative data quality: recent results and new developments. Paper and presentation for the European Conference on Quality in Official Statistics 2010. Helsinki, Finland. Eurostat, (2003). Item 6: Quality assessment of administrative data for statistical purposes. Luxembourg, Working group on assessment of quality in statistics, Eurostat. Frost, J.M., Green, S., Pereira, H., Rodrigues, S., Chumbau, A. & Mendes, J. (2010). Development of quality indicators for business statistics involving administrative data. Paper presented at the Q2010 European Conference on Quality in Official Statistics. Helsinki, Finland. Ossen, S.J.L., Daas, P.J.H. & Tennekes, M. (2011). Overall Assessment of the Quality of Administrative Data Sources. Paper accompanying the poster at the 58th Session of the International Statistical Institute. Dublin, Ireland.

European Commission, Eurostat, (2007). Handbook on Data Quality Assessment Methods and Tools.

Page 73: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

73

2.5 Appendix A: Notation for quality indicators

Administrative datasets are often ‘progressive’ – data for a given reference period can differ when measured at different time-points. This can present challenges when specifying and implementing quality indicators. This appendix outlines notation which may be helpful in specifying these kinds of problems and presents some possible solutions. For a more extensive treatment of the concept and a prediction framework for progressive data, see Zhang (2013). Notation to help deal with progressive nature of admin data It is important to be able to distinguish reference period – the time-point of interest - from measurement periods – the time-points at which we measure the time-point of interest. The following notation is suggested:

U(a ; b | c) – the population at time-period ‘a’ measured at time-period ‘b’ according to data source ‘c’

yi(a ; b | c) – value of interest for unit ‘i’ in U(a; b | c) So, for example, U(t ; t+α | Fiscal Register) refers to the population according to the ‘Fiscal Register’ admin source for time-point ‘t’ measured ‘α’ periods after ‘t’. A characteristic of many admin datasets is that the value for a given reference period depends on the measurement period: this can be referred to as progressiveness. This means that, for a lag ‘α’, both the number of units in U(t ; t+ α | c) and their total of any variable of interest will keep evolving over time, until α =∞ in principle. This characteristic is often true of business registers as well as admin datasets, particularly when business registers are maintained using admin sources. Implication for the implementation of the quality indicators When calculating quality indicators, results from an early version of the admin data may produce very different results from a later version. The decision as to which version of an admin dataset to use is therefore important and should be documented when the quality indicators are reported. The notation above may be useful in making and reporting this decision. Several quality indicators call for comparison with the business register. In this case, the choice of which version of the business register to use is equally important. Choice of datasets The version of the admin data used in the estimation is usually the best one to use. Frequently, this will be a dataset for the correct reference period. Where the reference period of the admin data differs from the statistical reference period – for example, where employment statistics for February use admin data with a reference period of January - it may be informative to calculate an alternative set of indicators using the admin data with the correct reference period. In our example, the ‘correct’ reference period would be February.

Page 74: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

74

This can help identify the quality impact of using admin data with an incorrect reference period. In terms of the choice of business register, it may be preferable to use the most up-to-date version of the business register with the correct reference period. However, it may happen that the business register is updated using the admin source under evaluation. In such cases, it may be preferable to use an earlier vintage of the business register, before this updating has taken place, but retaining the reference period of interest. It should be noted that this choice may be limited by practical constraints regarding what versions of the business register are stored. Concluding Remarks In general, it is important to consider the impact of the progressiveness of both admin data and business registers and to record which versions are used in the calculation of the quality indicators. The notation set out above may be helpful when doing so. Reference Zhang, L-C. (2013). Towards VAT register-based monthly turnover statistics. Development report available on request.

Page 75: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

75

Chapter 3: Guidance on calculating composite quality indicators for outputs based on administrative data

3.1 Introduction As set out in Chapter 2, Work Package 6 (WP6) of the ESSnet Admin Data has already developed a list of 23 basic quality indicators9. Whilst these indicators are useful in assessing the quality of outputs using admin data, it would be helpful for users to be able to see this information in summarised form. This chapter describes work to investigate methods for developing composite quality indicators to provide a more general overview of quality. Extensive consideration was given to developing one global composite indicator. However, the nature of ‘quality’ is complex – hence the six ESS dimensions of quality and the other considerations included in the European Handbook for Quality Reporting10. In fact, one of the key issues for producers and users of statistics is the trade-offs that are made between quality dimensions (eg timeliness vs. accuracy). It is recognised and accepted that statistical outputs cannot be ‘high’ in all dimensions; there have to be trade-offs between them. Any single, global composite indicator would mask these trade-offs and thus would be less meaningful and less useful to users and producers. Consequently, it was decided that WP6 would focus on developing separate composite indicators for a range of quality ‘themes’, based on the dimensions of the ESS quality framework.

The aim of a composite quality indicator is to provide useful, summarised information to users on the quality of a particular output. To be effective, it is important for the composite indicators developed to reflect user requirements. For this reason, it will be necessary for any specific parameters needed in calculating a composite indicator to be set based on the needs of an output, rather than fixing them as standard across all outputs and statistical organisations. It is important to note that the aim of creating a composite quality indicator is to assist users and not to accommodate any comparison between organisations. The composite indicators are however also useful for producers of statistics, enabling them to make comparisons over time to understand and improve the quality of the statistical outputs being produced. The first step in this work was to group the basic indicators into quality dimensions. In doing this, it was discovered that some of the indicators do not fit readily in the ESS quality dimensions and so extra quality themes (based on the characteristics described in the European Handbook for Quality Reporting) were identified to cover all indicators. Following this, appropriate methods to calculate composite indicators for each of those themes were considered with reference to the literature. A general approach has been chosen and developed along with practical examples of its use for each relevant quality theme. These steps are described in the remainder of this chapter.

9 The list of basic quantitative quality indicators has been tested across the European Statistical System and

feedback has confirmed that they are relevant and useful to producers of statistics. Although a number are

quality indicators relating directly to ESS dimensions of quality, some of the indicators provide background

information or relate more to other characteristics of quality (eg the latter ones listed in the European Handbook

for Quality Reports, such as performance, cost and respondent burden). However, given the feedback from

members of the ESS the decision was taken to include all of them in the current list. It should be noted that not

all indicators will be relevant in all contexts and some are more specific (one-off indicators), eg relating to the

situation when an NSI starts to use admin data to produce an output and no longer solely uses survey data. 10 See http://epp.eurostat.ec.europa.eu/portal/page/portal/ver-1/quality/documents/EHQR_FINAL.pdf

Page 76: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

76

3.2 Grouping indicators into quality themes The basic quality indicators developed by WP6 have been grouped into quality themes, based on the ESS quality dimensions and two extra groupings. The full list of grouped indicators can be found in Appendix A (at the end of this chapter). The basic quality indicators fit into the following quality themes:

Accuracy

Timeliness and punctuality

Comparability

Coherence

Cost and efficiency

Use of administrative data The scope of WP6 is to develop quality indicators specific to the use of admin data. Thus, indicators and dimensions / themes that are relevant to all statistics have not been included. As a consequence, there are two ESS quality dimensions which are not covered by the basic quantitative indicators: Accessibility and Clarity, and Relevance. This is because quality with regard to these dimensions is not normally impacted by whether the outputs are compiled using admin or survey data. Although there are caveats to this (eg different concepts – potentially less relevant to users – may be used due to the concepts available in the admin data), the ways in which admin and survey data differ in terms of these two dimensions are not quantifiable. Given the focus of this work on developing quantitative composite quality indicators, these dimensions have therefore not been considered further in this chapter11. The indicators that fit in the themes Use of admin data, and Cost and efficiency are mostly background information and all present information that is more easily understood separately. Therefore, it is not useful to develop composite indicators for these themes12. Composite indicators will be developed for Accuracy, Timeliness and Punctuality, Comparability, and Coherence. 3.3 Methods for calculating composite indicators Appendix B (at the end of this chapter) contains a review of existing literature on calculating composite indicators. Two main approaches can be identified. The first approach is to normalise the component indicators in some way and aggregate them using a weighted or unweighted average. The second approach is to model the data in some way to assess the impact of each component indicator on quality. The second approach includes methods such as Principal Component Analysis (PCA), factor analysis and structural equation models. Whilst this second approach is attractive from a theoretical point of view, it is often difficult to implement successfully in practice. For example, Brancato and Simeoni (2008) developed one structural equation model with

11 Establishing and addressing differences between administrative data definitions and statistical definitions is

very important however and thus work on this is being undertaken as part of WP3 (Methods of Estimation for

Variables) and WP7 (Statistics and Accounting Standards) of the ESSnet Admin Data. Further information on

both of these WPs is available here: http://essnet.admindata.eu/

12 The indicators included in these themes are listed in Appendix A (at the end of this chapter) and are described

in detail in Chapter 2.

Page 77: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

77

reasonable results, but noted that the model was unable to properly represent the Accuracy dimension. Even assuming a successful model can be identified for a specific data set, it is unlikely that this model will be suitable for other data or in other organisations. Furthermore, it is likely the model will need to be continually re-specified to remain useful. For these reasons, it is considered that the simpler, first approach is more suitable for developing a generic method to calculate composite indicators for outputs based on admin data. For further information, see Appendix B. Note that much of the literature on this topic concentrates on indicators which compare performance across countries or regions. Methods specific to this context are not covered in the literature review as this is not the purpose of WP6. 3.4 Development of composite quality indicators 3.4.1 Normalisation of basic quality indicators The basic quantitative quality indicators measure a range of different quality concepts. Where possible, the indicators have deliberately been expressed so that lower quality is reflected by a higher value (since most of the indicators measure errors and so are naturally in this direction). This removes one possible inconsistency, but it remains the case that the various indicators are on different scales. Superficially, it is apparent that many of the indicators are percentages. However, even the indicators that are expressed as percentages are not necessarily directly comparable. For example, a non-response rate of 20% is not of equivalent quality to 20% overcoverage. A range of options were investigated for normalising the basic indicators, so that they are on the same scale and can be combined more easily. This work was a collaboration between Portugal and the UK. Many of the methods discussed in the literature (see, for example, Nardo et al (2008)) relate only to the situation where indicators are being compared across geographies and so are not appropriate for this purpose. There are two main methods that could be more generally applicable:

Standardisation - converting the indicators to a common scale by subtracting a mean value for the indicator and dividing by a standard deviation.

Min-Max - converting the indicators to a common scale by subtracting a minimum value and dividing by the difference between a maximum and minimum value for the indicator.

It is difficult to implement either of these methods to normalise the basic quality indicators, since it is not immediately obvious how to calculate mean, minimum, maximum or standard deviation. It may be possible to compare indicator values over time, but this will not always be practicable. It is therefore necessary to adapt the concept of normalisation. The following formula adapts a typical standardisation method so that it can be applied to quality indicators.

Indicator value - Reference valueStandardised value =

Maximum - Minimum

The reference value in this formula is intended to denote the point at which the value of the quality indicator changes from being acceptable to unacceptable. For example, if a non-response rate of 20% is acceptable, but anything larger is unacceptable then the reference value for non-response rate would be 20%. This means that when the value of the basic

Page 78: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

78

indicator exceeds the reference value (which denotes unacceptable quality) the standardised value will be positive. Negative standardised values indicate that the quality is acceptable. It should be noted that this method will only work if it is possible to meaningfully define the reference values. The maximum and minimum values in the formula are used to transform the different indicators onto the same scale. Many of the basic indicators are defined as percentages and have physical maximum and minimum values of 100% and 0% respectively. However, dividing all percentage indicators by 100 ignores the fact that a particular percentage value does not have the same quality implication for all indicators. To properly standardise the indicators, we need to divide by a quantity that reflects the range of likely values for the indicator. In order to properly standardise the basic quality indicators, it is necessary to define reference, minimum and maximum values. The next section explores options for setting the reference values. Minimum and maximum values are considered afterwards. 3.4.2 Setting reference values The reference value denotes the point at which the value of a particular basic indicator changes from unacceptable to acceptable quality. For some indicators, it may be possible to make an educated guess at where this happens from a theoretical point of view. However, it is important to remember that acceptable quality for an output is driven by the uses of the output and the quality requirements of those uses. Reference values should therefore ideally be developed in consultation with users. In some cases, survey managers might already have a good idea of user needs and be able to set suitable reference values for the indicators. Once set, reference values should be kept constant unless there is a genuine change in user needs. Reference values should never be altered to mask any deterioration in the quality of outputs. Even with appropriate input from users and survey managers, the setting of reference values is likely to be subjective to some degree. It is important to consider how sensitive the final composite indicator is to the reference value. One way to do this is to calculate the composite indicator using a range of different reference values and examine the effect. Figure 1 shows values of an example composite indicator for the Accuracy dimension based on standardising each of the basic quality indicators relating to that dimension and calculating the mean of those values. Minimum and maximum values were set based on the likely range of acceptable values for the indicators. The point 0 on the x-axis denotes the value of the composite indicator for the best estimate of the reference values. The other points on the line show what happens to the value of the composite indicator when the reference values are decreased or increased by up to 100%. The line is straight because the reference values were changed uniformly across the different basic indicators.

Page 79: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

79

Figure 1: Sensitivity of reference values for an example composite indicator

In this example, the graph shows us that the composite indicator continues to be positive (implying unacceptable quality) until we increase the reference values by around 50%. If we believe that the reference values genuinely denote the point at which the quality changes from unacceptable to acceptable within a tolerance of 50% then we can be confident in saying that the accuracy of this output is unacceptable. If it is not possible to define the reference values that precisely, then we would have to conclude that it is not possible to make a definitive statement about the quality of the output. 3.4.3 Setting minimum and maximum values We can refine the standardisation of basic indicators by thinking about the lowest and highest values we would realistically expect for the reference value (the point at which quality changes from being acceptable to unacceptable). The likely range of values may be easier to define than the reference value itself. For example we might be confident that the true reference value for non-response rate is somewhere between 10% and 40%, but only be able to make an educated guess at where in that range it lies. The lowest and highest values of the reference value will differ between the basic indicators and give an indication of the expected spread of those indicators. For this reason, this range of values can be used in place of an educated guess for the minimum and maximum indicator values in the denominator of the normalisation formula. This allows us to take account of the fact that percentage values have different quality implications for different basic indicators. We can plot the composite indicators that result from using the minimum and maximum reference values along with the best estimates of those reference values (denoted as “Ref” in the graphs below) to better understand the meaningfulness of the composite indicator.

Page 80: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

80

Figure 2 shows an example composite indicator derived from the basic indicators in the Accuracy dimension, using best estimate, minimum and maximum reference values (that is, calculating the composite indicator using each of these three sets of reference values, and joining those points together). In each case, the normalised indicators were combined using a simple mean. Figure 2: Example composite indicator using minimum, mean and maximum reference values

In this example, the composite indicator has negative values for the whole range of likely reference values. We can therefore confidently say that the output has an acceptable level of accuracy. 3.4.4 Use of weighted and unweighted versions of the indicators For most of the basic quality indicators, it is possible to calculate weighted and unweighted versions. The weights are used to give a more direct idea of the impact of quality on the statistical output. For example, an unweighted non-response rate indicates the number of businesses that have missing values. A weighted non-response rate, using register Turnover as the weight, indicates the proportion of Turnover that is missing. If Turnover is related to the variables in the statistical output, this provides an indication of the impact of non-response on those statistics. As part of a Principal Component Analysis of quality indicators, Smith and Weir (2000) found that weighted indicators contain different information to unweighted indicators. It is therefore necessary to decide whether it is more useful to include the weighted or unweighted version of each indicator, or both.

Page 81: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

81

3.4.5 Combining and weighting the indicators The final step in calculating a composite indicator is to decide how to combine the basic indicators. This is related to the choice of which indicators should be included. The simplest option is to take a simple mean of all of the indicators – either the unweighted or weighted versions, or both. However, it may be necessary to use a weighted mean to produce a more meaningful composite indicator. The weights used when combining indicators have the purpose of allowing different indicators to have differing degrees of impact on the composite indicator. Higher weights should be given to any indicators that are more important to the quality needs of the outputs. This could include giving higher weights to weighted versions of the indicators, for example, if they are more important than the unweighted versions (for some or all of the basic indicators). Weighting can also be used to ensure that each aspect of the quality dimension gets equal consideration in the composite indicator, since it may be the case that some of the basic indicators are related to each other. The choice of appropriate weights needs to be handled carefully. Cecconi et al (2004) prefer using an unweighted average, since it removes the necessity to make a judgement on weights. However, Nardo et al (2008) suggest a practical method to develop suitable weights by asking relevant experts to allocate a budget of 100 points to the set of indicators and derive weights by taking the average of those allocations. Figure 3 shows an example of the range of values for a composite indicator (using best estimate, minimum and maximum reference values) for a particular output, again using basic indicators from the Accuracy dimension. Three versions of the composite indicator are plotted; one using unweighted versions of the indicators (“unwt”), one using weighted versions of the indicators (“wt”), and one using both unweighted and weighted versions (“comb”). In each case, the indicators are combined using a weighted mean, with higher weights given to indicators which are considered to be more important. The “comb” indicator is an unweighted mean of the “unwt” and “wt” indicators. It would be possible to use a weighted mean if either of the unweighted or weighted indicators were considered more important.

Page 82: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

82

Figure 3: Example composite indicators using three options for the versions of indicators included

For these examples, the lines cross from being positive to negative fairly near to the best estimate (“Ref”) values. This suggests that the realised values of the indicators are too close to the reference values to be able to make definitive conclusions about the quality of the output. Note that, in this example, the composite indicator using weighted versions of the indicators is the most clearly positive. We would be slightly more confident concluding that the quality is unacceptable if the weighted indicators were more important to users. However, for any of these examples, quality statements should be presented very carefully, making note of the uncertainty in the composite indicator. It is recommended to avoid using composite indicators when the result is ambiguous and to concentrate on the constituent basic indicators instead. The final weighting of the individual indicators should be decided based on the importance of different aspects of quality to the users of the output. In the same way, the final choice on whether to use unweighted indicators, weighted indicators or some combination of both should be addressed with reference to users. 3.4.6 Conclusions It is possible to derive a composite quality indicator for a quality theme by standardising the values of the basic indicators and combining them using a mean (weighted or unweighted). Composite indicators should be calculated separately for each important output. The standardisation relies on defining a reference value for each indicator, the point at which quality becomes unacceptable. It is important to consider the quality needs of the output for users in defining these reference values. If the resulting composite indicator is positive, that implies that the level of quality for that dimension is unacceptable for the output. Negative values imply acceptable quality.

Page 83: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

83

The sensitivity of the composite indicator can be tested by defining minimum and maximum values for the reference values and plotting the range of resulting composite indicators between these extremes. If the values are either all positive or all negative, the outcome of the composite indicator will be meaningful. If part of the range of values is positive and part negative then it may not be possible to comment on the quality with complete confidence and care should be taken. In some cases, the only reliable outcome may be to publish the individual basic quality indicators separately (or those that are considered to be of greatest importance to users). Note that the plots are intended to assist producers in deciding whether it is meaningful to denote an aspect of quality as being acceptable or unacceptable for an output. When it is meaningful, the published composite indicator should simply state that the Accuracy, for example, is of an acceptable level based on a range of indicators. The plots themselves are not intended to accompany published outputs. When combining the standardised indicators, weighting can be used to give the correct emphasis to the indicators, based on user needs for quality and ensuring that no aspects of quality are given disproportionate emphasis in the composite indicator. For some outputs, it may be decided that only a subset of the available basic indicators are relevant to reflect the quality. For other outputs, it may not be possible to calculate all of the basic indicators. It is not always necessary to use all of the basic indicators when compiling a composite indicator – those indicators that are both relevant and available should be used. When multiple admin sources are used to produce a statistical output, it will be necessary to calculate the basic quality indicators taking this into account. The important consideration is the effect on the output. For example, when calculating the misclassification rate, it is necessary to consider the effects of misclassifications from all sources on the output. A simple way to address this would be to take the average misclassification rate from each of the admin sources, but there is also scope to add more weight to misclassifications from those sources that have a larger impact on the output. As long as the basic indicators properly reflect multiple admin sources in this way, it will be straightforward to use the method described for creating composite indicators. The analysis above shows that it can be difficult to derive a meaningful composite indicator even when the only gradation is between acceptable and unacceptable quality. It is therefore not recommended to define composite indicators that attempt to grade the quality in any more detail than this. For example, trying to distinguish between acceptable and good quality will add further complications and is likely to lead to spurious results. The following sections of this chapter consider how composite indicators can be developed in practice for the different quality themes: Accuracy, Timeliness and Punctuality, Comparability, and Coherence. 3.5 Developing a composite indicator for Accuracy 3.5.1 Choice of indicators The first step in creating a composite indicator is to decide which of the basic quality indicators are useful or important for the particular quality theme. Table 1 lists the nine basic quality indicators that relate to Accuracy.

Page 84: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

84

Table 1: List of basic quality indicators in the Accuracy dimension

Indicator

9 Item non-response (% of units with missing values for key variables)

10 Misclassification rate

11 Undercoverage

12 Overcoverage

13 % of units in the admin source for which reference period differs from the required reference period

14 Size of revisions from the different versions of admin data - RMAR (Relative Mean Absolute Revisions)

15 % of units in admin data which fail checks

16 % of units for which data have been adjusted

17 % of imputed values (items) in the admin data

It is important to consider whether all of these indicators are needed for the Accuracy composite indicator and also whether there are any important concepts of Accuracy missing. The formula descriptions in the list of basic indicators note that it is possible to weight eight of these indicators (“Size of revisions” is the only one for which this would not make sense). Therefore, we also need to consider whether weighted, unweighted or both versions of the indicators should be used in the composite indicator. Some of the Accuracy indicators are related to each other: “% of imputed values (items) in the admin data” is directly related to “% of units for which data have been adjusted” and “Item non-response”, since adjusting suspect data and dealing with non-response in admin data are both commonly done using imputation. The indicator “% of units in admin data which fail checks” is also related to “% of units for which data have been adjusted”, since the data adjustments will generally be a consequence of failing checks. Including all four indicators in the composite indicator with the same weighting as the others will give disproportionate emphasis to this aspect of accuracy. Therefore, it will probably be necessary to combine the normalised indicators using a weighted mean to produce a meaningful composite indicator. 3.5.2 Example construction of composite indicator for Accuracy Table 2 contains examples of values for unweighted and weighted (where appropriate) values for each of the basic indicators belonging to the Accuracy dimension. The figures are illustrative only, but based on values that could typically be expected, for example when estimating annual Turnover using VAT data. Note that the values for indicators 15 (“% of units in admin data which fail checks”) and 16 (“% of units for which data have been adjusted”) are identical. This reflects the fact that in many statistical offices it is not possible to re-contact businesses to confirm suspicious values, so that the natural action for businesses which fail checks is to automatically adjust them. However, there are other options for dealing with businesses that fail checks, so it will not always be the case that these values are the same. Table 2 also contains reference values (best estimate (Ref), minimum and maximum) for each of the indicators. For the purposes of the example, a set of values have been chosen for purely illustrative purposes. These example reference values should not be used in practice. Note that for simplicity we have used the same reference values for the unweighted and weighted versions of the basic indicators. It is entirely possible to use different reference values for the two types of indicator and indeed this will often produce a better composite

Page 85: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

85

indicator. Reference values should always be set based on consultation with survey managers and users. Table 2: Example values for basic indicators in the Accuracy dimension

Indicator

Indicator value (%) Reference value (%)

Unweighted Weighted Min Ref Max

9 Item non-response 15 12 20 25 30

10 Misclassification rate 5 8 2.5 5 15

11 Undercoverage 10 15 20 25 30

12 Overcoverage 5 10 20 25 30

13 % of units with different reference period

20 7 20 30 40

14 Size of revisions 1 n/a 0.5 2 5

15 % units failing checks 7 11 2.5 5 10

16 % units with data adjusted 7 11 2.5 5 10

17 % imputed values (items) 22 23 22.5 30 40

Figure 4 shows the values of composite indicators for the range of minimum to maximum reference values using the three choices “unwt”, “wt” and “comb”. The indicator values are combined using a simple mean. Because there is no weighted version of indicator 14 (“Size of revisions”), the unweighted value is used when compiling the “wt” indicator and is used twice in the “comb” indicator. Figure 4: Example composite indicators for the Accuracy dimension

As previously mentioned, some of the basic indicators in the Accuracy dimension are related to each other. To produce a more representative composite indicator, it makes sense to reduce the weights of these indicators. Figure 5 shows composite indicators derived from the

Page 86: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

86

same data, but using a weighted mean where indicators 9, 15, 16 and 17 are given half the weight of the other indicators. That is, composite indicators are calculated as:

9 10 11 12 13 14 15 16 17

9 17

(0.5 I ) I I I I I (0.5 I ) (0.5 I ) (0.5 I )Composite indicator =

7

where I to I are the normalised values of indicators 9 to 17 respectively.

Figure 5: Example composite indicators for the Accuracy dimension, with weighting to reduce impact of related indicators

Weighting can also be used to give a more useful composite indicator if there are some of the basic indicators that are of more importance to users. Figure 6 shows composite indicators using a weighting where “Overcoverage” is given little importance (since it can be dealt with easily if it is identified), “Size of revisions” and “% units failing checks” are given higher importance. “% units with data adjusted” is excluded from the composite indicator (or, equivalently, given zero weight), since the same information is contained in “% units failing checks”. The composite indicator is calculated as:

9 10 11 12 13 14 15 17I I I (0.1 I ) I (2 I ) (2 I ) IComposite indicator =

9.1

Page 87: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

87

Figure 6: Example composite indicators for the Accuracy dimension, with weighting to reflect importance of basic indicators to users

Figures 4 to 6 show that the choice of weights can affect the values of composite indicators. In this example, the outcome changes from being clearly acceptable quality (figures 4 and 5) to having some doubt for the unweighted and combined versions of the composite indicator (figure 6). More extreme cases are of course possible. These illustrations show how it is possible to construct a composite indicator for Accuracy based on the proposed method. The final choice on weighting and choice of indicators should be made based on consultation with users.

Page 88: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

88

3.6 Developing a composite indicator for Timeliness and Punctuality Table 3 lists the basic quality indicators relating to Timeliness and Punctuality. Table 3: List of basic quality indicators in the Timeliness and Punctuality dimension

Number Indicator

4 Periodicity (frequency of arrival of the admin data)

18 Delay to accessing / receiving data from admin source

The ultimate decision on whether an output is timely and punctual is when it is published, compared to when the output was due to be published. Bearing this in mind, it is relatively straightforward to set reference values for these two indicators. “Periodicity” measures the frequency of arrival of admin data and a natural reference value would therefore be the frequency required by the statistical output. “Delay to accessing / receiving data from admin source” is calculated as:

Time from the end of reference period to receiving Admin data×100%

Time from the end of reference period to publication date

Quality in relation to this indicator is clearly unacceptable when data are received too late to be able to publish the output to schedule. The change from acceptable to unacceptable quality therefore happens when the time from the end of the reference period to receiving Admin data (including time to process the data within the NSI) is the same as the time from the end of the reference period to publication date. In the formula above, this implies a reference value of 100%. For both of these indicators, it is more difficult to define minimum and maximum values. Because the reference value is so clear cut, it is not meaningful to create upper and lower bounds for its value. One plausible option would be to put the minimum and maximum equal to the reference values described above. This would result in a denominator of zero in the normalisation formula, which illustrates the difficulty. However, using these indicators it is possible to create a simpler composite indicator describing whether the Timeliness and Punctuality is acceptable or not. If the data do not arrive with the desired frequency or on time to be used in the output, then the consequences for the output are serious. For either of these indicators, a failure to meet the minimum requirement would result in an output of unacceptable quality. A composite indicator for Timeliness and Punctuality can therefore be calculated by comparing each of the basic indicators to their reference value. If either of the indicators have unacceptable quality, then the composite indicator should state that the output has unacceptable Timeliness and Punctuality. If both indicators are acceptable, then the output can be said to have acceptable Timeliness and Punctuality. Table 4 contains example basic indicator values and accompanying reference values for Timeliness and Punctuality. The example is fictitious, but based on the concept of using quarterly admin data to estimate a quarterly output.

Page 89: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

89

Table 4: Example values for basic indicators in the Timeliness and Punctuality dimension

Indicator Indicator value Reference value

4 Periodicity 4 times a year 4 times a year

18 Delay to accessing data 106.7% (delay of 32 days)

100% (delay of 30 days)

In this example, we would conclude that the output is of unacceptable Timeliness and Punctuality, since the admin data are not available on time. Note that if a method was developed to estimate the output using forecast data from the previous quarter, it might be possible to reduce the indicator value to be less than 100%. This would allow us to create an output of acceptable Timeliness and Punctuality but may raise complications in terms of the other dimensions (e.g. accuracy) because of the estimate being based on the model, not the raw data. 3.7 Developing a composite indicator for Comparability Table 5 lists the basic quality indicators relating to Comparability. Table 5: List of basic quality indicators in the Comparability dimension

Number Indicator

19 Discontinuity in estimate when moving from a survey-based output to an output involving admin data

There is only one basic indicator in the Comparability dimension, so it is not necessary to calculate a composite indicator to gain an overall measure of the Comparability. However, it could still be useful to normalise the indicator, by comparison with a reference value, to determine whether the quality is acceptable or unacceptable. Table 6 contains example indicator and reference values for the Comparability dimension. Table 6: Example values for basic indicator in the Comparability dimension

Indicator Indicator value (%)

Reference value (%)

Min Ref Max

19 Discontinuity 0.7 0.5 1.0 2.0

Figure 7 displays the resulting “composite” indicator for the range of reference values. Since there is only one indicator in the Comparability dimension and it is an indicator which already takes account of survey weights, there is only one version of the indicator to plot.

Page 90: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

90

Figure 7: Example composite indicator for the Comparability dimension

In this example, it is not quite clear that the output is of acceptable quality with respect to Comparability. Since there is only one basic indicator, it is possible to make this deduction directly from Table 6. 3.8 Developing a composite indicator for Coherence Table 7 lists the basic quality indicators relating to Coherence. Table 7: List of basic quality indicators in the Coherence dimension

Number Indicator

5 % of common units across two or more admin sources

6 % of common units when combining admin and survey data

20 % of consistent items for common variables in more than one source

21 % of relevant units in admin data which have to be adjusted to create statistical units

Indicators 5 (“% of common units across two or more admin sources”) and 6 (“% of common units when combining admin and survey data”) both give useful background information, but neither directly measures the Coherence. However, there are particular situations where a higher proportion of common units across different sources would lead to higher quality; for example, where the multiple sources are used to validate data. When setting reference values for these indicators, this context should be taken into account. For many outputs, the quality may not be directly impacted by the values of indicators 5 and 6. In these cases, it would be sensible to not include those indicators when compiling the composite indicator (or equivalently to give them a weight of zero).

Page 91: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

91

Note that for indicators 5, 6 and 20 a higher indicator value implies higher quality. When creating composite indicators, we are assuming that higher basic indicator values imply lower quality. It is possible to deal with this by careful choice of the minimum and maximum reference values. The minimum reference values should be larger than the maximum reference values, to reflect the fact that a higher reference value is a tighter restriction. This will result in a negative value for the denominator of the normalised indicator (since the maximum value minus the minimum value will be negative). The negative denominator will have the effect of converting the normalised indicator to the correct scale. For example if the basic indicator value for “% of consistent items for common variables in more than one source” is 50% and the (best estimate) reference value is 60% then the indicator value minus the reference value is -10%. The negative value would imply acceptable quality, despite the fact that the indicator is below the reference value and a higher proportion of consistent values would be expected to give higher quality. Dividing this -10% by a negative denominator has the effect of converting this -10% into a positive normalised value, to reflect the unacceptable quality. By switching the direction of the minimum and maximum reference values, we can appropriately normalise indicators for which a higher value implies higher quality. Table 8 contains example indicator and reference values for the basic indicators in the Coherence dimension. Table 8: Example values for basic indicators in the Coherence dimension

Indicator

Indicator value (%) Reference value (%)

Unweighted Weighted Min Ref Max

5 Common units across sources

48 60 60 50 30

6 Common units combining admin and survey data

71 92 85 80 60

20 Consistent items 50 75 80 70 60

21 Units needing adjusting 32 8 5 10 20

Figures 8 and 9 show resulting composite indicators, both combine the normalised indicators using a simple mean. The composite indicators in Figure 8 use all four basic indicators, whereas those in Figure 9 only use indicators 20 and 21.

Page 92: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

92

Figure 8: Example composite indicator for the Comparability dimension, using all basic indicators

Page 93: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

93

Figure 9: Example composite indicator for the Comparability dimension, using only indicators 20 and 21

There is a large difference between using the unweighted and weighted versions of the indicators when constructing these composite indicators. Using the weighted versions, the Coherence is of acceptable quality for most of the range of reference values. However, using the unweighted versions, the Coherence is clearly of unacceptable quality. As with the other composite indicators, the choice of which versions to use depends on the needs of the users and producers of the data. These graphs demonstrate how important it is to get that choice right. In this example, there is relatively little difference whether including or excluding the background information indicators, 5 and 6. This will not be the case for all outputs.

Page 94: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

94

3.9 Conclusion This chapter has considered methods for calculating composite quality indicators for outputs based on admin data. Whilst various methods are discussed in the literature, none of them produce easily interpretable results that are relevant for this purpose. Therefore, a simple method has been developed and described to create composite quality indicators for four separate quality dimensions: Accuracy, Timeliness and Punctuality, Comparability, and Coherence. The other two quality themes covered in the list of basic quality indicators for outputs based on admin data are more related to background information and would not benefit from being summarised in composite indicators. It has been decided not to attempt to produce a single composite indicator covering all aspects of quality. Whilst this is mathematically possible, there is significant doubt that such an indicator would be meaningful and it would certainly mask the important issue of trade-offs between quality dimensions, a crucial consideration for users and producers as set out in section 3.1. It is important to note that the composite indicators described in this report are intended to assist users in understanding whether the quality attributes of particular outputs are acceptable or unacceptable. The composite indicators have not been developed with the purpose of allowing comparison between countries or outputs and are not designed to enable such comparisons. Rather, they are useful for producers of statistics, enabling them to make comparisons over time to understand and improve the quality of the statistical outputs being produced. As a result, composite indicators should be calculated for each important output. This chapter gives details of the recommended method for calculating composite quality indicators and examples for each of the four quality dimensions covered. The method involves selecting the basic quality indicators relating to the dimension that are both relevant and can be calculated for the output. These indicators need to be standardised so that they all have the same meaning for the quality of the output. The key part of standardisation is to compare the achieved value of the basic indicators with a reference value, which denotes the minimum quality requirement. Following standardisation, the indicators are combined using weighted or unweighted means. If the resulting indicator is positive, it can be concluded that the output is of insufficient quality. A negative composite indicator value implies acceptable quality. Composite indicators should be calculated for a range of plausible reference values to ensure that the conclusions are valid. In every case, the setting of parameters for the composite indicators should be based on user requirements for the quality of the particular output. If it is not possible to set meaningful reference values for an output, it is not recommended to pursue the calculation of composite indicators.

Page 95: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

95

3.10 References Brancato G. and Simeoni G. Modelling Survey Quality by Structural Equation Models. Proceedings of Q2008 European Conference on Quality in Survey Statistics, Rome, July 2008: Web. Cecconi C., Polidoro F. and Ricci R. Indicators to define a territorial quality profile for the Italian consumer price survey. Proceedings of Q2004 European Conference on Quality in Survey Statistics, Mainz, May 2004: CD-ROM. Munda G. and Nardo M. Weighting and Aggregation for Composite Indictors: A Non-compensatory Approach. Proceedings of Q2006 European Conference on Quality in Survey Statistics, Cardiff, 2006: Web. Nardo M., Saisana M., Saltelli A., Tarantola S., Hoffman A. and Giovannini E. Handbook on constructing composite indicators: methodology and user guide, OECD (2008): Web. Smith P. and Weir P. Characterisation of quality in sample surveys using principal components analysis. Proceedings of UNECE Work session on Statistical Data Editing, Cardiff, October 2000: Web.

Page 96: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

96

3.11 Appendix A: Grouping of basic quality indicators into quality themes The tables below list the basic quality indicators matched to each of the quality themes. For reference, the tables include the indicator numbers as shown in the WP6 list of quality indicators13. Accuracy

Number Indicator

9 Item non-response (% of units with missing values for key variables)

10 Misclassification rate

11 Undercoverage

12 Overcoverage

13 % of units in the admin source for which reference period differs from the required reference period

14 Size of revisions from the different versions of admin data – RMAR (Relative Mean Absolute Revisions)

15 % of units in admin data which fail checks

16 % of units for which data have been adjusted

17 % of imputed values (items) in the admin data

Timeliness and punctuality

Number Indicator

4 Periodicity (frequency of arrival of the admin data)

18 Delay to accessing / receiving data from admin source

Comparability

Number Indicator

19 Discontinuity in estimate when moving from a survey-based output to an output involving admin data

13 See: http://essnet.admindata.eu/WikiEntity?objectId=5452

Page 97: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

97

Coherence

Number Indicator

5 % of common units across two or more admin sources

6 % of common units when combining admin and survey data

20 % of consistent items for common variables in more than one source

21 % of relevant units in admin data which have to be adjusted to create statistical units

Cost and efficiency

Number Indicator

7 % of items obtained from admin source and also collected by survey

8 % reduction of sample size when moving from survey to admin data

22 Cost of converting admin data to statistical data

23 Efficiency gain in using admin data

Use of administrative data

Number Indicator

1 Number of admin sources used

2 % of items obtained exclusively from admin data

3 % of required variables derived from admin data that are used as a proxy

Page 98: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

98

3.12 Appendix B: Literature review on methods for developing composite indicators Prepared by Carys Davies, UK. Brancato G. and Simeoni G. “Modelling Survey Quality by Structural Equation Models”. Proceedings of Q2008 European Conference on Quality in Survey Statistics, Rome, July 2008: Web http://q2008.istat.it/sessions/paper/09Brancato.pdf This paper investigates the capacity of standard quality indicators to reflect quality components and overall quality, using structural equation models. The paper applies confirmatory factor analysis first-order and second-order models. Structural equation models provide measures of the impact of each manifest variable (e.g. quality indicators) on the relative latent factor (e.g. quality or quality components) as well as measures of reliability, such as the Squared Multiple Correlation. The paper evaluates the goodness of fit of the models using the Santorra-Bentler scaled Χ 2 statistic, instead of the standard Χ 2 statistic, since the standard Χ 2 statistic tends to be erroneously too high in the case of non-normality. In cases of unfavourable indicators of fit, inspection of modification indices can help guide model re-specification. Section 4 presents theoretical structural equation models. The paper evaluates overall quality as a second order latent factor, where no relationships among quality components are assumed. Two different theoretical first order models are also considered. The first model evaluates quality components as latent factors, where correlations between quality components can be assumed. The second model considers quality as a general latent dimension, which derives from all quality indicators; no quality components are included in the latent structure. The three theoretical models described in the paper were tested with real data. The models were then evaluated using the Goodness of fit statistics and Squared multiple correlations, to identify the best measurements of the common factor and loadings, to evaluate relationships in the model. The analysis showed that the second-order model did not converge and the simple first-order quality model did not produce interpretable results. The more reasonable model was the first order latent factor model on quality components. However, this model was not able to represent more complex quality components, such as Accuracy.

Page 99: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

99

Cecconi C., Polidoro F. and Ricci R. “Indicators to define a territorial quality profile for the Italian consumer price survey”. Proceedings of Q2004 European Conference on Quality in Survey Statistics, Mainz, May 2004: CD-ROM. This paper details a methodological approach to synthesising basic indicators in order to compare territorial data collection quality, for the Italian consumer price survey. Section 4 examines four main standardisation methods. Standardising the basic indicators helps to eliminate the influence of the unit of measure, making them more comparable. The main standardisation methods which were evaluated are:

Method 1 – the ratio between the indicators and the mean of the series

Method 2 – the ratio between the indicators and the maximum of the series

Method 3 – the ratio between the differences of the indicators with respect to the average of the distribution and the standard deviation

Method 4 – the ratio between the indicators with respect to the minimum of the distribution and its range Method 2 was chosen for the analysis as it offers easy interpretation of results since the range varies between 0 and 1 or 0 and 100. The method also provides the possibility to evaluate the classification of the areas in cardinal and ordinal views. Of particular interest is Section 5, which details the synthesis of the basic indicators. Since the basic indicators have been normalised and standardised they can be grouped. A non-weighted average was preferred to group the indicators since a weighted average introduces a judgemental criterion in selecting the system of weights. Due to the limited number of basic indicators, a geometric mean was used to calculate the synthetic indicators. Whereas, an arithmetic mean was used to group the indicators for regions and macro areas. The synthetic measures were transformed into spatial indices in order to rank and compare chief towns, regions and macro areas. Munda G. and Nardo M. “Weighting and Aggregation for Composite Indictors: A Non-compensatory Approach”. Proceedings of Q2006 European Conference on Quality in Survey Statistics, Cardiff, 2006: Web http://www.ons.gov.uk/ons/media-centre/events/past-events/q2006---european-conference-on-quality-in-survey-statistics-24-26-april-2006/agenda/index.html This paper evaluates the consistency between the mathematical aggregation rule, used to construct composite indicators and the meaning of weights. Section 2 formally proves that equal importance is incompatible with linear aggregation; since in a linear aggregation weights have the meaning of a trade-off ratio. The paper states that when using a linear aggregation rule, the only method which computes weights as scaling constants, with no ambiguous interpretation, is the trade-off method. Consider two countries differing only for the scores of two variables. The problem is then to adjust one of the scores for one of the countries so the two countries become indifferent. In order to compute N weights as trade-offs, it is necessary to assess N-1 equivalence relations. However, operationally this method is very complex. The assumption that the variable scores are measured on an interval or ratio scale of measurement must always hold. However, this is rarely the case in practice.

Page 100: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

100

It is concluded that whenever weights have the meaning of importance coefficients, it is essential to use non-compensatory aggregation rules to construct composite indicators. Nardo M., Saisana M., Saltelli A., Tarantola S., Hoffman A. and Giovannini E. “Handbook on constructing composite indicators: methodology and user guide”, OECD (2008): Web http://www.oecd.org/std/42495745.pdf This handbook provides a guide on constructing and using composite indicators, with a focus on composite indicators which compare and rank countries’ performances. Part 1 focuses on methodology for constructing composite indicators. Of particular interest are Sections 1.5 and 1.6 which detail normalisation, weighting and aggregation methods. Section 1.5 details nine different normalisation methods and provides formulas in table 3. Some of the methods included in this section are; standardisation, min-max and distance to reference.

Standardisation converts indicators to a common scale, with a mean of zero and standard deviation of one.

Min-max normalises indicators to have an identical range, by subtracting the minimum value and dividing by the range of the indicator values.

Distance to reference measures the relative position of a given indicator to a reference point i.e. a target or benchmark.

Of the nine methods described, some may only be suitable for composite indicators which compare/rank countries’ performances. Section 1.6 presents methods for weighting and aggregation, including a table detailing compatibility of aggregation and weighting methods. This section also briefly describes some of the pros and cons of the methods. Further details and practical applications are given in Part 2, Step 6. The paper mostly focuses on the weighting and aggregation methods in terms of composite indicators which compare countries’ performance. However, the methodology for some of these methods could be applicable to other types of indicators.

For principle components or factor analysis weights are only introduced to correct for overlapping information between correlated indicators, they are not used to measure theoretical importance. If there is no correlation, weights cannot be estimated with this method.

In the unobserved components model, individual indicators are assumed to depend on an unobserved variable plus an error term. The weight obtained is set to minimise the error and depends on the variance of an indicator, say q and the sum of the variances of all other indicators including q. This method resembles regression analysis.

For the budget allocation process, experts allocate a ‘budget’ of 100 points to a set of indicators. The weights are calculated as the average budgets.

Weights for the analytic hierarchy process represent the trade-off across indicators. The process compares pairs of indicators and assigns a preference. The relative weights of the individual indicators are calculated using an eigenvector.

Conjoint analysis asks for an evaluation of a set of alternative scenarios e.g. a given set of values for the individual indicators. The preference is then decomposed. A preference function is then estimated using the information emerging from the different scenarios. The derivatives with respect to the individual indicators of the preference function are used as weights.

The aggregation methods discussed in Part 2, Step 6 are geometric methods, Non-compensatory multi-criteria approach and additive methods; the difference between the number of indicators above and below a threshold (around the mean), summation of

Page 101: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

101

weighted and normalised indicators. More information on the non-compensatory multi-criteria approach can be found in Munda and Nardo (2006). Part 2, Step 4, looks at multivariate analysis techniques. It is noted that the methods are mostly for data expressed in an interval or ratio scale. However, some of the methods are suitable for ordinal data, for example, principle components analysis. Four main methods are considered, including; principal components analysis, factor analysis, Cronbach coefficient alpha and cluster analysis, as well as a few others.

Principal components analysis aims to explain the variance of observed data through a few linear combinations of the original data.

Factor analysis is similar to principal components analysis. The aim of the method is to describe a set of variables in terms of a smaller number of factors and to highlight the relationships between variables.

The Cronbach coefficient alpha (c-alpha) assesses how well a set of items (individual indicators) measures a single uni-dimensional object (e.g. attitude, phenomenon). C-alpha is a coefficient of reliability based on the correlation between individual indicators.

Cluster analysis uses algorithms to group items (individual indicators) into clusters, where items in the same cluster are more similar to each other than to those in other clusters.

Polidoro F., Ricci R. and Sgamba A.M. “The relationship between Data Quality and Quality Profile of the Process of Territorial Data Collection in Italian Consumer Price Survey”. Proceedings of Q2006 European Conference on Quality in Survey Statistics, Cardiff, October 2006: Web http://www.ons.gov.uk/ons/media-centre/events/past-events/q2006---european-conference-on-quality-in-survey-statistics-24-26-april-2006/agenda/index.html The methodology discussed in this paper expands on the methods detailed in Cecconi et al (2004). The paper details the methodology used to synthesise the indicators for sample coverage, data collection infrastructure and micro data accuracy as well as creating an overall synthetic indicator. Section 3.2 provides the methodology for standardising and synthesising the basic indicators. The standardisation method detailed in the paper is the one which was chosen in Cecconi et al (2004), the ratio between the indicator and the maximum value. In addition this paper has developed some mathematical notation for the chosen method. This paper examines the methods used for synthesising the basic indicators in more detail than in Cecconi et al (2004) and also provides notation and formulas. Firstly the basic indicators are grouped by town, for each component e.g. sample coverage and then for all the basic indicators (overall), using a geometric mean. However, regional and geographic synthetic indicators are calculated using a weighted arithmetic mean. Again, these are calculated for each component and then for all the basic indicators. Smith P. and Weir P. “Characterisation of quality in sample surveys using principal components analysis”. Proceedings of UNECE Work session on Statistical Data Editing, Cardiff, October 2000: Web http://www.unece.org/fileadmin/DAM/stats/documents/2000/10/sde/4.e.pdf This paper describes how to obtain some overall measure of quality by considering quality as a multivariate measure for any dataset, where each quality indicator represents one

Page 102: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

102

dimension of quality. This is an alternative approach to evaluating the total survey error, since total survey error evaluates quality in terms of overall accuracy but is very costly. The paper focuses on the use of principal components analysis to find the measures which best capture the underlying variation in the data quality measures. The analysis is used to try and obtain a small number of indicators which provide the most data quality information, in order to make the assessment of data quality more straight forward. Variables from the UK Monthly Inquiry into the distribution and Services Sector survey were used for the analysis. A relatively wide-ranging set of indicators were included covering sampling, response rates and data editing. The indicators were also calculated by stratum. The method was also applied to data from the U.S. Energy Information Administration’s Annual Fuel Oil and Kerosene Sales report, using the same indicators where possible. Before principal components analysis can be performed, the variables need to be standardised by subtracting the mean and dividing by the standard deviation. This removes the variability in the measures. The results detail the proportions of variation in the data, explained by the principle components and also provide the loadings (coefficients to derive the principle components) for the first five principle components. The larger coefficients highlight which variables are most important, in each principle component. The paper concludes that, for this set of indicators:

Most of the variation is explained by response rates.

Weighted indicators contain different information to unweighted indicators.

Some of the related indicators (e.g. sampling fraction and sampling errors) contain very similar information.

Page 103: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

103

Chapter 4: Guidance on the accuracy of mixed source statistics 4.1 Introduction The increasing use of admin data, as discussed in previous chapters, has not resulted in all statistics being based on admin data alone. In fact, although some National Statistical Institutes (NSIs) do produce a range of statistics entirely based on admin data, a larger proportion of statistics are produced by combining admin and survey data. It is the assessment of the quality of these statistics that is a particular challenge and is the focus of this chapter. Chapters 2 and 3 have set out basic quality indicators applicable to statistics developed using admin data and guidance on creating composite quality indicators for the different quality dimensions / themes. These quantitative indicators apply (at least to some extent) irrespective of how admin data are used in the production of the statistical outputs. However, those NSIs that do use admin data in the production of business statistics do so in different ways, both within the NSI (across statistical outputs) and across NSIs. Consequently, there are other (more complex) indicators that would be useful to members of the ESS but which vary, depending how the NSI uses the admin data (e.g. survey data for large businesses, admin data for small businesses and some estimation modelling for medium-sized businesses). Developing detailed indicators that capture the important elements of all of these different processes and combinations would be near impossible. Therefore, this chapter includes guidance that can be applied to these situations and which outlines important areas for consideration. As revealed in work to establish quality considerations in using admin data in the domain of business statistics across the ESS, it is particularly the dimension of accuracy that is of interest; the use of admin data does not mean that the data is error free. Therefore, this work focuses on the ‘accuracy’ quality dimension to assist NSIs when combining admin and survey data. Chapters 2 and 3 cover other quality dimensions / themes. The guidance includes case studies covering both Structural Business Statistics (SBS) and Short-Term business Statistics (STS) data and using information and methods from other work packages within the ESSnet (WP3 and WP4 respectively).14 4.2 Uses of admin data

Given the increasing availability of admin data, it is crucial to consider how these data can be used to estimate a population parameter15. In this chapter we consider statistics that are based on a combination of survey and admin data. Note that we define survey data as data collected by the NSI using a sample survey or a census, thus excluding register-based surveys. This is a narrower definition than that given by Wallgren and Wallgren (2007). We discern three ways of using admin data when survey data are also available. First, problems associated with integrating sources can be avoided by comparing a survey estimator with an admin data estimator (Laitila and Holmberg 2010). The mean square error (MSE) is a common measure of accuracy, combining structural error (bias) and random error

14 Work Packages 3 (Methods of Estimation for Variables) and 4 (Timeliness of Administrative Data):

http://essnet.admindata.eu/ 15 For a full discussion of model-based and design-based methods, see Little (2012) and the corresponding

discussion papers. The choice of method selected may lead to different methodologies for evaluating the

quantified indicators (bias, variance).

Page 104: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

104

(variance). The estimator that yields the smallest MSE is thus preferred. Laitila and Holmberg showed that an admin data estimator may have a higher relevance bias than a survey estimator, but this can be offset by the absence of sampling variance. They also showed that this trade-off depends on the bias from other sources, such as frame errors, non-response errors and measurement errors. Recently, Scholtus and Bakker (2013) described a quantitative method to assess the bias of an admin or survey variable. Their method requires record linkage of data from different sources measuring the same concept, but they relax the common assumption that one source can be used as an error-free reference. Second, one can combine a survey estimator and an admin data estimator while still avoiding data integration. Moore et al. (2008) defined a composite estimator as a weighted sum of a survey estimator and an admin data estimator, where the weight is equal to the contribution of each estimator’s MSE to the sum of both MSEs. Thus, the smaller the MSE of the admin data estimate relative to the MSE of the survey estimate, the more the composite estimator is based on the admin data estimate. The authors presented three issues when combining the estimates. If the admin data estimate is less timely than the survey estimate, the first needs to be forecasted. If the admin data estimate is less frequent, it needs to be temporally disaggregated. If both sources use different definitions, they need to be reconciled. The challenge is to minimise the MSE of the admin data estimate, maximising the benefit of the admin data information. Third, survey and admin data can be combined at unit level through data integration techniques, such as record linkage, statistical matching and micro-integration processing (ESSnet Data Integration 2011). In this report, we focus on this third option for using admin data as the previous methods are described in the literature referenced above. Furthermore, evidence16 shows that this is of particular interest to the ESS and is where guidance is currently lacking. In particular we discuss four situations in which admin data and survey data have been combined through data integration (Fig. 1). The first two situations (A and B) have been published, the latter two (C and D) are unpublished case studies developed in this project.

16 Work Package 1 of the ESSnet Admin Data – data collection providing information on the use of

administrative data for business statistics: http://essnet.admindata.eu/WorkPackage?objectId=4251

Page 105: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

105

Figure 1 Four situations in which survey and admin data have been combined at micro level: A) overlapping sources: a random sample survey overlaps with an admin source that covers a selective part of the target population, B) two-phase sampling: only a subset of a random sample from the target population is surveyed, and the remaining units in the sample and the non-responding units are imputed using admin data, C) cut-off sampling: large enterprises are observed through a census survey, medium enterprises are observed through a sample survey, small enterprises are not observed, and admin data is used as auxiliary information, D) two-strata mix: large enterprises are observed through a census survey, information on small and medium enterprises is derived from admin data, and missing data is imputed. Black frame is target population.

Overlapping sources At CBS Netherlands, Kuijvenhoven and Scholtus (2010, 2011) developed two combined estimators, and compared their accuracy with that of an ordinary regression estimator using a bootstrap method (see below). They considered the case where 1) the sample survey and admin data overlap, because admin data become available only after the sample has to be drawn, and 2) the admin data cover only a selective part of the target population (Fig. 1A). The first, “additive” combined estimator uses the admin data z if available (UR), but these are replaced by the survey data y if both are available(sR), because the authors consider the survey data more accurate in social surveys. The target population that is not covered by the admin data (UNR) is estimated by the regression estimator, using the sample data that do not overlap with the register data (sNR):

NRRRR sk

kk

sk

k

sk

k

Uk

k ywyzz ,

where wk is the weight of unit k and the product of the inclusion weight and a correction weight to make the sample representative with respect to auxiliary information. The second, “regression-based” combined estimator is an ordinary regression estimator, but the auxiliary information comes from the admin data for the part of the sample that overlaps with the register:

sk

kk yw ,

where the auxiliary vector x is defined by

NRNRk

Rkk

Uk

Ukz

if,0

if,

x

0x .

Survey data

Admin data

A B

C D

Time

Siz

e

Siz

e

y x

Survey data

Admin data

Survey data

Admin data

A B

C D

Time

Siz

e

Time

Siz

e

Siz

e

y x

Siz

e

y x

Page 106: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

106

Kuijvenhoven and Scholtus applied their methods to the Dutch educational attainment file, which is a relatively recent admin dataset that contains mainly information on young people. Simulations showed that the combined estimators are more precise than the ordinary regression estimator, that the additive combined estimator is more biased but more precise than the regression-based combined estimator, and that the bootstrap provides valid variance estimates. Based on the work by Kuijvenhoven and Scholtus, it can be assumed that an admin data value zk differs from a survey data value yk by a measurement error uk:

kkk uyz ,

where the measurement error is 0 if a Bernoulli random variable k with mean k equals 0,

and a random variable ek with mean ke and variance

2

ke if k equals 1:

1if

0if0

kk

k

ke

u

.

The variance of an admin data value ( ke zV ) can then be derived as a function of the

probability of a measurement error ( k ), and the mean (ke ) and variance (

2

ke ) of the

measurement error distribution:

22

kk ekekke zV .

The three parameters in this variance equation can be estimated from the data when the sample survey and admin data overlap. Two-phase sampling At Statistics Canada, Demnati and Rao (2009) developed a naive and design-based estimator and associated variance estimators for two-phase sampling. They considered the case where only a subset of a random sample from the target population is surveyed, and the remaining units in the sample and the non-responding units are imputed using admin data (Fig. 1B). For variance estimation, they used a generalised Taylor linearisation method extended to two-phase sampling (Demnati and Rao 2004). Cut-off sampling A common design that has the possibility to improve the efficiency and reduce the burden of Structural Business Statistics (SBS) involves splitting the population into three broad strata (Fig. 1C). This is based on UK work from Work Package 3 of the ESSnet Admin Data, which tests more efficient estimation approaches than those currently used in the UK. In the method tested, data for the largest businesses are directly observed through a census survey. Data for medium-sized businesses are also directly observed, but through a sample survey, with weighting used to provide inference for the whole population. Note that admin data are often used as auxiliary data in the weighting for this part of the survey to improve the accuracy of estimates. Data for the smallest businesses are not directly observed. For each industry, the data are modelled using available admin data. Various modelling techniques are available, see section 4.1 for more detail. Two-strata mix At CBS, quarterly turnover for the Short-Term business Statistics (STS) is based on a mix of survey and admin data. Business size does not determine the survey design but it determines the data source: most businesses are indirectly observed through the VAT admin data and only the statistical units (enterprises) underlying the largest businesses are directly

Page 107: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

107

observed through a census survey (Fig. 1D). The rationale of this design is that the larger the business, the weaker the link between the statistical and admin unit. The CBS STS case study also considers the timeliness of the sources, which is the main focus of Work Package 4 of the ESSnet Admin Data. Early estimates often need to be produced before all survey and admin data are available. CBS uses imputation to complete the missing quarterly data (de Waal et al. 2012; Israëls et al. 2011; EDIMBUS 2007).

4.3 Measuring error

The mean square error (MSE) is a measure of accuracy combining both the random error

(variance) and the structural error (bias): 2MSE = Variance + Bias . Statistical outputs based

on sample surveys are often accompanied by estimates of variance (or coefficients of variation), which give users an indication of the impact of sampling error on the accuracy of the statistics. In many common scenarios, standard formulae exist for calculating these variance estimates. The focus is on variance estimates since estimation methods used are often (approximately) unbiased, so the variance estimate can also be seen as an estimate of the MSE. It is highly desirable that mixed source statistics should be accompanied by a similar measure of accuracy. For estimates based on a combination of admin and survey data, it is less straightforward to estimate the sampling variance of the survey part, because of interaction with the admin data. The admin dataset itself is not subject to sampling variance, since it is a full enumeration of the admin population (ignoring any missingness). However, using raw or modelled admin data for part of a statistical output can introduce bias. Sources of bias include differences in the definition of variables, coverage errors, measurement error and misclassifications. It is therefore necessary to find methods to estimate each of the components of MSE. The method for estimating MSE will depend on the nature of the mixed source estimation. It is not possible to define a single approach for all estimation methods. Instead, this chapter investigates a variety of mixed source estimation approaches covering Structural Business Statistics and Short-Term business Statistics, using examples from the UK and the Netherlands. 4.4 Case studies As outlined in previous chapters, the ESSnet Admin Data is working to develop best practice in the use of admin and accounts data for business statistics. In addition to WP6 (focusing on the development of quality indicators), two other WPs are undertaking particularly relevant work linked to the work described in this chapter. Work package 3 (WP3) looks at estimating Structural Business Statistics (SBS) data when no admin data is directly available. Work package 4 (WP4) looks at timeliness issues, particularly in the context of Short-Term business Statistics (STS) data. To provide a coherent picture of the work being done in these areas and the impact of quality considerations across the board, the following methods use case studies using SBS data (WP3) and STS data (WP4). 4.4.1 Cut-off sampling (ONS case study)

4.4.1.1 Overview of ONS Case Study

The WP3 report of Sanderson et al. (2012), presents several methods for estimation in the presence of cut-off sampling. Cut-off sampling means that only part of the population is

Page 108: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

108

surveyed (Fig. 1C). Estimation for the part of the population below the cut-off has to be carried out by some other means. These methods are applied to data collected from the Annual Business Survey (ABS), the main structural business survey conducted by the Office for National Statistics (ONS) (Office for National Statistics 2012). If cut-off sampling were to be used for the ABS or indeed other surveys then appropriate quality measures would be required to accompany the estimates. Two such quality measures are bias and variance. Two approaches towards bias and variance estimation considered in this report are the use of analytical expressions where they exist and using bootstrap techniques. Bootstrap methods involve resampling from sample data in order to approximate the population distribution. The information from the distribution of bootstrap sample data can then be used to estimate variance and bias. More information on specific implementations of this approach is given below. This case study will apply techniques for estimating the bias and variance of estimates of gross investment in tangible goods using different methods considered by Sanderson et al. (2012) and data from the ABS for the years 2004 to 2008. The report in particular focuses on whether these techniques can be used after cut-off sampling has been implemented. While the variable considered in this report is gross investment in tangible goods, the variance and bias estimation techniques can be used more widely for other variables and on other surveys. The first method for estimating in the presence of cut-off sampling modifies the current estimation methodology. Estimation of gross investments in tangible goods is carried out using ratio estimation with turnover, which is available on the Inter-Departmental Business Register (IDBR) as the auxiliary variable. Under cut-off sampling the current methodology can be modified by adjusting the weighting. The variance of this estimator can be calculated using an analytical expression. The bias is calculated using bootstrap techniques; this gives a more robust estimate than approximating the bias from a single dataset. The second method, simple ratio adjustment, uses sample data from units above the cut-off to estimate a ratio of the variable of interest to an auxiliary variable. The ratio is then applied to the aggregate of the auxiliary variable in the cut-off band. Three choices of auxiliary variable are available – register (IDBR) turnover, register employment and VAT turnover. The investigations look at both calculating one overall ratio and at calculating ratios at lower levels of aggregation. In this report both the bias and variance of these estimators are calculated using bootstrap techniques. The third method uses regression models. There are two approaches – one to fit a regression model to sample data collected for units above the cut-off, the other to fit a regression model using past data for the cut-off band and carry this model forward. In this report both the bias and variance of these estimators are calculated using bootstrap techniques. The bias and variance of the different methods can be used to indicate how the different methods are performing as well as to provide quality information to users on the method that is ultimately implemented. The consistency of the estimates of bias over time will be assessed to determine whether the bias or a function of the bias can be assumed to be constant going forward when no survey data will be available from the cut-off band to estimate this quantity. Notation Some notation which will be used consistently throughout this section is outlined below.

Page 109: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

109

hN denotes the population size in stratum h

hn denotes the sample size in stratum h

hh

h

nf

N denotes the sampling fraction in stratum h

hU denotes the set of all elements in the universe in stratum h

hs denotes the set of all elements in the sample in stratum h

( )g h denotes the ‘g-weight’ band associated with stratum h

jU denotes the set of all elements in the universe belonging to a group j

Current estimation methodology The current method for estimating totals on the Annual Business Survey (ABS) in the UK is ratio estimation applied to a sample that covers the whole population. For a variable of interest y —in this case study, gross investment in tangible goods—we wish to estimate the

population total . ii UY y

A sample is taken and is observed for the sampled units.

The estimate of the population total is then taken to be i i ii sY ag y

where ia is the design

weight, defined in equation (1) and also known as the ‘a-weight’ and ig is the calibration

weight, defined in equation (2) and also known as the ‘g-weight’. Also define

cc i i ii s

Y ag y

to be the estimate for the cut-off band from the original survey data where

the sum is over all units cs which are units in the sample that would fall below the cut-off.

for hi

h

Na i h

n (1)

( ) for g h

g h

ii U

ih

ii U i sh

x

g i gN

xn

(2)

Where g is a particular calibration group and ix is the value of the auxiliary variable for unit

.i The auxiliary variable used for the ABS is Business Register turnover. An analytical expression exists for the approximate variance of a ratio estimator (Cochran

1977), detailed in equation (3):

22

1 1

(1 )( ) where for

( 1)

hnHjj Ugh h

i ii ih ih h jj Ug

yN fV y R x R i g

n n x

(3)

4.4.1.2 Methods used in WP3 of the ESSnet Admin Data

The work carried out in WP3 considers a cut-off band consisting of businesses with less than 10 employment from which a sample would no longer be taken. A sample is still conducted of businesses with employment of 10 or greater, and either these data or historic data from the cut-off band are used to predict for the variable of interest in the cut-off band. The

Page 110: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

110

estimate for businesses with employment of 10 or greater remains calculated under the current methodology of ratio estimation. The WP3 methods considered in this report are modifying the current estimation method by inflating the ‘g-weights’, the simple ratio adjustment method and two regression models. These methods are described below. Modifying the current estimation method The current methodology can be adapted to incorporate cut-off sampling by inflating the ‘g-weights’ to account for the absence of sampling in the cut-off band. Businesses in the cut-off band are assigned to a ‘g-weight’ band which includes businesses above the cut-off. The ‘g-weights’ are still calculated as in equation (2), however now the numerator contains auxiliary information for the businesses in the cut-off band assigned to this ‘g-weight’ band, while the denominator no longer has any survey data relating to elements in the cut-off band. This method is similar to the simple ratio adjustment method (see below) with register turnover as the auxiliary variable; however the ratios are calculated at a much lower level of aggregation than division level or the overall level. For more details on this method see Sanderson et al. (2012). Simple ratio adjustment The simple ratio adjustment method uses sample data from businesses above the cut-off band to estimate a ratio of the variable of interest to an auxiliary variable. The ratio is then applied to the aggregate of the auxiliary variable in the cut-off band to achieve an estimate for the total in the cut-off band. Ratios can be calculated at either the overall level or the division level. Division is a high level of aggregation of the UK’s SIC (Standard Industrial Classification), an equivalent of NACE. This report works with SIC03 in which there are 57 divisions. In this report the auxiliary variables used are register turnover, register employment and VAT turnover. For more details on this method see Sanderson et al. (2012). Unit-level linear regression modeling Two approaches were used to model estimates for the cut-off band, one was to use past data from the cut-off band to fit a linear regression model which would then be carried forward each year with the parameters remaining unchanged. The second was to fit a model each year to the sample data for businesses above the cut-off band and apply this model to predict for businesses in the cut-off band. For more details on the models and the variables included see Sanderson et al. (2012). 4.4.1.3 Bias and variance estimation

This section outlines the techniques used to estimate the bias and variance of the estimates for gross investment in tangible goods under the methods outlined in section 4.1.2. Modifying the current estimation approach There is an analytical expression for the approximate variance of a ratio estimator (Cochran 1977) which can be adjusted to estimate the variance of the estimator using ‘g-weight’ inflation. The formula is detailed in equation (4).

Page 111: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

111

\

22 2

( )

1 1

(1 )

(1 )

hg

gg c

nG ii U h hg hi i

g i U ii h hi U

x N fV y R x

x n n (4)

Where, iy is the observed value for sampled unit ,i ix is the value of the auxiliary variable

for unit ,i gU is the set of all elements in the universe in ‘g-weight’ band ,g \g cU is the set of

all elements in the universe in ‘g-weight’ band g excluding those in the cut-off band, and

( )ii Ug

g h

ii Ug

yR

x is an estimate of the ratio of the variable of interest to the auxiliary variable in

the appropriate ‘g-weight’ band. Bootstrapping can be used to estimate the bias of the ‘g-weight’-inflated estimator. The inflated ‘g-weights’ are calculated to include auxiliary information on the cut-off band as

described above. Denote these inflated ‘g-weights’ by *.ig The bootstrapping is carried out

using the Rao-Wu method (Girard 2009) which accounts for a finite population correction. In

stratum h , a sample size of 1hn (or 1 if 1hn ) is taken with replacement. It is recorded

how many times each businesses is re-sampled and this is denoted by ik . Each business in

the original sample is then assigned a modified ‘a-weight’ defined in equation (5).

* (1 1 1 )1

hi i ih h

h

na f f k a

n

(5)

The *ig remain the same. There is variation in these between different bootstrap samples,

but computing these inflated weights each time the bootstrapping is run would be even more computationally intensive. The estimate for total gross investment in tangible goods is then given by equation (6) where the sum is now only over businesses in the sample which are above the cut-off.

* *gwi i i ii s

Y a g y

(6)

The process of re-sampling and re-calculating the *ia and the total estimate is then repeated

many times. The bias is estimated as the average difference between the estimate coming from a bootstrap replicate and the estimate calculated from the full survey sample, given by the expression

,

1

1( )

K

gwi k

k

B Y YK

(7)

Where K is the number of bootstrap runs. Note that the original ratio estimator is only approximately unbiased, this report is not assessing the bias of the ratio estimator, but the additional bias that is introduced by the cut-off sampling compared with the current estimation method. Simple ratio adjustment The Rao-Wu bootstrap method presented in Girard (2009) is used for bias and variance

estimation of the simple ratio adjustment method. In stratum h , a sample size of 1hn (or 1

if 1hn ) is taken with replacement. It is recorded how many times each business is re-

Page 112: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

112

sampled and this is denoted by ik . Each business in the original sample is then assigned a

modified ‘a-weight’ defined in equation (5) with the ‘g-weights’ remaining the same.

The ratio

*

*m

m

i i ii s

i i ii s

a g yr

a g x

is then calculated where the summation is over those businesses

in the sample with 10 to 19 employment. The ratio can also be calculated at division level by taking the summations over those businesses in a given division with 10 to 19 employment. This ratio is then applied to the aggregate of the auxiliary variable for the businesses in the

cut-off to provide an estimate of total gross investment in tangible goods for the cut-off band,

given by sra ii c

Y r x

.

The process of resampling and recalculating the estimate is repeated many times. The bias is then estimated using equation (7). Again this is not looking at the bias caused by the ratio estimator, but the additional bias brought about by cut-off sampling. The variance of the estimate for the cut-off band is estimated as

2,

1

1( )

1

K

c sra k sra

k

V Y YK

(8)

Where sraY is the mean of the estimates from the bootstrap replications.

The total variance of all gross investments in tangible goods is then the sum of the estimated variance in the cut-off band and the analytical variance computed for the businesses not in the cut-off band, calculated using equation (3). These variances can be added because units above and below the cut-off band are independent. Unit-level linear regression modeling Bootstrapping is used to obtain estimates of both the bias and variance of the two regression models. This method is used as there are no analytical formulae available to estimate these quantities. A model was fitted to historic data from the cut-off band in 2004 and then carried forward each year until 2008. As the regression model is based on 2004 data, resampling of the cut-off businesses in the original 2004 sample is carried out.

In stratum ,h a sample size of 1hn (or 1 if 1hn ) is taken with replacement. This is then

treated as our sample and estimates for the parameters of the regression model are estimated using this sample (see Sanderson et al. 2012 for full details of the models used). The same variables that Sanderson et al. (2012) decided to use in their final regression model are retained in the model regardless of their significance. The resulting model is then applied to all units in the cut-off band in subsequent years to predict for total gross investment in tangible goods in the cut-off band. The estimator used is then detailed in equation (9), where is the fitted regression model and iz is the auxiliary

information for unit i .

( )c

reg ii U

Y z

(9)

Page 113: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

113

This process is repeated many times to obtain different estimates for the cut-off band. For each year the average and variance of these estimates is computed to estimate the expected value of the estimator and its variance respectively. The bias and variance are then estimated using analogous methods to equations (7) and (8). The total variance for the whole population is then the sum of this estimated variance and the analytical variance computed for the businesses not in the cut-off band using equation (3). The model fitted to sample data assumes that businesses with less than 10 employment have not been sampled, but a sample exists of businesses with employment of 10 or greater. For a given year a regression model is fitted to the sample of businesses with employment between 10 and 249 for that year and then applied to the businesses in the cut-off. The variance of this estimator comes from the variance in the sample of businesses with employment between 10 and 249, so the bootstrapping will be carried out on the sample of these businesses. 4.4.1.4 Estimating these quantities in the future without data

In the presence of cut-off sampling the variance of these estimators can be computed using the methods outlined in Section 4.1.3. However, the methods used to estimate bias all require survey data from the cut-off band to calculate an approximately unbiased estimate for these businesses. These will no longer be available under cut-off sampling. One idea for estimating the bias going forward when no data are available is to look at whether the bias or a function of the bias is constant over time. This quantity could then be assumed to remain constant going forward in time. Table 1 explains the notation used in the titles of figures 1 to 4.

Page 114: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

114

Table 1 Explanation of the titles of the graphs in figures 1 to 4.

Title Description

sra turnover overall The simple ratio adjustment method calculating the ratio at the overall level using register turnover as the auxiliary variable

sra turnover division The simple ratio adjustment method calculating the ratio at division level using register turnover as the auxiliary variable

sra employment overall

The simple ratio adjustment method calculating the ratio at the overall level using register employment as the auxiliary variable

sra employment division

The simple ratio adjustment method calculating the ratio at division level using register employment as the auxiliary variable

sra vat overall The simple ratio adjustment method calculating the ratio at the overall level using VAT turnover as the auxiliary variable

sra vat division The simple ratio adjustment method calculating the ratio at division level using VAT turnover as the auxiliary variable

gwt inflation The ‘g-weight’ inflation method

band 1 model The method fitting a regression model to historic cut-off band data and carrying it forward

band 2 model The method fitting a regression model to the survey data for businesses above the cut-off

The biases over time of the overall estimates for gross investment in tangible goods produced by the different methods considered are presented in Figure 2. The biases of the simple ratio adjustment methods and the ‘g-weight’ inflation method can be volatile, in particular looking at the trends between 2004 and 2006 and when using VAT turnover as the auxiliary variable. The bias from the regression model fitted to historic cut-off band data is generally lower and more consistent than the bias from the simple ratio adjustment methods. The regression model fitted to historic cut-off band data appears to have the lowest, most consistent bias.

Page 115: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

115

Figure 2 Bias of the estimates produced using the methods outlined in Section 4.1.2 for 2004 to 2008.

Another option is to consider whether the ratio of the bias to another quantity is constant over time. One choice is to look at the ratio of the absolute bias to the standard error. The standard error can be estimated using the methods discussed in Section 4.1.3 for estimating the variance. Figure 3 looks at the bias ratio at the overall level for each of the methods tested. The regression model fitted to past data and the method of ‘g-weight’ inflation have the lowest relative biases and these appear to be the most consistent from 2005 onwards.

Page 116: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

116

Figure 3 Ratio of bias to standard error of the estimates produced using the methods outlined in Section 4.1.2 for 2004 to 2008.

Figure 4 looks at the variance over time of the different methods considered. As the variance can be computed using the techniques outlined in Section 4.1.3 once cut-off sampling has been implemented, this comparison serves more as an evaluation of the methods of Sanderson et al. (2012). For ease of comparison the scales are all the same for each method, which highlights that both simple ratio adjustment methods using VAT data produce quite volatile variances, which make the other methods more difficult to compare. However, again it is the two regression models which have the lowest, consistent variance of all estimators compared.

Page 117: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

117

Figure 4 Variance of the estimates produced using the methods outlined in Section 4.1.2 for 2004 to 2008.

The variance of an estimator gives an indication of its quality; however an estimator could have a low variance but large bias, which should be reflected in measures of quality. The mean square error (MSE) gives an indication of a combination of the bias and variance of an estimator. Figure 5 plots the MSE for the different methods from 2004 to 2008. The simple ratio adjustment methods using the VAT turnover data has the most volatile MSEs, while the regression methods, and after 2005 the ‘g-weight’ inflation method, have the lowest, most consistent MSEs.

Page 118: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

118

Figure 5 Mean Squared Error (MSE) of the estimates produced using the methods outlined in Section 4.1.2 for 2004 to 2008.

At the overall level the bias and bias ratios can be quite volatile. The WP3 report of Sanderson et al. (2012) looked at using cut-off sampling in divisions where the new estimate was within ±5% of the original estimate. Therefore let us consider whether there are any divisions in which the bias or bias ratio could be treated as constant. The regression model using historic data from the cut-off band produced division level estimates within ±5% of original sample estimates in 16 of the 57 divisions. Of all methods, this method yielded the largest savings in terms of sample size if cut-off sampling were to be implemented across these 16 divisions. Figure 6 looks at the bias in these 16 divisions for this regression model. In several divisions the bias appears to be constant, so if these divisions were to adopt cut-off sampling and use this regression model for estimation then the assumption of constant bias going forward would hold. However, this is not true for all divisions.

Page 119: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

119

Figure 6 Bias in the 16 divisions under which the regression fitted to historic cut-off band data produced an estimate within ±5% of the original sample estimate.

The bias ratios for these divisions are plotted in Figure 7, with the exception of divisions 41 and 64 to allow for a more comparable scale. This ratio appears to be rather volatile and in some divisions the bias can be larger than the standard error.

Page 120: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

120

Figure 7 Bias ratio in 14 of the 16 divisions under which the regression fitted to historic cut-off band data produced an estimate within ±5% of the original sample estimate.

4.4.1.5 Practical implications

Calculations using bootstrap methods can be computationally intensive. In this work, 200 bootstrap iterations were carried out to calculate estimates for each expected value and variance. Computation times will vary according to processing power, but an idea of the amount of time to carry out one iteration on a standard desktop computer for each method is detailed in Table 2. The times ignore the time spent on formatting the data ready for bootstrapping and just look at how long the resampling and recalculation of the new estimate takes. Note that the regression model using past data for the cut-off band simultaneously computes estimates for four years. It would not be fair to compare a quarter of this time with the others, but would be fair to compare four times the length of time taken for the other methods with this time.

Page 121: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

121

Table 2 Time taken to run one bootstrap iteration for the methods where the variance is calculated using bootstrap techniques.

Method Time

Regression model using sample data for businesses above the cut-off band

2 minutes 9 seconds

Regression model using past data for the cut-off band 5 minutes 38 seconds

Simple ratio adjustment 13 seconds

The regression models take the most time to run because the models need to be fitted each time to the re-sampled data to obtain parameters, then the resultant models need to be fitted to all businesses in the universe below the cut-off. The actual bootstrapping part of the simple ratio adjustment method is relatively simple. If bootstrapping were used in practice to calculate variances, the number of iterations required would need to be investigated by looking at how long the estimates of the variance took to converge. More iterations than the 200 used in this report could well be needed, which would make the process even more computationally intensive. 4.4.1.6 Conclusions and recommendations

The bias estimation approach detailed above requires an estimate of the expected value of the method used, which is obtained through bootstrapping and can be calculated in the same way going forwards; however, an (approximately) unbiased estimate of the true value of the variable of interest in the cut-off band is also required. In this case study this estimate was calculated using survey data available for the cut-off band. In the future, in the presence of cut-off sampling, such data will not be available. Therefore going forward an alternative method for estimating this quantity should be used. The case study addressed this by looking for quantities which are constant over time. The quantities looked at were the bias, the MSE and the ratio of the bias to the standard error. If consistency across years is not found at the overall level, then one idea is to look at a lower level of aggregation; in this case the bias and bias ratio at division level were considered. If the bias were to be assumed constant over time, it would be sensible to take a sample periodically to check that the assumption still holds. If the regression model based on historic data from the cut-off band were used, then it would be worth taking a sample periodically anyway to update the model parameters. A further alternative for estimating the bias when there is no data available for the cut-off band is to use data collected from other surveys that sample for the whole population: for example, the ONS runs a Monthly Business Survey which is used to estimate Short-Term business Statistics (STS) and these estimates could be aggregated to an annual level. For variance estimation, the methods discussed above can be used going forward in the absence of data in the cut-off band. Two approaches were tested – using an analytical expression and using bootstrap methods. The bootstrap methods can be computationally intensive. In this report, 200 iterations were used, however in practice the number of iterations would have to be determined by the number of iterations it takes for the estimates to converge. Using an analytical expression is not so computationally intensive. However, bootstrapping can be applied for methods where no explicit formula for the variance is available, for example in the case of regression modelling, meaning that quality indicators for a wider range of methods can be produced.

Page 122: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

122

Although these bias and variance estimation techniques have been applied to methods for estimating gross investments in tangible goods, they can be used more widely to estimate these indicators for other variables and applied to other surveys. It is generally recommended that bootstrap methods can be used to estimate variance and bias in the case where mixed source estimation is based on a combination of sample surveys and modelled admin data and where no explicit formulae exist for estimating error. However, it is important to be aware of the computational restrictions and the difficulty of estimating bias when there is no gold standard with which to compare. 4.4.2 Two-strata mix (CBS case study)

In this section we describe the CBS case study (Fig. 1D) in more detail and provide guidance on how to estimate the accuracy of the mixed source statistic and on how to assess the sensitivity of early and late estimates to different sources of error. 4.4.2.1 Estimator

Because no samples are drawn, no complicated design-based or model-based estimators are required to make inferences about the target population. The estimator for the total

quarterly turnover, Y , in a given industry is simply the sum over all units in both strata (H = 2), i.e. the enterprises underlying the large businesses (survey data) and the enterprises underlying the small and medium businesses (admin data):

H

h

N

k

hk

h

yY1 1

ˆ ,

where yhk is the turnover of enterprise k in stratum h. When missing values are imputed, the total turnover simply becomes the sum over M observed and N−M imputed values:

H

h

MN

j

hj

M

i

hi

hhh

yyY1 1

imp

1

obs ˆˆ ,

where subscripts i and j refer to the enterprise. The CBS case study deals with four releases of quarterly turnover data for the third quarter of 2011 (Fig. 8). In addition to this quarterly turnover level, an important statistic is the change in turnover (quarter on quarter or year on year). The methods described here can also be applied to changes in turnover, but we limit the case study to turnover level. We will focus on nine industries of economic activity, defined by the Dutch particularisation of NACE Rev.2 within Division 45 (“Wholesale and retail trade and repair of motor vehicles and motorcycles”).

Page 123: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

123

Figure 8 Mixed source estimates of quarterly turnover as a function of time since end of reference period (third quarter of 2011) for nine industries within the Dutch particularisation of NACE Rev.2 Division 45. Note that the y-axes are scaled independently.

In most industries, turnover estimates are based on a combination of survey and admin data (Fig. 8). In some industries, such as 45111 (“Import of new cars and light motor vehicles”), estimates are based mainly on survey data. In other industries, such as 45194 (“Wholesale and retail trade and repair of caravans”) and 45402 (“Retail trade and repair of motorcycles and related parts and accessories”), estimates are completely admin data-based. The proportion of values that are imputed instead of observed can be substantial for early estimates (30 days after the end of the reference period) but is almost negligible for final estimates (one year after the end of the reference period). 4.4.2.2 Bias and variance estimation

Another consequence of the absence of sampling is that there is no sampling error. This does not imply, however, that the estimate is error-free. The simplest approach to bias and variance estimation would be to assume that the target population is a realisation of a sample from a super-population. This is illustrated below. More realistically, other sources of non-sampling error remain. These can be divided into measurement errors and representation errors (Zhang 2012). We illustrate one from each: measurement errors which occur when the measured target value differs from the true target value, and misclassification errors, which occur when the unit is assigned an incorrect category of a

Page 124: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

124

classification variable such as economic activity or size class. In all three approaches, we use resampling methods to estimate the accuracy (bias and variance) of the mixed source estimator of quarterly turnover. Since the latter two approaches require some error model as input, these should be considered sensitivity analyses. We use 10,000 simulations per estimate, which seems sufficient for confidence intervals to stabilise (Appendix A at the end of this chapter). The bias of our estimator is estimated as the difference between the average of the simulated replicates and the total turnover according to this estimator:

YYYBias ˆˆˆ * , where

B

b

bYB

Y1

** ˆ1ˆ and

*ˆbY is the estimated total turnover in simulated sample b.

The central limit theorem states that the distribution of the simulated replicates will tend towards a normal distribution. Therefore, the variance between simulated replicates is an estimate of the variance of the estimator:

B

b

b YYB

YV1

2*** ˆˆ

1

1ˆˆ .

The mean square error (MSE) is a measure of accuracy combining both the structural error (bias) and the random error (variance):

YVYBiasYMSE ˆˆˆˆ *2

.

In analogy of the square root of the variance (standard error), the square root of the mean square error (RMSE) is expressed in the same units as our target variable turnover, i.e. in euro. To make estimates comparable between releases and industries, we normalise the RMSE to the total turnover estimated from observed and imputed data, in analogy of the coefficient of variation (CV):

Y

YMSERMSECV

ˆ

ˆ .

This is the measure of accuracy that we use to compare releases and industries. Super-population In the bootstrap method (Efron and Tibshirani 1993), a large number (B) of bootstrap samples are drawn with replacement from the original sample, i.e. the target population. The population parameter is re-calculated for each subset b. Each sample has the same size as the target population. We therefore assume here that turnover is a random variable and that the target population is a realisation of a sample from a super-population. The bootstrap method provides unbiased estimates of the quarterly turnover (Fig. 9). According to this method, late estimates are usually no more precise than early estimates. Imputed values are replaced by observed values in later releases (Fig. 8), but this has

Page 125: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

125

generally no net effect on the variance between units within industries (Fig. 9). This suggests that there is no need to replace the current deterministic imputation process by a stochastic imputation process. In industry 45401 (“Wholesale trade of motorcycles and related parts and accessories”) there was a large increase in observed survey data between the second and third release (Fig. 8). This increased both the absolute and relative variance between units within this industry (Fig. 9). The least accurate estimates tend to occur in industries that are based mainly on survey data. Note, however, that turnover in industry 45191X is also based to a large extent on survey data, yet is considered to be fairly accurate. Note that source also correlates with business size. Bootstrapping thus suggests that small and medium enterprises may be more alike than large enterprises, but only in certain industries. In summary, the accuracy of the estimates according to this method depends on the variance between units within industries. This method allows us to determine which sources may require attention.

Figure 9 Bootstrapped confidence intervals of quarterly turnover per release and industry. Left: estimated turnover (line), bootstrapped mean ± SD (thick bars) and 95%-confidence interval (thin bars) using B = 10,000 bootstrap replicates. Note that the y-axes are scaled independently. Right: root mean square error normalised to the total turnover estimated from observed and imputed data.

Page 126: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

126

Source-specific measurement error Bootstrapping does not take non-sampling error into account. In this section we explore the sensitivity of our quarterly turnover estimates to source-specific measurement error. Suppose the measurement error in turnover value y of unit i is normally distributed with the mean equal to yi, and the standard deviation proportional to yi, where the coefficient of variation (CV) is a chosen parameter depending on both the source (survey or admin data) and the data (observed or imputed) (Fig. 10). In scenario 1, we assume that imputed values are the biggest source of error, and that survey data are more reliable than admin data. In scenario 2, we still assume that imputed values are more prone to error than observed values, but that survey data are the biggest source of error. Knowledge about the accuracy of survey and admin data might be obtained from data editing actions or from comparing initially imputed data with finally observed data (Aelen 2004; van der Stegen 2005). Preliminary analysis of data editing actions (not shown) suggests that scenario 2 might be more realistic. Conclusions from data editing actions are risky, however, partly because surveyed enterprises receive more attention in the editing process than secondarily observed enterprises. On the other hand, survey data might indeed be less reliable than admin data for another reason: the legal consequences of errors are less severe.

Figure 10 Two scenarios of source-specific measurement errors, shown for a business unit with a turnover of 1000 euro. Coefficients of variation for {observed admin data, imputed admin data, observed survey data, imputed survey data} are {0.05, 0.5, 0.01, 0.1} (scenario 1) and {0.01, 0.05, 0.1, 0.5} (scenario 2).

Using this input, we can draw a new turnover value for each unit from these probability density functions, recalculate the population parameter, repeat this a large number of times and use the distribution of the simulated replicates as a measure of the sensitivity of our quarterly turnover estimates to source-specific measurement error. Simulations show that the quarterly turnover estimates are fairly robust for source-specific measurement error (Fig. 11). In scenario 1, the accuracy is high even when almost half of the turnover estimate is based on an unreliable source with a CV of 50%, i.e. imputed admin

Page 127: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

127

data (industry 45194, t + 30). In scenario 2, early turnover estimates of industries 45310 and 45320 are the least accurate, because they are based substantially on the least reliable source, i.e. imputed survey data. In summary, the accuracy according to this method depends on the relative contribution of each source. Late estimates are more accurate than early estimates if imprecise sources are replaced by more precise sources. This method allows us to quantify the sensitivity of the target variable to source-specific measurement error. The method also allows the effects of other probability density functions, e.g. log-normally distributed measurement errors to be tested.

Figure 11 Sensitivity of mixed source estimates to source-specific measurement error. Left: estimated quarterly turnover (line), simulated mean ± SD (thick bars) and 95%-confidence interval (thin bars) using 10,000 simulations. Note that the y-axes are scaled independently. Right: root mean square error normalised to the total turnover estimated from observed and imputed data. Measurement error is assumed largest in imputed values (scenario 1) or survey data (scenario 2).

Source-specific representation error In this section we explore the sensitivity of our quarterly turnover estimates to source-specific representation error. Representation error can result from under-coverage or over-coverage, or from misclassification of auxiliary variables such as economic activity or size class. Here we will illustrate the effects of misclassifying the economic activity of a unit. CBS has a Service Level Agreement stating that the 3-digit NACE code should be correct for at

Page 128: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

128

least 95% of large enterprises and 65% of small and medium enterprises. We apply these figures at industry level. Assuming that the first two digits of the NACE code in our 9 industries are correct and that the probability of moving from industry i to j is the same for all j ≠ i, we define two source-specific transition matrices of dimension 9 × 9:

95.001.0

01.095.0SurveyP and

65.004.0

04.065.0AdminP .

This is scenario 1. In scenario 2, we switch the matrices by assuming that 65% of large enterprises (survey stratum) and 95% of small and medium enterprises (admin stratum) is correctly classified for economic activity. Using this input, we can draw a new industry code for each unit from these transition matrices, recalculate the population parameter per industry, repeat this a large number of times and use the distribution of the simulated replicates as a measure of the sensitivity of our quarterly turnover estimates to source-specific misclassification error. Simulations under scenario 1 show that source-specific misclassification can result in strongly biased estimates (Fig. 12). Our data set contains one large industry (45112), which is overestimated if some units in the small industries are misclassified. The small industries are underestimated if some units in the large industry are misclassified. In industry 45401 late estimates are more accurate than early estimates because they are based on more units with a likely correct NACE code (Fig. 8). In the other industries there is no effect of release on accuracy because the ratio between survey and admin data remains fairly constant. So far, simulations have confirmed mostly our intuition, given the input. The last simulation, however, shows that simulations can reveal counterintuitive results (Fig. 12). When we assume that the economic activity is more reliable for small and medium enterprises than for large enterprises (scenario 2), our estimates are less precise indeed, yet also less biased. This suggests that shifting the focus of editing the industry classification from small and medium to large enterprises can result in more biased estimates. Such a shift in resources has virtually no net effect on accuracy, because the gain in precision is offset by the creation of bias. In summary, the accuracy according to this method depends on the variation between industries. Misclassification errors can seriously affect the accuracy of our estimates.

Page 129: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

129

Figure 12 Sensitivity of mixed source estimates to source-specific classification error. Left: estimated quarterly turnover (line), simulated mean ± SD (thick bars) and 95%-confidence interval (thin bars) using 10,000 simulations. Note that the y-axes are scaled independently. Right: root mean square error normalised to the total turnover estimated from observed and imputed data. Classification error is assumed largest in admin stratum (scenario 1) or survey stratum (scenario 2).

4.4.2.3 Conclusions and recommendations

In this section, we have described the CBS case study where all large enterprises are surveyed and turnover of all small and medium enterprises is derived from admin data. Sampling theory does not apply here, because there is no sampling error. The estimate is not error-free because non-sampling errors still occur. Resampling methods can be used to compare the sensitivity of mixed source statistics to non-sampling errors between releases and industries. Such sensitivity analyses can guide the decision making about resource investment. The examples we have shown suggest that our estimates of turnover are much more sensitive to source-specific classification error than to measurement error. Moreover, shifting classification resources from small and medium to large enterprises has virtually no net effect on accuracy, because the gain in precision is offset by the creation of bias. On the other hand, this shift in resource allocation might improve the accuracy of temporal changes in turnover, because the creation of bias in both time points is annihilated, whereas the gain in precision is not. In addition, simple bootstrapping showed that replacing imputed values by observed values does not increase the variance. This suggests that there is no need to replace the current deterministic imputation method by a stochastic imputation method. Simple bootstrapping

Page 130: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

130

also suggested that small and medium enterprises may be more alike than large enterprises, but only in certain industries. 4.5 Conclusions and recommendations

In this chapter, we have discussed a number of possibilities for estimating the accuracy of statistics based on a combination of survey and admin data. We have cited studies comparing MSEs between a survey estimate and an admin data estimate, or combining them into a weighted sum, or developing composite estimates after integrating survey and admin data. We have discussed four situations where both sources are integrated, two of which were developed as case studies. In conclusion, the bootstrap re-sampling methods described above provide methods for estimating variance and bias, as well as giving insight into the sensitivity of mixed source statistics to non-sampling errors. Although note that there are significant difficulties in continuing to estimate the bias after cut-off sampling is implemented. Analytical methods can occasionally be used to estimate variance, but only for a limited range of estimators. Sensitivity analyses do not provide absolute estimates of accuracy, but can be used to compare outputs, and can assist in deciding where to invest editing resources. The normalised root mean square error is a useful measure to compare the accuracy of estimates between time points or domains. These re-sampling methods could be adapted to specific situations or needs. For example, the normally distributed measurement errors could be replaced by non-normally distributed errors. The methods could also be used to study the sensitivity of estimates to other sources of error, such as coverage errors, or to a combination of error sources. Another extension could be to assess the effect of source-specific error on the accuracy of changes over time in turnover. For instance, shifting classification resources from small and medium to large enterprises may increase the accuracy of turnover change, because the creation of bias in both time points is annihilated, whereas the gain in precision is not. Two important issues to bear in mind when using re-sampling methods are that the number of replications required can be computationally prohibitive in some instances, and the method for estimating bias by comparison to true data is difficult to maintain in the future.

Page 131: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

131

4.6 References

Aelen, F., 2004. Improving timeliness of industrial short-term statistics using time series analysis. Discussion paper 04005. Statistics Netherlands, The Hague/Heerlen. Cochran, W., 1977. Sampling Techniques. Wiley, New York. Demnati. A. and J.N.K. Rao, 2004. Linearisation variance estimators for survey data. Survey Methodology 30: 17–26. Demnati, A. and J.N.K. Rao, 2009. Linearisation variance estimation and allocation for two-phase sampling under mass imputation. Paper for the Federal Committee on Statistical Methodology Research Conference, 2-4 November, Washington DC. EDIMBUS, 2007. Recommended practices for editing and imputation in cross-sectional business surveys. Available at: http://epp.eurostat.ec.europa.eu/portal/page/portal/quality/documents/RPM_EDIMBUS.pdf. ESSnet Data Integration, 2011. State of the art on statistical methodologies for data integration. Final report WP1. http://www.cros-portal.eu/sites/default/files//FinalReport_WP1.pdf. Girard, C., 2009. The Rao-Wu rescaling bootstrap: from theory to practice. Paper for the Federal Committee on Statistical Methodology Research Conference, 2–4 November, Washington DC. Israëls, A., L. Kuijvenhoven, J. van der Laan, J. Pannekoek and E. Schulte Nordholt, 2011. Imputation. Statistical methods 201112. Statistics Netherlands, The Hague/Heerlen. Kuijvenhoven, L. and S. Scholtus, 2010. Estimating accuracy for statistics based on register and survey data. Discussion paper 10007. Statistics Netherlands, The Hague/Heerlen. Kuijvenhoven, L. and S. Scholtus, 2011. Bootstrapping combined estimator based on register and sample survey data. Discussion paper 201123. Statistics Netherlands, The Hague/Heerlen. Laitila, T. and A. Holmberg, 2010. Comparison of sample and register survey estimators via MSE decomposition. Paper for the European Conference on Quality in Official Statistics, 4–6 May, Helsinki. Little, R., 2012. Calibrating Bayes, an Alternative Inferential paradigm for Official Statistics. Journal of Official Statistics, 28, 3: 309-334. Moore, K., Brown, G. and Buccellato, T., 2008. Combining sources: a reprise. Paper for the CENEX-ISAD workshop Combination of surveys and admin data, 29–30 May, Vienna. Office for National Statistics, Annual Business Survey. Available at: http://www.ons.gov.uk/ons/guide-method/method-quality/specific/business-and-energy/annual-business-survey/index.html (accessed on 18/12/12). van der Stegen, R.H.M., 2005. Improving the quality of statistics of manufacturing turnover growth: timeliness and accuracy. Discussion paper 05002. Statistics Netherlands, The Hague/Heerlen.

Page 132: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

132

Sanderson R., J. Woods, T. Jones, G. Morgan, D. Lewis and J. Davies, 2012. Development of estimation methods for business statistics variables which cannot be obtained from admin sources. Report for Work Package 3 of the ESSnet on the Use of Admin and Accounts Data for Business Statistics. Available at: http://www.cros-portal.eu/content/admindata-sga-3. Scholtus, S. and B. Bakker, 2013. Estimating the validity of administrative and survey variables by means of structural equation models. Paper for the NTTS conference on New Techniques and Technologies for Statistics, 5–7 March, Brussels. de Waal, T., J. Pannekoek and S. Scholtus, 2012. Handbook of Statistical Data Editing and Imputation. Wiley, Hoboken. Wallgren, A. and B. Wallgren, 2007. Register-based Statistics. Administrative Data for Statistical Purposes. Wiley, Chichester. Zhang, L.C., 2012. Topics of statistical theory for register-based statistics and data integration. Statistica Neerlandica 66: 41–63.

Page 133: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

133

4.7 Appendix A

Effect of number of simulations on distribution of simulated estimates for total turnover per release and industry. Mean ± SD (thick bars) and 95%-confidence intervals (thin bars) from simulations of source-specific representation error using scenario 1.

Page 134: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

134

Chapter 5: Conclusions

Administrative data are increasingly being used by National Statistical Institutes (NSIs) in the production of their statistics. This has resulted in the need to develop and disseminate ‘best practice’ and recommendations which differ from those for the production of statistics based on survey data. The ESSnet Admin Data was established to do just that and WP6 specifically was designed to address the lack of quality indicators and guidance within this field. As set out in previous chapters, the purpose of this project was not to reinvent the wheel but to build on work already in place, with a particular focus on developing quantitative quality indicators for business statistics involving admin data. This work is for the benefit of the members of the ESS, and producers of statistics more widely. Thus, our hope is that the end result of the ESSnet Admin Data’s work in this area should be integrated with the work already underway on quality within NSIs and by Eurostat. It is important to note that this work focussed specifically on quality indicators for statistics involving admin data. Indicators that can be applied in the same way when using admin or survey data have been excluded as work on this is already in place. This project addressed the quality of the statistical output, taking input and process into account because input and process are critical to the work of NSIs and it is the input and process in particular that differ when using admin data. This focus on aiding NSIs has been central to the work of WP6. As set out in Chapter 1, broader work to establish cross-European measures of output quality were outside the scope of this work. The previous chapters of this document include the main outcomes of the WP6 part of the ESSnet Admin Data project. In addition to the development of these indicators and the associated guidance, WP6 has also engaged others to assess its work. As such, Statistics Portugal (INE) and a colleague from Statistics Norway and Southampton University (Li-Chun Zhang) have worked with the WP6 team to test the applicability and useability of the indicators and guidance within diverse statistical production contexts. This testing covered SBS and STS considerations in NSIs at very different stages in their use of admin data. The outcomes of this testing have resulted in further developments of the WP6 outputs which are reflected in this document17. They have also identified some areas for potential further developments which are outlined below. In June 2013, WP6 also circulated the output of its work to members of the ESS requesting feedback and comments. Feedback was received from 32 different respondents covering 19 Member States. The feedback was generally very positive and is summarised below. The members of the ESS also provided some suggestions for improvements to the guidance and identified potential future developments. As many as possible of these improvements have been incorporated in the final version of the ESSnet Admin Data guidance included within the earlier chapters of this document.

17 The testing reports are available (with limited access) on the ESSnet Information Centre.

http://essnet.admindata.eu/WikiEntity?objectId=5452

The findings of the reports have largely been addressed within this document

Page 135: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

135

Basic quality indicators Chapter 2 of this document contains the list of basic quality indicators (numeric and descriptive) as well as a framework for examples and individual examples for each quantitative indicator (tailored lists for Structural Business Statistics (SBS) and Short Term (business) Statistics (STS) are included in Annex 1) to aid understanding and implementation. In-depth testing Testing of this list of indicators revealed that they provided a useful ‘checklist’ of the major determinants of quality and that the process of calculating them can help identify the strengths and weaknesses of an output which uses admin data. The indicators were informative and comprehensive, although some initial work is required before they can be used for the first time. This involves clearly defining the domain (eg key variables) to be examined and then there may be some work to alter or adapt the data before the basic quality indicators can be calculated. However, once this initial work to calculate the indicators is completed (eg programmed in SAS), they are easy to re-run and were thought to be a useful tool in investigating the quality of the statistical output. In general, the NSIs involved in the in-depth testing were able to calculate the indicators and many were informative and raised issues for consideration. There was also an understanding that not all indicators are applicable to all situations and nor are they designed to be. Where difficulties were encountered, many of these were due to the progressive nature of statistical data based on admin sources. Consequently, further guidance on this issue has been provided in Appendix A of Chapter 2 – which outlines a number of challenges that producers of statistics need to consider when dealing with admin data. ESS Feedback Feedback elicited from across the ESS revealed that over 90% of respondents felt that the quality indicators were understandable and 88% expected the list to be useful in their measuring or assessing of quality. 66% anticipated using the indicators in their work. Where improvements were identified, this feedback has largely been taken on board within Chapter 2 of this document and thus these comments are not discussed further here. Individual comments received summarised this chapter as:

“A comprehensive list that will clearly be of use for the relevant subject matter areas.”

“The indicators are described clearly and understandably. Most of them are well-illustrated by examples.”

“The list of indicators gives a good framework of what can be used for assessing the quality of admin data.”

“Clear structure with short definitions, comprehensive explanations and good examples.”

“The document is very fruitful and helpful.” Composite quality indicators Chapter 3 sets out a simple method to create composite quality indicators for statistics involving admin data. Guidance on how to calculate composite indicators for a number of quality dimensions is included. The aim is to assist producers and users in understanding whether the quality attributes of particular outputs are acceptable or unacceptable. We hope

Page 136: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

136

that these composite indicators will be used to enable producers to improve the quality of their outputs by providing a more holistic insight into their quality. In-depth testing: Testing of the guidance on calculating composite indicators found that the method is robust in terms of simplicity and transparency, and provides holistic indicators of the quality for the different themes. The main difficulty experienced with the composite quality indicators was the challenge of setting meaningful reference values. The guidance does suggest ways of establishing these values but it is clear that composite indicators should only be calculated when producers have some idea of the range of values which might be considered acceptable for an indicator (eg based on existing survey data or from experience of producing the output). If this is not possible, then composite indicators should not be calculated. In summary, despite the initial work involved in calculating the basic quality indicators and setting the reference values to calculate the composite quality indicators, the testing revealed that the benefits of doing this work were considerable. ESS Feedback Feedback from across the ESS in July 2013 revealed that over 85% of respondents felt that the guidance was useful although there was some uncertainty over the added value of the composite indicators over and above the individual basic indicators. As such, a smaller proportion (36% of respondents) anticipated using the composite indicators. Despite this, the feedback received was positive:

“The discussion of composite indicators has been going on for a long time now. Your presentation shows one of the most useful and understandable descriptions I have seen. It will also provide valuable input for the discussion of composite indicators also for ‘normal’ surveys.”

“The guidance is useful, but it seems to be an extra burden for NSIs to develop other indicators.”

Where issues or lack of clarity were identified, this has been addressed in Chapter 3 of this document. Quality guidance relating to mixed source outputs Chapter 4 focuses on the important area of accuracy when using mixed source designs. Specifically, it looks at ways of measuring the bias and variance (Mean Squared Error) surrounding estimates involving survey and admin data. In-depth testing Testing of this guidance was limited, given its more complex nature and the more theoretical considerations required. Nevertheless, the feedback received identified the usefulness of this guidance, if the NSI is in the position to utilise it. We hope that this guidance will be particularly useful in highlighting areas for consideration and challenges that producers of statistics may face in their developments. Some additional potential areas for investigation are identified in Section 4.5 of this document. ESS Feedback The ESS feedback revealed mixed feelings about this chapter. Almost 85% of respondents found the guidance useful and the vast majority of those (90%+) found the case studies particularly useful in understanding the issues discussed. Around half of respondents (54%)

Page 137: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

137

anticipated using the guidance in considering the quality of mixed source statistical outputs. Of those who did not, the main reasons were that this work seemed to have a more “academic approach, compared to chapters 2 and 3.” Similarly, another respondent stated:

“Some of the proposed quality indicators are too complicated and it is difficult to be used in everyday work.”

Nevertheless, feedback on this area was generally positive and comments included:

“It is very important, because the necessity of using administrative data is increasing and therefore methods examined in this document are very actual.”

“The explanations given in chapter 4 are certainly interesting, because they fit to (our) situation. To assess, whether the methods for measuring accuracy explained in the document are transferable and can be used in practice, it is necessary to go deeper into the subject. But the chapter gives a good introduction to what is possible to do.”

This section also provides producers of statistics with new ways of considering the accuracy of mixed source statistics. Although this is not the approach commonly implemented, it is a useful consideration and may be a first step, particularly given the increasingly complex combinations of survey and admin data which are being implemented across Europe and beyond. Overall Assessing the quality of statistics involving admin data is a challenging problem. We feel that the basic, composite and mixed source quality indicators presented in this report are an important step in the right direction and should provide a useful resource for the members of the ESS. The basic indicators provide an informative and fairly comprehensive ‘checklist’ of the determinants of quality in an output which uses admin data. The composite indicators are fairly simple, transparent and account for uncertainty surrounding quality thresholds, but should only be used when these quality thresholds can be meaningfully defined. The mixed source indicators are useful examples of Mean Squared Error estimators in particular contexts and set out a potential new way of assessing quality, which may also be the first step in establishing cross-European quality indicators – a challenge that will require considerable further work. Feedback from across the ESS has suggested that the work of the ESSnet Admin Data in this area is welcomed by NSIs who are often struggling to incorporate and consider the increasingly available admin data within their existing systems: “By developing and disseminating good practices, countries can implement more efficient and effective ways of producing statistics using administrative data.” In feedback, the WP6 document has been referred to as “valueable”, “very useful” and “a successful piece of work”. Respondents also appreciated the pragmatic approach of the document (especially chapters 2 and 3) and commented that this will provide the “ability to use these methods practically.” There was a range of views on whether the document was too long and should be divided into its component parts. To address this, elements of the WP6 outputs will be available individually but this overall document will also be available to provide the advantages outlined by one respondent: “A substantial merit of the document is that the existing findings and new thoughts are brought together in one central document.”

Page 138: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

138

Overall, respondents felt that the WP6 outputs provided them with a useful starting point for further work in this area and ensured that they considered the range of issues involved in using admin data in the production of statistical outputs. Not all of the indicators or guidance are applicable to all outputs or NSIs, so it will be crucial for producers to select those that are most applicable and relevant to their work. Although the various deliverables of WP6 of the ESSnet Admin Data have been positively received, further developments and enhancements are possible, particularly in terms of changing the underlying assumptions and systems within NSIs. With the increasing use of admin data and increasing integration of sources (survey and admin), it will be important that the output of this project is integrated with others to ensure broader quality considerations are addressed and progressed. For example, integration with the other outputs of the ESSnet Admin Data (WPs 2, 3, 4 and 7) as well as other work, such as the BLUE-ETS project. More broadly still, there is the need to integrate with work on the quality of survey data to ensure a more holistic approach to the quality of ESS statistics is undertaken, irrespective of the source of the data. We hope that the work of the ESSnet Admin Data will be a significant step in this direction and the work of ongoing and future European collaborations will be able to facilitate and effectively implement further developments.

Page 139: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

139

Annex 1: Tailored lists of indicators

Annex 1a – SBS indicators

ESSNET

USE OF ADMINISTRATIVE AND ACCOUNTS DATA

IN BUSINESS STATISTICS

WP6 Quality Indicators when using Administrative Data

in Statistical Outputs

Tailored list of basic quality indicators: Structural Business Statistics (SBS)

July, 2013

Page 140: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

140

1. Introduction

With the increasing use of administrative data in the production of business statistics comes the challenge for statistical producers of how to assess quality. The European Statistical System network (ESSnet) project on the Use of Admin and Accounts data in Business Statistics was established to develop best practice in the use of admin data for business statistics. One work package of the ESSnet Admin Data focusses on quality and has developed quality indicators in this area. The current document provides a list of basic quality indicators specifically in relation to the use of admin data. More generic considerations of quality are available (see European Commission, Eurostat, 2007, for the Handbook on Data Quality Assessment Methods and Tools) but these have not specifically considered quality in the context of the increasing use of admin data – which has an impact on quality as not all the attributes of the quality framework can be applied in the same way to statistics involving admin data. Both quantitative and qualitative indicators are included in this list, which focusses on assessing the quality of the statistical output, taking the input and process into consideration. To aid statistical producers in their use of this list, tailored versions have been developed for the main statistical regulations: Structural Business Statistics – SBS – and Short Term (Business) Statistics – STS. This is in order to aid the understanding and application of this work within these areas. This document is the list of indicators including SBS specific examples.

Page 141: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

141

2. A Quick Guide to the Quality Indicators

What are the quality indicators?

The European Statistical System network project on admin data (ESSnet Admin Data) has developed a list of quality indicators, for use with business statistics involving admin data. The indicators provide a measure of quality of the statistical output, taking input and process into account. They are based on the ESS dimensions of statistical output quality and other characteristics considered within the ESS Handbook for Quality Reports18. Who are they for?

The list of quality indicators has been developed primarily for producers of statistics, within the ESS and more widely. The indicators can also be used for quality reporting, thus benefiting users of the statistical outputs. They provide the user with an indication of the quality of the output, and an awareness of how the admin data have been used in the production of the output. When can they be used?

The list of quality indicators is particularly useful for two broad situations: 1. When planning to start using admin data as a replacement for, or to supplement,

survey data. In this scenario, the indicators can be used to assess the feasibility of increasing the use of admin data, and the impact on output quality.

2. When admin data are already being used to produce statistical outputs. In this scenario, the indicators can be used to gauge and report on the quality of the output, and to monitor it over time. Certain indicators will be suitable to report to users, whilst others will be most useful for the producers of the statistics only.

How should they be used?

There are 23 basic quantitative quality indicators and 46 qualitative quality indicators in total, but not all indicators will be relevant to all situations. Therefore, a statistical producer should select the indicators relevant to its output. The table in Section 3.3 shows which of the quantitative indicators relate to which dimension or ‘theme’ of quality, which may be useful in identifying which indicators to use Indicators 1 to 8 are background indicators, which provide general information on the use of admin data in the statistical output in question but do not, directly, relate to the quality of the statistical output.

Indicators 9 to 23 provide information directly addressing the quality of the statistical output.

18 More information on the ESS Handbook for quality reports can be found here:

http://epp.eurostat.ec.europa.eu/portal/page/portal/product_details/publication?p_product_code=KS-RA-08-016

Page 142: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

142

3. Quality Indicators when using Administrative Data in Statistical Outputs

3.1 Quantitative quality indicators

One of the aims of the ESSnet Admin Data is the development of quality indicators for business statistics involving admin data, with a particular focus on developing quantitative quality indicators and qualitative indicators to complement them.

Some work has already been done in the area of quality of business statistics involving admin data and some indicators have been produced. However, the work conducted thus far refers to qualitative indicators or is based more on a descriptive analysis of admin data (see Eurostat, 2003). The quantitative indicators that have been produced have been more to do with the quality of the admin sources (Daas, Ossen & Tennekes, 2010) or have been to develop a quality framework for the evaluation of admin data (Ossen, Daas & Tennekes, 2011) 19. These do not address the quality of the production of the statistical output however. In fact, almost no work has been done on quantitative indicators of business statistics involving admin data, which is the main focus of this project (for further discussion on this topic see Frost, Green, Pereira, Rodrigues, Chumbau & Mendes, 2010).

The ESSnet aims to develop quality indicators of statistical outputs that involve admin data. These indicators are for the use of members of the European Statistical System; producers of statistics. Therefore, the list contains indicators on input and process because these are critical to the work of the National Statistical Institutes and it is the input and process in particular that are different when using admin data. Moreover, the list of indicators developed is specifically in relation to business statistics involving admin data. Indicators (e.g. on accessibility) that do not differ for admin vs. survey based statistics are not included in this work because they fall outside the remit of the ESSnet Admin Data project.

To address some issues of terminology, a few definitions are provided below to clarify how these terms are used in this document and throughout the ESSnet Admin Data.

What is administrative data? Administrative data are the data derived from an administrative source, before any processing or validation by the NSIs.

What is an administrative source? An administrative source is a data holding containing information collected and maintained for the purpose of implementing one or more administrative regulations. In a wider sense, any data source containing information that is not primarily collected for statistical purposes.

Further information on terminology and useful links to other, related work is available on the ESSnet Admin Data Information Centre20.

A list of quantitative quality indicators has been developed on the basis of research which took stock of work being conducted in this field across Europe21. This list was

19 More information on the BLUE-ETS project and the associated deliverables can be found here:

http://www.blue-ets.istat.it/index.php?id=7 20 ESSnet Admin Data Glossary: http://essnet.admindata.eu/Glossary/List

ESSnet Admin Data Reference Library: http://essnet.admindata.eu/ReferenceLibrary/List

Page 143: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

143

then user tested within five European NSIs, before testing across Member States22. Feedback from this testing was used to improve the list of quality indicators during its development (2010/11). The entry for each quantitative indicator is self-contained in the attached list (see Section 4), including a description, information on how the indicator can be calculated and one or two examples. As this document is tailored to aid producers involved in the SBS regulation, all the examples are in this domain. Qualitative (or descriptive) indicators have also been developed to complement the quantitative indicators and are included in Section 4. Further information on the qualitative indicators is included in Section 3.2. The quantitative indicators have been developed so that a low indicator score denotes high quality, and a high indicator score denotes low quality. This is consistent with the concept of error, where high errors signify low quality. In essence, the indicators measure quality risks – for example, the higher the level of non-response, the higher the risk to the quality of the output. The exceptions to this rule are the background indicators (1 to 8), where the score provides information rather than a quality ‘rating’; and indicators 20 and 23, where a high indicator score denotes high quality, and a low indicator score denotes low quality. Examples are also given for weighted indicators, for example weighting the indicators by turnover or number of employees. Caution needs to be taken when considering these weighted indicators in case of bias caused by the weighting. A framework for the basic quantitative quality indicator examples

The calculation of an indicator needs some preliminary steps. Some or all of the following steps will be used for each example of the indicators to ensure consistency of the examples, and to aid understanding of the indicators themselves. A simple framework to aid calculating the quantitative indicators is included here:

A. Define the statistical output B. Define the relevant units C. Define the relevant variables D. Adopt a schema for calculation E. Declare the tolerance for quantitative and qualitative variables

The list is not created so that producers calculate all indicators but rather as a means of enabling producers to select those that a most relevant to their output. Not all indicators will apply in all situations and it is not recommended that they are all calculated on an ongoing, regular basis. Whilst, some may be useful for exactly this purpose, others may only be used when considering options for increasing the use of

21 A summary of the main findings of this stock take research (Deliverable 2010/6.1) is available on the ESSnet

Infomration Centre here: http://essnet.admindata.eu/WikiEntity?objectId=4696 22 The outcome of this testing is reported on the ESSnet Information Centre (included within the SGA 2010 final

report) and is available here: http://essnet.admindata.eu/WikiEntity?objectId=4751

A summary of the 2011 user testing is reported in the Phase 2 User Testing Report, available here:

http://essnet.admindata.eu/WikiEntity?objectId=4696

Page 144: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

144

admin data or when undergoing or evaluating changes in the process of producing the statistical output.

Links between this and other work on Quality

The work being carried out under this project should not be seen as independent of other work already in place. When analysing the list of indicators, one can conclude that some other information is useful in regard to the quality of the output. However, some of that very useful information is not specific to the use of admin data and thus is out of scope for the work of this ESSnet. This work is for the benefit of the members of the European Statistical System (ESS); the producers of statistics. Consequently, the end result of the ESSnet Admin Data work in this area should be integrated with the work already in place in NSIs and Eurostat.

3.2 Qualitative quality indicators While much of the focus of the ESSnet Admin Data work on quality has been around the development of quantitative quality indicators, the project also required the development of qualitative quality indicators to complement the quantitative indicators. Quantitative and qualitative indicators can be thought of as numerical and descriptive quality indicators respectively: the quantitative indicators provide a numerical measure around the quality of the output, whereas the qualitative indicators provide further descriptive information that cannot be obtained from observing a numerical value. Many of the qualitative indicators have been taken from a UK document entitled ‘Guidelines for Measuring Statistical Output Quality’, which serves as a comprehensive list of quality measures and indicators for reporting on the quality of a statistical output. Others have been developed as part of the work of the ESSnet Admin Data. Beneath each quantitative indicator in Section 4 is a table which displays any potentially relevant qualitative indicators, a description of each indicator and the quality theme with which they are associated. Some of the qualitative indicators are repeated in Section 4 as they are related to more than one quantitative indicator. Appendix A contains a complete list of all qualitative indicators, grouped by theme, and also references the quantitative indicators to which they have been linked in Section 4.

Page 145: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

145

3.3 Using the list of quality indicators

The list of indicators has been grouped into two main areas:

1. Background Information – these are ‘indicators’ in the loosest sense. They provide general information on the use of admin data in the statistical output in question but do not, directly, relate to the quality of the statistical output. This information is often crucial in understanding better those indicators that measure quality more directly.

2. Quality Indicators – these provide information directly addressing the quality of the statistical output.

The background information indicators and the quality indicators are further grouped by quality ‘theme’. These quality themes are based on the ESS dimensions of output quality, with some additional themes which relate specifically to admin data and are consistent with quality considerations as outlined in the ESS Handbook on Quality Reports. The quality themes are:

Quality theme Description

Relevance

Relevance is the degree to which statistical outputs meet current and potential user needs. Note: only a subset of potential relevance quality indicators are considered within this document given the scope of the ESSnet project (eg. differences between statistical and admin data definitions). All relevance indicators are qualitative.

Accuracy The closeness between an estimated result and the unknown true value.

Timeliness and punctuality The lapse of time between publication and the period to which the data refer, and the time lag between actual and planned publication dates.

Comparability The degree to which data can be compared over time and domain.

Coherence The degree to which data that are derived from different sources or methods, but which refer to the same phenomenon, are similar.

Page 146: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

146

Other relevant considerations

Cost and efficiency The cost of incorporating admin data into statistical systems, and the efficiency savings possible when using admin data in place of survey data.

Use of administrative data Background information relating to admin data inputs.

The following table shows which quantitative indicators are relevant to each of the

quality themes.

Quality theme Quantitative indicators relevant to that theme

Accuracy 9, 10, 11, 12, 13, 14, 15, 16, 17.

Timeliness and punctuality

4, 18.

Comparability 19.

Coherence 5, 6, 20, 21.

Cost and efficiency 7, 8, 22, 23.

Use of administrative

data 1, 2, 3.

Reminder:

The quantitative indicators have been developed so that a low indicator score denotes high quality, and a high indicator score denotes low quality. Thus, the indicators measure quality risks – for example, the higher the level of non-response, the higher the risk to the quality of the output. The exceptions to this rule are the background indicators (1 to 8), where the score provides information rather than a quality ‘rating’; and indicators 20 and 23, where a high indicator score denotes high quality, and a low indicator score denotes low quality.

Each individual indicator will not apply in all situations. The list is not created so that producers calculate all indicators but rather as a means of enabling producers to select those that a most relevant to their output.

Page 147: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

147

4.1 Background Information (indicators)

Use of administrative data:

1 Number of admin sources used

Description This indicator provides information on the number of administrative sources used in each statistical output. The number of sources should include all those used in the statistical output whether the admin data are used as raw data, in imputation or to produce estimates. In general, administrative data sources used for updating base registers (where available) should not be included in this indicator.

How to calculate Note. Where relevant, a list of the admin sources may also be helpful for users, along with a list of the variables included in each source. Alternatively, the number of admin sources used can be specified by variable.

Note: all examples use the relevant parts of the examples framework set out in Section 3.1.

Example A. Statistical output: Annual structural data on performance in industry sector

B. Relevant units: Units in the statistical population

C. Relevant variables: Personnel costs; Number of employees

D. Steps for calculation: Identify all the relevant administrative sources

Let S1 be the Balance Sheet source. Let S2 be the Social Security source. I(1)=2 (two sources)

For further clarification on

terminology and definitions of

terms used, please refer to the

Glossary included in Appendix C.

Page 148: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

148

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

A - Name each admin source used as an input into the statistical product and describe the primary purpose of the data collection for each source

Relevance Name all administrative sources and their providers. This information will assist users in assessing whether the statistical product is relevant for their intended use. Providing information on the primary purpose of data collection also enables users to assess whether the data are relevant to their needs.

B - Describe the main uses of each of the admin sources and, where possible, how the data relate to the needs of users

Relevance Include all the main statistical processes and/or outputs known to require data from the administrative source This indicator should also capture how well the data support users’ needs. This information can be gathered from user satisfaction surveys and feedback.

Page 149: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

149

2 % of items obtained exclusively from admin data

Description This indicator provides information on the proportion of items only obtained from admin data, whether directly or indirectly, and where survey data are not collected. This includes where admin data are used as raw data, as proxy data, in calculations, etc. This indicator should be calculated on the basis of the statistical output – the number of items obtained exclusively from admin data (not by survey) should be considered.

How to calculate

%100 items of no. Total

dataadmin fromy exclusivel obtained items of No.

This indicator could also be weighted in terms of whether or not the variables are key to the statistical output.

Example

A. Statistical output: Annual structural data on performance in industry sector

B. Relevant units: Units in the statistical population

C. Relevant variables: Turnover; Personnel costs; Wages and salaries

D. Steps for calculation:

D1. For the relevant variables, calculate the number of items for which the variable is obtained exclusively from Admin data (items with non-missing values)

D2. Divide the sum of number of items whose variables are obtained exclusively from Admin Data by the sum of numbers of items where variables are not missing

D3. Calculate the indicator as follows:

%100)2( items of no. Total

data admin from yexclusivel obtained items of No.I

The Balance Sheet source is available only for companies, so the items of enterprises which are not companies and the items of companies for which admin data are missing are taken from a survey.

Page 150: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

150

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

A - Name each admin source used as an input into the statistical product and describe the primary purpose of the data collection for each source

Relevance Name all administrative sources and their providers. This information will assist users in assessing whether the statistical product is relevant for their intended use. Providing information on the primary purpose of data collection also enables users to assess whether the data are relevant to their needs.

B - Describe the main uses of each of the data sources and, where possible, how the data relate to the needs of users

Relevance Include all the main statistical processes and/or outputs known to require data from the administrative source This indicator should also capture how well the data support users’ needs. This information can be gathered from user satisfaction surveys and feedback.

AQ – Describe reasons for significant overlap in admin data and survey data collection for some items.

Cost and efficiency

Where items are not obtained exclusively from admin data, reasons for the overlap between admin data and survey data should be described.

I(2) Turnover = (4/7)*100 = 57.1%

I(2) Personnel costs = (3/6)*100 = 50.0%

I(2) Wages and salaries = (5/7)*100 = 71.4%

I(2) Total = [(4+3+5)/(7+6+7)]*100 = 60.0%

Units (1) (2) (3) (4) (5) (6) (7) (8) (9) X1 1,540,362 1 95,632 1 66,942 1 X2 96,321 0 0 0 0 0 X3 15,236,300 16,589,630 0 589,641 622,300 0 422,756 466,725 0 X4 20,360,200 20,360,200 0 986,577 1,032,530 0 700,012 1 X5 145,200 0 19,650 0 15,023 0 X6 154,063 1 0 19,250 1 X7 845,630 1 89,640 1 61,589 1 X8 256,300 0 52,321 0 39,240 0 X9 158,463 0 0 0 0 0 X10 8,564,030 1 518,600 1 364,000 1 X11 19,856,320 20,031,250 0 1,754,890 1,754,890 0 1,228,534 1,316,190 0 Sum 66,556,905 57,637,364 4 4,034,980 3,481,691 3 2,863,083 1,837,178 5

Wages and salaries (Survey)

Wages and salaries: Items in (7) and not in (8)

Turnover: Items in (1) and not in (2)

Personnel costs: Items in (4) and not in (5)

Personnel costs (Balance Sheet Source)

Turnover (Balance Sheet Source)

Wages and salaries (Balance Sheet Source)

Turnover (Survey)

Personnel costs (Survey)

Page 151: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

151

3 % of required variables which are derived using admin data as a proxy

Description This indicator provides information on the extent that admin data are used in the statistical output as a proxy or are used in calculations rather than as raw data. A proxy variable can be defined as a variable that is related to the required variable and is used as a substitute when the required variable is not available. This indicator should be calculated on the basis of the statistical output – the number of required variables derived indirectly from admin data (because not available directly from admin or survey data) should be considered.

How to calculate

%100variables required of No.

proxy a as data admin using derived are whichvariables required of No.

Note. If a combination of survey and admin data is used, this indicator would need to be weighted (by number of units). If double collection is necessary (e.g. to check quality of admin data), some explanation should be provided. This indicator could also be weighted in terms of whether or not the variables are key to the statistical output.

Example

A. Statistical output: Annual structural data on performance in trade sector

B. Relevant units: Units in the statistical population

C. Relevant variables: Turnover; Personnel costs; Wages and salaries; Number of persons employed; Number of employees; Number of employees in full time equivalent; Production value.

D. Steps for calculation:

D1. Number of required variables (Denominator) D2. Number of required variables used as a proxy (Numerator) Calculate the indicator as follows:

%100)3( variables required of No.

proxy a as data admin using derived are whichvariables required of No.I

Let S1 be the Balance Sheet source. From this source we can obtain directly the variables Personnel costs; Wages and salaries; and Production value. Let S2 be the VAT Turnover source. From this source we can obtain indirectly Turnover, using VAT turnover as proxy. Let S3 be the Social Security source: from this source we can obtain directly the number of employees and indirectly the Number of employees in full time equivalent. Let S4 be the Shareholders and Associates Data Bank: from this source we can obtain indirectly the variable Number of Self Employed which is a component of the variable Number of persons employed.

Page 152: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

152

So I(3)=(3/7)*100=42.9%

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

C – Describe the extent to which the data from the administrative source meet statistical requirements

Relevance Statistical requirements of the output should be outlined and the extent to which the administrative source meets these requirements stated. Gaps between the administrative data and statistical requirements can have an effect on the relevance to the user. Any gaps and reasons for the lack of completeness should be described, for example, if certain areas of the target population are missed or if certain variables that would be useful are not collected. Any methods used to fill the gaps should be stated.

D – Describe constraints on the availability of administrative data at the required level of detail

Relevance Some administrative microdata have restricted availability or may only be available at aggregate level. Describe any restrictions on the level of data available and their effects on the statistical product.

E – Describe reasons for use of admin data as a proxy.

Relevance Where admin data are used in the statistical output as a proxy or are used in calculations rather than as raw data, information should be provided in terms of why the admin data have been used as a proxy for the required variables.

Page 153: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

153

Timeliness and punctuality:

4 Periodicity (frequency of arrival of the admin data)

Description This indicator provides information about how often the admin data are received by the NSI. This indicator should be provided for each admin source.

How to calculate Note. If data are provided via continuous feed from the admin source, this should be stated in answer to this indicator. Only data you receive for statistical purposes should be considered.

Example

A. Statistical output: Annual data on structure and competitiveness of Construction sector

B. Relevant units: Units in the statistical population

C. Relevant variables: Turnover; Production value; Number of employees.

D. Steps for calculation: Assess the frequency of arrival of data relating to the reference period of the statistical data and used to construct the output

Let A be the VAT Turnover source (to obtain Turnover Variable). Let B be the Balance Sheet source (to obtain Production value). Let C be the Social Security source (to obtain Number of employees).

I(4)A=2; I(4)B=1; I(4)C=2

Source

Frequency of arrival

(per year=reference

period):

A 2

B 1

C 2

Page 154: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

154

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

G – Describe the timescale since the last update of data from the administrative source

Timeliness An indication of the timescale since the last update from administrative sources will provide the user with an indication of whether the statistical product is timely enough to meet their needs.

H – Describe the extent to which the administrative data are timely

Timeliness Provide information on how soon after their collection the statistical institution receives the administrative data. The effects of any lack of timeliness on the statistical product should be described.

I – Describe any lack of punctuality in the delivery of the administrative data source

Timeliness Give details of the time lag between the scheduled and actual delivery dates of the data. Any reasons for the delay should be documented along with their effects on the statistical product.

Page 155: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

155

Coherence:

5 % of common units across two or more admin sources

Description

This indicator relates to the combination of two or more admin sources. This indicator provides information on the proportion of common units across two or more admin sources. Only units relevant to the statistical output should be considered. This indicator should be calculated pairwise for each pair of admin sources and then averaged. If only one admin source is available, this indicator is not relevant.

How to calculate

%100 units uniquerelevant of No.

sourcesadmin in the unitscommon relevant of No.

Note. The “unique units” in the denominator means that units should only be counted once, even if they appear in multiple sources. This indicator should be calculated separately for each variable. If the sources are designed to cover different populations and are combined to provide an overall picture, this should be explained. This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution of these units to the statistical output.

Example A. Statistical output: Annual data on structure and competitiveness of insurance

enterprises B. Relevant units: Units in the statistical population

D. Steps for calculation:

D1: Identify the statistical units for each source (i.e. group the administrative records in one source at identification code level)

D2: Match each source with each other by identification code (if available) or by other methods

D3: Attribute a Presence(1)/Absence(0) indicator to the unit with regard to the specific source

D4: Calculate the number of possible pairings between sources (i.e. when there are n sources it is the combinations of n sources taken 2 at a time):

Cn,2=n!/2!*(n-2)! Cn=n/2*(n-1) D5. Multiply the Presence(1)/Absence(0) indicator to obtain the Presence(1)/Absence(0)

indicator for each pair D6. Sum up the Presence(1)/Absence(0) indicators at pair level and divide by Cn* no. of relevant units (m) Let A be the Balance Sheet source. Let B be the Isvap source (Supervisory Authority for the Insurance companies).

Page 156: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

156

Let C be the Chamber of Commerce source.

n = 3, m = 11 Numerator = 7 + 2 + 3 = 12 Denominator = Cn * no.of relevant units (m) Cn = n/2*(n-1) = 3/2*(3-1) = 3 No. of relevant units = 11 Denominator = 3 * 11 = 33 I(5) = 12 / 33 = 36.4% Weighted for turnover: I(5)w = [(29,773,475 + 20,221,727 + 20,915,290)/(3*31,022,657)]*100 = 76.2% Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

O – Describe the common identifiers of population units in administrative data

Coherence Different administrative sources often have different population unit identifiers. The user can utilise this information to match records from two or more sources. Where there is a common identifier, matching is generally more successful.

U – Describe the record matching methods and processes used on the administrative data sources

Accuracy Record matching is when different administrative records for the same unit are matched using a common, unique identifier or key variables common to both datasets. There are many different techniques for carrying out this process. A description of the technique (e.g. automatic or clerical matching) should be provided along with a description (qualitative or quantitative) of its effectiveness.

Presence in

source A

(1/0)

Presence in

source B

(1/0)

Presence in

source C

(1/0) Turnover

Presence

in A∩B

Presence

in A∩C

Presence

in B∩C

Units (1) (2) (3) (4) (5) (6) (7) (8)=(4)*(5) (9)=(4)*(6) (10)=(4)*(7)

X1 1 1 1 1,526,365 1 1 1 1,526,365 1,526,365 1,526,365

X2 0 1 0 232,654 0 0 0 0 0 0

X3 1 1 0 596,325 1 0 0 596,325 0 0

X4 1 1 0 3,658,960 1 0 0 3,658,960 0 0

X5 1 1 0 4,658,963 1 0 0 4,658,963 0 0

X6 1 1 1 18,695,362 1 1 1 18,695,362 18,695,362 18,695,362

X7 0 1 0 256,985 0 0 0 0 0 0

X8 1 1 0 487,500 1 0 0 487,500 0 0

X9 0 1 1 693,563 0 0 1 0 0 693,563

X10 0 1 0 65,980 0 0 0 0 0 0

X11 1 1 0 150,000 1 0 0 150,000 0 0

Sum 7 11 3 31,022,657 7 2 3 29,773,475 20,221,727 20,915,290

Page 157: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

157

6 % of common units when combining admin and survey data

Description This indicator relates to the combination of admin and survey data. This indicator provides information on the proportion of common units across admin and survey data. Linking errors should be detected and resolved before this indicator is calculated. This indicator should be calculated for each admin source and then aggregated based on the number of common units (weighted by turnover) in each source. (For information on the progressive nature of administrative data, see Appendix B.)

How to calculate

%100survey in units of No.

datasurvey andadmin in unitscommon of No.

Note. If there are few common units due to the design of the statistical output (e.g. a combination of survey and admin data), this should be explained. If the sources are designed to cover different populations and are combined to provide an overall picture, this should also be explained. This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution of these units to the statistical output.

Example A. Statistical output: Annual data on structure and performance of credit institutions B. Relevant units: Units in the survey

D. Steps for calculation:

D1. Match each source with survey(s) by the common identification code (if available) or by other methods

D2. Attribute a Presence(1)/Absence(0) indicator to the unit if it belongs at least to one survey (sum up to obtain denominator)

D3. Attribute a Presence(1)/Absence(0) indicator to the unit if it belongs both to the survey and to each source (sum up by source to obtain numerator)

D4. Calculate the indicator as follows:

%100)6( survey in units of No.

data survey and admin in units common of No.I

Let A be the Yellow Pages source. Let B the National Bank source.

Page 158: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

158

I(6)=[(1+5-1)/5]*100=100%

Weighted by number of employees:

I(6)W=[(32,584+39,272-32,584)/39,272]*100=100%

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

O - Describe the common identifiers of population units in administrative data

Coherence

Different administrative sources often have different population unit identifiers. The user can utilise this information to match records from two or more sources. Where there is a common identifier, matching is generally more successful.

U – Describe the record matching methods and processes used on the administrative data sources

Accuracy Record matching is when different administrative records for the same unit are matched using a common, unique identifier or key variables common to both datasets. There are many different techniques for carrying out this process. A description of the technique (e.g. automatic or clerical matching) should be provided along with a description (qualitative or quantitative) of its effectiveness.

Units (1) (2) (3) (4)=(1)*(2) (5)=(1)*(3) (6)=(1)*(2)*(3) (7) (8)=(4)*(7) (9)=(5)*(7) (10)=(6)*(7) (11)=(1)*(7)

X1 1 0 1 0 1 0 1,899 0 1,899 0 1,899

X2 0 1 1 0 0 0 249 0 0 0 0

X3 0 1 1 0 0 0 186 0 0 0 0

X4 0 0 1 0 0 0 48 0 0 0 0

X5 0 0 1 0 0 0 225 0 0 0 0

X6 1 0 1 0 1 0 1,536 0 1,536 0 1,536

X7 1 0 1 0 1 0 2,986 0 2,986 0 2,986

X8 1 1 1 1 1 1 32,584 32,584 32,584 32,584 32,584

X9 0 0 1 0 0 0 69 0 0 0 0

X10 1 0 1 0 1 0 267 0 267 0 267

X11 0 0 1 0 0 0 25 0 0 0 0

X12 0 1 1 0 0 0 46 0 0 0 0

Sum 5 4 12 1 5 1 40,120 32,584 39,272 32,584 39,272

Presence in

Source B ∩

Survey

Number of

employees

Presence in

Source A ∩ B

∩ Survey

Presence

in survey

Presence

in Source

A

Presence

in Source

B

Presence

in Source A

∩ Survey

Page 159: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

159

Cost and efficiency:

7 % of items obtained from admin source and also collected by survey

Description This indicator relates to the combination of admin and survey data. This indicator provides information on the double collection of data, both admin source and surveys. Thus, it provides an idea of redundancy as the same data items are being obtained more than once. This indicator should be calculated for each admin source and then aggregated. Note. Double collection is sometimes conducted for specific reasons, e.g. to measure quality or because admin data is not sufficiently timely for the requirements of the statistical output. If this is the case, this should be explained.

How to calculate

%100survey in itemsrelevant of No.

datasurvey andadmin by obtained itemscommon relevant of No.

Only admin data which meet the definitions and timeliness requirements of the output should be included.

Example

A. Statistical output: Annual data on structure and competitiveness of credit institutions

B. Relevant units: All units in statistical population

C. Relevant variables: Number of employees; Number of female employees

D. Steps for calculation: D1. Match each source with survey(s) by the common identification code (if

available) or by other methods D2. Attribute a Presence(1)/Absence(0) indicator for items of variable(s) in the survey

(sum up to obtain denominator) D3. Attribute a value 1(0) for common(not common) items in the survey and in the

source (sum up to obtain numerator) D4. Calculate the indicator as follows:

Let CCIAA be the Chamber of Commerce source. Let EM be the Social Security source.

%100)7( survey in items relevant of No.

data survey and adminby obtained items common relevant of No.I

Page 160: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

160

I(7) Female Employees = (3/3)*100 = 100%

Weighted by turnover:

I(7) W(Female Employees) = (2,498,607,893/2,498,607,893)*100=100%

Units (1) (2) (3)=(1)*(2) (4) (5)=(1)*(4) (6)=(3)*(4)

X1 1 1 1 656,589,643 656,589,643 656,589,643 X2 1 1 1 189,652,300 189,652,300 189,652,300 X3 0 1 0 56,896,236 0 0 X4 0 0 0 35,698,420 0 0 X5 0 1 0 96,584,200 0 0 X6 1 1 1 1,652,365,950 1,652,365,950 1,652,365,950 X7 0 0 0 95,300 0 0 X8 0 0 321,056 0 0 X9 0 0 0 269,850 0 0 X10 0 1 0 341,258 0 0 X11 0 0 0 465,800 0 0

X12 0 1 0 333,652 0 0 Sum 3 7 3 2,689,613,665 2,498,607,893 2,498,607,893

Common items in survey ∩

source EM

Turnover

0

Female

employees in

survey

(Y/N)=(1/0)

Number of

female

employees

in source EM

172 15

201

220

43

201

15

173

323

9

I(7) Employees = (3/4)*100 = 75.0%

Weighted by turnover:

I(7) W(Employees) = (2,498,607,893/2,499,073,693)*100 = 99.98%

Units (1) (2) (3)=(1)*(2) (4) (5)=(1)*(4) (6)=(3)*(4)

X1 1 1 1 656,589,643 656,589,643 656,589,643 X2 1 1 1 189,652,300 189,652,300 189,652,300 X3 0 1 0 56,896,236 0 0 X4 0 0 0 35,698,420 0 0 X5 0 1 0 96,584,200 0 0 X6 1 1 1 1,652,365,950 1,652,365,950 1,652,365,950 X7 0 0 0 95,300 0 0 X8 1 0 321,056 0 0 X9 0 1 0 269,850 0 0 X10 0 1 0 341,258 0 0 X11 1 0 0 465,800 465,800 0

X12 0 1 0 333,652 0 0 Sum 4 9 3 2,689,613,665 2,499,073,693 2,498,607,893

Number of

employees in

source EM

(Y/N)=(1/0)

Common items in survey ∩

source EM

Turnover

0

Number of

employees in

survey

(Y/N)=(1/0)

Number of

employees

in survey

Number of

employees

in source

EM

290 27

357

8 498

126

58

1002

357

27

294

701

14

Female

employees in

source EM

(Y/N)=(1/0)

Number of

female

employees

in survey

Page 161: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

161

And the total indicator:

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

AR – Comment on the types of items that are being obtained by the admin source as well as the survey

Cost and Efficiency

If the same data items are collected from both the admin source and the survey, this can lead to duplication when combining the sources. This indicator should highlight to users the variables that are being collected across both sources.

AS – If items are purposely collected by both the admin source and the survey, describe the reason for this duplication (eg validity checks)

Cost and Efficiency

In some instances it may be beneficial to collect the same variables from both the admin source and the survey, such as to validate the micro level data. The reason for the double collection should be described to users.

I(7) Total = [(3+3)/(4+3)]*100 = 85.71%

Weighted by turnover: I(7) W(Total) =[(2,498,607,893+2,498,607,893)/(2,499,073,693+2,498,607,893)]*100

=99.99%

Page 162: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

162

8 % reduction of survey sample size when moving from survey to admin data

Description This indicator relates to the combination of admin and survey data. This indicator provides information on the reduction in survey sample size because of an increased use of admin data. Only changes to the sample size due to using admin data should be included in this calculation. The indicator should be calculated for each survey and then aggregated (if applicable). This can be viewed as a one-off indicator when moving from a survey based output to an output involving admin data.

How to calculate

%100dataadmin of usein increase before size Sample

after size sample - dataadmin of usein increase before size Sample

Note. This indicator is likely to be calculated once, when making the change from survey to admin data.

Example 1 A. Statistical output: Annual data on structure and competitiveness of companies of

one NACE Division of industry sector

B. Relevant units: Units in the statistical population

C. Relevant variables: Turnover from industrial activities; turnover from service activities; turnover from trading activities of purchase and resale and from intermediary activities

D. Steps for calculation:

D1. Calculate sample size before increase in use of administrative data D2. Calculate sample size after increase in use of administrative data D3. Calculate the indicator as follows: In order to obtain the desired precision and reliability, we calculate three different sample sizes depending on the examined variable; so it would be suitable using at least the largest sample size dimension (amongst the three), which assures good results for all the three variables. Let the admin source be the Balance Sheet source. Let A be turnover from industrial activities Let U be turnover from service activities Let E be turnover from trading activities of purchase and resale and from intermediary activities Survey sample size before increase in use of admin data: Average A (Ā): 10,263,650 Average U (Ū): 1,023,652 Average E (Ē): 3,976,678

Page 163: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

163

Correct sample variance of A (S2A): 34,350,734,902,500 = 5,860,9502

Correct sample variance of U (S2U): 40,000,000,000 = 200,0002

Correct sample variance of E (S2E): 62,500,000,000 = 250,0002

Variable Requested precision (ε) Requested reliability (1- α) Survey sample size (n)

A U E

205,273 10,000 11,000

95% 95% 95%

1,598 784

1,012

Sample size (A) = nA = (z2

α*S2A)/ε

2A = (1.962*34,350,734,902,500)/205,2732 = 3,134

Sample size (U) = nU = (z2α*S

2U)/ε

2U = (1.962*40,000,000,000)/10,0002 = 1,537

Sample size (E) = nE = (z2α*S

2E)/ε

2E = (1.962*62,500,000,000)/11,0002 = 1,984

Survey sample size after increase in use of admin data: Average A (Ā’): 9,852,347 Average U (Ū’): 1,000,365 Average E (Ē’): 3,782,658 Correct sample variance of A (S’2A): 28,718,452,281,600 = 5,358,9602 Correct sample variance of U (S’2U): 32,400,000,000 = 180,0002

Correct sample variance of E (S’2E): 52,900,000,000 = 230,0002

Variable Requested precision (ε) Requested reliability (1- α) Survey sample size (n)

A U E

205,273 10,000 11,000

95% 95% 95%

1,336 635 857

Sample size (A) = nA = (z2

α*S’2A)/ε2A = (1.962*28,718,452,281,600)/205,2732 = 2,618

Sample size (U) = nU = (z2α*S’2U)/ε

2U = (1.962*32,400,000,000)/10,0002 = 1,245

Sample size (E) = nE = (z2α*S’2E)/ε

2E = (1.962*52,900,000,000)/11,0002 = 1,680

%100)8(data admin of use in increase before sizeSample

after size sample- data admin of use in increase before size SampleI

%46.16100*3134

26183134

Thus, due to an increase in the use of admin data, the survey sample size has decreased by 16.46%.

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

L – Describe the impact of moving from a survey based output to an admin-data based output

Comparability Provide information on how the data are affected when moving from a survey based output to an admin-data based output. Comment on any reasons behind differences in the output and describe how any inconsistencies are dealt with.

Page 164: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

164

4.2 Quality Indicators Accuracy:

9 Item non-response (% of units with missing values for key variables)

Description Although there are technically no ‘responses’ when using admin data, non-response (missing values at item or unit level) is an issue in the same way as with survey data. This indicator provides information on the extent of missing values for the key variables. The higher the level of missing values, the poorer the quality of the data (and potentially the statistical output). However, other indicators should also be considered, eg. the level of imputation and also the means of imputation used to address this missingness. This indicator should be calculated for each of the key variables and for each admin source and then aggregated based on the contributions of the variables to the overall output.

How to calculate

%100 variableXfor relevant units of No.

variableXfor valuemissing with dataadmin in the unitsrelevant of No.

This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution to the output

Example

A. Statistical output: Annual data on structure and competitiveness of companies in the industry sector

B. Relevant units: Units in the statistical population

C. Relevant variables: Production value; Wages and salaries; Social Security costs

D. Steps for calculation:

D1. Match source A with units in the statistical population and take the common units D2. Calculate number of units in D1 with missing value for source A D3. Calculate the indicator as follows:

%100 variable X for relevant units of No.

variable X for value missing withdata admin the in units relevant of No.

Let A be the Balance Sheet source.

Page 165: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

165

I(9)Production value = %0100*12

0 I(9)Social Security costs = %67.16100*

12

2

I(9)Wages and salaries = %25100*12

3 I(9)Total = %89.13100*

36

5100*

)121212(

)320(

Weighted by number of employees:

I(9)W(Production value) = %0100*788

0 I(9)W(Social Security costs)= %36.7100*

788

58

I(9)W(Wages and salaries) = %33.5100*788

42

I(9)W(Total )= %23.4100*2364

100100*

)788788788(

)42580(

Units (1) (2) (3) (4) (5) (6) (7) (8)=(2)*(7) (9)=(4)*(7) (10)=(6)*(7)

X1 12,365,980 0 6,274,189 0 1,330,889 0 38 0 0 0

X2 36,589,740 0 21,749,124 0 6,242,027 0 139 0 0 0

X3 6,958,450 0 2,454,837 0 572,089 0 25 0 0 0

X4 1,526,980 0 1 186,185 0 8 0 8 0

X5 85,964,150 0 40,514,474 0 11,493,837 0 296 0 0 0

X6 7,541,110 0 3,419,064 0 1,067,897 0 30 0 0 0

X7 2,589,630 0 1 1 6 0 6 6

X8 3,569,800 0 1,600,841 0 380,398 0 9 0 0 0

X9 7,854,200 0 2,958,481 0 772,264 0 20 0 0 0

X10 48,658,960 0 30,175,854 0 8,021,430 0 137 0 0 0

X11 14,586,250 0 7,110,797 0 1 52 0 0 52

X12 8,455,600 0 1 915,699 0 28 0 28 0

Sum 236,660,850 0 116,257,661 3 30,982,713 2 788 0 42 58

Number of

employees

Missing

values in (1)

(Y/N)=(1/0)

Production

Value Source A

Wages and

salaries

Source A

Social

Security

costs

Source A

Missing

values in (5)

(Y/N)=(1/0)

Missing

values in

(3)

(Y/N)=(1/0)

Page 166: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

166

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

V – Describe the data processing known to be required on the administrative data source to deal with non-response

Accuracy Data processing is often required to deal with non-response. The user should be made aware of how and why particular data processing methods are used.

W – Describe differences between responders and non-responders

Accuracy This indicates to users how significant the non-response bias is likely to be. Where response is high, non-response bias is likely to be less of a problem than when there are high rates of non-response. NB: There may be instances where non-response bias is high even with very high response rates, if there are large differences between responders and non-responders.

X – Assess the likely impact of non-response/imputation on final estimates

Accuracy Non-response error may reduce the representativeness of the data. An assessment of the likely impact of non-response/imputation on final estimates allows users to gauge how reliable the key estimates are as estimators of population values. This assessment may draw upon the indicator: ‘% of imputed values (items) in the admin data’.

Page 167: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

167

10 Misclassification rate

Description This indicator provides information on the proportion of units in the admin data which are incorrectly coded. For simplicity and clarity, activity coding as recorded on the Business Register (BR) can be considered to be correct – the example in this report makes this assumption (the validity of this assumption will depend on the systems used within different countries; other sources may be used if there is evidence they are more accurate than the BR). The level of coding used should be at a level consistent with the level used in the statistical output (e.g. if the statistical output is produced at the 3-digit level, then the accuracy of the coding should be measured at this level). This indicator should be calculated for each admin source and then aggregated based on the number of relevant units (weighted by turnover) in each source.

How to calculate

%100 dataadmin in unitsrelevant of No.

BR tocode NACEdifferent with dataadmin in unitsrelevant of No.

Note. If the activity code from the admin data is not used by the NSI (e.g. if coding from BR is used), details of the misclassification rate for the BR should be provided instead.

If a survey is conducted to check the rate of misclassification, the rate from this survey should be provided and a note added to the indicator. This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution to the output.

Example

A. Statistical output: Annual structural data on the enterprises of trade sector

B. Relevant units: Units in the statistical population

C. Relevant variables: NACE activity code (5 digits)

D. Steps for calculation: D1. Match each source with Business Register by the common identification code (if

available) or by other methods D2. Attribute a Presence(1)/Absence(0) indicator to items of the variables in each

admin data source and sum up to obtain denominator D3. Attribute a value 1/0 for inconsistency/consistency between the items of the

admin data source(s) and the items in the source(s) D4. Calculate the indicator as follows:

I(10) = %100 data admin in units relevant of No.

BR to code NACE different withdata admin in units relevant of No.

Page 168: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

168

E. Tolerance: The class (4 digits) of NACE activity code must be the same in order for items to be considered consistent

Let A be the VAT Turnover source. Let B be the Chamber of Commerce source.

%27.27100*11

3)10( I

Weighted by turnover:

%43.11100*700,820,36

400,208,4)10( WI

Units (1) (2) (3) (4)=(1)~(2) (5) (6)=(3)*(5) (7)=(4)*(5)

X1 46311 46311 1 0 5,236,550 5,236,550 0 X2 46721 46721 1 0 3,254,200 3,254,200 0 X3 47511 46423 1 1 2,586,900 2,586,900 2,586,900 X4 46732 0 9,236,500 0 X5 47532 47599 1 1 1,256,300 1,256,300 1,256,300 X6 47610 47621 1 1 365,200 365,200 365,200 X7 47114 47112 1 0 265,000 265,000 0 X8 47114 47114 1 0 186,500 186,500 0 X9 47112 47112 1 0 8,789,650 8,789,650 0 X10 46493 46493 1 0 2,586,950 2,586,950 0 X11 46321 46321 1 0 7,569,850 7,569,850 0 X12 46321 46321 1 0 4,723,600 4,723,600 0

Sum 11 3 46,057,200 36,820,700 4,208,400

Turnover NACE Code BR

NACE Code Source A

Inconsistency source A-BR

(4 digits) (Y/N)=(1/0)

Presence item in source A (Y/N)=(1/0)

Page 169: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

169

%33.8100*12

1)10(' I

Weighted by turnover:

%62.5100*200,057,46

950,586,2)10(' WI

%39.17100*23

4100*

)1211(

)13()10(''

TotI

Weighted by turnover:

%20.8100*900,877,82

350,795,6100*

)200,057,46700,820,36(

)950,586,2400,208,4()10('' )(

TotWI

Units (1) (2) (3) (4)=(1)~(2) (5) (6)=(3)*(5) (7)=(4)*(5)

X1 46311 46311 1 0 5,236,550 5,236,550 0

X2 46721 46721 1 0 3,254,200 3,254,200 0

X3 47511 47511 1 0 2,586,900 2,586,900 0

X4 46732 46732 1 0 9,236,500 9,236,500 0

X5 47532 47531 1 0 1,256,300 1,256,300 0

X6 47610 47610 1 0 365,200 365,200 0

X7 47114 47112 1 0 265,000 265,000 0

X8 47114 47114 1 0 186,500 186,500 0

X9 47112 47112 1 0 8,789,650 8,789,650 0

X10 46493 46193 1 1 2,586,950 2,586,950 2,586,950

X11 46321 46321 1 0 7,569,850 7,569,850 0

X12 46321 46321 1 0 4,723,600 4,723,600 0

Sum 12 1 46,057,200 46,057,200 2,586,950

NACE

Code

Source B

NACE

Code BR

Presence

item in

source B

(Y/N)=(0/1)

Consistency

source B-BR

(Y/N)=(0/1) Turnover

Page 170: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

170

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

R - Describe differences in concepts, definitions and classifications between the administrative source and the statistical output

Coherence

There may be differences in concepts, definitions and classifications between the administrative source and statistical product. Concepts include the population, units, domains, variables and time reference and the definitions of these concepts may vary between the administrative data and the statistical product. Time reference problems occur when the statistical institution requires data from a certain time period but can only obtain them for another. Any effects on the statistical product need to be made clear along with any techniques used to remedy the problem.

Z – Describe how the misclassification rate is determined

Accuracy It is often difficult to calculate the misclassification rate.Therefore, where this is possible, a description of how the rate has been calculated should also be provided.

AA – Describe any issues with classification and how these issues are dealt with

Accuracy Whereas a statistical institution can decide upon and adjust the classifications used in its own surveys to meet user needs, the institution usually has little or no influence over those used by administrative sources. Issues with classification and how these issues are dealt with should be described so that the user can decide whether the source meets their needs.

Page 171: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

171

11 Undercoverage

Description This indicator provides information on the undercoverage of the admin data. That is, units in the reference population that should be included in the admin data but are not (for whatever reason). This indicator should be calculated for each admin source and then aggregated based on the number of relevant units (weighted by turnover) in each source. (For information on the progressive nature of administrative data, see Appendix B.)

How to calculate

%100 population referencein unitsrelevant of No.

dataadmin in NOTbut population referencein unitsrelevant of No.

Note. This could be calculated for each relevant publication of the statistical output, e.g. first and final publication. This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution to the output.

Example

A. Statistical output: Annual data on structure and competitiveness of craft business of the industrial sector

B. Relevant units: Units in the statistical population

C. Relevant variables: Craft nature (Y/N=1/0) of the enterprise.

D. Steps for calculation:

D1. Identify units in reference population, i.e. population of craft enterprises of industry sector (e.g. using Business Register)

D2. Match source A with the units in D1 by the common identification code and take the units which are in D1 but not in A (relevant units in reference population but not in A)

D4. Calculate the indicator as follows.

%100)11( population reference in units relevant of No.

data admin in NOT but population reference in units relevant of No.I

Let A be the Register of craft enterprises of Chamber of Commerce.

Page 172: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

172

Presence of unit in reference population taken from BR (i.e. craft business of industry sector): (Y/N)=(1/0)

Presence of unit in source A (Y/N)=(1/0)

Units not in Source A but in reference population Turnover

Units (1) (2) (3) (4) (5)=(3)*(4)

X1 1 1 0 1,532,620 0

X2 1 1 0 758,900 0

X3 1 1 0 256,300 0

X4 1 1 0 1,025,890 0

X5 1 0 1 650,000 650,000

X6 1 1 0 475,620 0

X7 1 1 0 965,002 0

X8 1 1 0 1,487,500 0

X9 1 1 0 325,640 0

X10 1 0 1 265,400 265,400

X11 1 1 0 654,250 0

X12 1 1 0 1,596,300 0

Sum 12 10 2 9,993,422 915,400

%67.16100*12

2)11( I

Weighted by turnover:

%16.9100*422,993,9

400,915)11( WI

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

AB – Describe the extent of coverage of the administrative data and any known coverage problems

Accuracy This information is useful for assessing whether the coverage is sufficient. The population that the administrative data covers should be included along with all known coverage problems. There could be overcoverage (where duplicate records are included) or undercoverage (where certain records are missed).

AC – Describe methods used to deal with coverage issues

Accuracy Updating procedures and cleaning procedures should be described. Also, information should be provided on edit checks and /or imputations that are carried out because of coverage error. This information indicates to users the resultant robustness of the administrative data, after these procedures have been carried out to improve coverage.

AD – Assess the likely impact of coverage error on key estimates

Accuracy Coverage issues could relate to undercoverage and overcoverage (or both). Special studies can sometimes be carried out to assess the impact of undercoverage and overcoverage. An assessment of the likely impact of coverage error indicates to users how reliable the key estimates are as estimators of population values.

Page 173: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

173

12 Overcoverage

Description This indicator provides information on the overcoverage of the admin data. That is, units that are included in the admin data but should not be (e.g. are out-of-scope, outside the reference population). Note that when overcoverage is identified, quite often it can be addressed by removing these units when calculating the statistical output. However, in cases where overcoverage is identified but cannot be addressed, it is this estimate of ‘uncorrected’ overcoverage that should be provided for this indicator. This indicator should be calculated for each admin source and then aggregated based on the number of relevant units (weighted by turnover) in each source. (For information on the progressive nature of administrative data, see Appendix B.)

How to calculate

%100 population referencein units of No.

population referencein NOTbut dataadmin in units of No.

This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution to the output.

Example

A. Statistical output: Annual data on structure and competitiveness of craft business of industrial sector

B. Relevant units: Units in the statistical population

C. Relevant variables: Craft nature (Y/N=1/0) of the enterprise.

D. Steps for calculation:

D1. Identify units in reference population i.e. craft enterprises of industrial sector (e.g. using Business Register)

D2. Match source A with units in D1 by the common identification code and take the units which are in A but not in D1 (units in source A but not in reference population)

D3. Calculate the indicator as follows:

%100)12( population reference in units of No.

population reference in NOT but data admin in units of No.I

Let A be the Register of craft enterprises of the Chamber of Commerce.

Note: sometimes some enterprises are struck from the Register of craft business because they don’t have the legal requisites for admission anymore (e.g. they change legal status or the number of their employees increases too much etc.) but in the available version of data this fact is still not recorded because of a delay in the registration of these data in the Register.

Page 174: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

174

Presence of unit in reference population taken from BR (i.e. craft business of industry sector): (Y/N)=(1/0)

Presence of unit in source A (Y/N)=(1/0)

Units in Source A but not in reference population Turnover

Turnover in reference population

Units (1) (2) (3) (4) (5)=(1)*(4) (5)=(3)*(4)

X1 1 1 0 1,532,620 1,532,620 0

X2 0 1 1 758,900 0 758,900

X3 1 0 0 256,300 256,300 0

X4 1 1 0 1,025,890 1,025,890 0

X5 1 1 0 650,000 650,000 0

X6 1 1 0 475,620 475,620 0

X7 1 1 0 965,002 965,002 0

X8 1 1 0 1,487,500 1,487,500 0

X9 0 1 1 325,640 0 325,640

X10 1 0 0 265,400 265,400 0

X11 1 1 0 654,250 654,250 0

X12 1 1 0 1,596,300 1,596,300 0

Sum 10 10 2 9,993,422 9,908,882 1,084,540

%00.20100*10

2)12( I

Weighted by turnover:

%17.12100*882,908,8

540,084,1)12( WI

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

AB – Describe the extent of coverage of the administrative data and any known coverage problems

Accuracy This information is useful for assessing whether the coverage is sufficient. The population that the administrative data covers should be included along with all known coverage problems. There could be overcoverage (where duplicate records are included) or undercoverage (where certain records are missed).

AC – Describe methods used to deal with coverage issues

Accuracy Updating procedures and cleaning procedures should be described. Also, information should be provided on edit checks and /or imputations that are carried out because of coverage error. This information indicates to users the resultant robustness of the administrative data, after these procedures have been carried out to improve coverage.

AD – Assess the likely impact of coverage error on key estimates

Accuracy Coverage issues could relate to undercoverage and overcoverage (or both). Special studies can sometimes be carried out to assess the impact of undercoverage and overcoverage. An assessment of the likely impact of coverage error indicates to users how reliable the key estimates are as estimators of population values.

Page 175: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

175

13 % of units in the admin source for which reference period differs from the required reference period

Description This indicator provides information on the proportion of units that provide data for a different reporting period than the required period for the statistical output. If the periods are not those required, then some imputation is necessary, which may impact quality. This indicator should be calculated for each admin source and then aggregated based on the number of relevant units (weighted by turnover) in each source.

How to calculate

%100 dataAdmin in unitsrelevant of No.

period required from period reporting

different with dataAdmin in unitsrelevant of No.

This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution to the output.

Note: In some cases, 'calenderization' adjustments must be made to get the administrative data to the correct periodicity - for example, converting quarterly data to monthly data. If this is required, it may be helpful to calculate an additional indicator covering the proportion of units for which calenderization adjustments have taken place.

Example

A. Statistical output: Annual data on structure and competitiveness of enterprises of industry sector

B. Relevant units: Units in the statistical population

C. Relevant variables: Production value; Wages and salaries; Social Security costs

D. Steps for calculation:

D1. Identify all the units in the source with different reporting period from the required period of the statistical output

D2. Calculate the indicator as follows.

%100)13( data Admin in units relevant of No.

period required from period reporting different withdata Admin in units relevant of No.I

Let A be the Balance Sheet source.

The usual reporting period for the yearly budget of enterprises is the solar year; but sometimes, for various reasons, the reporting period is different (e.g. it could be from 1st June to 31st May).

Page 176: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

176

%25100*12

3)13( I

Weighted by turnover:

%24.30100*090,422,86

440,138,26)13( WI

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

AE – Describe the data processing known to be required on the administrative data source to address instances where the reference period differs from the required reference period.

Accuracy Data processing may sometimes be required to check or improve the quality of the administrative data or create new variables to be used for statistical purposes. The user should be made aware of how and why data processing is used.

Units (1) (2) (3) (4)=(2)*(3)

X1 01/01/2011-12/31/2011 0 1,586,900 0

X2 03/01/2011-02/28/2012 1 18,965,320 18,965,320

X3 01/01/2011-12/31/2011 0 22,563,850 0

X4 01/01/2011-12/31/2011 0 258,000 0

X5 01/01/2011-12/31/2011 0 789,600 0

X6 01/01/2011-12/31/2011 0 7,778,540 0

X7 01/01/2011-12/31/2011 0 15,635,800 0

X8 05/01/2011-04/30/2012 1 6,584,120 6,584,120

X9 01/01/2011-12/31/2011 0 654,000 0

X10 01/01/2011-12/31/2011 0 758,000 0

X11 03/01/2011-02/28/2012 1 589,000 589,000

X12 01/01/2011-12/31/2011 0 10,258,960 0

Sum 3 86,422,090 26,138,440

Units with different

reporting period from

solar year (Y/N)=(1/0)

Reporting period of

source A Turnover

Page 177: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

177

14 Size of revisions from the different versions of the admin data RAR – Relative Absolute Revisions

Description This indicator assesses the size of revisions from different versions of the admin data, providing information on the reliability of the data received. With this indicator it is possible to understand the impact of the different versions of admin data on the results for a certain reference period. When data is revised based on other information (e.g. survey data) this should not be included in this indicator. The indicator should be calculated for each admin source and then aggregated.

How to calculate

%100

1

1

T

t Pt

T

t PtLt

X

XX

= Latest data for X variable

= First data for X variable If only one version of the admin data is received, this indicator is not relevant. Note. This indicator should only be calculated for estimates based on the same units (not including any additional units added in a later delivery of the data). This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution to the output.

Example A. Statistical output: Multiannual data on structure and competitiveness of enterprises

of NACE division 47

B. Relevant units: Units in the statistical population

C. Relevant variables: Number of retail stores.

D. Steps for calculation:

D1. Identify the statistical unit (enterprise) in the first and in the second version of data coming from the same source

D2. Match the source with the units in the statistical population by the common identification code (if available) or by other methods, and take the units in common

D3. Take the non missing values (XPt) from the first data version D4. Take the non missing values (XLt) from the second data version for the same units

received in the first data version D5. Calculate the difference (absolute value) between the latest data and the first data

version for each unit D6. Sum up the differences and divide it by the sum of the absolute values of the first

version of data D7. Calculate the indicator as follows:

LtX

PtX

Page 178: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

178

Let A be the Nielsen Data Bank of retail sector.

%10100*50

5)14( I

Weighted by turnover:

%70.7100*322,814,699,1

690,608,128)14( WI

Units (1) (2) (3)=|(2)-(1)| (4) (5)=(1)*(4) (6)=(3)*(4)X1 3 4 1 2,356,980 7,070,940 2,356,980X2 5 6 1 5,684,230 28,421,150 5,684,230X3 1 1 0 652,000 652,000 0X4 1 1 0 412,530 412,530 0X5 1 1 0 722,410 722,410 0X6 27 25 2 59,658,740 1,610,785,980 119,317,480X7 2 3 1 1,250,000 2,500,000 1,250,000X8 4 4 0 2,154,003 8,616,012 0X9 1 1 0 231,000 231,000 0X10 1 1 0 365,430 365,430 0X11 2 2 0 1,528,690 3,057,380 0X12 2 2 0 3,489,750 6,979,500 0Sum 50 51 5 78,505,763 1,669,814,332 128,608,690

Number of

retail stores of

the unit (first

data version)

(XPt)

Number of retail

stores of the unit

(second data version)

(XLt)

Absolute value of

the difference

between XLt and

XPt: |XLt-XPt|

Turnover

(second

version)

100*

||

||

I(14)

1

1

T

t

T

t

XPt

XPtXLt

Page 179: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

179

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

AF - Comment on the impact of the different versions of admin data on the results

Accuracy When commenting on the size of the revisions of the different versions of the admin data, information on the impact of the revisions on the statistical product for the relevant reference period should also be explained to users.

AG – Flag any published data that are subject to revision and data that have already been revised

Accuracy This indicator alerts users to published data that may be, or have already been, revised. This will enable users to assess whether provisional data will be fit for their purposes.

AH – For ad hoc revisions, detail revisions made and provide reasons

Accuracy Where revisions occur on an ad hoc basis to published data, this may be because earlier estimates have been found to be inaccurate. Users should be clearly informed of the revisions made and why they occurred. Clarifying the reasons for revisions guards against any misinterpretation of why revisions have occurred, while at the same time making processes (including any errors that may have occurred) more transparent to users.

AP – Reference/link to detailed revisions analyses

Accessibility Users should be directed to where detailed revisions analyses are available.

Page 180: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

180

15 % of units in admin data which fail checks

Description This indicator provides information on the extent to which data fail some elements of the checks (automatic or manual) and are flagged by the NSI as suspect. This does not mean that the data are necessarily adjusted (see Indicator 16), simply that they fail one or more check(s). This checking can either be based on a model, checking against other data sources (admin or survey), internet research or through direct contact with the businesses. This indicator should be calculated for each of the key variables and aggregated based on the number of relevant units (weighted by turnover) in each source.

How to calculate

%100 checked unitsrelevant of no. Total

failed and checked dataadmin in unitsrelevant of No.

Note. If the validation is done automatically and the system does not flag or record this in some way, this should be noted. Users should state the number of checks done, and the proportion of data covered by these checks. This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution to the output.

Example A. Statistical output: Annual data on structure and competitiveness of enterprises of

industry sector

B. Relevant units: Units in the statistical population

C. Relevant variables: Legal status; Production value

D. Steps for calculation:

D1. Identify for each key variable the number of units checked in admin data. D2. Identify for each key variable the number of units in admin data which fail checks. D3. Average the proportions of units which fail checks by weighting by the number of

units. D4. Calculate the indicator as follows:

%100)15( checked units relevant of no. Total

failed and checked data admin in units relevant of No.I

Let A be the Chamber of Commerce source (we take from A the variable Legal Status). Let B be the Balance Sheet source (we take from B the variable Production value).

Page 181: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

181

Unit checked for variable Legal Status (Y/N)=(1/0)

Unit fails checks for variable Legal Status (Y/N)=(1/0)

Unit checked for variable Production value (Y/N)=(1/0)

Unit fails checks for variable Production value (Y/N)=(1/0)

Number of employees

Units (1) (2) (3) (4) (5) (6)=(1)*(5) (7)=(2)*(6) (8)=(3)*(5) (9)=(4)*(8)

X1 1 1 1 0 18 18 18 18 0

X2 1 0 1 0 32 32 0 32 0

X3 1 0 1 1 327 327 0 327 327

X4 1 0 0 0 0 0 0 0 0

X5 1 1 1 1 27 27 27 27 27

X6 1 1 1 0 2 2 2 2 0

X7 1 0 0 0 1 1 0 0 0

X8 1 0 1 0 985 985 0 985 0

X9 1 0 0 0 1,008 1,008 0 0 0

X10 1 0 1 0 15 15 0 15 0

X11 1 0 0 0 0 0 0 0 0

X12 1 0 0 0 0 0 0 0 0

Sum 12 3 7 2 2,415 2,415 47 1,406 354

%25100*12

3)15( statusLegalI %57.28100*

7

2)15( valueoductionPrI

%32.26100*19

5100*

)712(

)23()15(

TotalI

Weighted by number of employees:

%95.1100*2415

47)15( )( statusLegalWI %18.25100*

1406

354)15( )(Pr valueoductionWI

%49.10100*3821

401100*

)14062415(

)35447()15( )(

TotalWI

Page 182: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

182

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

AI – Describe the known sources of error in administrative data

Accuracy Metadata provided by the administrative source and/or information from other reliable sources can be used to assess data errors. The magnitude of any errors (where known) that have a significant impact on the administrative data should be made available to users. This will help the user to understand how accurate the administrative data are.

AJ – Describe the data processing known to be required on the administrative data source in terms of the types of checks carried out

Accuracy Data processing may sometimes be required to check or improve the quality of the administrative data or create new variables to be used for statistical purposes. The user should be made aware of how and why data processing is used.

AK – Describe processing systems and quality control

Accuracy This informs users of the mechanisms in place to minimise processing error by ensuring accurate data transfer and processing.

AL – Describe the main sources of measurement error

Accuracy Measurement error is the error that occurs from failing to collect the true data values from respondents. Measurement error cannot usually be calculated. However, outputs should contain a definition of measurement error and a description of the main sources of the error when the admin data holder gathers the data.

AM – Describe processes employed by the admin data holder to reduce measurement error

Accuracy Describing processes to reduce measurement error indicates to users the accuracy and reliability of the measures.

AN – Describe the main sources of processing error

Accuracy Processing error is the error that occurs when processing data. It includes errors in data capture, coding, editing and tabulation of the data. It is not usually possible to calculate processing error exactly. However, outputs should be accompanied by a definition of processing error and a description of the main sources of the error.

Page 183: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

183

16 % of units for which data have been adjusted

Description This indicator provides information about the proportion of units for which the data have been adjusted (a subset of the units included in Indicator 15). These are units that are considered to be erroneous and are therefore adjusted in some way (missing data should not be included in this indicator – see Indicator 9). Any changes to the admin data before arrival with the NSI should not be considered in this indicator. This indicator should be calculated for each of the key variables and aggregated based on the number of relevant units (weighted by turnover) in each source.

How to calculate

%100 DataAdmin in unitsrelevant of No.

data adjusted with dataAdmin in the unitsrelevant of No.

This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution to the output.

Example

A. Statistical output: Annual data on structure and competitiveness of

enterprises of industry sector

B. Relevant variables: Legal status; Production value

C. Relevant units: Units in the statistical population

D. Steps for calculation:

D1. Identify for each key variable the number of units in admin data D2. Identify for each key variable the number of units in admin data that have been

adjusted D3. Average the proportions of units that have been adjusted by weighting by the

numbers of units D4. Calculate the indicator as follows:

%100)16( Data Admin in units relevant of No.

data adjusted withdata Admin the in units relevant of No.I

Let A be the Chamber of Commerce source (we take from A the variable Legal Status). Let B be the Balance Sheet source (we take from B the variable Production value).

Page 184: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

184

Units in admin data for variable Legal Status (Y/N)=(1/0)

Units adjusted for variable Legal Status (Y/N)=(1/0)

Units in admin data for variable Production value (Y/N)=(1/0)

Units adjusted for variable Production value (Y/N)=(1/0)

Number of employees

Units (1) (2) (3) (4) (5) (6)=(1)*(5) (7)=(2)*(6) (8)=(3)*(5) (9)=(4)*(8)

X1 1 0 1 0 18 18 0 18 0

X2 1 0 1 0 32 32 0 32 0

X3 1 0 1 0 327 327 0 327 0

X4 1 0 0 0 0 0 0 0 0

X5 1 1 1 0 27 27 27 27 0

X6 1 1 1 0 2 2 2 2 0

X7 1 0 0 0 1 1 0 0 0

X8 1 0 1 0 985 985 0 985 0

X9 1 0 0 0 1,008 1,008 0 0 0

X10 1 0 1 0 15 15 0 15 0

X11 1 0 0 0 0 0 0 0 0

X12 1 0 0 0 0 0 0 0 0

Sum 12 2 7 0 2,415 2,415 29 1,406 0

%67.16100*12

2)16( statusLegalI %0100*

7

0)16( valueoductionPrI

%53.10100*19

2100*

)712(

)02()16(

TotalI

Weighted by number of employees:

%20.1415,2

29)16( )( statusLegalWI %0100*

1406

0)16( )( valueoductionPrWI

%76.0100*821,3

29100*

)14062415(

)029()16( )(

TotalWI

Page 185: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

185

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

S - Describe any adjustments made for differences in concepts and definitions between the administrative source and the statistical output

Coherence Adjustments may be required as a result of differences in concepts and definitions between the administrative data and the requirements of the statistical product. A description of why the adjustment needed to be made and how the adjustment was made should be provided to the users so they can assess the coherence of the statistical product with other sources.

AK – Describe processing systems and quality control

Accuracy This informs users of the mechanisms in place to minimise processing error by ensuring accurate data transfer and processing.

AL – Describe the main sources of measurement error

Accuracy Measurement error is the error that occurs from failing to collect the true data values from respondents. Measurement error cannot usually be calculated. However, outputs should contain a definition of measurement error and a description of the main sources of the error when the admin data holder gathers the data.

AM – Describe processes employed by the admin data holder to reduce measurement error

Accuracy Describing processes to reduce measurement error indicates to users the accuracy and reliability of the measures.

AN – Describe the main sources of processing error

Accuracy Processing error is the error that occurs when processing data. It includes errors in data capture, coding, editing and tabulation of the data. It is not usually possible to calculate processing error exactly. However, outputs should be accompanied by a definition of processing error and a description of the main sources of the error.

AO – Describe the data processing known to be required on the administrative data source in terms of the types of edits carried out

Accuracy Data processing may sometimes be required to check or improve the quality of the administrative data or create new variables to be used for statistical purposes. The user should be made aware of how and why data processing is used.

Page 186: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

186

17 % of imputed values (items) in the admin data

Description This indicator provides information on the impact of the values imputed by the NSI. These values are imputed because data are missing (see Indicator 9) or data items are unreliable (see Indicator 16). This indicator should be calculated by variable for each admin source and then aggregated based on the contributions of the variables to the overall output.

How to calculate

%100 dataadmin in itemsrelevant of No.

dataadmin relevant in the items imputed of No.

This indicator should be weighted (e.g. by turnover or number of employees) in terms of the % contribution of the imputed values to the statistical output.

Example

A. Statistical output: Annual data on structure and competitiveness of enterprises of industry sector

B. Relevant units: Units in the statistical population

C. Relevant variables: Number of employees; craft nature of the enterprises

D. Steps for calculation:

D1. For each variable in the source calculate the number of relevant items D2. For each variable identify all the units with items either missing or present in admin

data which are afterwards imputed D3. For each variable calculate the proportion of D2 on D1 D4. Calculate the indicator for each source weighting the proportions with the items

D5. Calculate the indicator as follows:

%100)17( data admin in items relevant of No.

data admin relevant the in items imputed of No.I

Let A be the Chamber of Commerce source.

Page 187: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

187

Variable NACE code: items in admin data

Items afterwards imputed in Source A

Relevant items in Source A (Y/N)=(1/0)

Items afterwards imputed in Source A (Y/N)=(1/0) Turnover

Units (1) (2) (3) (4) (5) (6)=(4)*(5)

X1 32300 1 0 15,863,200 0

X2 27520 1 0 10,547,210 0

X3 10130 1 0 785,400 0

X4 15201 1 0 657,400 0

X5 25999 1 0 458,700 0

X6 Missing 17120 1 1 18,520,630 18,520,630

X7 10712 1 0 239,500 0

X8 Missing 15201 1 1 2,587,400 2,587,400

X9 25991 1 0 7,584,120 0

X10 26302 1 0 3,254,120 0

X11 25121 1 0 1,547,200 0

X12 32201 32200 1 1 547,800 547,800

X13 15201 1 0 540,210 0

X14 25999 1 0 787,000 0

Sum 14 3 63,919,890 21,655,830

%43.21100*14

3)17( codeactivityNACEI

Weighted by turnover:

%88.33100*890,919,63

830,655,21)17( )( codeactivityNACEWI

Variable craft nature code: items in admin data

Items afterwards imputed in Source A

Relevant items in Source A (Y/N)=(1/0)

Items afterwards imputed in Source A (Y/N)=(1/0) Turnover

Units (1) (2) (3) (4) (5) (6)=(4)*(5)

X1 0 1 0 15,863,200 0

X2 0 1 0 10,547,210 0

X3 1 1 0 785,400 0

X4 1 1 0 657,400 0

X5 Missing 1 1 1 458,700 458,700

X6 1 0 1 1 18,520,630 18,520,630

X7 1 1 0 239,500 0

X8 0 1 0 2,587,400 0

X9 0 1 0 7,584,120 0

X10 0 1 0 3,254,120 0

X11 0 1 0 1,547,200 0

X12 0 1 0 547,800 0

X13 1 1 0 540,210 0

X14 0 1 0 787,000 0

Sum 14 2 63,919,890 18,979,330

Page 188: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

188

%29.14100*14

2)17( natureCraftI

Weighted by turnover:

%69.29100*890,919,63

330,979,18)17( )( natureCraftWI

%86.17100*28

5100*

)1414(

)23()17(

TotalI

Weighted by turnover:

%79.31100*780,839,127

160,635,40100*

)890,919,63890,919,63(

)330,979,18830,655,21()()17(

TotalWI

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

X – Assess the likely impact of non-response/imputation on final estimates

Accuracy Non-response error may reduce the representativeness of the data. An assessment of the likely impact of non-response/imputation on final estimates allows users to gauge how reliable the key estimates are as estimators of population values. This assessment may draw upon the indicator: ‘% of imputed values (items) in the admin data’.

Y – Comment on the imputation method(s) in place within the statistical process

Accuracy The imputation method used can determine how accurate the imputed value is. Information should be provided on why the particular method(s) was chosen and when it was last reviewed.

Page 189: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

189

Timeliness and punctuality: 18 Delay to accessing / receiving data from Admin Source

Description This indicator provides information on the proportion of the time from the end of the reference period to the publication date that is taken up waiting to receive the admin data. This is calculated as a proportion of the overall time between reference period and publication date to provide comparability across statistical outputs. This indicator should be calculated for each admin source and then aggregated.

How to calculate

%100daten publicatio toperiod reference of end thefrom Time

dataAdmin receiving toperiod reference of end thefrom Time

Note. Include only the final dataset used for the statistical output. If a continuous feed of data is received, the ‘last’ dataset used to calculate the statistical output should be used in this indicator. If more than one source is used, an average should be calculated, weighted by the sources’ contributions to the final estimate. If the admin data are received before the end of the reference period, this indicator would be 0. This indicator applies to the first publication only, not to revisions.

Example A. Statistical output: Annual data on structure and competitiveness of

enterprises with employees

B. Relevant units: Units in the statistical population

C. Relevant variables: Number of employees; Turnover; Production value.

D. Steps for calculation:

D1. Match each source with the units in the statistical population by the common identification code (if available) or by other methods, obtaining the number of common units

D2. Calculate for each source the number of days from the end of reference period to the arrival of admin data.

D3. Calculate the number of days from the end of reference period to dissemination date.

D4. Calculate the indicator as follows:

%100)18( datenpublicatiotoperiodreferenceofend the from Time

dataAdmin receivingtoperiod referenceofendthe fromTimeI

Let A be the Social Security source (from which we take the number of employees). Let B be the Fiscal register (from which we take the VAT proxy of turnover). Let C be Balance Sheet source (from which we take the Production Value).

Page 190: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

190

Number of units in source and statistical population

Number of days from the end of the reference period to receiving Admin data

Number of days from the end of reference period to publication date

I(18) for each source

Weighting for contributions

Source (1) (2) (3) (4)=(2)/(3) (5)=(1)*(4)

A 2,598 186 291 63.92% 1660.64

B 1,962 123 291 42.27% 829.34

C 2,241 235 291 80.76% 1809.83

Sum 6,801 4299.81

%92.63100*291

186)18( EmployeesI

%27.42100*291

123)18( TurnoverI

%76.80100*291

235)18( Pr valueoductionI

Weighted by the contributions of the source to the statistical output:

%22.63100*801,6

81.4299)18( AggregateI

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

G – Describe the timescale since the last update of data from the administrative source

Timeliness An indication of the timescale since the last update from administrative sources will provide the user with an indication of whether the statistical product is timely enough to meet their needs.

H – Describe the extent to which the administrative data are timely

Timeliness Provide information on how soon after their collection the statistical institution receives the administrative data. The effects of any lack of timeliness on the statistical product should be described.

I – Describe any lack of punctuality in the delivery of the administrative data source

Timeliness Give details of the time lag between the scheduled and actual delivery dates of the data. Any reasons for the delay should be documented along with their effects on the statistical product.

J – Frequency of production

Timeliness This indicates how timely the outputs are, as the frequency of publication indicates whether the outputs are up to date with respect to users’ needs.

K – Describe key user needs for timeliness of data and how these needs have been addressed

Timeliness This indicates how timely the data are for specified needs, and how timeliness has been secured, eg by reducing the time lag to a number of days rather than months for monthly releases.

Page 191: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

191

Comparability: 19 Discontinuity in estimate when moving from a survey-based output to an

output involving admin data

Description This indicator measures the impact on the level of the estimate when changing from a survey-based output to an output involving admin data (either entirely admin based or partly). This indicator should be calculated separately for each key estimate included in the output. This can be viewed as a one-off indicator when moving from a survey based output to an output involving admin data.

How to calculate

%100 survey fromEstimate

surveyfrom Estimate - data Admininvolving Estimate

Note. This indicator should be calculated using survey and admin data which refer to the same period.

Example

A. Statistical output: An annual survey on structure and competitiveness of companies of trade sector

B. Relevant units: Units in the statistical population

C. Relevant variables: Personnel costs

D. Steps for calculation:

D1. Compute the estimate of the variable(s) for the survey based output D2. Compute the estimate of the variable(s) for the admin-data based output D3. Calculate the indicator as follows:

100*)19(surveyfromEstimate

datasurveyfromestimatedataminadinvolvingEstimateI

Let A be the Balance Sheet source.

%27.0100*180,54

180,54036,54)19(

I

Survey-based output € 35,000.00 2,840 4,323.15 € 54,180.00

Admin-based output € 35,000.00 179,500 3,242.41 € 54,036.00

Unbiased sample

standard deviation

Sample

meanCost Sample size

Page 192: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

192

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

L – Describe the impact of moving from a survey based output to an admin-data based output

Comparability Provide information on how the data are affected when moving from a survey based output to an admin-data based output. Comment on any reasons behind differences in the output and describe how any inconsistencies are dealt with.

M – Describe any method(s) used to deal with discontinuity issues

Comparability Where the level of the estimate is impacted when moving from a survey based output to an admin-data based output, it may be possible to address this discontinuity. In this case a description of the method(s) used to deal with the discontinuity should be provided.

N - Describe the reasons behind discontinuities when moving from survey based estimates to admin-data based estimates

Comparability Where it is not possible to address a discontinuity in estimates when moving from a survey based output to an admin-data based output, the reasons behind the discontinuity should be provided along with commentary around the impact on the level of the estimate. Any changes in terms of the advantages and limitations of the differences should also be highlighted.

Page 193: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

193

Coherence: 20 % of consistent items for common variables in more than one source23

Description This indicator provides information on consistent items for any common variables across sources (either admin or survey). Only variables directly required for the statistical output should be considered – basic information (e.g. business name and address) should be excluded. Values within a tolerance should be considered consistent – the width of this tolerance (1%, 5%, 10%, etc.) would depend on the variables and methods used in calculating the statistical output. This indicator should be calculated for each of the key variables and aggregated based on the contributions of the variables to the overall output.

How to calculate

%100 variable X for required items of no. Total

variable X for tolerance) (within items consistent of No.

Note. If only one source is available or there are no common variables, this indicator is not relevant. Please state the tolerance used. This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution to the output.

Example A. Statistical output: Annual data on structure and competitiveness of enterprises

of trade sector

B. Relevant units: Units in the survey

C. Relevant variables: Number of employees

D. Steps for calculation:

D1. Match each source with the survey by the common identification code (if available) or by other methods

D2. Attribute a Presence(1)/Absence(0) indicator to items of the variables in the survey (sum up to obtain denominator)

D3. Attribute a value 1(0) for consistent (not consistent) items in the survey and in the source (it is considered “consistent” if the percentage difference is less than 3%)

D4. Calculate the indicator as follows:

%100)20( variable Xfor required items of no.Total

variableX for tolerance) (within items consistentofNo.I

E. Tolerance: Max Difference = 3%

Let A be the Social Security source. Let B be the Nielsen data bank data.

23 Indicators 20 and 23 are the only indicators in Section 4.2 for which a high indicator score denotes high

quality and a low indicator score denotes low quality.

Page 194: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

194

%29.64100*14

9)20( I

Weighted by turnover:

%54.37100*541,619,281

830,724,105)20( WI

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

O – Describe the common identifiers of population units in administrative data

Coherence Different administrative sources often have different population unit identifiers. The user can utilise this information to match records from two or more sources. Where there is a common identifier, matching is generally more successful.

Q - Describe the width of the tolerance and the reasons for this

Coherence Where values within a particular tolerance are considered consistent for common variables across more than one source, the width of the tolerance should be stated, along with a brief explanation as to why this particular tolerance width was chosen.

U – Describe the record matching methods and processes used on the administrative data sources

Accuracy Record matching is when different administrative records for the same unit are matched using a common, unique identifier or key variables common to both datasets. There are many different techniques for carrying out this process. A description of the technique (e.g. automatic or clerical matching) should be provided along with a description (qualitative or quantitative) of its effectiveness.

Units (1) (2) (3)={[(2)-(1)]/(1)}*100 (4) (5) (6)=(4)*(5)

X1 152 154 1.32 1 38,985,610 38,985,610

X2 335 352 5.07 0 58,945,620 0

X3 15 15 0.00 1 7,540,210 7,540,210

X4 29 40 37.93 0 48,540,210 0

X5 2 2 0.00 1 298,540 298,540

X6 0 0 0.00 1 680,000 680,000

X7 11 12 9.09 0 1,548,760 0

X8 18 18 0.00 1 1,800,000 1,800,000

X9 60 61 1.67 1 9,856,410 9,856,410

X10 71 70 1.41 1 17,564,280 17,564,280

X11 29 27 6.90 0 6,985,471 0

X12 569 600 5.45 0 59,874,650 0

X13 235 240 2.13 1 26,541,780 26,541,780

X14 11 11 0.00 1 2,458,000 2,458,000

Sum 1,537 1,602 9 281,619,541 105,724,830

Turnover

Number of

employees source A

Number of employees

source B

Percentage

difference <3%

(Y/N)=(1/0) A and B

Percentage

difference between

Page 195: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

195

oSS

S

U+U

U

21 % of relevant units in admin data which have to be adjusted to create statistical units

Description This indicator provides information on the proportion of units that have to be adjusted in order to create statistical units. For example, the proportion of data at enterprise group level which therefore need to be split to provide reporting unit data.

How to calculate

Relevant units in the reference population that are adjusted to the statistical concepts by the use of statistical methods

Relevant units in the reference population that correspond to the statistical concepts This indicator should be weighted (e.g. by turnover or number of employees) in terms of the % contribution of these units to the statistical output. Note: Frequently, administrative units must be aggregated into 'composite units' before being disaggregated into statistical units. If this is required, it may be helpful to calculate an additional indicator covering the proportion of administrative units which can be successfully matched, or 'aligned', with composite units.

Example

A. Statistical output: Annual data on structure and competitiveness in industry sector

B. Relevant units: Enterprises in the statistical population (but statistical units are enterprises groups)

D. Steps for calculation:

D1. Identify the units in admin data which need to be adjusted in order to obtain the relevant statistical units.

D2. Identify the relevant units in admin data that correspond to the statistical concepts. D3. Divide D1 by (D1+D2) to calculate the indicator as follows:

100*)21(

UosUs

UsI

Where:

Us= Relevant units in the reference population that are adjusted to the statistical concepts by the use of statistical methods.

Uos= Relevant units in the reference population that correspond to the statistical concepts.

SS UU

oSU

Page 196: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

196

%25.81100*313

13)21(

I

Weighted by number of employees;

%45.96100*116156,3

156,3)21(

WI

Relevant Units (1) (2) (3) (4) (5)=(2)*(4) (5)=(3)*(4)

X1 A1 0 1 26 0 26

X2 A1 0 1 369 0 369

X3 A1 0 1 856 0 856

X4 A2 1 0 96 96 0

X5 A3 0 1 15 0 15

X6 A3 0 1 27 0 27

X7 A3 0 1 100 0 100

X8 A3 0 1 25 0 25

X9 A3 0 1 38 0 38

X10 A4 1 0 2 2 0

X11 A5 0 1 2 0 2

X12 A5 0 1 0 0 0

X13 A6 0 1 15 0 15

X14 A6 0 1 985 0 985

X15 A6 0 1 698 0 698

X16 A7 1 0 18 18 0

Sum 3 13 3,272 116 3,156

Enterprises

groups code

Relevant unit

corresponds to

statistical unit

(Y/N)=(1/0)

Number of

employees

Relevant unit

corresponds to

statistical unit

(N/Y)=(1/0)

Page 197: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

197

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

D – Describe constraints on the availability of administrative data at the required level of detail

Relevance Some administrative microdata have restricted availability or may only be available at aggregate level. Describe any restrictions on the level of data available and their effects on the statistical product.

R - Describe differences in concepts, definitions and classifications between the administrative source and the statistical output

Coherence

There may be differences in concepts, definitions and classifications between the administrative source and statistical product. Concepts include the population, units, domains, variables and time reference and the definitions of these concepts may vary between the administrative data and the statistical product. Time reference problems occur when the statistical institution requires data from a certain time period but can only obtain them for another. Any effects on the statistical product need to be made clear along with any techniques used to remedy the problem.

S - Describe any adjustments made for differences in concepts and definitions between the administrative source and the statistical output

Coherence

Adjustments may be required as a result of differences in concepts and definitions between the administrative data and the requirements of the statistical product. A description of why the adjustment needed to be made and how the adjustment was made should be provided to the users so they can assess the coherence of the statistical product with other sources.

Page 198: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

198

Cost and efficiency:

22 Cost of converting admin data to statistical data

Description This indicator provides information on the estimated cost (in person hours) of converting admin data to statistical data. It can be considered in two ways: either as a one-off indicator to identify the set-up costs of moving from survey data to administrative data (as such it should include set-up costs, monitoring of data sources, negotiating with data providers, etc.), or as a regular indicator to identify the ongoing running costs of the system that converts the administrative data to statistical data (which should include costs of technical processing, monitoring of the data, ongoing liaison with data providers, etc.). The indicator should be calculated for each admin source and then aggregated based on the contribution of the admin source to the statistical output.

How to calculate

(Estimated) Cost of conversion in person hours

Note. This should only be calculated for parts of the admin data relevant to the statistical output.

Example

A. Statistical output: Annual data on structure and competitiveness of companies of industry sector which are part of enterprise groups

B. Relevant units: Units in the statistical population

C. Relevant variables: Production value

D. Steps for calculation:

D1.Identify the time in person hours necessary to convert the admin data in order to obtain statistical data as a function of admin source size and complexity in the treatment of admin data.

Let c1=number of records in admin data. Let c2=estimated average number of minutes necessary to split the Production value variable to obtain the items related to each enterprise of the group. I(22)=Cost of conversion in person hours = f(#of records in admin data, estimated average number of minutes necessary to process one admin unit) I(22) = c1*c2 = 2172*15min = 32,580 min = 543 person hours. Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

AT - Describe the processes required for converting admin data to statistical data and comment on any potential issues

Cost and efficiency

Data processing may be required to convert the admin data to statistical data. This indicator should provide the user with information on the types of processes and techniques used, along with a description of any issues that are dealt with, or perhaps still remain.

Page 199: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

199

23 Efficiency gain in using admin data24

Description This indicator provides information on the efficiency gain in using admin data rather than simply using survey data. For example, collecting admin data is usually cheaper than collecting data through a survey but this benefit might be offset by higher processing costs. This indicator should consider the total estimated costs of producing the output when using survey data (potentially a few years ago if the move was gradual) and then compare this to the total estimated costs of producing the output when using admin data or a combination of both. Production cost should include all costs the NSI is able to attribute to the production of the statistical output. (For example, this may include the cost of the use of computers and electrical equipment, staff costs, cost of data processing, cost of results dissemination, etc.) This can be viewed as a one-off indicator when moving from a survey based output to an output involving admin data.

How to calculate

%100statistic basedsurvey ofcost Production

statistic basedadmin ofcost production - statistic basedsurvey ofcost Production

Note. Estimated costs are acceptable.

This indicator is likely to be calculated once, when making the change from survey to admin data.

Example A. Statistical output: Annual data on structure and competitiveness in trade sector

B. Relevant units: Units in the statistical population

D. Steps for calculation:

D1. Quantify costs of survey based statistics (as a function of p=preparatory work of survey; s=survey costs; w=personnel costs; e=processing costs; d=dissemination costs)

D2. Quantify costs of admin based statistics (as a function of p=preparatory work of admin data; a=cost of obtaining admin data; w=personnel costs; e=processing costs; d=dissemination costs)

D3. Calculate the indicator as follows:

%100)23( statisticbased survey of cost Production

statisticbased admin of cost production - statisticbased survey of cost ProductionI

24 Indicators 20 and 23 are the only indicators in Section 3.2 for which a high indicator score denotes high

quality and a low indicator score denotes low quality.

Page 200: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

200

%15.53100*120,103

310,48120,103)23(

I

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

AT - Describe the processes required for converting admin data to statistical data and comment on any potential issues

Cost and efficiency

Data processing may be required to convert the admin data to statistical data. This indicator should provide the user with information on the types of processes and techniques used, along with a description of any issues that are dealt with, or perhaps still remain.

Cost of survey based output=f(p,s,w,e,d)= =f(3,650, 45,890, 12,480, 18,540, 22,560)=3,650+45,890+12,480+18,540+22,560 = € 103,120.00

Cost of admin based output=f(p,a,w,e,d)= =f(2,540, 4,210, 8,500, 10,500, 22,560)=2,540+4,210+8,500+10,500+22,560

= € 48,310.00

Page 201: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

201

References Daas, P.J.H., Ossen, S.J.L. & Tennekes, M. (2010). Determination of administrative data quality: recent results and new developments. Paper and presentation for the European Conference on Quality in Official Statistics 2010. Helsinki, Finland. Eurostat, (2003). Item 6: Quality assessment of administrative data for statistical purposes. Luxembourg, Working group on assessment of quality in statistics, Eurostat. Frost, J.M., Green, S., Pereira, H., Rodrigues, S., Chumbau, A. & Mendes, J. (2010). Development of quality indicators for business statistics involving administrative data. Paper presented at the Q2010 European Conference on Quality in Official Statistics. Helsinki, Finland. Ossen, S.J.L., Daas, P.J.H. & Tennekes, M. (2011). Overall Assessment of the Quality of Administrative Data Sources. Paper accompanying the poster at the 58th Session of the International Statistical Institute. Dublin, Ireland.

European Commission, Eurostat, (2007). Handbook on Data Quality Assessment Methods and Tools.

Page 202: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

202

Appendix A: List of Qualitative Indicators by Theme

Relevance

Qualitative indicator Description Related

quantitative indicator(s)

A - Name each admin source used as an input into the statistical product and describe the primary purpose of the data collection for each source

Name all administrative sources and their providers. This information will assist users in assessing whether the statistical product is relevant for their intended use. Providing information on the primary purpose of data collection also enables users to assess whether the data are relevant to their needs.

1, 2

B – Describe the main uses of each of the admin sources and, where possible, how the data relate to the needs of users

Include all the main statistical processes and/or outputs known to require data from the administrative source This indicator should also capture how well the data support users’ needs. This information can be gathered from user satisfaction surveys and feedback.

1, 2

C – Describe the extent to which the data from the administrative source meet statistical requirements

Statistical requirements of the output should be outlined and the extent to which the administrative source meets these requirements stated. Gaps between the administrative data and statistical requirements can have an effect on the relevance to the user. Any gaps and reasons for the lack of completeness should be described, for example if certain areas of the target population are missed or if certain variables that would be useful are not collected. Any methods used to fill the gaps should be stated.

3

D – Describe constraints on the availability of administrative data at the required level of detail

Some administrative microdata have restricted availability or may only be available at aggregate level. Describe any restrictions on the level of data available and their effects on the statistical product.

3, 21

E – Describe reasons for use of admin data as a proxy

Where admin data are used in the statistical output as a proxy or are used in calculations rather than as raw data, information should be provided in terms of why the admin data have been used as a proxy for the required variables.

3

F - Identify known gaps between key user needs, in terms of coverage and detail, and current data

Data are complete when they meet user needs in terms of coverage and detail. This indicator allows users to assess, when there are gaps, how relevant the outputs are to their needs.

N/A

Page 203: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

203

Timeliness and punctuality

Qualitative indicator Description Related quantitative indicator(s)

G – Describe the timescale since the last update of data from the administrative source

An indication of the timescale since the last update from administrative sources will provide the user with an indication of whether the statistical product is timely enough to meet their needs.

4, 18

H – Describe the extent to which the administrative data are timely

Provide information on how soon after their collection the statistical institution receives the administrative data. The effects of any lack of timeliness on the statistical product should be described.

4, 18

I – Describe any lack of punctuality in the delivery of the administrative data source

Give details of the time lag between the scheduled and actual delivery dates of the data. Any reasons for the delay should be documented along with their effects on the statistical product.

4, 18

J – Frequency of production

This indicates how timely the outputs are, as the frequency of publication indicates whether the outputs are up to date with respect to users’ needs.

18

K – Describe key user needs for timeliness of data and how these needs have been addressed

This indicates how timely the data are for specified needs, and how timeliness has been secured, eg by reducing the time lag to a number of days rather than months for monthly releases.

18

Comparability

Qualitative indicator Description Related quantitative indicator(s)

L – Describe the impact of moving from a survey based output to an admin-data based output

Provide information on how the data are affected when moving from a survey based output to an admin-data based output. Comment on any reasons behind differences in the output and describe how any inconsistencies are dealt with.

8, 19

M – Describe any method(s) used to deal with discontinuity issues

Where the level of the estimate is impacted when moving from a survey based output to an admin-data based output, it may be possible to address this discontinuity. In this case a description of the method(s) used to deal with the discontinuity should be provided.

19

N - Describe the reasons behind discontinuities when moving from survey based estimates to admin-data based estimates

Where it is not possible to address a discontinuity in estimates when moving from a survey based output to an admin-data based output, the reasons behind the discontinuity should be provided along with commentary around the impact on the level of the estimate. Any changes in terms of the advantages and limitations of the differences should also be highlighted.

19

Page 204: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

204

Coherence

Qualitative indicator Description Related quantitative indicator(s)

O – Describe the common identifiers of population units in administrative data

Different administrative sources often have different population unit identifiers. The user can utilise this information to match records from two or more sources. Where there is a common identifier, matching is generally more successful.

5, 6, 20

P – Provide a statement of the nationally/internationally agreed definitions, classifications and standards used

This is an indicator of clarity, in that users are informed of concepts and classifications used in compiling the output. It also indicates geographical comparability where the agreed definitions and standards are used.

N/A

Q - Describe the width of the tolerance and the reasons for this

Where values within a particular tolerance are considered consistent for common variables across more than one source, the width of the tolerance should be stated, along with a brief explanation as to why this particular tolerance width was chosen.

20

R - Describe differences in concepts, definitions and classifications between the administrative source and the statistical output

There may be differences in concepts, definitions and classifications between the administrative source and statistical product. Concepts include the population, units, domains, variables and time reference and the definitions of these concepts may vary between the administrative data and the statistical product. Time reference problems occur when the statistical institution requires data from a certain time period but can only obtain them for another. Any effects on the statistical product need to be made clear along with any techniques used to remedy the problem.

10, 21

S - Describe any adjustments made for differences in concepts and definitions between the administrative source and the statistical output

Adjustments may be required as a result of differences in concepts and definitions between the administrative data and the requirements of the statistical product. A description of why the adjustment needed to be made and how the adjustment was made should be provided to the users so they can assess the coherence of the statistical product with other sources.

16, 21

T - Compare estimates with other estimates on the same theme

This statement advises users whether estimates from other sources on the same theme are coherent (ie they ‘tell the same story’), even where they are produced in different ways. Any known reasons for lack of coherence should be given.

N/A

Page 205: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

205

Accuracy

Qualitative indicator Description Related quantitative indicator(s)

U – Describe the record matching methods and processes used on the administrative data sources

Record matching is when different administrative records for the same unit are matched using a common, unique identifier or key variables common to both datasets. There are many different techniques for carrying out this process. A description of the technique (e.g. automatic or clerical matching) should be provided along with a description (qualitative or quantitative) of its effectiveness.

5, 6, 20

V – Describe the data processing known to be required on the administrative data source to deal with non-response

Data processing is often required to deal with non-response. The user should be made aware of how and why particular data processing methods are used.

9

W – Describe differences between responders and non-responders

This indicates to users how significant the non-response bias is likely to be. Where response is high, non-response bias is likely to be less of a problem than when there are high rates of non-response. NB: There may be instances where non-response bias is high even with very high response rates, if there are large differences between responders and non-responders.

9

X – Assess the likely impact of non-response/imputation on final estimates

Non-response error may reduce the representativeness of the data. An assessment of the likely impact of non-response/imputation on final estimates allows users to gauge how reliable the key estimates are as estimators of population values. This assessment may draw upon the indicator: ‘% of imputed values (items) in the admin data’.

9, 17

Y – Comment on the imputation method(s) in place within the statistical process

The imputation method used can determine how accurate the imputed value is. Information should be provided on why the particular method(s) was chosen and when it was last reviewed.

17

Z – Describe how the misclassification rate is determined

It is often difficult to calculate the misclassification rate.Therefore, where this is possible, a description of how the rate has been calculated should also be provided.

10

AA – Describe any issues with classification and how these issues are dealt with

Whereas a statistical institution can decide upon and adjust the classifications used in its own surveys to meet user needs, the institution usually has little or no influence over those used by administrative sources. Issues with classification and how these issues are dealt with should be described so that the user can decide whether the source meets their needs.

10

AB – Describe the extent of coverage of the administrative data and any known coverage problems

This information is useful for assessing whether the coverage is sufficient. The population that the administrative data covers should be included along with all known coverage problems. There could be overcoverage (where duplicate records are included) or undercoverage (where certain records are missed).

11, 12

Page 206: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

206

Qualitative indicator Description Related quantitative indicator(s)

AC – Describe methods used to deal with coverage issues

Updating procedures and cleaning procedures should be described. Also, information should be provided on edit checks and /or imputations that are carried out because of coverage error. This information indicates to users the resultant robustness of the administrative data, after these procedures have been carried out to improve coverage.

11, 12

AD – Assess the likely impact of coverage error on key estimates

Coverage issues could relate to undercoverage and overcoverage (or both). Special studies can sometimes be carried out to assess the impact of undercoverage and overcoverage. An assessment of the likely impact of coverage error indicates to users how reliable the key estimates are as estimators of population values.

11, 12

AE – Describe the data processing known to be required on the administrative data source to address instances where the reference period differs from the required reference period

Data processing may sometimes be required to check or improve the quality of the administrative data or create new variables to be used for statistical purposes. The user should be made aware of how and why data processing is used.

13

AF - Comment on the impact of the different versions of admin data on the results

When commenting on the size of the revisions of the different versions of the admin data, information on the impact of the revisions on the statistical product for the relevant reference period should also be explained to users.

14

AG – Flag any published data that are subject to revision and data that have already been revised

This indicator alerts users to published data that may be, or have already been, revised. This will enable users to assess whether provisional data will be fit for their purposes.

14

AH – For ad hoc revisions, detail revisions made and provide reasons

Where revisions occur on an ad hoc basis to published data, this may be because earlier estimates have been found to be inaccurate. Users should be clearly informed of the revisions made and why they occurred. Clarifying the reasons for revisions guards against any misinterpretation of why revisions have occurred, while at the same time making processes (including any errors that may have occurred) more transparent to users.

14

AI – Describe the known sources of error in administrative data

Metadata provided by the administrative source and/or information from other reliable sources can be used to assess data errors. The magnitude of any errors (where known) that have a significant impact on the administrative data should be made available to users. This will help the user to understand how accurate the administrative data are.

15

AJ – Describe the data processing known to be required on the administrative data source in terms of the types of checks carried out

Data processing may sometimes be required to check or improve the quality of the administrative data or create new variables to be used for statistical purposes. The user should be made aware of how and why data processing is used.

15

Page 207: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

207

Qualitative indicator Description Related quantitative indicator(s)

AK – Describe processing systems and quality control

This informs users of the mechanisms in place to minimise processing error by ensuring accurate data transfer and processing.

15, 16

AL – Describe the main sources of measurement error

Measurement error is the error that occurs from failing to collect the true data values from respondents. Measurement error cannot usually be calculated. However, outputs should contain a definition of measurement error and a description of the main sources of the error when the admin data holder gathers the data.

15, 16

AM – Describe processes employed by the admin data holder to reduce measurement error

Describing processes to reduce measurement error indicates to users the accuracy and reliability of the measures.

15, 16

AN – Describe the main sources of processing error

Processing error is the error that occurs when processing data. It includes errors in data capture, coding, editing and tabulation of the data. It is not usually possible to calculate processing error exactly. However, outputs should be accompanied by a definition of processing error and a description of the main sources of the error.

15, 16

AO – Describe the data processing known to be required on the administrative data source in terms of the types of edits carried out

Data processing may sometimes be required to check or improve the quality of the administrative data or create new variables to be used for statistical purposes. The user should be made aware of how and why data processing is used.

16

Accessibility and clarity

Qualitative indicator Description Related quantitative indicator(s)

AP – Reference/link to detailed revisions analyses

Where published data have been revised, users should be directed to where detailed revisions analyses are available.

14

Page 208: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

208

Cost and efficiency

Qualitative indicator Description Related quantitative indicator(s)

AQ – Describe reasons for significant overlap in admin data and survey data collection for some items

Where items are not obtained exclusively from admin data, reasons for the overlap between admin data and survey data should be described.

2

AR – Comment on the types of items that are being obtained by the admin source as well as the survey

If the same data items are collected from both the admin source and the survey, this can lead to duplication when combining the sources. This indicator should highlight to users the variables that are being collected across both sources.

7

AS – If items are purposely collected by both the admin source and the survey, describe the reason for this duplication (eg validity checks)

In some instances it may be beneficial to collect the same variables from both the admin source and the survey, such as to validate the micro level data. The reason for the double collection should be described to users.

7

AT - Describe the processes required for converting admin data to statistical data and comment on any potential issues

Data processing may be required to convert the admin data to statistical data. This indicator should provide the user with information on the types of processes and techniques used, along with a description of any issues that are dealt with, or perhaps still remain.

22, 23

Page 209: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

209

Appendix B: Notation for quality indicators

Administrative datasets are often ‘progressive’ – data for a given reference period can differ when measured at different time-points. This can present challenges when specifying and implementing quality indicators. This appendix outlines notation which may be helpful in specifying these kinds of problems and presents some possible solutions. For a more extensive treatment of the concept and a prediction framework for progressive data, see Zhang (2013). Notation to help deal with progressive nature of admin data It is important to be able to distinguish reference period – the time-point of interest - from measurement periods – the time-points at which we measure the time-point of interest. The following notation is suggested:

U(a ; b | c) – the population at time-period ‘a’ measured at time-period ‘b’ according to data source ‘c’

yi(a ; b | c) – value of interest for unit ‘i’ in U(a; b | c) So, for example, U(t ; t+α | Fiscal Register) refers to the population according to the ‘Fiscal Register’ administrative source for time-point ‘t’ measured ‘α’ periods after ‘t’. A characteristic of many administrative datasets is that the value for a given reference period depends on the measurement period: this can be referred to as progressiveness. This means that, for a lag ‘α’, both the number of units in U(t ; t+ α | c) and their total of any variable of interest will keep evolving over time, until α =∞ in principle. This characteristic is often true of business registers as well as administrative datasets, particularly when business registers are maintained using administrative sources. Implication for the implementation of the quality indicators When calculating quality indicators, results from an early version of the administrative data may produce very different results from a later version. The decision as to which version of an administrative dataset to use is therefore important and should be documented when the quality indicators are reported. The notation above may be useful in making and reporting this decision. Several quality indicators call for comparison with the business register. In this case, the choice of which version of the business register to use is equally important. Choice of datasets The version of the administrative data used in the estimation is usually the best one to use. Frequently, this will be a dataset for the correct reference period. Where the reference period of the administrative data differs from the statistical reference period – for example, where employment statistics for February use administrative data with a reference period of January - it may be informative to calculate an alternative set of indicators using the administrative data with the correct reference period. In our example, the ‘correct’ reference period would be February. This can help identify the quality impact of using administrative data with an incorrect reference period.

Page 210: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

210

In terms of the choice of business register, it may be preferable to use the most up-to-date version of the business register with the correct reference period. However, it may happen that the business register is updated using the administrative source under evaluation. In such cases, it may be preferable to use an earlier vintage of the business register, before this updating has taken place, but retaining the reference period of interest. It should be noted that this choice may be limited by practical constraints regarding what versions of the business register are stored. Concluding Remarks In general, it is important to consider the impact of the progressiveness of both administrative data and business registers and to record which versions are used in the calculation of the quality indicators. The notation set out above may be helpful when doing so. Reference Zhang, L-C. (2013). Towards VAT register-based monthly turnover statistics. Development report available on request.

Page 211: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

211

Appendix C: Glossary25

Term Definition

1. administrative data

The data derived from an administrative source, before any processing or validation by the NSIs.

2. administrative source

A data holding containing information collected and maintained for the purpose of implementing one or more administrative regulations.

In a wider sense, any data source containing information that is not primarily collected for statistical purposes.

3. common units

Within a process of matching, those units that are identified in more than one source.

4. consistent items

Within a process of matching, the values of a variable, referring to the same unit, that are logically and/or numerically coherent across different sources.

According to the level of accuracy required, values can be considered consistent even within a certain tolerance.

5. item A ‘value’ for a variable for a specific unit.

6. key variables Within the ESSnet Admin Data, this term is used to refer to the statistical variables that are most important and have the largest impact on a statistical output (e.g. turnover, number of employees, wages and salaries, etc.).

7. reference population

The set of units about which information is wanted and estimates are required.

8. relevant units

Businesses that are within the scope of the statistical output (e.g. units from the services sector should be excluded from manufacturing statistics).

9. relevant items

‘Values’ for units on relevant variables that should be included in calculating the statistical output.

10. required period

The reporting period used within the statistical output.

11. required variables

Variables necessary to calculate the statistical output.

12. statistical output

A statistic produced by the NSI – whether based on a specific variable (e.g. no. of employees) or a set of related variables (e.g. total turnover; domestic market turnover; external market turnover). In the broadest sense, statistical output would also apply to the whole STS or SBS output.

25 Work Package 1 (WP1) of the ESSnet AdminData has developed an ‘Admin Data Glossary’. To access the

glossary, please follow this link: http://essnet.admindata.eu/Glossary/List

Page 212: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

212

13. unit Refers to statistical units – enterprise, legal unit, local unit, etc.

14. weighted A number of the quality indicators described in this document can be calculated in unweighted or weighted versions. Formulae are given for the unweighted versions of the indicators. Weighting can be beneficial as the weighted indicator will often better describe the quality of the statistical output. For example, the unweighted item non-response will inform users what proportion of valid units did not respond for a particular variable, whereas the weighted item non-response will estimate the proportion of the output variable affected by non-response. A non-response rate of 30% is of less concern if those 30% of units only cover 1% of the output variable. In practice, we do not know the values of the output variable for non-responders, so we use a related variable instead. Business register variables such as Turnover or Employment are often used as proxies.

The weighted indicators are calculated as follows:

Page 213: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

213

Annex 1b – STS indicators

ESSNET

USE OF ADMINISTRATIVE AND ACCOUNTS DATA

IN BUSINESS STATISTICS

WP6 Quality Indicators when using Administrative Data

in Statistical Outputs

Tailored list of basic quality indicators:

Short Term (Business) Statistics (STS)

July, 2013

Page 214: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

214

1. Introduction

With the increasing use of administrative data in the production of business statistics comes the challenge for statistical producers of how to assess quality. The European Statistical System network (ESSnet) project on the Use of Admin and Accounts Data in Business Statistics was established to develop best practice in the use of admin data for business statistics. One work package of the ESSnet Admin Data focusses on quality and has developed quality indicators in this area. The current document provides a list of basic quality indicators specifically in relation to the use of admin data. More generic considerations of quality are available (see European Commission, Eurostat, 2007, for the Handbook on Data Quality Assessment Methods and Tools) but these have not specifically considered quality in the context of the increasing use of admin data – which has an impact on quality as not all the attributes of the quality framework can be applied in the same way to statistics involving admin data. Both quantitative and qualitative indicators are included in this list, which focusses on assessing the quality of the statistical output, taking the input and process into consideration. To aid statistical producers in their use of this list, tailored versions have been developed for the main statistical regulations: Structural Business Statistics – SBS – and Short Term (Business) Statistics – STS. This is in order to aid the understanding and application of this work within these areas. This document is the list of indicators including STS specific examples.

Page 215: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

215

2. A Quick Guide to the Quality Indicators

What are the quality indicators?

The European Statistical System network project on admin data (ESSnet Admin Data) has developed a list of quality indicators, for use with business statistics involving admin data. The indicators provide a measure of quality of the statistical output, taking input and process into account. They are based on the ESS dimensions of statistical output quality and other characteristics considered within the ESS Handbook for Quality Reports26. Who are they for?

The list of quality indicators has been developed primarily for producers of statistics, within the ESS and more widely. The indicators can also be used for quality reporting, thus benefiting users of the statistical outputs. They provide the user with an indication of the quality of the output, and an awareness of how the admin data have been used in the production of the output. When can they be used?

The list of quality indicators is particularly useful for two broad situations: 1. When planning to start using admin data as a replacement for, or to supplement,

survey data. In this scenario, the indicators can be used to assess the feasibility of increasing the use of admin data, and the impact on output quality.

2. When admin data are already being used to produce statistical outputs. In this scenario, the indicators can be used to gauge and report on the quality of the output, and to monitor it over time. Certain indicators will be suitable to report to users, whilst others will be most useful for the producers of the statistics only.

How should they be used?

There are 23 basic quantitative quality indicators and 46 qualitative quality indicators in total, but not all indicators will be relevant to all situations. Therefore, a statistical producer should select the indicators relevant to its output. The table in Section 3.3 shows which of the quantitative indicators relate to which dimension or ‘theme’ of quality, which may be useful in identifying which indicators to use. Indicators 1 to 8 are background indicators, which provide general information on the use of admin data in the statistical output in question but do not, directly, relate to the quality of the statistical output. Indicators 9 to 23 provide information directly addressing the quality of the statistical output.

26 More information on the ESS Handbook for quality reports can be found here:

http://epp.eurostat.ec.europa.eu/portal/page/portal/product_details/publication?p_product_code=KS-RA-08-016

Page 216: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

216

3. Quality Indicators when using Administrative Data in Statistical Outputs

3.1 Quantitative quality indicators

One of the aims of the ESSnet Admin Data is the development of quality indicators for business statistics involving admin data, with a particular focus on developing quantitative quality indicators and qualitative indicators to complement them.

Some work has already been done in the area of quality of business statistics involving admin data and some indicators have been produced. However, the work conducted thus far refers to qualitative indicators or is based more on a descriptive analysis of admin data (see Eurostat, 2003). The quantitative indicators that have been produced have been more to do with the quality of the admin sources (Daas, Ossen & Tennekes, 2010) or have been to develop a quality framework for the evaluation of admin data (Ossen, Daas & Tennekes, 2011) 27. These do not address the quality of the production of the statistical output however. In fact, almost no work has been done on quantitative indicators of business statistics involving admin data, which is the main focus of this project (for further discussion on this topic see Frost, Green, Pereira, Rodrigues, Chumbau & Mendes, 2010).

The ESSnet aims to develop quality indicators of statistical outputs that involve admin data. These indicators are for the use of members of the European Statistical System; producers of statistics. Therefore, the list contains indicators on input and process because these are critical to the work of the National Statistical Institutes and it is the input and process in particular that are different when using admin data. Moreover, the list of indicators developed is specifically in relation to business statistics involving admin data. Indicators (e.g. on accessibility) that do not differ for admin vs. survey based statistics are not included in this work because they fall outside the remit of the ESSnet Admin Data project.

To address some issues of terminology, a few definitions are provided below to clarify how these terms are used in this document and throughout the ESSnet Admin Data.

What is administrative data? Administrative data are the data derived from an administrative source, before any processing or validation by the NSIs.

What is an administrative source? An administrative source is a data holding containing information collected and maintained for the purpose of implementing one or more administrative regulations. In a wider sense, any data source containing information that is not primarily collected for statistical purposes.

Further information on terminology and useful links to other, related work is available on the ESSnet Admin Data Information Centre28.

A list of quantitative quality indicators has been developed on the basis of research which took stock of work being conducted in this field across Europe29. This list was

27 More information on the BLUE-ETS project and the associated deliverables can be found here:

http://www.blue-ets.istat.it/index.php?id=7 28 ESSnet Admin Data Glossary: http://essnet.admindata.eu/Glossary/List

ESSnet Admin Data Reference Library: http://essnet.admindata.eu/ReferenceLibrary/List

Page 217: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

217

then user tested within five European NSIs, before testing across Member States30. Feedback from this testing was used to improve the list of quality indicators during its development (2010/11). The entry for each quantitative indicator is self-contained in the attached list (see Section 4), including a description, information on how the indicator can be calculated and one or two examples. As this document is tailored to aid producers involved in the STS regulation, all the examples are in this domain. Qualitative (or descriptive) indicators have also been developed to complement the quantitative indicators and are included in Section 4. Further information on the qualitative indicators is included in Section 3.2. The quantitative indicators have been developed so that a low indicator score denotes high quality, and a high indicator score denotes low quality. This is consistent with the concept of error, where high errors signify low quality. In essence, the indicators measure quality risks – for example, the higher the level of non-response, the higher the risk to the quality of the output. The exceptions to this rule are the background indicators (1 to 8), where the score provides information rather than a quality ‘rating’; and indicators 20 and 23, where a high indicator score denotes high quality, and a low indicator score denotes low quality. Examples are also given for weighted indicators, for example weighting the indicators by turnover or number of employees. Caution needs to be taken when considering these weighted indicators in case of bias caused by the weighting. A framework for the basic quantitative quality indicator examples

The calculation of an indicator needs some preliminary steps. Some or all of the steps will be used for each example of the indicators to ensure consistency of the examples, and to aid understanding of the indicators themselves. A simple framework to aid calculating the quantitative indicators is included here:

A. Define the statistical output B. Define the relevant units C. Define the relevant variables D. Adopt a schema for calculation E. Declare the tolerance for quantitative and qualitative variables

The list is not created so that producers calculate all indicators but rather as a means of enabling producers to select those that a most relevant to their output. Not all indicators will apply in all situations and it is not recommended that they are all calculated on an ongoing, regular basis. Whilst, some may be useful for exactly this purpose, others may only be used when considering options for increasing the use of

29 A summary of the main findings of this stock take research (Deliverable 2010/6.1) is available on the ESSnet

Infomration Centre here: http://essnet.admindata.eu/WikiEntity?objectId=4696 30 The outcome of this testing is reported on the ESSnet Information Centre (included within the SGA 2010 final

report) and is available here: http://essnet.admindata.eu/WikiEntity?objectId=4751

A summary of the 2011 user testing is reported in the Phase 2 User Testing Report, available here:

http://essnet.admindata.eu/WikiEntity?objectId=4696

Page 218: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

218

admin data or when undergoing or evaluating changes in the process of producing the statistical output.

Links between this and other work on Quality

The work being carried out under this project should not be seen as independent of other work already in place. When analysing the list of indicators, one can conclude that some other information is useful in regard to the quality of the output. However, some of that very useful information is not specific to the use of admin data and thus is out of scope for the work of this ESSnet. This work is for the benefit of the members of the European Statistical System (ESS); the producers of statistics. Consequently, the end result of the ESSnet Admin Data work in this area should be integrated with the work already in place in NSIs and Eurostat.

3.2 Qualitative quality indicators

While much of the focus of the ESSnet Admin Data work on quality has been around the development of quantitative quality indicators, the project also required the development of qualitative quality indicators to complement the quantitative indicators. Quantitative and qualitative indicators can be thought of as numerical and descriptive quality indicators respectively: the quantitative indicators provide a numerical measure around the quality of the output, whereas the qualitative indicators provide further descriptive information that cannot be obtained from observing a numerical value. Many of the qualitative indicators have been taken from a UK document entitled ‘Guidelines for Measuring Statistical Output Quality’, which serves as a comprehensive list of quality measures and indicators for reporting on the quality of a statistical output. Others have been developed as part of the work of the ESSnet Admin Data. Beneath each quantitative indicator in Section 4 is a table which displays any potentially relevant qualitative indicators, a description of each indicator and the quality theme with which they are associated. Some of the qualitative indicators are repeated in Section 4 as they are related to more than one quantitative indicator. Appendix A contains a complete list of all qualitative indicators, grouped by theme, and also references the quantitative indicators to which they have been linked in Section 4.

Page 219: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

219

3.3 Using the list of quality indicators

The list of indicators has been grouped into two main areas:

3. Background Information – these are ‘indicators’ in the loosest sense. They provide general information on the use of admin data in the statistical output in question but do not, directly, relate to the quality of the statistical output. This information is often crucial in understanding better those indicators that measure quality more directly.

4. Quality Indicators – these provide information directly addressing the quality of the statistical output.

The background information indicators and the quality indicators are further grouped by quality ‘theme’. These quality themes are based on the ESS dimensions of output quality, with some additional themes which relate specifically to admin data and are consistent with quality considerations as outlined in the ESS Handbook on Quality Reports. The quality themes are:

Quality theme Description

Relevance

Relevance is the degree to which statistical outputs meet current and potential user needs. Note: only a subset of potential relevance quality indicators are considered within this document given the scope of the ESSnet project (eg. differences between statistical and admin data definitions). All relevance indicators are qualitative.

Accuracy The closeness between an estimated result and the unknown true value.

Timeliness and punctuality The lapse of time between publication and the period to which the data refer, and the time lag between actual and planned publication dates.

Comparability The degree to which data can be compared over time and domain.

Coherence The degree to which data that are derived from different sources or methods, but which refer to the same phenomenon, are similar.

Page 220: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

220

Other relevant considerations

Cost and efficiency The cost of incorporating admin data into statistical systems, and the efficiency savings possible when using admin data in place of survey data.

Use of administrative data Background information relating to admin data inputs.

The following table shows which quantitative indicators are relevant to each of the

quality themes.

Quality theme Quantitative indicators relevant to that theme

Accuracy 9, 10, 11, 12, 13, 14, 15, 16, 17.

Timeliness and punctuality

4, 18.

Comparability 19.

Coherence 5, 6, 20, 21.

Cost and efficiency 7, 8, 22, 23.

Use of administrative

data 1, 2, 3.

Reminder:

The quantitative indicators have been developed so that a low indicator score denotes high quality, and a high indicator score denotes low quality. Thus, the indicators measure quality risks – for example, the higher the level of non-response, the higher the risk to the quality of the output. The exceptions to this rule are the background indicators (1 to 8), where the score provides information rather than a quality ‘rating’; and indicators 20 and 23, where a high indicator score denotes high quality, and a low indicator score denotes low quality.

Each individual indicator will not apply in all situations. The list is not created so that producers calculate all indicators but rather as a means of enabling producers to select those that a most relevant to their output.

Page 221: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

221

4.1 Background Information (indicators) Use of administrative data:

1 Number of admin sources used

Description This indicator provides information on the number of administrative sources used in each statistical output. The number of sources should include all those used in the statistical output whether the admin data are used as raw data, in imputation or to produce estimates. In general, administrative data sources used for updating base registers (where available) should not be included in this indicator.

How to calculate Note. Where relevant, a list of the admin sources may also be helpful for users, along with a list of the variables included in each source. Alternatively, the number of admin sources used can be specified by variable.

Note: all examples use the relevant parts of the examples framework set out in Section 3.1.

Example

A. Statistical output: Quarterly construction

B. Relevant units: Units in the statistical population

C. Steps for calculation: Identify the relevant admin sources.

Let S1 be the Social Security source.

I(1) = 1 source.

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

A - Name each admin source used as an input into the statistical product and describe the primary purpose of the data collection for each source

Relevance Name all administrative sources and their providers. This information will assist users in assessing whether the statistical product is relevant for their intended use. Providing information on the primary purpose of data collection also enables users to assess whether the data are relevant to their needs.

B - Describe the main uses of each of the admin sources and, where possible, how the data relate to the needs of users

Relevance Include all the main statistical processes and/or outputs known to require data from the administrative source This indicator should also capture how well the data support users’ needs. This information can be gathered from user satisfaction surveys and feedback.

For further clarification on

terminology and definitions of

terms used, please refer to the

Glossary included in Appendix C.

C.

Page 222: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

222

2 % of items obtained exclusively from admin data

Description This indicator provides information on the proportion of items only obtained from admin data, whether directly or indirectly, and where survey data are not collected. This includes where admin data are used as raw data, as proxy data, in calculations, etc. This indicator should be calculated on the basis of the statistical output – the number of items obtained exclusively from admin data (not by survey) should be considered.

How to calculate

%100 items of no. Total

dataadmin fromy exclusivel obtained items of No.

This indicator could also be weighted in terms of whether or not the variables are key to the statistical output.

Example

A. Statistical output: Monthly manufacturing

B. Relevant units: Units in the statistical population

C. Relevant variables Number of employees

D. Steps for calculation:

D1. For the relevant variable, calculate the number of items for which the variable is obtained exclusively from admin data (items with non missing variable);

D2. Divide the sum of numbers of items for which the variables are obtained exclusively from admin data by the sum of numbers of items for which the variable is not missing

D3. Calculate the indicator as follows:

Let S1 be the Social Security source.

Page 223: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

223

I(2)=itemsofnumberTotal

dataminadfromyexclusivelobtaineditemsofNo.*100= %7.35100*

12

5

Weighted by turnover:

I(2)W=

100*277,923,1

518,27860,27000,58632,79652,12816.7%

Units

Number of

employees-S1

Number of

employees

(survey)

Items obtained

exclusively from admin

data (1/0)=(Y/N) Items non missing Turnover

X1 27 1 1 128,652

X2 512 518 0 1 759,830

X3 missing 2 0 0 14,000

X4 28 27 0 1 253,000

X5 11 1 1 79,632

X6 3 2 0 1 22,536

X7 118 120 0 1 123,412

X8 123 123 0 1 237,523

X9 1 1 1 58,000

X10 missing 1 0 0 39,800

X11 1 1 1 27,860

X12 3 3 0 1 79,845

X13 28 30 0 1 125,469

X14 2 1 1 27,518

Sum 5 12 1,977,077

Page 224: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

224

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

A - Name each admin source used as an input into the statistical product and describe the primary purpose of the data collection for each source

Relevance Name all administrative sources and their providers. This information will assist users in assessing whether the statistical product is relevant for their intended use. Providing information on the primary purpose of data collection also enables users to assess whether the data are relevant to their needs.

B - Describe the main uses of each of the admin sources and, where possible, how the data relate to the needs of users

Relevance Include all the main statistical processes and/or outputs known to require data from the administrative source This indicator should also capture how well the data support users’ needs. This information can be gathered from user satisfaction surveys and feedback.

AQ – Describe reasons for significant overlap in admin data and survey data collection for some items.

Cost and efficiency

Where items are not obtained exclusively from admin data, reasons for the overlap between admin data and survey data should be described.

Page 225: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

225

3 % of required variables which are derived using admin data as a proxy

Description This indicator provides information on the extent that admin data are used in the statistical output as a proxy or are used in calculations rather than as raw data. A proxy variable can be defined as a variable that is related to the required variable and is used as a substitute when the required variable is not available. This indicator should be calculated on the basis of the statistical output – the number of required variables derived indirectly from admin data (because not available directly from admin or survey data) should be considered.

How to calculate

%100variables required of No.

proxy a as data admin using derived are whichvariables required of No.

Note. If a combination of survey and admin data is used, this indicator would need to be weighted (by number of units). If double collection is necessary (e.g. to check quality of admin data), some explanation should be provided. This indicator could also be weighted in terms of whether or not the variables are key to the statistical output.

Example

A. Statistical output: Monthly manufacturing

B. Relevant unit: Units in the statistical population

C. Let the list of relevant variables31 be as follows:

1) Production; 2) Turnover; 3) Number of persons employed; 4) Hours worked; 5) Gross wages and salaries

The relevant variable obtained from the Fiscal source is the VAT turnover, proxy of the Turnover.

D: Steps for calculation:

D1: Number of required variables derived from admin data D2: Number of variables of D1 used as a proxy (i.e. the variable is derived indirectly from admin data) D3: Number of required variables by the STS Regulation

31

The variables are required by the Regulation (EC) No 1165/98 of the European Parliament and of the Council,

but each country will decide which of the required variables will be relevant for this indicator.

Page 226: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

226

Formula: I(3)=

%20100*5

1100*

.

.

iablesvarrequiredofNo

proxyaasdataminadgsinuderivedarewhichblesvariarequiredofNo

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

C – Describe the extent to which the data from the administrative source meet statistical requirements

Relevance Statistical requirements of the output should be outlined and the extent to which the administrative source meets these requirements stated. Gaps between the administrative data and statistical requirements can have an effect on the relevance to the user. Any gaps and reasons for the lack of completeness should be described, for example, if certain areas of the target population are missed or if certain variables that would be useful are not collected. Any methods used to fill the gaps should be stated.

D – Describe constraints on the availability of administrative data at the required level of detail

Relevance Some administrative microdata have restricted availability or may only be available at aggregate level. Describe any restrictions on the level of data available and their effects on the statistical product.

E – Describe reasons for use of admin data as a proxy.

Relevance Where admin data are used in the statistical output as a proxy or are used in calculations rather than as raw data, information should be provided in terms of why the admin data have been used as a proxy for the required variables.

Page 227: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

227

Timeliness and punctuality:

4 Periodicity (frequency of arrival of the admin data)

Description This indicator provides information about how often the admin data are received by the NSI. This indicator should be provided for each admin source.

How to calculate Note. If data are provided via continuous feed from the admin source, this should be stated in answer to this indicator. Only data you receive for statistical purposes should be considered.

Example

A. Statistical output: OROS Survey (Employment, earnings and social security

contributions) based on the Social Security administrative data.

B. Relevant units: Small enterprises with employees

D: Steps for calculation: Record periodicity for each source

Let S1 be the VAT Turnover source.

Let S2 be the Social Security source.

I’S1(4) = 4; I’S2(4) = 4

VAT Turnover data 4

Social Security data 4

Type of admin data

Frequency of arrival of the admin data

- Per year

Page 228: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

228

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

G – Describe the timescale since the last update of data from the administrative source

Timeliness An indication of the timescale since the last update from administrative sources will provide the user with an indication of whether the statistical product is timely enough to meet their needs.

H – Describe the extent to which the administrative data are timely

Timeliness Provide information on how soon after their collection the statistical institution receives the administrative data. The effects of any lack of timeliness on the statistical product should be described.

I – Describe any lack of punctuality in the delivery of the administrative data source

Timeliness Give details of the time lag between the scheduled and actual delivery dates of the data. Any reasons for the delay should be documented along with their effects on the statistical product.

Page 229: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

229

Coherence:

5 % of common units across two or more admin sources

Description

This indicator relates to the combination of two or more admin sources. This indicator provides information on the proportion of common units across two or more admin sources. Only units relevant to the statistical output should be considered. This indicator should be calculated pairwise for each pair of admin sources and then averaged. If only one admin source is available, this indicator is not relevant.

How to calculate

%100 units uniquerelevant of No.

sourcesadmin in the unitscommon relevant of No.

Note. The “unique units” in the denominator means that units should only be counted once, even if they appear in multiple sources. This indicator should be calculated separately for each variable. If the sources are designed to cover different populations and are combined to provide an overall picture, this should be explained. This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution of these units to the statistical output.

Example

A. Statistical output: Feasibility study of a quarterly sample survey on the retail sector.

B. Relevant units: Enterprises in the feasibility study;

C. Relevant variables: Number of employees of enterprises of retail trade sector ;

D. Steps for calculation:

D1. Identify the statistical unit (enterprise) for each source (i.e. group the administrative records in one source at id code level)

D2. Match all sources with each other by id code D3. Attribute a Presence(1)/Absence(0) indicator to the unit with regard to the specific

source D4. Calculate the number of possible pairings between sources (i.e. when there are n

sources, it is the combination of n sources taken 2 at a time) Cn,2= n!/(n-2)!* 2! Cn= n/2*(n-1) Let’s suppose 2 sources, the possible combinations will be: C2 =2/2=1 D5. Multiply the Presence(1)/Absence(0) indicator to obtain the

Presence(1)/Absence(0) indicator for each pairwise D6. Sum up the Presence(1)/Absence(0) indicators at pair level and divide by Cn* no. of relevant units (m)

Page 230: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

230

Let A be the Social Security source. Let B be a data bank on retail trade sector (e.g. Nielsen) Formula:

I(5) = %100 units unique relevant of No.

sourcesadmin the in units common relevant of No.

n = 2, m = 15 Numerator = 5 Denominator = Cn* no. of relevant units (m) = 1*15 = 15 I(5)= (Numerator/Denominator)*100 = (5/15)*100=33.3% Weighted by turnover: Numeratorw = 5,320,088 Denominatorw = 1*6,030,179=6,030,179 I(5)w= (Numeratorw / Denominatorw )*100 = (5,320,088/6,030,179)*100=88.2%

UNITS

Number of

employees -

Source A

Number of

employees

- Source B A B AB Turnover (1) AB*(1)

X1 125 97 1 1 1 1,700,254 1,700,254

X2 45 48 1 1 1 526,032 526,032

X3 1 0 1 0 32,100 0

X4 18 1 0 0 87,000 0

X5 1 1 0 0 6,212 0

X6 2 1 0 0 39,254 0

X7 1 1 0 0 65,958 0

X8 2 1 0 0 58,951 0

X9 15 1 0 0 254,100 0

X10 1 1 0 0 38,000 0

X11 538 500 1 1 1 2,536,897 2,536,897

X12 2 2 1 1 1 29,874 29,874

X13 7 1 0 0 89,562 0

X14 1 1 0 0 38,954 0

X15 324 326 1 1 1 527,031 527,031

Sum 5 6,030,179 5,320,088

Page 231: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

231

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

O – Describe the common identifiers of population units in administrative data

Coherence Different administrative sources often have different population unit identifiers. The user can utilise this information to match records from two or more sources. Where there is a common identifier, matching is generally more successful.

U – Describe the record matching methods and processes used on the administrative data sources

Accuracy Record matching is when different administrative records for the same unit are matched using a common, unique identifier or key variables common to both datasets. There are many different techniques for carrying out this process. A description of the technique (e.g. automatic or clerical matching) should be provided along with a description (qualitative or quantitative) of its effectiveness.

Page 232: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

232

6 % of common units when combining admin and survey data

Description This indicator relates to the combination of admin and survey data. This indicator provides information on the proportion of common units across admin and survey data. Linking errors should be detected and resolved before this indicator is calculated. This indicator should be calculated for each admin source and then aggregated based on the number of common units (weighted by turnover) in each source. (For information on the progressive nature of administrative data, see Appendix B.)

How to calculate

%100survey in units of No.

datasurvey andadmin in unitscommon of No.

Note. If there are few common units due to the design of the statistical output (e.g. a combination of survey and admin data), this should be explained. If the sources are designed to cover different populations and are combined to provide an overall picture, this should also be explained. This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution of these units to the statistical output.

Example

A. Statistical output: Monthly manufacturing

B. Relevant units: Enterprises of the survey

C. Relevant variables: Turnover

D. Steps for calculation:

D1. Match each source with survey(s) by the common id code (if available) or by other methods

D2. Attribute a Presence(1)/Absence(0) indicator to the unit if it belongs at least to a survey (sum up for obtaining denominator)

D3. Attribute a Presence(1)/Absence(0) indicator to the unit if it belongs both to the survey and to each source (sum up by source for obtaining numerator)

D4. Calculate the aggregate indicator as follows:

Let A be the VAT Turnover source.

100*.

.)6(

SurveyunitsofNo

SurveyAunitsofNoI

Page 233: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

233

I(6)=(7/7)*100=100% Weighted by employment: I(6)w=(181/188)*100=96.3% Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

O - Describe the common identifiers of population units in administrative data

Coherence

Different administrative sources often have different population unit identifiers. The user can utilise this information to match records from two or more sources. Where there is a common identifier, matching is generally more successful.

U – Describe the record matching methods and processes used on the administrative data sources

Accuracy Record matching is when different administrative records for the same unit are matched using a common, unique identifier or key variables common to both datasets. There are many different techniques for carrying out this process. A description of the technique (e.g. automatic or clerical matching) should be provided along with a description (qualitative or quantitative) of its effectiveness.

Units

Turnover

source A

Turnover

survey Source A Ind_Survey Ind_Survey ∩ A

No. of

Employees (B)*(C)

(B) (C)

X1 257896 259632 1 1 1 33 33

X2 58211 57917 1 1 1 2 2

X3 25632 1 0 0 2 0

X4 18000 1 0 0 0 0

X5 789654 789654 1 1 1 78 78

X6 587224 586947 1 1 1 45 45

X7 28777 28000 1 1 1 1 1

X8 128000 128125 1 1 1 22 22

X9 51420 1 0 0 3 0

X10 15420 1 0 0 2 0

X11 23456 25550 1 1 1 0 0

Sum 11 7 7 188 181

Page 234: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

234

Cost and efficiency:

7 % of items obtained from admin source and also collected by survey

Description This indicator relates to the combination of admin and survey data. This indicator provides information on the double collection of data, both admin source and surveys. Thus, it provides an idea of redundancy as the same data items are being obtained more than once. This indicator should be calculated for each admin source and then aggregated. Note. Double collection is sometimes conducted for specific reasons, e.g. to measure quality or because admin data is not sufficiently timely for the requirements of the statistical output. If this is the case, this should be explained.

How to calculate

%100survey in itemsrelevant of No.

datasurvey andadmin by obtained itemscommon relevant of No.

Only admin data which meet the definitions and timeliness requirements of the output should be included.

Example

A. Statistical output: Monthly data on the industrial sector

B. Relevant units: Units in the survey

C. Relevant variable: Number of employees

D. Steps for calculation:

D1. Match each source with survey(s) by the common id code (if available) or by other methods

D2. Attribute a Presence(1)/Absence(0) indicator to relevant units in the survey (sum up for obtaining denominator)

D3. Attribute a Presence(1)/Absence(0) indicator for common items in the survey and in the source (sum up for obtaining numerator)

D4. Calculate the indicator as follows: Let EMP be the Social Security source. Let STS1 be the survey.

100*)(.

.)7(

ssurveyinitemsrelevantofofNo

datasurveyandminadbyobtaineditemscommonrelevantofNoI

Page 235: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

235

I(7)=(8/9)*100=89% Weighted by turnover: I(7)W=(422,426,156/422,444,698)*100=100% Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

AR – Comment on the types of items that are being obtained by the admin source as well as the survey

Cost and Efficiency

If the same data items are collected from both the admin source and the survey, this can lead to duplication when combining the sources. This indicator should highlight to users the variables that are being collected across both sources.

AS – If items are purposely collected by both the admin source and the survey, describe the reason for this duplication (eg validity checks)

Cost and Efficiency

In some instances it may be beneficial to collect the same variables from both the admin source and the survey, such as to validate the micro level data. The reason for the double collection should be described to users.

Units Turnover

Number of employees (EMP)

Number of employees (STS1)

Number of items in STS1

Presence/Absence (1/0) index of items in STS1 and EMP

(A) (B) (C) (D) (E)=(B) ∩

(C) (A)*(D) (A)*(E)

X1 2,157,322 15 18 1 1 2,157,322 2,157,322 X2 14,000 0 0 0 0 X3 3,458,610 27 25 1 1 3,458,610 3,458,610 X4 358,987,462 587 600 1 1 358,987,462 358,987,462 X5 22,125 2 0 0 0 0 X6 5,027,321 34 34 1 1 5,027,321 5,027,321 X7 32,154 1 0 0 0 0 X8 18,542 1 1 0 18,542 0 X9 27,854 5 5 1 1 27,854 27,854 X10 52,369,584 965 962 1 1 52,369,584 52,369,584 X11 20,154 1 0 0 0 0 X12 153,000 2 2 1 1 153,000 153,000 X13 87,965 7 0 0 0 0 X14 245,003 17 18 1 1 245,003 245,003

Sum 9 8 422,444,698 422,426,156

Page 236: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

236

8 % reduction of survey sample size when moving from survey to admin data

Description This indicator relates to the combination of admin and survey data. This indicator provides information on the reduction in survey sample size because of an increased use of admin data. Only changes to the sample size due to using admin data should be included in this calculation. The indicator should be calculated for each survey and then aggregated (if applicable). This can be viewed as a one-off indicator when moving from a survey based output to an output involving admin data.

How to calculate

%100dataadmin of usein increase before size Sample

after size sample - dataadmin of usein increase before size Sample

Note. This indicator is likely to be calculated once, when making the change from survey to admin data.

Example 1 A. Statistical output: Monthly manufacturing

B. Relevant units: Units in the statistical population

D. Steps for calculation: D1. Identify sample size before use of admin data D2. Identify sample size after use of admin data D3. Calculate the indicator as follows:

Sample size before increase in use of administrative data: 5231; Sample size after increase in use of administrative data: 4536. Formula:

I(8)= 100*dataminadofuseinincreasebeforesizeSample

aftersizeSampledataminadofuseinincreasebeforesizeSample

I(8) = %2.13100*5231

45365231

Example 2 A. Statistical output: Monthly manufacturing

B. Relevant units: Units in the statistical population

C. Relevant variables: Number of employees

D. Steps for calculation: D1. Identify sample size before use of admin data D2. Identify sample size after use of admin data D3. Calculate the indicator as follows:

Page 237: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

237

Let A be the Social Security source. We identify the sample size before the use of admin data basing the estimation of variability on the sample survey of the same quarter of the previous year. We want a maximum error

(Ɛ) of 1.5 employees and a reliability of 95%. S2=Correct sample variance for the same quarter of the previous year: 78.32. n=sample size before use of Admin data=(z2

α*s2/ε2)=1.962*78.32/1.52=10,468.

We identify the sample size after the use of admin data basing the estimation of variability on the Admin data of source A in the previous quarter for the entire manufacturing sector. S2=Variance of admin data in source A for the same quarter of the previous year: 72.32. n’=sample size after use of Admin data=(z2

α*s2/ε2)=1.962*72.32/1.52=8925.

I(8)=

468,10

8925468,10'

n

nn14.74%

Thus, due to an increase in the use of admin data, the survey sample size has decreased by 14.74%.

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

L – Describe the impact of moving from a survey based output to an admin-data based output

Comparability Provide information on how the data are affected when moving from a survey based output to an admin-data based output. Comment on any reasons behind differences in the output and describe how any inconsistencies are dealt with.

Page 238: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

238

4.2 Quality Indicators Accuracy:

9 Item non-response (% of units with missing values for key variables)

Description Although there are technically no ‘responses’ when using admin data, non-response (missing values at item or unit level) is an issue in the same way as with survey data. This indicator provides information on the extent of missing values for the key variables. The higher the level of missing values, the poorer the quality of the data (and potentially the statistical output). However, other indicators should also be considered, eg. the level of imputation and also the means of imputation used to address this missingness. This indicator should be calculated for each of the key variables and for each admin source and then aggregated based on the contributions of the variables to the overall output.

How to calculate

%100 variableXfor relevant units of No.

variableXfor valuemissing with dataadmin in the unitsrelevant of No.

This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution to the output

Example

A. Statistical output: Quarterly construction

B. Relevant units: Units of the statistical population

C. Relevant variables: Number of employees

D. Steps for calculation:

D1. Match source A with units in the statistical population and take the common units D2. Calculate number of units in D1 with missing value for source A D4. Calculate the indicator as follows:

Let A be the Social Security source.

Page 239: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

239

I(9) = 100*.

.

iablevartheforrelevantunitsofNo

iablevartheforvaluesgsinmiswithdataminadtheinunitsrelevantofNo

I(9) = %10100*10

1

Weighted by turnover:

I(9)W= %59.0100*072,191,4

880,24

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

V – Describe the data processing known to be required on the administrative data source to deal with non-response

Accuracy Data processing is often required to deal with non-response. The user should be made aware of how and why particular data processing methods are used.

W – Describe differences between responders and non-responders

Accuracy This indicates to users how significant the non-response bias is likely to be. Where response is high, non-response bias is likely to be less of a problem than when there are high rates of non-response. NB: There may be instances where non-response bias is high even with very high response rates, if there are large differences between responders and non-responders.

X – Assess the likely impact of non-response/imputation on final estimates

Accuracy Non-response error may reduce the representativeness of the data. An assessment of the likely impact of non-response/imputation on final estimates allows users to gauge how reliable the key estimates are as estimators of population values. This assessment may draw upon the indicator: ‘% of imputed values (items) in the admin data’.

Units

Number of

employees (Source

A)

Number of units in

source A with missing values for employees

(1/0)=(Y/N) Number of relevant

units for the variable Turnover

X1 15 0 1 158,325

X2 1 1 24,880

X3 25 0 1 233,541

X4 178 0 1 780,251

X5 52 0 1 200,320

X6 1 0 1 18,000

X7 1 0 1 15,358

X8 37 0 1 785,423

X9 19 0 1 185,320

X10 612 0 1 1,789,654

Sum 1 10 4,191,072

Page 240: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

240

10 Misclassification rate

Description This indicator provides information on the proportion of units in the admin data which are incorrectly coded. For simplicity and clarity, activity coding as recorded on the Business Register (BR) can be considered to be correct – the example in this report makes this assumption (the validity of this assumption will depend on the systems used within different countries; other sources may be used if there is evidence they are more accurate than the BR). The level of coding used should be at a level consistent with the level used in the statistical output (e.g. if the statistical output is produced at the 3-digit level, then the accuracy of the coding should be measured at this level). This indicator should be calculated for each admin source and then aggregated based on the number of relevant units (weighted by turnover) in each source.

How to calculate

%100 dataadmin in unitsrelevant of No.

BR tocode NACEdifferent with dataadmin in unitsrelevant of No.

Note. If the activity code from the admin data is not used by the NSI (e.g. if coding from BR is used), details of the misclassification rate for the BR should be provided instead.

If a survey is conducted to check the rate of misclassification, the rate from this survey should be provided and a note added to the indicator. This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution to the output.

Example

B. Statistical output: Monthly data on enterprises

C. Relevant units: Units of the statistical population

D. Relevant variables: ATECO (5-digits, Italian version of NACE classification).

E. Steps for calculation: D1. Match each source with the Business Register by the common ID code (if

available) or by other methods D2. Attribute a Presence(1)/Absence(0) indicator to items of the variable in each

admin data source (sum up for obtaining denominator) D3. Attribute a value=1/0 for inconsistency/consistency between the items of the

admin source(s) and the items in the source(s) (sum up for obtaining numerator)

D4. Calculate the indicator as follows:

I(10) = dataminadinunitsrelevantofNo

BRtocodeATECOdifferentwithdataminadinunitsrelevantofNo

.

.

Page 241: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

241

E. Tolerance: The division of the ATECO code (first two digits) must be the same to consider the items “consistent”

Let CCIAA be the Chamber of Commerce source.

%30100*10

3)10( I

Weighted by employment:

%2.2100*90

2)10( WI

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

R - Describe differences in concepts, definitions and classifications between the administrative source and the statistical output

Coherence

There may be differences in concepts, definitions and classifications between the administrative source and statistical product. Concepts include the population, units, domains, variables and time reference and the definitions of these concepts may vary between the administrative data and the statistical product. Time reference problems occur when the statistical institution requires data from a certain time period but can only obtain them for another. Any effects on the statistical product need to be made clear along with any techniques used to remedy the problem.

Z – Describe how the misclassification rate is determined

Accuracy It is often difficult to calculate the misclassification rate.Therefore, where this is possible, a description of how the rate has been calculated should also be provided.

AA – Describe any issues with classification and how these issues are dealt with

Accuracy Whereas a statistical institution can decide upon and adjust the classifications used in its own surveys to meet user needs, the institution usually has little or no influence over those used by administrative sources. Issues with classification and how these issues are dealt with should be described so that the user can decide whether the source meets their needs.

Units CCIAA-ATECO BR-ATECO # items in CCIAA data Inconsistency (2-digits) Employees

X1 13910 13910 1 0 25

X2 23701 23702 1 0 33

X3 28999 68200 1 1 0

X4 22290 22290 1 0 15

X5 32300 32300 1 0 3

X6 21200 20412 1 1 2

X7 10130 10120 1 0 5

X8 32401 32402 1 0 0

X9 28220 58290 1 1 0

X10 26301 26301 1 0 7

10 3 90

Page 242: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

242

11 Undercoverage

Description This indicator provides information on the undercoverage of the admin data. That is, units in the reference population that should be included in the admin data but are not (for whatever reason). This indicator should be calculated for each admin source and then aggregated based on the number of relevant units (weighted by turnover) in each source. (For information on the progressive nature of administrative data, see Appendix B.

How to calculate

%100 population referencein unitsrelevant of No.

dataadmin in NOTbut population referencein unitsrelevant of No.

Note. This could be calculated for each relevant publication of the statistical output, e.g. first and final publication. This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution to the output.

Example

A. Statistical output: Quarterly data on enterprises of NACE section I (Accommodation and food service activities) with 10 or more employees

B. Relevant units: Units in the statistical population

C. Relevant variables: Number of employees

D. Steps for calculation:

D1. Identify units in reference population i.e. population of enterprises of NACE section I with 10 or more employees (e.g. using Business Register).

D2. Match source A with the units in D1 by the common identification code and take the units which are in D1 but not in source A (relevant units in reference population but not in A);

D3. Calculate the indicator as follows Let A be the Social Security source.

100*)11(population reference in units relevant of No.

A sourcein NOT but population reference in units relevant of No.I

Page 243: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

243

Reference population units

Units in Source A

Relevant units in reference population but NOT in source A Turnover

Turnover in reference population

Units (A) (B) (C) (D) (A)*(D) (C)*(D)

1 X X 58,965 58,965

2 X X 25,369 25,369

3 X

1,580

4 X X 56,321 56,321

5 X X 14,587 14,587

6 X X 2,541 2,541 2,541

7 X X 5,750 5,750

8 X 5,214

9 X X 98,547 98,547

10 X X 14,000 14,000

11 X X 15,420 15,420

12 X X 23,000 23,000

13 X

18,002

14 X X 54,723 54,723 54,723

15 X X 85,471 85,471

16 X X 1,500 1,500

17 X X 2,410 2,410

18 X X 2,317 2,317

19 X X 54,710 54,710

20 X X 51,000 51,000

21 X X 52,145 52,145

22 X 2,300

23 1,025

24 X X 1,084 1,084

25 X X 2,369 2,369

26 X X 18,231 18,231

27 X X 2,201 2,201

28 X X 1,201 1,201

29 X X 58,641 58,641 58,641

Sum 24 25 3 730,624 702,503 115,905

Num = No. of relevant units in reference population but not in source A = 3

Denom = No. of relevant units in reference population = 24

I(11) = (Num/Denom)*100 = (3/24)*100 = 12.5%

Weighted by turnover:

I(11)w = (115,905/702,503)*100 = 16.5%

Page 244: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

244

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

AB – Describe the extent of coverage of the administrative data and any known coverage problems

Accuracy This information is useful for assessing whether the coverage is sufficient. The population that the administrative data covers should be included along with all known coverage problems. There could be overcoverage (where duplicate records are included) or undercoverage (where certain records are missed).

AC – Describe methods used to deal with coverage issues

Accuracy Updating procedures and cleaning procedures should be described. Also, information should be provided on edit checks and /or imputations that are carried out because of coverage error. This information indicates to users the resultant robustness of the administrative data, after these procedures have been carried out to improve coverage.

AD – Assess the likely impact of coverage error on key estimates

Accuracy Coverage issues could relate to undercoverage and overcoverage (or both). Special studies can sometimes be carried out to assess the impact of undercoverage and overcoverage. An assessment of the likely impact of coverage error indicates to users how reliable the key estimates are as estimators of population values.

Page 245: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

245

12 Overcoverage

Description This indicator provides information on the overcoverage of the admin data. That is, units that are included in the admin data but should not be (e.g. are out-of-scope, outside the reference population). Note that when overcoverage is identified, quite often it can be addressed by removing these units when calculating the statistical output. However, in cases where overcoverage is identified but cannot be addressed, it is this estimate of ‘uncorrected’ overcoverage that should be provided for this indicator. This indicator should be calculated for each admin source and then aggregated based on the number of relevant units (weighted by turnover) in each source. (For information on the progressive nature of administrative data, see Appendix B.)

How to calculate

%100 population referencein units of No.

population referencein NOTbut dataadmin in units of No.

This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution to the output.

Example

A. Statistical output: Monthly data on enterprises of NACE section I (Accommodation and food service activities) with 10 or more employees

B. Relevant units: Units in the statistical population

C. Relevant variables: Number of employees

D. Steps for calculation:

D1. Identify units in reference population i.e. population of enterprises of NACE section I with 10 or more employees (e.g. using Business Register).

D2. Match source A with the units in D1 by the common identification code and take the units which are in source A but not in D1 (units in source A but not in reference population);

D3. Calculate the indicator as follows:

I(12)= 100*.

.

populationreferenceinunitsrelevantofNo

populationreferenceinnotbutdataminadinunitsrelevantofNo

Let Source A be the Social Security source.

Page 246: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

246

Reference population units

Units in Source A

Units in source A but NOT in reference population Turnover

Turnover in reference population

Units (A) (B) (C) (D) (A)*(D) (C)*(D)

1 X X 58,965 58,965

2 X X 25,369 25,369

3 X X 1,580 1,580

4 X X 56,321 56,321

5 X X 14,587 14,587

6 X 2,541 2,541

7 X X 5,750 5,750

8 X X 5,214 5,214

9 X X 98,547 98,547

10 X X 14,000 14,000

11 X X 15,420 15,420

12 X X 23,000 23,000

13 X X 18,002 18,002

14 X 54,723 54,723

15 X X 85,471 85,471

16 X X 1,500 1,500

17 X X 2,410 2,410

18 X X 2,317 2,317

19 X X 54,710 54,710

20 X X 51,000 51,000

21 X X 52,145 52,145

22 X X 2,300 2,300

23 1,025

24 X X

1,084 1,084

25 X X 2,369 2,369

26 X X 18,231 18,231

27 X X 2,201 2,201

28 X X 1,201 1,201

29 X

58,641 58,641

Sum 24 25 4 730,624 702,503 27,096

Num = No. of units in Admin data but not in reference population = 4

Denom = No. of units in reference population = 24

I(12) = (Num/Denom)*100 = (4/24)*100 = 16.67%

Weighted by turnover:

I(12)w = (27,096/702,503)*100 = 3.86%

Page 247: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

247

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

AB – Describe the extent of coverage of the administrative data and any known coverage problems

Accuracy This information is useful for assessing whether the coverage is sufficient. The population that the administrative data covers should be included along with all known coverage problems. There could be overcoverage (where duplicate records are included) or undercoverage (where certain records are missed).

AC – Describe methods used to deal with coverage issues

Accuracy Updating procedures and cleaning procedures should be described. Also, information should be provided on edit checks and /or imputations that are carried out because of coverage error. This information indicates to users the resultant robustness of the administrative data, after these procedures have been carried out to improve coverage.

AD – Assess the likely impact of coverage error on key estimates

Accuracy Coverage issues could relate to undercoverage and overcoverage (or both). Special studies can sometimes be carried out to assess the impact of undercoverage and overcoverage. An assessment of the likely impact of coverage error indicates to users how reliable the key estimates are as estimators of population values.

Page 248: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

248

13 % of units in the admin source for which reference period differs from the required reference period

Description This indicator provides information on the proportion of units that provide data for a different reporting period than the required period for the statistical output. If the periods are not those required, then some imputation is necessary, which may impact quality. This indicator should be calculated for each admin source and then aggregated based on the number of relevant units (weighted by turnover) in each source.

How to calculate

%100 dataAdmin in unitsrelevant of No.

period required from period reporting

different with dataAdmin in unitsrelevant of No.

This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution to the output.

Note: In some cases, 'calenderization' adjustments must be made to get the administrative data to the correct periodicity - for example, converting quarterly data to monthly data. If this is required, it may be helpful to calculate an additional indicator covering the proportion of units for which calenderization adjustments have taken place.

Example A. Statistical output: Quarterly data on enterprises in NACE section M (Professional,

scientific and technical activities)

B. Relevant units: Units in the pilot statistical population

C. Relevant variables: Number of employees

D: Steps for calculation:

D1. Identify all the units in the source with different reporting period from the required period of the statistical output (quarterly).

D2. Calculate the indicator as follows:

I(13)=

%100 data Admin in units relevant of No.

period required from period reporting different withdata Admin in units relevant of No.

Let A be the Social Security source which has a monthly reporting period for the employees variable.

Page 249: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

249

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

AE – Describe the data processing known to be required on the administrative data source to address instances where the reference period differs from the required reference period.

Accuracy Data processing may sometimes be required to check or improve the quality of the administrative data or create new variables to be used for statistical purposes. The user should be made aware of how and why data processing is used.

Relevant units

Units in

source A

Units with

different

reporting period Turnover

Units (A) (B) (D) (B)*(D)

1 X X X 26,598 26,598

2 X X X 59,863 59,863

3 X X X 128,475 128,475

4 X 63,000

5 X 15,000

6 X X X 456,236 456,236

7 X 18,320

8 X 2,501

9 X X X 578,632 578,632

10 X 18,600

11 X 69,350

Sum 11 5 5 1,436,575 1,249,804

I(13)=(5/11)*100=45.4%

Weighted by turnover:

I(13)w=(1,249,804/1,436,575)*100=87%

Page 250: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

250

14 Size of revisions from the different versions of the admin data RAR – Relative Absolute Revisions

Description This indicator assesses the size of revisions from different versions of the admin data, providing information on the reliability of the data received. With this indicator it is possible to understand the impact of the different versions of admin data on the results for a certain reference period. When data is revised based on other information (e.g. survey data) this should not be included in this indicator. The indicator should be calculated for each admin source and then aggregated.

How to calculate

%100

1

1

T

t Pt

T

t PtLt

X

XX

= Latest data for X variable

= First data for X variable If only one version of the admin data is received, this indicator is not relevant. Note. This indicator should only be calculated for estimates based on the same units (not including any additional units added in a later delivery of the data). This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution to the output.

Example A. Statistical output: Monthly manufacturing

B. Relevant units: Units in the statistical population

C. Relevant variables: Turnover

D. Steps for calculation:

D1. Identify the statistical unit (enterprise) in the first and in the second version of data coming from the same source

D2. Match the source with the units in the statistical population by the common identification code (if available) or by other methods, and take the units in common

D3. Take the non missing values (XPt) from the first data version D4. Take the non missing values (XLt) from the second data version for the same units

received in the first data version D5. Calculate the difference (absolute value) between the latest data and the first data

version for each unit D6. Sum up the differences and divide it by the sum of the absolute values of the first

data D7. Calculate the indicator as follows:

LtX

PtX

Page 251: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

251

Let A be the VAT Turnover source.

Units

Turnover 1st data version

Turnover 2nd

data version Employment

(A) (B) (P) (D)=|(B)-(A)| (E)=(D)*(P) (F)=(A)*(P)

X1 15,860 18,362 0 2,502 0 0

X2 596,321 597,523 25 1,202 30,050 14,908,025

X3 1,500,693 1,500,693 63 0 0 94,543,659

X4 276,365 276,527 12 162 1,944 3,316,380

X5 56,321 56,321 2 0 0 112,642

X6 159,632 160,523 6 891 5,346 957,792

X7 1,895,471 1,925,632 132 30,161 3,981,252 250,202,172

X8 15,630 15,630 0 0 0 0

X9 28,963 30,213 0 1,250 0 0

X10 58,741 58,967 1 226 226 58,741

X11 41,205 41,205 1 0 0 41,205

Sum 4,645,202 4,681,596 242 36,394 4,018,818 364,140,616

I(14)= (36,394/4,645,202)*100 = 0.78%

Weighted by employment:

I(14)w=(4,018,818/364,140,616)*100 =1.10%

100 *

| |

| |

I(14)

1

1

T

t

T

t

XPt

XPt XLt

Page 252: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

252

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

AF - Comment on the impact of the different versions of admin data on the results

Accuracy When commenting on the size of the revisions of the different versions of the admin data, information on the impact of the revisions on the statistical product for the relevant reference period should also be explained to users.

AG – Flag any published data that are subject to revision and data that have already been revised

Accuracy This indicator alerts users to published data that may be, or have already been, revised. This will enable users to assess whether provisional data will be fit for their purposes.

AH – For ad hoc revisions, detail revisions made and provide reasons

Accuracy Where revisions occur on an ad hoc basis to published data, this may be because earlier estimates have been found to be inaccurate. Users should be clearly informed of the revisions made and why they occurred. Clarifying the reasons for revisions guards against any misinterpretation of why revisions have occurred, while at the same time making processes (including any errors that may have occurred) more transparent to users.

AP – Reference/link to detailed revisions analyses

Accessibility Users should be directed to where detailed revisions analyses are available.

Page 253: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

253

15 % of units in admin data which fail checks

Description This indicator provides information on the extent to which data fail some elements of the checks (automatic or manual) and are flagged by the NSI as suspect. This does not mean that the data are necessarily adjusted (see Indicator 16), simply that they fail one or more check(s). This checking can either be based on a model, checking against other data sources (admin or survey), internet research or through direct contact with the businesses. This indicator should be calculated for each of the key variables and aggregated based on the number of relevant units (weighted by turnover) in each source.

How to calculate

%100 checked unitsrelevant of no. Total

failed and checked dataadmin in unitsrelevant of No.

Note. If the validation is done automatically and the system does not flag or record this in some way, this should be noted. Users should state the number of checks done, and the proportion of data covered by these checks. This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution to the output.

A. Statistical output: Quarterly data on Transportation and storage

B. Relevant units: Units in the statistical population

C. Relevant variables: Turnover, NACE code.

D. Steps for calculation:

D1: Identify for each key variable the number of units checked in admin data. D2: Identify for each key variable the number of units in admin data that fail checks. D3: Average the proportions of units that fail checks by weighting by the numbers of

units.

I(15) = %100 checked units relevant of no. Total

failed and checked data admin in units relevant of No.

Let the source of Turnover be the VAT Turnover source. Let the source of NACE code be the Chamber of Commerce source.

Page 254: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

254

I(15) = [(3+1)/(12+10)]*100 = 18.18%

Weighted by employment: I(15)w = [(15+32)/(265+262)]*100 = 8.92%

Units Units checked Units checked

Units failing check (Y/N)=(1/0)

Units failing check (Y/N)=(1/0)

var=Turnover var=NACE code Var=Turnover

Var=NACE code Employees

(A) (B) (C) (D) (E) (A)*(E) (B)*(E) (C)*(E) (D)*(E) X1 1 1 1 0 15 15 15 15 0 X2 1 1 0 1 0 0 0 0 0 X3 1 0 0 0 3 3 0 0 0 X4 1 1 0 1 0 0 0 0 0 X5 1 1 0 0 1 1 1 0 0 X6 1 1 0 0 5 5 5 0 0 X7 1 1 0 0 14 14 14 0 0 X8 1 0 0 0 0 0 0 0 0 X9 1 1 0 0 150 150 150 0 0 X10 1 1 0 0 27 27 27 0 0 X11 1 1 0 0 18 18 18 0 0 X12 1 1 0 1 32 32 32 0 32 Sum 12 10 1 3 265 265 262 15 32

Page 255: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

255

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

AI – Describe the known sources of error in administrative data

Accuracy Metadata provided by the administrative source and/or information from other reliable sources can be used to assess data errors. The magnitude of any errors (where known) that have a significant impact on the administrative data should be made available to users. This will help the user to understand how accurate the administrative data are.

AJ – Describe the data processing known to be required on the administrative data source in terms of the types of checks carried out

Accuracy Data processing may sometimes be required to check or improve the quality of the administrative data or create new variables to be used for statistical purposes. The user should be made aware of how and why data processing is used.

AK – Describe processing systems and quality control

Accuracy This informs users of the mechanisms in place to minimise processing error by ensuring accurate data transfer and processing.

AL – Describe the main sources of measurement error

Accuracy Measurement error is the error that occurs from failing to collect the true data values from respondents. Measurement error cannot usually be calculated. However, outputs should contain a definition of measurement error and a description of the main sources of the error when the admin data holder gathers the data.

AM – Describe processes employed by the admin data holder to reduce measurement error

Accuracy Describing processes to reduce measurement error indicates to users the accuracy and reliability of the measures.

AN – Describe the main sources of processing error

Accuracy Processing error is the error that occurs when processing data. It includes errors in data capture, coding, editing and tabulation of the data. It is not usually possible to calculate processing error exactly. However, outputs should be accompanied by a definition of processing error and a description of the main sources of the error.

Page 256: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

256

16 % of units for which data have been adjusted

Description This indicator provides information about the proportion of units for which the data have been adjusted (a subset of the units included in Indicator 15). These are units that are considered to be erroneous and are therefore adjusted in some way (missing data should not be included in this indicator – see Indicator 9). Any changes to the admin data before arrival with the NSI should not be considered in this indicator. This indicator should be calculated for each of the key variables and aggregated based on the number of relevant units (weighted by turnover) in each source.

How to calculate

%100 DataAdmin in unitsrelevant of No.

data adjusted with dataAdmin in the unitsrelevant of No.

This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution to the output.

Example

A. Statistical output: Quarterly data on Transportation and storage

B. Relevant units: Units in the statistical population

C. Relevant variables: Turnover, NACE code

D. Steps for calculation:

D1: Identify for each key variable the number of units in admin data D2: Identify for each key variable the number of units in admin data that have been

adjusted D3: Average the proportions of units that have been adjusted by weighting by the

numbers of units

I(16) = %100 Data Admin in units relevant of No.

data adjusted withdata Admin the in units relevant of No.

Let the source of Turnover be the VAT Turnover source. Let the source of NACE code be the Chamber of Commerce source.

Page 257: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

257

I(16) = [(1+1)/(12+10)]*100=9.09%

Weighted by employment: I(16)w = [(15+0)/(265+262)]*100=2.85%

Units Units in admin data

Units in admin data

Units adjusted (Y/N)=(1/0)

Units adjusted (Y/N)=(1/0)

var=Turnover var=NACE code Var=Turnover Var=NACE code Employees

(A) (B) (D) (E) (F) (A)*(F) (B)*(F) (D)*(F) (E)*(F) X1 1 1 1 0 15 15 15 15 0 X2 1 1 0 1 0 0 0 0 0 X3 1 0 0 0 3 3 0 0 0 X4 1 1 0 0 0 0 0 0 0 X5 1 1 0 0 1 1 1 0 0 X6 1 1 0 0 5 5 5 0 0 X7 1 1 0 0 14 14 14 0 0 X8 1 0 0 0 0 0 0 0 0 X9 1 1 0 0 150 150 150 0 0 X10 1 1 0 0 27 27 27 0 0 X11 1 1 0 0 18 18 18 0 0 X12 1 1 0 0 32 32 32 0 0 Sum 12 10 1 1 265 265 262 15 0

Page 258: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

258

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

S - Describe any adjustments made for differences in concepts and definitions between the administrative source and the statistical output

Coherence Adjustments may be required as a result of differences in concepts and definitions between the administrative data and the requirements of the statistical product. A description of why the adjustment needed to be made and how the adjustment was made should be provided to the users so they can assess the coherence of the statistical product with other sources.

AK – Describe processing systems and quality control

Accuracy This informs users of the mechanisms in place to minimise processing error by ensuring accurate data transfer and processing.

AL – Describe the main sources of measurement error

Accuracy Measurement error is the error that occurs from failing to collect the true data values from respondents. Measurement error cannot usually be calculated. However, outputs should contain a definition of measurement error and a description of the main sources of the error when the admin data holder gathers the data.

AM – Describe processes employed by the admin data holder to reduce measurement error

Accuracy Describing processes to reduce measurement error indicates to users the accuracy and reliability of the measures.

AN – Describe the main sources of processing error

Accuracy Processing error is the error that occurs when processing data. It includes errors in data capture, coding, editing and tabulation of the data. It is not usually possible to calculate processing error exactly. However, outputs should be accompanied by a definition of processing error and a description of the main sources of the error.

AO – Describe the data processing known to be required on the administrative data source in terms of the types of edits carried out

Accuracy Data processing may sometimes be required to check or improve the quality of the administrative data or create new variables to be used for statistical purposes. The user should be made aware of how and why data processing is used.

Page 259: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

259

17 % of imputed values (items) in the admin data

Description This indicator provides information on the impact of the values imputed by the NSI. These values are imputed because data are missing (see Indicator 9) or data items are unreliable (see Indicator 16). This indicator should be calculated by variable for each admin source and then aggregated based on the contributions of the variables to the overall output.

How to calculate

%100 dataadmin in itemsrelevant of No.

dataadmin relevant in the items imputed of No.

This indicator should be weighted (e.g. by turnover or number of employees) in terms of the % contribution of the imputed values to the statistical output.

Example

A. Statistical output: Monthly data on manufacture of computer, electronic and optical products

B. Relevant units: Units in the statistical population

C. Relevant variables: Number of employees

D. Steps for calculation:

D1: For each source identify the variables which are used for the statistical output D2. For each variable in the source calculate the number of items in admin data D3. For each variable in the source identify all the units with items either missing or

present in the admin data which are afterwards imputed D4. For each variable calculate the proportion of D3 on D2 D5. Calculate the indicator for each source weighting the proportions with the items D6. Calculate the general indicator weighting the indicators of D5 for the data

% data admin in items relevant of No.

data admin relevant the in items imputed of No.100 I(17)

Let A be the Social Security source.

Page 260: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

260

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

X – Assess the likely impact of non-response/imputation on final estimates

Accuracy Non-response error may reduce the representativeness of the data. An assessment of the likely impact of non-response/imputation on final estimates allows users to gauge how reliable the key estimates are as estimators of population values. This assessment may draw upon the indicator: ‘% of imputed values (items) in the admin data’.

Y – Comment on the imputation method(s) in place within the statistical process

Accuracy The imputation method used can determine how accurate the imputed value is. Information should be provided on why the particular method(s) was chosen and when it was last reviewed.

I(17) = (2/14)*100 = 14.29%

Weighted by turnover:

I(17)w = (17,626,728/24,034,571)*100=73.34%

(A) (B) (C)=(Y/N)=1/0 in (A) (D)=(Y/N)=1/0 in (B) (E) (G)=(C)*(E) (H)=(D)*(E) X1 0 1 0 18,632 18,632 0 X2 25 1 0 658,362 658,362 0 X3 187 1 0 4,501,259 4,501,259 0 X4 0 1 0 45,236 45,236 0 X5 3 1 0 105,641 105,641 0 X6 1 1 0 26,547 26,547 0 X7 0 1 0 54,710 54,710 0 X8 missing 1 1 1 63,478 63,478 63,478 X9 22 1 0 745,862 745,862 0 X10 584 570 1 1 17,563,250 17,563,250 17,563,250 X11 2 1 0 98,654 98,654 0 X12 1 1 0 15,420 15,420 0 X13 0 1 0 41,200 41,200 0 X14 3 1 0 96,320 96,320 0 Sum 828 571 14 2 24,034,571 24,034,571 17,626,728

No. of employees Source A Units

No. of employees of Source A imputed

No. of relevant items in Source A

No. of imputed items in Source A Turnover

Page 261: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

261

Timeliness and punctuality: 18 Delay to accessing / receiving data from Admin Source

Description This indicator provides information on the proportion of the time from the end of the reference period to the publication date that is taken up waiting to receive the admin data. This is calculated as a proportion of the overall time between reference period and publication date to provide comparability across statistical outputs. This indicator should be calculated for each admin source and then aggregated.

How to calculate

%100daten publicatio toperiod reference of end thefrom Time

dataAdmin receiving toperiod reference of end thefrom Time

Note. Include only the final dataset used for the statistical output. If a continuous feed of data is received, the ‘last’ dataset used to calculate the statistical output should be used in this indicator. If more than one source is used, an average should be calculated, weighted by the sources’ contributions to the final estimate. If the admin data are received before the end of the reference period, this indicator would be 0. This indicator applies to the first publication only, not to revisions.

Example

A. Statistical output: Monthly data on the manufacture of machinery and equipment

B. Relevant units: Units in the statistical population

D: Steps for calculation:

D1. Take the units in the statistical population D2. Match each source with the units in the statistical population by the common id

code obtaining the number of common units D3. Calculate for each source the number of days from the end of the reference period

to the arrival of Admin data D4: Calculate the number of days from the end of the reference period to the

dissemination date D4. Calculate the indicator as follows:

%100date npublicatio to period reference of end the from Time

data Admin receiving to period reference of end the from TimeI(18)

Example: Let A be the VAT Turnover source. Let B be the Social Security source.

Page 262: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

262

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

G – Describe the timescale since the last update of data from the administrative source

Timeliness An indication of the timescale since the last update from administrative sources will provide the user with an indication of whether the statistical product is timely enough to meet their needs.

H – Describe the extent to which the administrative data are timely

Timeliness Provide information on how soon after their collection the statistical institution receives the administrative data. The effects of any lack of timeliness on the statistical product should be described.

I – Describe any lack of punctuality in the delivery of the administrative data source

Timeliness Give details of the time lag between the scheduled and actual delivery dates of the data. Any reasons for the delay should be documented along with their effects on the statistical product.

J – Frequency of production

Timeliness This indicates how timely the outputs are, as the frequency of publication indicates whether the outputs are up to date with respect to users’ needs.

K – Describe key user needs for timeliness of data and how these needs have been addressed

Timeliness This indicates how timely the data are for specified needs, and how timeliness has been secured, eg by reducing the time lag to a number of days rather than months for monthly releases.

I(18)Source A=(25/37)*100=67.57%

I(18)Source B=(18/37)*100=48.65%

I(18) aggregated weighting for the contributions of the source to statistical output:

I(18)agg=(1737.84/2929)*100=59.33%

Weighted by turnover:

I(18)w=(6,716,555,009/11,341,511,679)*100=59.22%

Source (A) (B) (C) (D) (E)=(B)/(C) (F)=(E)*(A) (G)=((D)*(E) A 1654 25 37 6,337,903,585 67.57% 1117.57 4,282,367,287 B 1275 18 37 5,003,608,094 48.65% 620.27 2,434,187,721 Sum 2929 43 74 11,341,511,679 1737.84 6,716,555,009

Turnover I(18) for each Source

Weighting for contributions

Weighting for turnover

Number of common units in source and Statistical population

Number of days from the end of reference period to receiving Admin data

Number of days from the end of reference period to publication date

Page 263: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

263

Comparability: 19 Discontinuity in estimate when moving from a survey-based output to an

output involving admin data

Description This indicator measures the impact on the level of the estimate when changing from a survey-based output to an output involving admin data (either entirely admin based or partly). This indicator should be calculated separately for each key estimate included in the output. This can be viewed as a one-off indicator when moving from a survey based output to an output involving admin data.

How to calculate

%100 survey fromEstimate

surveyfrom Estimate - data Admininvolving Estimate

Note. This indicator should be calculated using survey and admin data which refer to the same period.

Example

A. Statistical output: Monthly manufacturing

B. Relevant units: Units in the statistical population

C. Relevant variables: Turnover; number of employees

D. Steps for calculations:

D1. Compute the estimate of the variable(s) for the survey based output D2. Compute the estimate of the variable(s) for the admin-data based output D3. Calculate the indicator as follows:

%100 survey fromEstimate

surveyfrom Estimate - data Admininvolving EstimateI(19)

Let A be the VAT Turnover source. Let B be the Social Security source. Estimator: sample mean.

Page 264: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

264

Units (A) (B) (D) (E)

X1 56,321 75,210 1 0

X2 118,948 120,321 4 4

X3 658,362 658,362 20 22

X4 29,632 31,550 0 0

X5 85,690 102,362 3 3

X6 522,360 522,360 30 30

X7 14,520,369 14,554,320 153 155

X8 99,652 101,520 0 0

X9 369,584 369,584 8 8

X10 887,456 890,630 22 22

X11 58,630 61,230 0 0

X12 741,252 741,550 6 6

Sum 18,148,256 18,228,999 247 250

Turnover

Source A

Turnover

Survey

Employees

Source B

Employees

Survey

Estimate Turnover (Source A) = 18,148,256/12 = 1,512,355 Estimate Turnover (Survey) = 18,228,999/12 = 1,519,083

I(19) Turnover = [(1,512,355-1,519,083)/1,519,083]*100 = -0.44%

Estimate Employees (Source B) = 247/12 = 20.6

Estimate Employees (Survey) = 250/12 = 20.8

(19)Employees = [(20.6-20.8)/20.8]*100 = -0.96%

Page 265: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

265

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

L – Describe the impact of moving from a survey based output to an admin-data based output

Comparability Provide information on how the data are affected when moving from a survey based output to an admin-data based output. Comment on any reasons behind differences in the output and describe how any inconsistencies are dealt with.

M – Describe any method(s) used to deal with discontinuity issues

Comparability Where the level of the estimate is impacted when moving from a survey based output to an admin-data based output, it may be possible to address this discontinuity. In this case a description of the method(s) used to deal with the discontinuity should be provided.

N - Describe the reasons behind discontinuities when moving from survey based estimates to admin-data based estimates

Comparability Where it is not possible to address a discontinuity in estimates when moving from a survey based output to an admin-data based output, the reasons behind the discontinuity should be provided along with commentary around the impact on the level of the estimate. Any changes in terms of the advantages and limitations of the differences should also be highlighted.

Page 266: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

266

Coherence: 20 % of consistent items for common variables in more than one source32

Description This indicator provides information on consistent items for any common variables across sources (either admin or survey). Only variables directly required for the statistical output should be considered – basic information (e.g. business name and address) should be excluded. Values within a tolerance should be considered consistent – the width of this tolerance (1%, 5%, 10%, etc.) would depend on the variables and methods used in calculating the statistical output. This indicator should be calculated for each of the key variables and aggregated based on the contributions of the variables to the overall output.

How to calculate

%100 variable X for required items of no. Total

variable X for tolerance) (within items consistent of No.

Note. If only one source is available or there are no common variables, this indicator is not relevant. Please state the tolerance used. This indicator could also be weighted (e.g. by turnover or number of employees) in terms of the % contribution to the output.

Example

A. Statistical output: Monthly data on NACE divisions 22 and 23 (Manufacture of rubber and plastics products, and other non-metallic mineral products)

B. Relevant units: Units in the survey

C. Relevant variables: Turnover

D. Steps for calculation:

D1. Match each source with the survey by the common id code (if available) or by other methods

D2. Attribute a Presence(1)/Absence(0) indicator to items of the relevant variables in the survey (sum up for obtaining denominator)

D3. Attribute a value=1(0) for consistent (not) item in survey and in the source (it is considered as consistent if difference between vari and vari (survey) is less than 5%.

D4. Calculate the indicator as follows

%100 variableXfor requireditemsofno.Total

variable X for tolerance) (within items consistent of No.I(20)

E. Tolerance: Max difference=5%

32 Indicators 20 and 23 are the only indicators in Section 4.2 for which a high indicator score denotes high

quality and a low indicator score denotes low quality.

Page 267: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

267

Let A be the VAT Turnover source.

I(20) = (11/14)*100 = 78.57% Weighted by employment: I(20)W = (605/639)*100 = 94.68%

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

O – Describe the common identifiers of population units in administrative data

Coherence Different administrative sources often have different population unit identifiers. The user can utilise this information to match records from two or more sources. Where there is a common identifier, matching is generally more successful.

Q - Describe the width of the tolerance and the reasons for this

Coherence Where values within a particular tolerance are considered consistent for common variables across more than one source, the width of the tolerance should be stated, along with a brief explanation as to why this particular tolerance width was chosen.

U – Describe the record matching methods and processes used on the administrative data sources

Accuracy Record matching is when different administrative records for the same unit are matched using a common, unique identifier or key variables common to both datasets. There are many different techniques for carrying out this process. A description of the technique (e.g. automatic or clerical matching) should be provided along with a description (qualitative or quantitative) of its effectiveness.

Turnover Source A

Turnover Survey

Percentage difference between (A) and (B)

No. of items In survey

(Turnover) Consistent items (Turnover) Employees

Units (A) (B) (C)=[|(A)-(B)|/(B)]*100 (D) (E)=(C)<5%(Y/N) =1/0 (F) (G)=(D)*(F) (H)=(E)*(F) X1 100,230 101,542 1.29% 1 1 2 2 2 X2 227,630 227,630 0.00% 1 1 3 3 3 X3 8,563,230 8,563,230 0.00% 1 1 58 58 58 X4 202,310 205,841 1.72% 1 1 21 21 21 X5 563,210 596,323 5.55% 1 0 34 34 0 X6 100,287 108,695 7.74% 1 0 0 0 0 X7 128,965 145,896 11.60% 1 0 0 0 0 X8 11,685,987 11,658,987 0.23% 1 1 150 150 150 X9 16,258,965 16,267,000 0.05% 1 1 223 223 223 X10 65,232 65,232 0.00% 1 1 0 0 0 X11 169,852 169,900 0.03% 1 1 0 0 0 X12 1,232,525 1,232,525 0.00% 1 1 22 22 22 X13 2,895,478 3,000,120 3.49% 1 1 36 36 36 X14 9,658,741 10,000,258 3.42% 1 1 90 90 90 Sum 51,852,642 52,343,179 14 11 639 639 605

Page 268: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

268

oSS

S

U+U

U

oSS

S

U+U

U

21 % of relevant units in admin data which have to be adjusted to create statistical units

Description This indicator provides information on the proportion of units that have to be adjusted in order to create statistical units. For example, the proportion of data at enterprise group level which therefore need to be split to provide reporting unit data.

How to calculate

Relevant units in the reference population that are adjusted to the statistical concepts by the use of statistical methods

Relevant units in the reference population that correspond to the statistical concepts This indicator should be weighted (e.g. by turnover or number of employees) in terms of the % contribution of these units to the statistical output. Note: Frequently, administrative units must be aggregated into 'composite units' before being disaggregated into statistical units. If this is required, it may be helpful to calculate an additional indicator covering the proportion of administrative units which can be successfully matched, or 'aligned', with composite units.

Example

A. Statistical output: Quarterly manufacturing

B. Relevant units: Kind of Activity Units (KAU) of the statistical population

C. Relevant variables: Turnover

D. Steps for calculation:

D1. Identify the units in the admin data which need to be adjusted in order to obtain the relevant statistical unit

D2. Identify the relevant units in the admin data that correspond to the statistical concepts

D3. Divide D1 by (D1+D2)

I(21)=

Where Us= Relevant units in the reference population that are adjusted to the statistical concepts by the use of statistical methods U0S= Relevant units in the reference population that correspond to the statistical concepts

Let A be the VAT Turnover source. The source provides data about the whole enterprise and not about the KAU (Kind of activity unit).

S U

oSU

Page 269: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

269

Let us suppose that we can assign a proportion of the turnover of the enterprise to the KAU basing the adjustment on supplementary information.

Units in the admin data (Enterprises)

Turnover enterprise (admin data)

Relevant units (KAU) obtained splitting data of enterprises Turnover KAU

Relevant units which correspond to statistical concepts (1/0)=(Y/N)

Relevant units which need to be adjusted to correspond to statistical concepts (1/0)=(Y/N) Employees

(A) (B) (C) (D) (E) (F) (G) (H)=(E)*(G) (I)=(F)*(G)

X1 85,632 K1 85,632 1 0 0 0 0

X2 1,587,463 K2 804,859 0 1 22 0 22

K3 523,641 60 0 0

K4 258,963 3 0 0

X3 8,954,122 K5 4,526,850 0 1 63 0 63

K6 1,896,540 16 0 0

K7 1,425,107 20 0 0

K8 1,105,625 12 0 0

X4 74,321 K9 74,321 1 0 1 1 0

X5 90,000 K10 90,000 1 0 2 2 0

X6 158,693 K11 96,500 0 1 0 0 0

K12 62,193 0 0 0

X7 487,520 K13 358,410 0 1 3 0 3

K14 129,110 1 0 0

X8 500,210 K15 452,100 0 1 15 0 15

K16 48,110 0 0 0

X9 4,582,310 K17 2,258,965 0 1 45 0 45

K18 1,152,362 27 0 0

K19 896,500 32 0 0

K20 274,483 7 0 0

X10 85,412 K21 85,412 1 0 0 0 0

X11 954,850 K22 754,820 0 1 5 0 5

K23 200,030 2 0 0

X12 100,236 K24 100,236 1 0 0 0 0

Sum 17,660,769 17,660,769 5 7 336 3 153

I(21) = [7/(5+7)]*100 = 58.33% Weighted by employees: I(21)w = [153/(3+153)]*100 = 96.84%

Page 270: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

270

Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

D – Describe constraints on the availability of administrative data at the required level of detail

Relevance Some administrative microdata have restricted availability or may only be available at aggregate level. Describe any restrictions on the level of data available and their effects on the statistical product.

R - Describe differences in concepts, definitions and classifications between the administrative source and the statistical output

Coherence

There may be differences in concepts, definitions and classifications between the administrative source and statistical product. Concepts include the population, units, domains, variables and time reference and the definitions of these concepts may vary between the administrative data and the statistical product. Time reference problems occur when the statistical institution requires data from a certain time period but can only obtain them for another. Any effects on the statistical product need to be made clear along with any techniques used to remedy the problem.

S - Describe any adjustments made for differences in concepts and definitions between the administrative source and the statistical output

Coherence

Adjustments may be required as a result of differences in concepts and definitions between the administrative data and the requirements of the statistical product. A description of why the adjustment needed to be made and how the adjustment was made should be provided to the users so they can assess the coherence of the statistical product with other sources.

Page 271: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

271

Cost and efficiency:

22 Cost of converting admin data to statistical data

Description This indicator provides information on the estimated cost (in person hours) of converting admin data to statistical data. It can be considered in two ways: either as a one-off indicator to identify the set-up costs of moving from survey data to administrative data (as such it should include set-up costs, monitoring of data sources, negotiating with data providers, etc.), or as a regular indicator to identify the ongoing running costs of the system that converts the administrative data to statistical data (which should include costs of technical processing, monitoring of the data, ongoing liaison with data providers, etc.). The indicator should be calculated for each admin source and then aggregated based on the contribution of the admin source to the statistical output.

How to calculate

(Estimated) Cost of conversion in person hours

Note. This should only be calculated for parts of the admin data relevant to the statistical output.

Example

A. Statistical output: Quarterly data on the manufacture of electrical equipment

B. Relevant units: Kind of Activity Units (KAU) of the statistical population

C. Relevant variables: Number of employees.

D. Steps for calculation:

D1.Identify the time in person hours necessary to convert the admin data in order to obtain statistical data as a function of admin source size and complexity in the treatment of admin data.

Let c1=number of records in admin data. Let c2=estimated average number of minutes necessary to split the total number of employees of enterprise on each KAU, basing the attribution on supplementary information. I(22)=Cost of conversion in person hours = f(#of records in admin data, estimated average number of minutes necessary to process one admin unit) I(22) = c1*c2 = 1550*20min = 31,000 min = 517 person hours. Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

AT - Describe the processes required for converting admin data to statistical data and comment on any potential issues

Cost and efficiency

Data processing may be required to convert the admin data to statistical data. This indicator should provide the user with information on the types of processes and techniques used, along with a description of any issues that are dealt with, or perhaps still remain.

Page 272: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

272

23 Efficiency gain in using admin data33

Description This indicator provides information on the efficiency gain in using admin data rather than simply using survey data. For example, collecting admin data is usually cheaper than collecting data through a survey but this benefit might be offset by higher processing costs. This indicator should consider the total estimated costs of producing the output when using survey data (potentially a few years ago if the move was gradual) and then compare this to the total estimated costs of producing the output when using admin data or a combination of both. Production cost should include all costs the NSI is able to attribute to the production of the statistical output. (For example, this may include the cost of the use of computers and electrical equipment, staff costs, cost of data processing, cost of results dissemination, etc.) This can be viewed as a one-off indicator when moving from a survey based output to an output involving admin data.

How to calculate

%100statistic basedsurvey ofcost Production

statistic basedadmin ofcost production - statistic basedsurvey ofcost Production

Note. Estimated costs are acceptable.

This indicator is likely to be calculated once, when making the change from survey to admin data.

Example

A. Statistical output: Quarterly data on the manufacture of machinery and equipment

B. Relevant units: Units in the statistical population

D. Steps for calculation:

D1. Quantify costs of survey based statistic (total cost of the survey including questionnaires, mailing, recalling, staff etc.)

D2. Quantify cost of statistic when based on admin data (cost of admin source acquisition, processing costs, staff etc.)

%100 statisticbased survey of cost Production

statisticbased admin of cost production - statisticbased survey of cost ProductionI(23)

33 Indicators 20 and 23 are the only indicators in Section 3.2 for which a high indicator score denotes high

quality and a low indicator score denotes low quality.

Page 273: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

273

I(23) = [(89,086-46,000)/89,086]*100 = 48.36% Related qualitative indicators:

Qualitative indicators

Quality Theme

Description

AT - Describe the processes required for converting admin data to statistical data and comment on any potential issues

Cost and efficiency

Data processing may be required to convert the admin data to statistical data. This indicator should provide the user with information on the types of processes and techniques used, along with a description of any issues that are dealt with, or perhaps still remain.

Cost of survey based statistics = f(c1,c2,c3,c4,c5,c6,c7,c8,n)

where:

c1=cost of survey planning € 18,000

c2=cost of use of computers and electrical equipment € 786

c3=cost of each questionnaire € 3

c4=cost of mailing for each questionnaire € 3

c5=cost of staff employed in survey € 30,000

c6=cost of telephone calls for survey requirements € 250

c7=cost of possible website ad hoc for the survey € 2,300

c8=cost of results dissemination € 6,000

c9=cost of data processing € 750

c10=other costs € 10,000

n=number of questionnaires 3,500

Cost of survey based statistic = c1+c2+c5+c6+c7+c8+c9+c10+n*(c3+c4)

= 18,000+786+30,000+250+2,300+6,000+750+10,000+3,500*(3+3) = € 89,086

Cost of admin based statistics = f(c1,c2,c3,c4,c5,c6,c7)

c1=cost of planning € 5,000

c2=cost of admin source € 10,000

c3=cost of use of computers and electrical equipment € 2,000

c4=cost of staff employed € 7,000

c5=cost of data processing € 15,000 c6=cost of results dissemination € 6,000 c7=other costs € 1,000

Cost of admin based statistics = C1+c2+c3+c4+c5+c6+c7

€ 46,000 = 5,000+10,000+2,000+7,000+15,000+6,000+1,000 =

Page 274: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

274

References Daas, P.J.H., Ossen, S.J.L. & Tennekes, M. (2010). Determination of administrative data quality: recent results and new developments. Paper and presentation for the European Conference on Quality in Official Statistics 2010. Helsinki, Finland. Eurostat, (2003). Item 6: Quality assessment of administrative data for statistical purposes. Luxembourg, Working group on assessment of quality in statistics, Eurostat. Frost, J.M., Green, S., Pereira, H., Rodrigues, S., Chumbau, A. & Mendes, J. (2010). Development of quality indicators for business statistics involving administrative data. Paper presented at the Q2010 European Conference on Quality in Official Statistics. Helsinki, Finland. Ossen, S.J.L., Daas, P.J.H. & Tennekes, M. (2011). Overall Assessment of the Quality of Administrative Data Sources. Paper accompanying the poster at the 58th Session of the International Statistical Institute. Dublin, Ireland.

European Commission, Eurostat, (2007). Handbook on Data Quality Assessment Methods and Tools.

Page 275: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

275

Appendix A: List of Qualitative Indicators by Theme

Relevance

Qualitative indicator Description Related

quantitative indicator(s)

A - Name each admin source used as an input into the statistical product and describe the primary purpose of the data collection for each source

Name all administrative sources and their providers. This information will assist users in assessing whether the statistical product is relevant for their intended use. Providing information on the primary purpose of data collection also enables users to assess whether the data are relevant to their needs.

1, 2

B – Describe the main uses of each of the admin sources and, where possible, how the data relate to the needs of users

Include all the main statistical processes and/or outputs known to require data from the administrative source This indicator should also capture how well the data support users’ needs. This information can be gathered from user satisfaction surveys and feedback.

1, 2

C – Describe the extent to which the data from the administrative source meet statistical requirements

Statistical requirements of the output should be outlined and the extent to which the administrative source meets these requirements stated. Gaps between the administrative data and statistical requirements can have an effect on the relevance to the user. Any gaps and reasons for the lack of completeness should be described, for example if certain areas of the target population are missed or if certain variables that would be useful are not collected. Any methods used to fill the gaps should be stated.

3

D – Describe constraints on the availability of administrative data at the required level of detail

Some administrative microdata have restricted availability or may only be available at aggregate level. Describe any restrictions on the level of data available and their effects on the statistical product.

3, 21

E – Describe reasons for use of admin data as a proxy

Where admin data are used in the statistical output as a proxy or are used in calculations rather than as raw data, information should be provided in terms of why the admin data have been used as a proxy for the required variables.

3

F - Identify known gaps between key user needs, in terms of coverage and detail, and current data

Data are complete when they meet user needs in terms of coverage and detail. This indicator allows users to assess, when there are gaps, how relevant the outputs are to their needs.

N/A

Page 276: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

276

Timeliness and punctuality

Qualitative indicator Description Related quantitative indicator(s)

G – Describe the timescale since the last update of data from the administrative source

An indication of the timescale since the last update from administrative sources will provide the user with an indication of whether the statistical product is timely enough to meet their needs.

4, 18

H – Describe the extent to which the administrative data are timely

Provide information on how soon after their collection the statistical institution receives the administrative data. The effects of any lack of timeliness on the statistical product should be described.

4, 18

I – Describe any lack of punctuality in the delivery of the administrative data source

Give details of the time lag between the scheduled and actual delivery dates of the data. Any reasons for the delay should be documented along with their effects on the statistical product.

4, 18

J – Frequency of production

This indicates how timely the outputs are, as the frequency of publication indicates whether the outputs are up to date with respect to users’ needs.

18

K – Describe key user needs for timeliness of data and how these needs have been addressed

This indicates how timely the data are for specified needs, and how timeliness has been secured, eg by reducing the time lag to a number of days rather than months for monthly releases.

18

Comparability

Qualitative indicator Description Related quantitative indicator(s)

L – Describe the impact of moving from a survey based output to an admin-data based output

Provide information on how the data are affected when moving from a survey based output to an admin-data based output. Comment on any reasons behind differences in the output and describe how any inconsistencies are dealt with.

8, 19

M – Describe any method(s) used to deal with discontinuity issues

Where the level of the estimate is impacted when moving from a survey based output to an admin-data based output, it may be possible to address this discontinuity. In this case a description of the method(s) used to deal with the discontinuity should be provided.

19

N - Describe the reasons behind discontinuities when moving from survey based estimates to admin-data based estimates

Where it is not possible to address a discontinuity in estimates when moving from a survey based output to an admin-data based output, the reasons behind the discontinuity should be provided along with commentary around the impact on the level of the estimate. Any changes in terms of the advantages and limitations of the differences should also be highlighted.

19

Page 277: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

277

Coherence

Qualitative indicator Description Related quantitative indicator(s)

O – Describe the common identifiers of population units in administrative data

Different administrative sources often have different population unit identifiers. The user can utilise this information to match records from two or more sources. Where there is a common identifier, matching is generally more successful.

5, 6, 20

P – Provide a statement of the nationally/internationally agreed definitions, classifications and standards used

This is an indicator of clarity, in that users are informed of concepts and classifications used in compiling the output. It also indicates geographical comparability where the agreed definitions and standards are used.

N/A

Q - Describe the width of the tolerance and the reasons for this

Where values within a particular tolerance are considered consistent for common variables across more than one source, the width of the tolerance should be stated, along with a brief explanation as to why this particular tolerance width was chosen.

20

R - Describe differences in concepts, definitions and classifications between the administrative source and the statistical output

There may be differences in concepts, definitions and classifications between the administrative source and statistical product. Concepts include the population, units, domains, variables and time reference and the definitions of these concepts may vary between the administrative data and the statistical product. Time reference problems occur when the statistical institution requires data from a certain time period but can only obtain them for another. Any effects on the statistical product need to be made clear along with any techniques used to remedy the problem.

10, 21

S - Describe any adjustments made for differences in concepts and definitions between the administrative source and the statistical output

Adjustments may be required as a result of differences in concepts and definitions between the administrative data and the requirements of the statistical product. A description of why the adjustment needed to be made and how the adjustment was made should be provided to the users so they can assess the coherence of the statistical product with other sources.

16, 21

T - Compare estimates with other estimates on the same theme

This statement advises users whether estimates from other sources on the same theme are coherent (ie they ‘tell the same story’), even where they are produced in different ways. Any known reasons for lack of coherence should be given.

N/A

Page 278: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

278

Accuracy

Qualitative indicator Description Related quantitative indicator(s)

U – Describe the record matching methods and processes used on the administrative data sources

Record matching is when different administrative records for the same unit are matched using a common, unique identifier or key variables common to both datasets. There are many different techniques for carrying out this process. A description of the technique (e.g. automatic or clerical matching) should be provided along with a description (qualitative or quantitative) of its effectiveness.

5, 6, 20

V – Describe the data processing known to be required on the administrative data source to deal with non-response

Data processing is often required to deal with non-response. The user should be made aware of how and why particular data processing methods are used.

9

W – Describe differences between responders and non-responders

This indicates to users how significant the non-response bias is likely to be. Where response is high, non-response bias is likely to be less of a problem than when there are high rates of non-response. NB: There may be instances where non-response bias is high even with very high response rates, if there are large differences between responders and non-responders.

9

X – Assess the likely impact of non-response/imputation on final estimates

Non-response error may reduce the representativeness of the data. An assessment of the likely impact of non-response/imputation on final estimates allows users to gauge how reliable the key estimates are as estimators of population values. This assessment may draw upon the indicator: ‘% of imputed values (items) in the admin data’.

9, 17

Y – Comment on the imputation method(s) in place within the statistical process

The imputation method used can determine how accurate the imputed value is. Information should be provided on why the particular method(s) was chosen and when it was last reviewed.

17

Z – Describe how the misclassification rate is determined

It is often difficult to calculate the misclassification rate.Therefore, where this is possible, a description of how the rate has been calculated should also be provided.

10

AA – Describe any issues with classification and how these issues are dealt with

Whereas a statistical institution can decide upon and adjust the classifications used in its own surveys to meet user needs, the institution usually has little or no influence over those used by administrative sources. Issues with classification and how these issues are dealt with should be described so that the user can decide whether the source meets their needs.

10

AB – Describe the extent of coverage of the administrative data and any known coverage problems

This information is useful for assessing whether the coverage is sufficient. The population that the administrative data covers should be included along with all known coverage problems. There could be overcoverage (where duplicate records are included) or undercoverage (where certain records are missed).

11, 12

Page 279: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

279

Qualitative indicator Description Related quantitative indicator(s)

AC – Describe methods used to deal with coverage issues

Updating procedures and cleaning procedures should be described. Also, information should be provided on edit checks and /or imputations that are carried out because of coverage error. This information indicates to users the resultant robustness of the administrative data, after these procedures have been carried out to improve coverage.

11, 12

AD – Assess the likely impact of coverage error on key estimates

Coverage issues could relate to undercoverage and overcoverage (or both). Special studies can sometimes be carried out to assess the impact of undercoverage and overcoverage. An assessment of the likely impact of coverage error indicates to users how reliable the key estimates are as estimators of population values.

11, 12

AE – Describe the data processing known to be required on the administrative data source to address instances where the reference period differs from the required reference period

Data processing may sometimes be required to check or improve the quality of the administrative data or create new variables to be used for statistical purposes. The user should be made aware of how and why data processing is used.

13

AF - Comment on the impact of the different versions of admin data on the results

When commenting on the size of the revisions of the different versions of the admin data, information on the impact of the revisions on the statistical product for the relevant reference period should also be explained to users.

14

AG – Flag any published data that are subject to revision and data that have already been revised

This indicator alerts users to published data that may be, or have already been, revised. This will enable users to assess whether provisional data will be fit for their purposes.

14

AH – For ad hoc revisions, detail revisions made and provide reasons

Where revisions occur on an ad hoc basis to published data, this may be because earlier estimates have been found to be inaccurate. Users should be clearly informed of the revisions made and why they occurred. Clarifying the reasons for revisions guards against any misinterpretation of why revisions have occurred, while at the same time making processes (including any errors that may have occurred) more transparent to users.

14

AI – Describe the known sources of error in administrative data

Metadata provided by the administrative source and/or information from other reliable sources can be used to assess data errors. The magnitude of any errors (where known) that have a significant impact on the administrative data should be made available to users. This will help the user to understand how accurate the administrative data are.

15

AJ – Describe the data processing known to be required on the administrative data source in terms of the types of checks carried out

Data processing may sometimes be required to check or improve the quality of the administrative data or create new variables to be used for statistical purposes. The user should be made aware of how and why data processing is used.

15

Page 280: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

280

Qualitative indicator Description Related quantitative indicator(s)

AK – Describe processing systems and quality control

This informs users of the mechanisms in place to minimise processing error by ensuring accurate data transfer and processing.

15, 16

AL – Describe the main sources of measurement error

Measurement error is the error that occurs from failing to collect the true data values from respondents. Measurement error cannot usually be calculated. However, outputs should contain a definition of measurement error and a description of the main sources of the error when the admin data holder gathers the data.

15, 16

AM – Describe processes employed by the admin data holder to reduce measurement error

Describing processes to reduce measurement error indicates to users the accuracy and reliability of the measures.

15, 16

AN – Describe the main sources of processing error

Processing error is the error that occurs when processing data. It includes errors in data capture, coding, editing and tabulation of the data. It is not usually possible to calculate processing error exactly. However, outputs should be accompanied by a definition of processing error and a description of the main sources of the error.

15, 16

AO – Describe the data processing known to be required on the administrative data source in terms of the types of edits carried out

Data processing may sometimes be required to check or improve the quality of the administrative data or create new variables to be used for statistical purposes. The user should be made aware of how and why data processing is used.

16

Accessibility and clarity

Qualitative indicator Description Related quantitative indicator(s)

AP – Reference/link to detailed revisions analyses

Where published data have been revised, users should be directed to where detailed revisions analyses are available.

14

Page 281: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

281

Cost and efficiency

Qualitative indicator Description Related quantitative indicator(s)

AQ – Describe reasons for significant overlap in admin data and survey data collection for some items

Where items are not obtained exclusively from admin data, reasons for the overlap between admin data and survey data should be described.

2

AR – Comment on the types of items that are being obtained by the admin source as well as the survey

If the same data items are collected from both the admin source and the survey, this can lead to duplication when combining the sources. This indicator should highlight to users the variables that are being collected across both sources.

7

AS – If items are purposely collected by both the admin source and the survey, describe the reason for this duplication (eg validity checks)

In some instances it may be beneficial to collect the same variables from both the admin source and the survey, such as to validate the micro level data. The reason for the double collection should be described to users.

7

AT - Describe the processes required for converting admin data to statistical data and comment on any potential issues

Data processing may be required to convert the admin data to statistical data. This indicator should provide the user with information on the types of processes and techniques used, along with a description of any issues that are dealt with, or perhaps still remain.

22, 23

Page 282: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

282

Appendix B: Notation for quality indicators

Administrative datasets are often ‘progressive’ – data for a given reference period can differ when measured at different time-points. This can present challenges when specifying and implementing quality indicators. This appendix outlines notation which may be helpful in specifying these kinds of problems and presents some possible solutions. For a more extensive treatment of the concept and a prediction framework for progressive data, see Zhang (2013). Notation to help deal with progressive nature of admin data It is important to be able to distinguish reference period – the time-point of interest - from measurement periods – the time-points at which we measure the time-point of interest. The following notation is suggested:

U(a ; b | c) – the population at time-period ‘a’ measured at time-period ‘b’ according to data source ‘c’

yi(a ; b | c) – value of interest for unit ‘i’ in U(a; b | c) So, for example, U(t ; t+α | Fiscal Register) refers to the population according to the ‘Fiscal Register’ administrative source for time-point ‘t’ measured ‘α’ periods after ‘t’. A characteristic of many administrative datasets is that the value for a given reference period depends on the measurement period: this can be referred to as progressiveness. This means that, for a lag ‘α’, both the number of units in U(t ; t+ α | c) and their total of any variable of interest will keep evolving over time, until α =∞ in principle. This characteristic is often true of business registers as well as administrative datasets, particularly when business registers are maintained using administrative sources. Implication for the implementation of the quality indicators When calculating quality indicators, results from an early version of the administrative data may produce very different results from a later version. The decision as to which version of an administrative dataset to use is therefore important and should be documented when the quality indicators are reported. The notation above may be useful in making and reporting this decision. Several quality indicators call for comparison with the business register. In this case, the choice of which version of the business register to use is equally important. Choice of datasets The version of the administrative data used in the estimation is usually the best one to use. Frequently, this will be a dataset for the correct reference period. Where the reference period of the administrative data differs from the statistical reference period – for example, where employment statistics for February use administrative data with a reference period of January - it may be informative to calculate an alternative set of indicators using the administrative data with the correct reference period. In our example, the ‘correct’ reference period would be February. This can help identify the quality impact of using administrative data with an incorrect reference period.

Page 283: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

283

In terms of the choice of business register, it may be preferable to use the most up-to-date version of the business register with the correct reference period. However, it may happen that the business register is updated using the administrative source under evaluation. In such cases, it may be preferable to use an earlier vintage of the business register, before this updating has taken place, but retaining the reference period of interest. It should be noted that this choice may be limited by practical constraints regarding what versions of the business register are stored. Concluding Remarks In general, it is important to consider the impact of the progressiveness of both administrative data and business registers and to record which versions are used in the calculation of the quality indicators. The notation set out above may be helpful when doing so. Reference Zhang, L-C. (2013). Towards VAT register-based monthly turnover statistics. Development report available on request.

Page 284: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

284

Appendix C: Glossary34

Term Definition

15. administrative data

The data derived from an administrative source, before any processing or validation by the NSIs.

16. administrative source

A data holding containing information collected and maintained for the purpose of implementing one or more administrative regulations.

In a wider sense, any data source containing information that is not primarily collected for statistical purposes.

17. common units

Within a process of matching, those units that are identified in more than one source.

18. consistent items

Within a process of matching, the values of a variable, referring to the same unit, that are logically and/or numerically coherent across different sources.

According to the level of accuracy required, values can be considered consistent even within a certain tolerance.

19. item A ‘value’ for a variable for a specific unit.

20. key variables Within the ESSnet Admin Data, this term is used to refer to the statistical variables that are most important and have the largest impact on a statistical output (e.g. turnover, number of employees, wages and salaries, etc.).

21. reference population

The set of units about which information is wanted and estimates are required.

22. relevant units

Businesses that are within the scope of the statistical output (e.g. units from the services sector should be excluded from manufacturing statistics).

23. relevant items

‘Values’ for units on relevant variables that should be included in calculating the statistical output.

24. required period

The reporting period used within the statistical output.

25. required variables

Variables necessary to calculate the statistical output.

26. statistical output

A statistic produced by the NSI – whether based on a specific variable (e.g. no. of employees) or a set of related variables (e.g. total turnover; domestic market turnover; external market turnover). In the broadest sense, statistical output would also apply to the whole STS or SBS output.

34 Work Package 1 (WP1) of the ESSnet AdminData has developed an ‘Admin Data Glossary’. To access the

glossary, please follow this link: http://essnet.admindata.eu/Glossary/List

Page 285: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

285

27. unit Refers to statistical units – enterprise, legal unit, local unit, etc.

28. weighted A number of the quality indicators described in this document can be calculated in unweighted or weighted versions. Formulae are given for the unweighted versions of the indicators. Weighting can be beneficial as the weighted indicator will often better describe the quality of the statistical output. For example, the unweighted item non-response will inform users what proportion of valid units did not respond for a particular variable, whereas the weighted item non-response will estimate the proportion of the output variable affected by non-response. A non-response rate of 30% is of less concern if those 30% of units only cover 1% of the output variable. In practice, we do not know the values of the output variable for non-responders, so we use a related variable instead. Business register variables such as Turnover or Employment are often used as proxies.

The weighted indicators are calculated as follows:

Page 286: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

286

Annex 2: Glossary35

Term Definition

29. administrative data The data derived from an administrative source, before any processing or validation by the NSIs.

30. administrative source A data holding containing information collected and maintained for the purpose of implementing one or more administrative regulations.

In a wider sense, any data source containing information that is not primarily collected for statistical purposes.

31. common units Within a process of matching, those units that are identified in more than one source.

32. consistent items Within a process of matching, the values of a variable, referring to the same unit, that are logically and/or numerically coherent across different sources.

According to the level of accuracy required, values can be considered consistent even within a certain tolerance.

33. item A ‘value’ for a variable for a specific unit.

34. key variables Within the ESSnet Admin Data, this term is used to refer to the statistical variables that are most important and have the largest impact on a statistical output (e.g. turnover, number of employees, wages and salaries, etc.).

35. reference population The set of units about which information is wanted and estimates are required.

36. relevant units Businesses that are within the scope of the statistical output (e.g. units from the services sector should be excluded from manufacturing statistics).

37. relevant items ‘Values’ for units on relevant variables that should be included in calculating the statistical output.

35 Work Package 1 (WP1) of the ESSnet AdminData has developed an ‘Admin Data Glossary’. To access the glossary, please follow this link: http://essnet.admindata.eu/Glossary/List

Page 287: ESSNET - European Commission 2011_Deliverable… · John-Mark Frost*, Emma Newman*, Ceri Lewis*, Daniel Lewis*, Joep Burger ... business statistics involving administrative data.

287

38. required period The reporting period used within the statistical output.

39. required variables Variables necessary to calculate the statistical output.

40. statistical output A statistic produced by the NSI – whether based on a specific variable (e.g. no. of employees) or a set of related variables (e.g. total turnover; domestic market turnover; external market turnover). In the broadest sense, statistical output would also apply to the whole STS or SBS output.

41. unit Refers to statistical units – enterprise, legal unit, local unit, etc.

42. weighted A number of the quality indicators described in this document can be calculated in unweighted or weighted versions. Formulae are given for the unweighted versions of the indicators. Weighting can be beneficial as the weighted indicator will often better describe the quality of the statistical output. For example, the unweighted item non-response will inform users what proportion of valid units did not respond for a particular variable, whereas the weighted item non-response will estimate the proportion of the output variable affected by non-response. A non-response rate of 30% is of less concern if those 30% of units only cover 1% of the output variable. In practice, we do not know the values of the output variable for non-responders, so we use a related variable instead. Business register variables such as Turnover or Employment are often used as proxies.

The weighted indicators are calculated as follows: