SISAI - STATISTICAL INFORMATION YSTEMS … · [email protected] ... ‘VIP on data...

22
Commission européenne, 2920 Luxembourg, LUXEMBOURG - Tel. +352 43011 Office: BECH A3/122 - Tel. direct line +352 4301-35285 - Fax +352 4301-31092 http://epp.eurostat.ec.europa.eu [email protected] EUROPEAN COMMISSION EUROSTAT Directorate B: Corporate statistical and IT services Unit B-3: IT and standards for data and metadata exchange SISAI - STATISTICAL INFORMATION SYSTEMS ARCHITECTURE AND INTEGRATION WORKING GROUP 3 rd MEETING 13-14 MAY 2013 BECH BUILDING ROOM AMPÈRE LUXEMBOURG ITEM 2.7 WORKING DOCUMENT – Pending further analysis and improvements

Transcript of SISAI - STATISTICAL INFORMATION YSTEMS … · [email protected] ... ‘VIP on data...

Page 1: SISAI - STATISTICAL INFORMATION YSTEMS … · adam.wronski@ec.europa.eu ... ‘VIP on data validation general approach ... 2.2 Checks within and between datasets ...

Commission européenne, 2920 Luxembourg, LUXEMBOURG - Tel. +352 43011 Office: BECH A3/122 - Tel. direct line +352 4301-35285 - Fax +352 4301-31092 http://epp.eurostat.ec.europa.eu [email protected]

EUROPEAN COMMISSION EUROSTAT Directorate B: Corporate statistical and IT services Unit B-3: IT and standards for data and metadata exchange

SISAI - STATISTICAL INFORMATION SYSTEMS ARCHITECTURE AND INTEGRATION

WORKING GROUP

3rdMEETING

13-14 MAY 2013

BECH BUILDING

ROOM AMPÈRE LUXEMBOURG

ITEM 2.7

WORKING DOCUMENT – Pending further analysis and improvements

Page 2: SISAI - STATISTICAL INFORMATION YSTEMS … · adam.wronski@ec.europa.eu ... ‘VIP on data validation general approach ... 2.2 Checks within and between datasets ...

European Commission – Eurostat/B1, Eurostat/E1, Eurostat/E6

WORKING DOCUMENT – Pending further

analysis and improvements

Based on deliverable 2.4 Contract No. 40107.2011.001-2011.567

‘VIP on data validation general approach’

2.4 - Exhaustive and detailed typology

of validation rules – v 0.1304

April 2013

Page 3: SISAI - STATISTICAL INFORMATION YSTEMS … · adam.wronski@ec.europa.eu ... ‘VIP on data validation general approach ... 2.2 Checks within and between datasets ...

Project: ESS.VIP.BUS Common data validation policy

Document: 2.4 - Exhaustive and detailed typology of validation rules

(3rd_main)

Version: 0.1304

April 2013 Page ii

Document Service Data

Type of Document Deliverable

Reference: 2-4 EXHAUSTIVE AND DETAILED TYPOLOGY OF

VALIDATION RULES

Version: 0.1304 Status: Draft

Created by: Angel SIMÓN Date: 23.04.2013

Distribution: European Commission – Eurostat/B1, Eurostat/E1, Eurostat/E6

For Internal Use Only

Reviewed by: Angel SIMÓN

Approved by: Remark: Pending further analysis and improvements

Document Change Record

Version Date Change

0.1304 23.04.2013 Initial release based on deliverable from contractor AGILIS

Contact Information

EUROSTAT

Ángel SIMÓN

Unit E-6: Transport statistics

BECH B4/334

Tel.: +352 4301 36285

Email: [email protected]

Page 4: SISAI - STATISTICAL INFORMATION YSTEMS … · adam.wronski@ec.europa.eu ... ‘VIP on data validation general approach ... 2.2 Checks within and between datasets ...

Project: ESS.VIP.BUS Common data validation policy

Document: 2.4 - Exhaustive and detailed typology of validation rules

(3rd_main)

Version: 0.1304

April 2013 Page iii

Table of contents Page

1 Introduction ..................................................................................................................................... 1

2 Validation rules ............................................................................................................................... 1 2.1 File structure ............................................................................................................................... 1

2.1.1 Filename check .................................................................................................................... 1 2.1.2 File type check ..................................................................................................................... 2 2.1.3 Allowed character checks .................................................................................................... 2 2.1.4 Format check ....................................................................................................................... 3

2.2 Checks within and between datasets ......................................................................................... 3 2.2.1 Type Check .......................................................................................................................... 3 2.2.2 Length Check ....................................................................................................................... 4 2.2.3 Presence Check ................................................................................................................... 5 2.2.4 Allowed character checks .................................................................................................... 5 2.2.5 Uniqueness Check ............................................................................................................... 6 2.2.6 Referential integrity .............................................................................................................. 6 2.2.7 Code List Check ................................................................................................................... 7 2.2.8 Consistency checks ............................................................................................................. 8 2.2.9 Cardinality checks .............................................................................................................. 10 2.2.10 Mirror checks ................................................................................................................... 10 2.2.11 Range Check ................................................................................................................... 13 2.2.12 Control Check .................................................................................................................. 13 2.2.13 Conditional Checks .......................................................................................................... 15 2.2.14 Time series checks .......................................................................................................... 15 2.2.15 Revised data integrity Check ........................................................................................... 17 2.2.16 Model – based Consistency Check ................................................................................. 18

Page 5: SISAI - STATISTICAL INFORMATION YSTEMS … · adam.wronski@ec.europa.eu ... ‘VIP on data validation general approach ... 2.2 Checks within and between datasets ...

Project: ESS.VIP.BUS Common data validation policy

Document: 2.4 - Exhaustive and detailed typology of validation rules

(3rd_main)

Version: 0.1304

April 2013 Page 1

1 Introduction

The aim of this document is to present, in most exhaustive way, all the validation rules currently

applied to the data received by Eurostat. This document is the evolution of document ‘Typology of

data validation rules and imputation methods’ completed in the task 1 of the project (deliverable 1.4

Typology of data validation rules and imputation methods).

2 Validation rules

The different validation rules can be split into two categories:

a. Checks on the file structure: This validation category involves consistency and reasonability

tests applied by the data manager prior to integration into the Database System. Consistency

tests verify that file naming conventions, data formats, field names, and file structure are

consistent with project conventions. Discrepancies are reported to the measurement

investigator for remediation.

b. Intra-dataset and inter-dataset checks: This category of data validation takes place after data

have been assembled in the database1. This validation category is the first step in data

analysis. Validation tests in this category involve the testing of measurement assumptions,

comparisons of measurements, and internal consistency tests.

When the validation failed, it gives two types of error2:

Fatal: the data is rejected;

Warning: the record can be accepted, with some corrections or explanations from the data

provider.

The presentation of the rules is structured as follows: a short description of each rule type is followed

by examples of its application in several domains. The examples are drawn from the inventory of

validation rules (deliverable 1.5 – 1.6 of the project).

2.1 File structure

2.1.1 Filename check

Checks that the filename is consistent with file naming conventions based on predefined rules, for

example CENSUS_2011_LU_SEX. This validation also checks implicitly the filename length whether

is consistent with file naming conventions agreed for each domain e.g. Windows imposes a 260

maximum length for the Path+Filename. Below we present in a table the fields for which filename

check validation rules applied for the distinct domains:

2.1.1.1 Road freight transport statistics

1 Intra-dataset checks take place before the data is assembled in the database as well. Some basic

checks – consistency, integrity can be performed on the incoming file. 2 Eurostat unit B3 proposes three levels of error with progressively increasing impact on the quality of

input data: warning, error and fatal error. Further developments will take this into account depending

on the needs. Moreover, there are ideas about error weights to be added up over a whole file and the

report to contain these kinds of 'measurements' or 'indicators'.

Page 6: SISAI - STATISTICAL INFORMATION YSTEMS … · adam.wronski@ec.europa.eu ... ‘VIP on data validation general approach ... 2.2 Checks within and between datasets ...

Project: ESS.VIP.BUS Common data validation policy

Document: 2.4 - Exhaustive and detailed typology of validation rules

(3rd_main)

Version: 0.1304

April 2013 Page 2

Type of error: Fatal

Table File naming convention File name/File name length

(characters)

A2

Country Code (2-characters) +

Year (2-digits) + Quarter Code (2-

characters) + ’ROAD’+ Table

Name

LU09Q3ROADA2.dat/16

A1

Country Code (2-characters) +

Year (2-digits) + Quarter Code (2-

characters) + ’ROAD’+ Table

Name

IT07Q2ROADA1.dat/16

2.1.2 File type check

Checks the type of data file we are dealing with. This validation is quite important since both sender

and receiver rely both upon the compatibility and integrity of data file e.g. a system can require input

data in csv format.

2.1.2.1 Road freight transport statistics

Type of error: Fatal

Table File File Format

A1 All data Files referring to A1

table

a DAT format which is a generic

"data" file or a ZIP format which

is used for data file compression

A2 All data Files referring to A2

table

a DAT format which is a generic

"data" file or a ZIP format which

is used for data file compression

2.1.3 Allowed character checks

Checks that ascertain that only expected characters are present as field or record separators. For

example for a csv file may only allow comma as field separator. Below we present in a table the fields

for which allowed character check validation rules applied for the distinct domains:

2.1.3.1 Farm structure survey statistics

Type of error: Fatal

Table Field Valid character checks

Any table Any data field A plus sign ‘+’ used as field

Page 7: SISAI - STATISTICAL INFORMATION YSTEMS … · adam.wronski@ec.europa.eu ... ‘VIP on data validation general approach ... 2.2 Checks within and between datasets ...

Project: ESS.VIP.BUS Common data validation policy

Document: 2.4 - Exhaustive and detailed typology of validation rules

(3rd_main)

Version: 0.1304

April 2013 Page 3

separator

Any table Any data record

A plus sign ‘+’ is used as record

separator followed by a line feed

character

2.1.3.2 Rail transport statistics

Type of error: Fatal

Table Field Valid character checks

A1 – A9 Any data field A semicolon ‘;’ used as field

separator

A1 – A9 Any data record A semicolon ‘;’ used as field

separator

Any table Any data record

A plus sign ‘+’ is used as record

separator followed by a line feed

character

2.1.4 Format check

Checks that the data is in a specified format (template), e.g., each record must contain ten fields.

Below we present in a table the fields for which format check validation rules apply for the distinct

domains:

2.1.4.1 Rail transport statistics

Type of error: Fatal

Table Field Valid character checks

A1 – A9 Any data record Each record must include 18

fields

A1 – A9 Any data file corresponding to

tables A1 – A9

Each file must include the

correct names for the fields and

in the specified order

2.2 Checks within and between datasets

2.2.1 Type Check

A type check will ensure that the correct type of data is entered into that field. By setting the data type

as number, only numbers could be entered e.g. 10,12, 14, and you would prevent anyone to enter text

Page 8: SISAI - STATISTICAL INFORMATION YSTEMS … · adam.wronski@ec.europa.eu ... ‘VIP on data validation general approach ... 2.2 Checks within and between datasets ...

Project: ESS.VIP.BUS Common data validation policy

Document: 2.4 - Exhaustive and detailed typology of validation rules

(3rd_main)

Version: 0.1304

April 2013 Page 4

such as ‘ten’ or ‘twelve’. Below we present in a table the fields for which type check validation rules

applied for the distinct domains:

2.2.1.1 Road freight transport statistics

Type of error: Fatal

Table Field Valid Type

A1 Rcount Text

A1 A1 Text

A1 Year Text

A1 Quarter Text

A1 QuestN Text

A1 A1.1 Text

A1 A1.3 Number

A1 A1.6 Text

A1 A1.8.1 Number

A1 A1.8.2 Number

A1 A1.9 Number

A1 Stratum Text

A1 A2link Text

2.2.2 Length Check

Sometimes we may have a set of data, which always has the same number of characters. For

example if alpha – 2 codes are adopted for countries a length check could be set up to ensure that

exactly 2 characters are entered into the field. This type of validation can’t check that the 2 characters

are correct but it can ensure that 1 or 4 characters aren’t entered. A length check can also be set up to

allow characters to be entered within a certain range. Below we present in a table the fields for which

length check validation rules applied for the distinct domains:

2.2.2.1 Road freight transport statistics

Type of error: Fatal

Table Field Valid Length (in

characters/digits)

A1 Rcount 2

A1 A1 2

Page 9: SISAI - STATISTICAL INFORMATION YSTEMS … · adam.wronski@ec.europa.eu ... ‘VIP on data validation general approach ... 2.2 Checks within and between datasets ...

Project: ESS.VIP.BUS Common data validation policy

Document: 2.4 - Exhaustive and detailed typology of validation rules

(3rd_main)

Version: 0.1304

April 2013 Page 5

A1 Year 4

A1 Quarter 2

A1 QuestN 9

A1 A1.1 1

A1 A1.3 2

A1 A1.6 5

A1 A1.8.1 5

A1 A1.8.2 4

A1 A1.9 8

A1 Stratum 7

A1 A2link 5

2.2.3 Presence Check

Checks that important data are actually present and have not been missed out, e.g., for road freight

transport data files the survey year is mandatory. The check would not ensure that each field was filled

in the correct way. Below we present in a table the fields for which presence check validation rules

applied for the distinct domains:

2.2.3.1 Road freight transport statistics

Type of error: Fatal

Table Field Mandatory Presence

A1 Year Yes

A1 Quarter Yes

A1 QuestN Yes

A1 A1.9 Yes

A1 Stratum Yes

A1 A1.3 Yes

2.2.4 Allowed character checks

Checks that ascertain that only expected characters are present in a field. For example a numeric field

may only allow the digits 0-9, the decimal point and perhaps a minus sign or commas. Below we

present in a table the fields for which allowed character check validation rules applied for the distinct

domains:

Page 10: SISAI - STATISTICAL INFORMATION YSTEMS … · adam.wronski@ec.europa.eu ... ‘VIP on data validation general approach ... 2.2 Checks within and between datasets ...

Project: ESS.VIP.BUS Common data validation policy

Document: 2.4 - Exhaustive and detailed typology of validation rules

(3rd_main)

Version: 0.1304

April 2013 Page 6

2.2.4.1 Road freight transport statistics

Type of error: Fatal

Table Field Valid character checks

A1 A1.9 A comma as decimal separator

instead of full stop

2.2.4.2 Farm structure survey statistics

Type of error: Fatal

Table Field Valid character checks

Any table Any numerical field A full stop as decimal separator

instead of comma

Any table Any data field The character ‘:’ is used for non

available data

2.2.5 Uniqueness Check

The uniqueness checks are integrity rules, which checks that each value in specific fields is unique.

This can be applied to several fields (i.e. Country, Year, Type of transport). This type of validation

checks for duplicate data values in certain combinations of fields, which created mistakenly during

data import process. Below we present in a table the fields for which uniqueness check validation

rules applied for the distinct domains:

2.2.5.1 Road freight transport statistics

Type of error: Fatal

Table Table key (fields combination) Unique

A1 Rcount + Year + Quarter + QuestN Yes

2.2.6 Referential integrity

Referential integrity is a data quality concept. Data quality is a common concern of information system;

the first line of defense for data quality is a series of human controls. Once input into the database

computer-based controls used to eliminate problems, which reduce data quality. Referential integrity is

a computer-based control that ensures that relationships between tables remain consistent. When one

table has a foreign key to another table, the concept of referential integrity states that you may not add

a record to the table that contains the foreign key unless there is a corresponding record in the linked

table e.g. a Journey record in Journey file with no corresponding key in Vehicle file. Below we present

in a table the fields for which referential integrity validation rules applied for the distinct domains:

Page 11: SISAI - STATISTICAL INFORMATION YSTEMS … · adam.wronski@ec.europa.eu ... ‘VIP on data validation general approach ... 2.2 Checks within and between datasets ...

Project: ESS.VIP.BUS Common data validation policy

Document: 2.4 - Exhaustive and detailed typology of validation rules

(3rd_main)

Version: 0.1304

April 2013 Page 7

2.2.6.1 Road freight transport statistics

Type of error: Fatal

Table1 Table2 Foreign key

A1 A2 Rcount + Year + Quarter + QuestN

A2 A3 Rcount + Year + Quarter + QuestN +

JourN

2.2.6.2 Inland Waterways transport statistics

Type of error: Fatal

Table1 Table2 Foreign key

A1 B1 Reporting Country + Year + Type of

Transport

A1 C1 Reporting Country + Year + Type of

Transport

B1 D1 Reporting Country + Year + Type of

Transport

C1 D2 Reporting Country + Year

2.2.7 Code List Check

A table look up check takes the entered data item and compares it to a valid list of entries that are

stored in a database table. Below we present in a table the fields for which code list check validation

rules applied for the distinct domains:

2.2.7.1 Road freight transport statistics

Type of error: Fatal

Table Field Valid List of entries

A1 Rcount

All country codes in table

COUNTRY where Year equals

the reference Year

A1 A1 A1

A1 Quarter Q1, Q2, Q3, Q4

A1 A1.6 All NACE codes where Year

equals the reference Year

Page 12: SISAI - STATISTICAL INFORMATION YSTEMS … · adam.wronski@ec.europa.eu ... ‘VIP on data validation general approach ... 2.2 Checks within and between datasets ...

Project: ESS.VIP.BUS Common data validation policy

Document: 2.4 - Exhaustive and detailed typology of validation rules

(3rd_main)

Version: 0.1304

April 2013 Page 8

2.2.7.2 Maritime transport statistics

Type of error: Warning

Dataset Field Valid List of entries

A1 Reporting port

All available codes for survey

year

A1 National partner port

A1 Non sea partner countries

F1 Size of vessel

F1 Type of vessel

B1 Type of cargo

2.2.7.3 External Trade statistic

Type of error: Warning

Dataset Field Valid List of entries

INTRASTAT Commodity Code against

partner country

All available codes for survey

year

INTRASTAT Country of origin Valid ISO country codes

INTRASTAT Region of origin/Destination All acceptable codes

INTRASTAT Nature of transaction All acceptable codes

2.2.8 Consistency checks

Checks fields to ensure data in these fields corresponds, e.g., If file naming convention includes

country code e.g. ‘LU09Q3ROADA2.dat’ then the reporting country code indicated in the dataset

should be Country = “LU". This validation applies not only to categorical fields but also to numerical

fields e.g. V13310<1.18*V16130. Below we present in a table the fields for which consistency check

validation rules applied for the distinct domains:

2.2.8.1 Road freight transport statistics

Type of error: Fatal

Table Field1 Field2 Consistency Rule

A1 Year Filename Year Field1=Field2

A1 Quarter Filename Quarter Field1=Filed2

A1 A1.8.1 A2.6

Field1=SUM (A2.6)

for the A2 linked

records

Page 13: SISAI - STATISTICAL INFORMATION YSTEMS … · adam.wronski@ec.europa.eu ... ‘VIP on data validation general approach ... 2.2 Checks within and between datasets ...

Project: ESS.VIP.BUS Common data validation policy

Document: 2.4 - Exhaustive and detailed typology of validation rules

(3rd_main)

Version: 0.1304

April 2013 Page 9

A1 A1.8.2 A2.6

Field1>=SUM (A2.6)

for the A2 linked

records

2.2.8.2 Structural Business Statistics

Type of error: Fatal or Warning

Table Variable Consistency Rule

Series 1A V12150/V12120 0.85<V12150/V12120<1.15

Series1A V13110/V12120 0.85<V13110/V12120<1.15

Series 1A V12110/V16110 0.85< V12110/V16110<1.18

Series 1A V12150/V16110 0.82< V12150/V16110<1.22

Series 1A V13310/V16130 0.85< V13310/V16130<1.18

Series 1A V13310/V12120 0.85< V13310/V12120<1.15

Series 1A V13320/V13310 0.85< V13320/V13310<1.15

Series 2A V12150/V12120 0.85< V12150/V12120<1.15

Series 2A V13110/V12120 0.85< V13110/V12120<1.15

Series 2A V12110/V16110 0.85< V12110/V16110<1.18

Series 2A V12150/V16110 0.85< V12150/V16110<1.22

Series 2A V13310/V16130 0.85< V13310/V16130<1.18

2.2.8.3 Rail transport statistics

Type of error: Warning

Table Variable Consistency Rule

C1, E2 C1-11 – E2-12 0.05<= C1-11 – E2-12 <=0.2

C3, E2 C3-12 – E2-09 0.05<= C3-12 – E2-09 <=0.2

Type of error: Fatal

Table Variable Consistency Rule

C1, E2 C1-11 – E2-12 C1-11 – E2-12>0.2

C3, E2 C3-12 – E2-09 C3-12 – E2-09 >0.2

Page 14: SISAI - STATISTICAL INFORMATION YSTEMS … · adam.wronski@ec.europa.eu ... ‘VIP on data validation general approach ... 2.2 Checks within and between datasets ...

Project: ESS.VIP.BUS Common data validation policy

Document: 2.4 - Exhaustive and detailed typology of validation rules

(3rd_main)

Version: 0.1304

April 2013 Page 10

2.2.8.4 Farm structure survey

Type of error: Fatal or Warning

Table Variable Consistency Rule

Table2 A_3_2_4_1 – C_2 A_3_2_4_1 – C_2<=0

Table2 A_3_2_4_2 – C_4 A_3_2_4_2 – C_4<=0

Table2 A_3_2_4_4 – C_5 A_3_2_4_4 – C_5<=0

Table2 B_6_2_1 – A_3_1 B_6_2_1 – A_3_1<=0

Table2 B_5_2 – B_5_2_1 B_5_2 – B_5_2_1>=0

2.2.9 Cardinality checks

Checks that record has a valid number of related records. For example in an imaginary Census we

have household data and personal data, If based to household records the number of persons living in

the same household is three, there must be three associated records in personal data for this

household (Cardinality = 3). Below we present in a table the fields for which cardinality check

validation rules applied for the distinct domains:

2.2.9.1 Road freight transport statistics

Type of error: Fatal

Table Field1 Field2 Cardinality Check

A1 A1.8.1 A2link If Field1=0 then

Field2=0

A1 A1.8.1 A2link If Field1<>0 then

Field2<>0

A1 A1.8.2 A2link If Field1=0 then

Field2=0

A2 A2.1 A3link If Field1=4 then

Field2=0

A2 A2.1 A3link If Field1<>4 then

Field2<>0

A2 A2.1 A2.2 If Field1=4 then

Field2=0

Page 15: SISAI - STATISTICAL INFORMATION YSTEMS … · adam.wronski@ec.europa.eu ... ‘VIP on data validation general approach ... 2.2 Checks within and between datasets ...

Project: ESS.VIP.BUS Common data validation policy

Document: 2.4 - Exhaustive and detailed typology of validation rules

(3rd_main)

Version: 0.1304

April 2013 Page 11

2.2.10 Mirror checks

These quality checks have been performed in order to compare the consistency between two partner

declarations. Mirror validation may entail, according to the data category under consideration, the

reconciliation of stocks and transactions data; the differences with partner data or preshipment

inspection data. Below we present in a table the fields for which mirror checks validation rules applied

for the distinct domains:

2.2.10.1 Road freight transport statistics

Type of error: Warning

Table Field Mirror field1 Mirror field2 Mirror Check

Table

A2

A2.2

(Weight of

goods)

A2.3 (Place of

loading (for a

laden journey):

either country

code or full region

code with

country)

A2.4 (Place of

unloading (for a

laden journey):

either country

code or full region

code with

country)

A2.2[A2.3] A2.2[A2.4] Round_Err

Table

A2

A2.2

(Weight of

goods)

A2.8 (Place of

loading of the

goods road motor

vehicle on

another means of

transport)

A2.9 (Place of

unloading of the

goods road motor

vehicle on

another means of

transport)

A2.2[A2.8] A2.2[A2.9] Round_Err

Table

A3

A3.2

(Weight of

goods)

A3.5 (Place of

loading (for a

laden journey):

either country

code or full region

code with

country)

A3.6 (Place of

unloading (for a

laden journey):

either country

code or full region

code with

country)

A3.2[A3.5] A3.2[A3.6] Round_Err

2.2.10.2 Rail transport statistics

Type of error: Warning

Table Field Mirror field1 Mirror field2 Mirror Check

Table A1 A1.6 (Weight of

goods)

A1.5 (Place of loading

(for a laden journey):

outward international

transport, A1.5=3)

A1.5 (Place of unloading

(for a laden journey):

inward international

transport, A1.5=4)

A1.6[A1.5=3]

=A1.6[A1.5=4]

+Round_Err

Page 16: SISAI - STATISTICAL INFORMATION YSTEMS … · adam.wronski@ec.europa.eu ... ‘VIP on data validation general approach ... 2.2 Checks within and between datasets ...

Project: ESS.VIP.BUS Common data validation policy

Document: 2.4 - Exhaustive and detailed typology of validation rules

(3rd_main)

Version: 0.1304

April 2013 Page 12

2.2.10.3 Air transport statistics

Type of error: Warning

Table Field Mirror field1 Mirror field2 Mirror Check

Table A1 Passengers

Total Passengers on

board at Departure

(Reporting country)

Total Passengers on

board at Arrival

(Partner country)

Passengers [Mirror

field1] =Passengers

[Mirror field2] +

Deviation

Table B1 Passengers

Total Passengers

carried at Departure

(Reporting country)

Total Passengers

carried at Arrival

(Partner country)

Passengers [Mirror

field1] =Passengers

[Mirror field2] +

Deviation

Table A1 Freight and mail

Total tons of freight

and mail on board at

Departure (Reporting

country)

Total tons of freight

and mail on board at

Arrival (Partner

country)

Passengers [Mirror

field1] =Passengers

[Mirror field2] +

Deviation

Table B1 Freight and mail

Total tons of freight

and mail carried at

Departure (Reporting

country)

Total tons of freight

and mail carried at

Arrival (Partner

country)

Passengers [Mirror

field1] =Passengers

[Mirror field2] +

Deviation

2.2.10.4 Inland waterways transport statistics

Type of error: Warning

Table Field Mirror field1 Mirror field2 Mirror Check

Table A1 Weight of goods

Place of loading: either

country code or full

region code with

country

Place of unloading: either

country code or full

region code with country

Weight of goods [Mirror field1]

= Weight of goods [Mirror

field2]+Round_Err

2.2.10.5 Maritime transport statistics

Type of error: Warning

Table Field Mirror field1 Mirror field2 Mirror Check

Table D1 Passengers

Total Passengers

embarked at

Departure (Reporting

country)

Total Passengers

disembarked at Arrival

(Partner country)

Passengers [Mirror

field1] =Passengers

[Mirror field2] +

Deviation

Page 17: SISAI - STATISTICAL INFORMATION YSTEMS … · adam.wronski@ec.europa.eu ... ‘VIP on data validation general approach ... 2.2 Checks within and between datasets ...

Project: ESS.VIP.BUS Common data validation policy

Document: 2.4 - Exhaustive and detailed typology of validation rules

(3rd_main)

Version: 0.1304

April 2013 Page 13

Table A1 Freight

Total tons of freight on

board at Departure

(Reporting country)

Total tons of freight

and on board at Arrival

(Partner country)

Passengers [Mirror

field1] =Passengers

[Mirror field2] +

Deviation

2.2.10.6 External Trade

Type of error: Warning

The classic example of mirror checks comes from INTRASTAT.

MSs report monthly the arrivals of goods from the other MSs and the dispatches of goods to other

MSs. Therefore for each combination of [Reference month, Type of goods, Dispatching MS, Receiving

MS] there are two data items: one for the dispatch declared by the dispatching country as reporting

country and one for the arrival declared by the receiving country as reporting country. In principle the

statistical values and quantities in the two items must be equal.

This is not always the case due to different reporting thresholds for different MSs and different

recording dates of shipments, which strand two months. The principle however remains and is used in

actual validation.

2.2.11 Range Check

Checks that the data lay within a specified range of values, e.g., the month of a person's date of birth

should lie between 1 and 12. This validation checks data also for one limit only, upper OR lower, e.g.,

data should not be greater than 2 (<=2). Below we present in a table the fields for which range check

validation rules applied for the distinct domains:

2.2.11.1 Road freight transport statistics

Type of error: Fatal or Warning

Table Field Valid Range (Minimum,

Maximum)

A1 A1.3 (0,30)

A1 A1.9 (0,99999.9999)

A2 A1.4 (5,700)

A2 A1.5 (3,400)

A1 Year >1998

A1 A1.8.1 <10000

A2 A1.5 =< 1200

A2 A1.5 =< 0.7 * A1.4

Page 18: SISAI - STATISTICAL INFORMATION YSTEMS … · adam.wronski@ec.europa.eu ... ‘VIP on data validation general approach ... 2.2 Checks within and between datasets ...

Project: ESS.VIP.BUS Common data validation policy

Document: 2.4 - Exhaustive and detailed typology of validation rules

(3rd_main)

Version: 0.1304

April 2013 Page 14

2.2.12 Control Check

This is a total done on one or more numeric fields, which appears in every record. This field is called

the Control totals key figure field. This is a meaningful total, e.g., add the total payment for a number

of Customers. Control totals are used to verify the integrity of the contents of the data. Below we

present in a table the fields for which control check validation rules applied for the distinct domains:

2.2.12.1 Farm structure survey

Type of error: Warning

Table Key fields Control Check

Table2 C_5_1, C_5_2, C_5_3 C_5 = C_5_1 + C_5_2 + C_5_3 +

Round_Err

Table2 C_4_1, C_4_2, C_4_99 C_4 = C_4_1 + C_4_2 + C_4_99

+ Round_Err

Table2 C_3_2_1, C_3_2_99 C_3_2 = C_3_2_1 + C_3_2_99 +

Round_Err

Table2 C_2_1, C_2_2, C_2_3, C_2_4,

C_2_5, C_2_6, C_2_99

C_2 = C_2_1+C_2_2, C_2_3+

C_2_4+C_2_5+C_2_6+C_2_99 +

Round_Err

2.2.12.2 Agricultural statistics

Type of error: Warning

Table Key fields Control Check

Data Bullocks, Bulls, Cows, Heifers,

Calves

Bovines = Bullocks + Bulls +

Cows + Heifers + Calves +

Round_Err

Data Bullocks, Bulls, Cows, Heifers Adult cattle = Bullocks + Bulls +

Cows + Heifers + Round_Err

Data Sheep, Goats Sheep = Sheep + Goats +

Round_Err

Data Slaughtering, Exports of live

animals, Imports of live animals

Gross Indigenous production =

Slaughtering + Exports of live

animals - Imports of live animals +

Round_Err

Page 19: SISAI - STATISTICAL INFORMATION YSTEMS … · adam.wronski@ec.europa.eu ... ‘VIP on data validation general approach ... 2.2 Checks within and between datasets ...

Project: ESS.VIP.BUS Common data validation policy

Document: 2.4 - Exhaustive and detailed typology of validation rules

(3rd_main)

Version: 0.1304

April 2013 Page 15

2.2.12.3 Migration statistics

Type of error: Warning

Table Key fields Control Check

IMM1CTZ Sex (Total, Male, Female) Total = Male + Female

IMM7CTB

Citizenship (TOTAL,

NATIONALS, NON-

NATIONALS, UNK_GR)

TOTAL = NATIONALS + NON-

NATIONALS + UNK_GR

IMM6CTZ

Country of birth (TOTAL,

NATIVE-BORN, FOREIGN-

BORN, UNK_GR)

TOTAL = NATIVE-BORN +

FOREIGN-BORN + UNK_GR

2.2.13 Conditional Checks

Conditional checks perform different checks depending on whether a pre-specified condition evaluates

to true or false. Below we present in a table the fields for which conditional check validation rules

applied for the distinct domains:

2.2.13.1 Road freight transport statistics

Type of error: Fatal or Warning

Table Condition Type Condition Conditional Check

A2 Format check A1.2 like ‘1XX’ A1.5 <= 0.7*A1.4

A2 Format check A1.2 like ‘2XX’ A1.5 <= 0.8*A1.4

A2 Format check A1.2 like ‘3XX’ A1.5 <= 0.85*A1.4

A2 Limit check A2.1=1 A2.2<=A1.5

A2 Limit check A2.1=2 A2.2<=A1.5

A2 Limit check A2.1=3 A2.2<=A1.5

2.2.14 Time series checks

Time series checks are implemented in order to detect suspicious evolution of data during the time.

They can be associated to outlier detection. The second type takes into account the seasonality of

data. Below we present time series validation rules applied for the distinct domains:

2.2.14.1 Maritime transport statistics

Type of error: Warning

Table Field Indicator Valid Range (Minimum,

Page 20: SISAI - STATISTICAL INFORMATION YSTEMS … · adam.wronski@ec.europa.eu ... ‘VIP on data validation general approach ... 2.2 Checks within and between datasets ...

Project: ESS.VIP.BUS Common data validation policy

Document: 2.4 - Exhaustive and detailed typology of validation rules

(3rd_main)

Version: 0.1304

April 2013 Page 16

Maximum)

A1 Gross weight of goods A1(t) / A1(t-1) (Low limit, High Limit)

C1 Number of TEU's and number of

units for cargo type 5, 6 and X C1(t) / C1(t-1) (Low limit, High Limit)

D1 Passengers excluding cruise

passengers D1(t) / D1(t-1) (Low limit, High Limit)

F2 Gross tonnage and number of

vessels F2(t) / F2(t-1) (Low limit, High Limit)

2.2.14.2 Inland Waterways transport statistics

Type of error: Warning

Table Field

Indicator Valid Range (Minimum,

Maximum)

A1 Tonnes by type of transport A1(t) / A1(t-1) (Low limit, High Limit)

A1 Tonnes by type of transport and

type of goods A1(t) / A1(t-1) (Low limit, High Limit)

A2 Tonnes by type of transport A2(t) / A2(t-1) (Low limit, High Limit)

B1 Tonnes by type of vessel B1(t) / B1(t-1) (Low limit, High Limit)

B1 Tonnes by nationality of vessel B1(t) / B1(t-1) (Low limit, High Limit)

B2 Movements of vessels by type of

transport and loading status B2(t) / B2(t-1) (Low limit, High Limit)

2.2.14.3 Structural Business Statistics

Type of error: Fatal

Table Variable Valid Range (Minimum,

Maximum)

Series 1A V11110(t)/V11110(t-1) (0.82,1.22)

Series1A V12110(t)/V12110(t-1) (0.82,1.22)

Series 1A V12120(t)/V12120(t-1) (0.82,1.22)

Series 1A V12150(t)/V12150(t-1) (0.77,1.30)

Series 1A V13110(t)/V13110(t-1) (0.82,1.22)

Page 21: SISAI - STATISTICAL INFORMATION YSTEMS … · adam.wronski@ec.europa.eu ... ‘VIP on data validation general approach ... 2.2 Checks within and between datasets ...

Project: ESS.VIP.BUS Common data validation policy

Document: 2.4 - Exhaustive and detailed typology of validation rules

(3rd_main)

Version: 0.1304

April 2013 Page 17

2.2.14.4 Air transport statistics

Type of error: Fatal or Warning

Table Condition Type Condition Conditional Check

A1 Range check 10000 =< Passengers

<100000

Passengers (t)/Passengers(t-1)=>0.6

A1 Range check 10000 =< Passengers

<100000

Passengers (t)/Passengers(t-1)<=1.4

A1 Range check 50 ton =< Freight

transport <1500 ton

Freight transport(t)/Freight

transport(t-1)<=2

A1 Range check 100 =< Flights <1200 Flights(t)/Flights(t-1)<=1.7

A1 Range check 100 =< Flights <1200 Flights(t)/Flights(t-1)=>0.3

2.2.14.5 External Trade

Type of error: Warning

Table Variable Valid Range (Minimum,

Maximum)3

Model specifies

Valid Range

INTRASTAT Statistical Value (Low limit, Higher Limit) MAD

INTRASTAT Invoice Value (Low limit, Higher Limit) MAD

INTRASTAT Quantity (supplementary

unit) (Low limit, Higher Limit)

MAD

INTRASTAT Total value declared (Low limit, Higher Limit) MAD

2.2.14.6 National Accounts

Type of error: Warning

Table Variable Valid Range

V101.EE.B1GM.CLV00MF.QNW B1GM (t)/B1GM (t-1)

Average Growth Rate2

V102.EE.P3.CLV00MF.QSW P3 (t)/P3 (t-1)

Average Growth Rate2

V102.EE.P5.CLV00MF.QSW P5 (t)/P5 (t-1)

Average Growth Rate2

3 The lower and higher limits in the valid range defined by the limits specified in MAD routine for

detection of outliers

Page 22: SISAI - STATISTICAL INFORMATION YSTEMS … · adam.wronski@ec.europa.eu ... ‘VIP on data validation general approach ... 2.2 Checks within and between datasets ...

Project: ESS.VIP.BUS Common data validation policy

Document: 2.4 - Exhaustive and detailed typology of validation rules

(3rd_main)

Version: 0.1304

April 2013 Page 18

2.2.15 Revised data integrity Check

Revised data integrity check applies to revised datasets. This validation compares revised to initial

data and, if necessary4, investigates the sources of significant discrepancies. The levels of acceptable

discrepancies are either ad – hoc or model specified. Below we present revised integrity validation

rules applied for the distinct domains:

2.2.15.1 National Accounts

Type of error: Warning

Table1 (Revised data) Table2 (Initial data) Condition

V101.EE.B1GM.CLV00MF.QSW V.EE.B1GM.CLV00MF.QSW (ValueT1 – ValueT2) <=

0.005*ValueT2

V101.EE.B1GM.CLV05MF.QSW V.EE.B1GM.CLV05MF.QSW (ValueT1 – ValueT2) <=

0.005*ValueT2

V101.EE.B1GM.KPM95F.QSW V.EE.B1GM.KPM95F.QSW (ValueT1 – ValueT2) <=

0.005*ValueT2

2.2.16 Model – based Consistency Check

These rules compare quantitative data with limits derived from other data of the same reference

period, e.g. with limits set at a number of standard deviations around the data mean or limits derived

from a regression model that connects two variables.

Note: models are also used to derive limits from historical data for comparison of current data with

them. These are listed under type “Time series checks”, presented in section 2.2.14. For the time

being we have not found any specific example among the rules of deliverable 1.5 – 1.6.

4 For some datasets, the revision process is a normal one so the detection of revisions is a

'processing' step and not a pure validation step. However, in order to validate data, the revised figures

should be detected and tested against some thresholds