ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira...

57
Name office email Manuel Herrera Espiñei ES INE Bart Bakker NL CBS [email protected] Amy Large UK ONS [email protected] Olivier Goddeeris BE Nico Weydert LU STATEC co un tr y [email protected] SPF Economie, PME, Classes moyennes et Energie Direction général Statistique et Information économique [email protected] .be [email protected] e [email protected]

Transcript of ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira...

Page 1: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

Name office email

Manuel Herrera Espiñeira ES INE

Bart Bakker NL CBS [email protected]

Amy Large UK ONS [email protected]

Olivier Goddeeris BE

Nico Weydert LU STATEC

country

[email protected]

SPF Economie, PME, Classes moyennes et Energie Direction général Statistique et Information économique

[email protected] [email protected]

[email protected]

Page 2: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

Pieter Dewitte BE Census 2011 [email protected]

Bettina Gerber CH SFSO [email protected]

Elina Merkaine LV Central Statistical Burea [email protected]

Page 3: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

Intars Abrazuns LV Central Statistical Bureau [email protected]

Page 4: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

Miroslawa Grebowiec PL CSO

Kai Kaarna EE Statistics Estonia

[email protected]

[email protected]

Page 5: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

Elisa Martín ES INE [email protected]

Page 6: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

Ioannis Nikolaidis HE EL.STAT

Zrinka Pavlović HR Central Bureau of Statistics

[email protected]

[email protected]

Page 7: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

Željko Jelovečki HR Central Bureau of Statistics

Snježana Varga HR Central Bureau of Statistics

[email protected]

[email protected]

Page 8: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

Martin Ribe SE Statistics Sweden

Andej Vallo SK Statistical Office of the Slo

[email protected]

[email protected]

Page 12: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

1 2 3

probabilistic record linkage Individuals and households

Business data

Describe the general purpose of the work

Describe the integration method used

Describe briefly the data sets that you combined

People Crossing files. Search records in a database. Search of duplicates

Creating the Social Statistical Database to publish all social statistics. It is cheaper, it can produce more detailed information, it can produce longitudinal and intergenerational information. However, it was not necessary

Linking by a unique identifier, micro-integration, consistent repeated weighting

All information of approximately 60 registers

Matching Census data to Coverage survey data in order to produce population estimates.

Mixture of probabilistic record linkage and clerical matching

Census data and Census Coverage Survey data at household and person level. (Mixture of variables such as name and address data, gender, age as well as derived variables like household structure.)

Integration is necessary to: lower burden on enterprises estimation imputation to determine the population

Since we have an unique enterprise number, the integration was based on it

Business: VAT, business register, annual accounts, social security

1. Population Census 2001. The name of the employer is given on the questionnaires and in order to associate the activity (NACE) code we tried some record linkage with this data and the Business register data (linkage of names). 2. Association of telephone numbers (commercially available) to enterprises in the Business register (linkage of names and addresses) 3. The Commerce register uses a specific identifier and a different identifier is used by administrations internally. For the Eurostat EGR project we tried to match both files on names and addresses.

Record linkage based on names and addresses. We used the trigram algorithm to compute probability of names and addresses being the same.

Page 13: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

use of a unique identifier

Our purpose of integrating different data sets is to derive new variables based on data available in different data sources. A second purpose is to cross variables that come from different data sources.

For most of the data sources, we can use a unique identifier to integrate the data sets. On the other hand we have also data sets that don’t contain a unique identifier. E.g. linkage based on an address or linkage based on names. In these cases, we use a special algorithm that can match text strings that aren’t exactly identical.

We combined the following data sources: Population register (units = individuals / households), General Social and Economical Survey 2001 (units = individuals / households), Register of enterprises (units = enterprises), Register of buildings (units = parcels), Social security register (units = individuals), Educational registers (units = individuals)

Up to now we have not used neither record linkage nor statistical matching. But in the coming years we have to do record linkage for three purposes: - link data sets over the years (longitudinal analysis); - link data sets of the same year (cross-section analysis) to identify people who are part of two or more data sets; - link people that are two or even more times in one data set to build households

I guess we have to do the following procedure: - first: use the unique identifier we have; - second: for every person the unique identifier is missing we have to do probabilistic record linkage

The data sets consist of cases (dossiers), in each case there is at least one person (up to 10 persons); the information are collected either on the dossier level or on the personal level

Imputation of non-respondents and small enterprises, which are not surveyed with statistical questionnaires but estimated from administrative sources exhaustively. - Update and actualisation of population units - Editing of the data -- Target: to reduce the response burden (especially for small enterprises), to avoid of the collection of similar variables twice, to reduce the costs for statistical office on the data collection, to increase reliability of data

Different enterprise data from administrative sources are combined; financial data, data on wages and salaries and information on number of persons employed. Data from administrative sources on individuals are used and linked with enterprises. Data from different statistical questionnaires are combined (business questionnaires).

Page 14: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

Data are integrated using u

In the preparation of EU-SILC (Community Statistics on Income and Living Conditions) data files several data sets are used. We are using data from administrative registers - Population Register, State Revenue Service (SRS) and State Social Insurance Agency (SSIA). Information is also obtained from survey. Administrative data are used to improve the quality of obtained information and also to diminish burden on respondents.

Demographic information such as persons name and surname, sex, data of birth, personal identification number is used from Population register. Practically all government transfers data such as pensions and state social benefits are obtained from SSIA. Only information about some minor benefits, which are administrated by local municipalities or pensions paid by other countries and service pensions, which are not administrated by SSIA, is asked in questionnaires. The exception is net employee cash or near cash income, which is available as well from SRS, but it was decided to use information from questionnaires. Gross employee cash or near cash income was obtained counting up net employee cash or near cash income from questionnaires with paid taxes from SRS. Information from SRS is also used for imputation purposes if amount of net employee cash or near cash income is missing in

Page 15: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

Statistical matching

The general aim of activities is application of administrative information systems for the needs of the public statistics. This means that administrative data sets should be prepared in this way which will allow to use them as a source of data for statistical surveys and for the National Census.

Detail aims (examples):• national accounts:For the national accounts purposes the data from various sources are used: results of statistical surveys, administrative information, BoP data, supplementary data from other sources. For individual sector accounts some data are prepared by CSO branch divisions according to national accounts methodology which are used in compiling the accounts.

• demography: Data integration involves both integration of data coming from one type of sources, e.g. statistical surveys, only from registers or from combined system of surveys, i.e. statistical surveys and registers.

• enterprises:Preparation of SBS data.

• national accounts:Compiling the national accounts the derivative, aggregated data are used. Statistical data are aggregated by branch divisions according to the ESA methodology. Administrative data are collected from suitable units and partly aggregated by Computing Centre. The main task is to maintain the proper calculation according to the ESA methodology, and to watch over the exhaustiveness of accounts. Data sources, procedure and methodology used in the Polish national accounts calculation for the GNI purposes is described in detail in Gross National Income Inventory.

• demography: Currently the issue of combined data integration refers to National Census of People and Dwellings 2011. The methodology of data coming from registers and statistical surveys combination is being worked out. Experiences

See point 1 and 2.• demography: In current statistics on international migration (for compiling the data on immigration and emigration) and internal migration for permanent residence – two types of data sources are used, i.e. PESEL register and register systems of gminas. Probably, when the new Act on Registration of Population will enter into force (in 2014) combination of the data coming from several information systems and registers will be necessary. Moreover, it is also essential to extend the scope of information on migrants with reference to socio-economic data. Preparations for National Census 2011 allowed for a detailed identification of administrative sources, which can be used for a migration statistics purposes. However, the main problem will be to gain the personalized data, i.e. data containing PESEL number from the register holders, what would allow for a combination and

Purpose of the work was to analyse data of dwellings from different sources. We had data from last census and data from register, but we did not have unique identifier for the dwellings

There were data of dwellings in both datasets.

Page 16: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

There was a necessity of data on annual earnings during the years between two consecutive Structure of Earnings Surveys, that are carried out fouryearly.The Annual Structure of Earnings Survey combines information from Social Security files, data from the Quarterly Labour Cost Survey, from a small survey conducted by the INE, and the information on income from form ‘190: Annual Summary of the Tax Agency Personal Income Tax (IRPF) Withholdings and Advance Payments on Account’ to obtain annual earnings per employee classified by gender, age, occupation, nationality and type of contract, without increasing the informative burden on enterprises. The method used is an

exact record linkage.

One of the basic aims of the ASES is to obtain up-to-date earnings results, but without entailing an informative burden for the respondent. For this reason, it is necessary to use a range of information sources. Three different sources are used in this survey:1. INE - Quarterly Labour Cost Survey The Quarterly Labour Costs Survey (QLCS) is a continuous short-term statistic, elaborated quarterly by the INE. The population scope is the Social Security contribution accounts whose economic activity is related to industry, construction or services. It covers the whole country. A contribution account can be defined as a local unit. For each account all employees associated with the account are investigated. The sample size is 19,500 establishments. The QLCS provides levels and indicators on the average cost of labour per employee and month, the average cost of labour per hour

Page 17: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

The integration of two or more data sets is carried out for updating the Greek business register. The business register (BS) is updated using data from both sampling surveys and administrative sources, such as Social Insurance Foundation and Value Added Tax (VAT) Business Rregister of Finance Ministry

The integration method is based on the use of a unique identifier, such as VAT registration number

The data sets combined for updating the business register are the sampling surveys conducted by Hellenic Statistical Authority (EL.STAT) and the above-mentioned administrative sources. The conducted surveys are: The Structural Business Surveys (SBS) Short Term Statistics (STS) Surveys

In the SBR, data from different administrative sources are being used in order to have good coverage of units in the population of businesses or to connect already registered units in the SBR with the data on employment and turnover.

In the SBR, we use unique identifier for linking data from different sources. Businesses have only one ID, but natural persons (crafts, free lances) have both “business” ID and personal ID of the owner.

In the SBR, we combine data about businesses (legal and natural persons), using the data from e.g. Register of annual accounts (RAA) and Tax Authority - corporate tax, income tax, VAT for comparing activity status, data on employees and turnover. Main source is RAA and when some data is missing, Tax data is used.

Page 18: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

CBS has been implementing a new approach and calculating Structural business statistics (SBS) by combining different administrative and statistical sources.

In the SBS we use unique identifier (ID number)

In the SBS, we combine and use the data from: SBR, Register of annual accounts (reports -balance sheet, profit and loss account, etc), and regularly CBS data e.g. Gross Investment survey and Investment in Environmental Protection survey. Main source is RAA and when some data is missing, we use data on turnover and employees from SBR, estimate other SBS variables and imputation on SBS enterprises data.

In Labour Market Statistics Department data from administrative sources is being used in order to estimate number of employees in legal entities with up to 10 employees for which reports were not submitted

Labour Market Statistics Department use unique identifier (ID number).

In Labour Market Statistics Department we combine and use the data from annual survey on persons in paid employment and data from FINA (financial agency).

Page 19: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

Unique identifier.

Statistics Sweden maintains several registers by data integration in the form of record-linking and processing of administrative registers from various sources.An example briefly treated in the following consists in the Income and Taxation Register (I&T) and the Income Statistics based on Administrative Register. The purpose is to produce statistics on income and transfers. The target population comprises all individuals who by law shall be covered by the Swedish continuous population registration. The use of combined registers serves to yield essentially complete data for statistics on income. Advantages of the approach using combined administrative registers are that data are collected in a cost-efficient way with minimal respondent burden, and that statistics with detailed break-down can be produced, as there are no sampling errors.

Unique identifier (Person Number, assigned to each individual in the population by the continuous population registration)

The data combined are records on individuals in administrative register data from the National Tax Board, the Swedish Social Insurance Agency, the Swedish National Board of Student Aid (CSN), and other central government agencies. These registers provide data on income from various income sources, and other taxation data etc.

Comparison of data from survey (annual structural business statistics) and administrative source (income tax returns data) for the purpose of evaluation of the data quality with the ultimate goal to replace the survey by administrative data or at least to use it to improve quality of the estimates.

The units were small entrepreneurs – the microenterprises

Page 20: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

see the document "matching methodologies in use within ONS.doc" to have an overview of the UK practices on integration

With the single exception of the Census to CCS linkage, probabilisticmatching has not been used. Instead, the linkage tends to rely on acombination of deterministic and "fuzzy", i.e. rules-based, matching. SASis often the tool used to carry out these procedures. Despite thesesimilarities, those carrying out the various projects are generally notconversant with the work of their colleagues in other parts of the officeand there is no manual of best practice for them to follow. It is partly toaddress these issues that my post was created!

For data quality checking purposes reported Intrastat data are integrated with administrative data of VAT returns received from the Tax authority (VAT data are also used for register and estimation purposes).Another example of data integration for quality checking purposes is the integration of Foreign Trade Statistics in Goods and SBS exports data.

Data are integrated by using the common identifier, i.e. the tax number of the companies

The intra-Community acquisitions, and deliveries data of VAT are combined with Intrastat reported data.The Intrastat and Extrastat exports data are combined with SBS exports data.

Page 21: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

use of a unique identifier

The general purpose is to impute missing data and to correct false data. As the KSH data collection is a voluntary sample survey income data must be checked and corrected using – among other methods – integration with external information source. The selected external data collection approaches personal incomes via enterprises and budgetary institutions

Both Hot deck and Cold deck version of the Statistical matching mixed.

1. Income Survey (IS) of KSH, connected to the Microcensus 2005: A voluntary sample survey of almost 19000 households with the purpose of having accurate, micro-level income data.2. Tariff Survey (TS): a yearly statistical data collection among enterprises and budgetary institutions, including data of more than 500 thousand employees, referring to individual wages and demographic data as well as characteristics of employers; carried out by the Public Employment Service (Ministry of Labour).

Purpose of the integration is to obtain more information completing the nomenclature on government units.

Two data sets are integrated:1. register of government units made by the Treasury2. business register made by the Statistical Office

The general purpose of linking the two files is to present the Hungarian unemployment in a more complex context than the administrative data show. The received administrative data on unemployment does not include so detailed characteristics of settlements than the regional database of KSH

Unemployment data of a settlement in the administrative register are linked to the size and population data of the same settlement using a unique ID Number developed by HCSO

The linked two files are as follows:1. unemployment data of the Public Employment Service2. The regional database of KSH contain geographical and public administration data as well as characteristics of the permanent population (small area, county, region, legal status of the settlement, size of the population, area, Data are aggregated at settlement level in both files.

Page 22: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

It serves as basis for general government planning. Editing and imputation are done by the data manager. As a result of joining the data with settlement (municipality) identification numbers the data in the relevant theme will be available by settlements.

The data of settlements are joined by the individual identifiers within territorial codes, which are maintained by the HCSO

The character of data is of no significance since although the data are individual ones, there is only one self-government in each settlement, so all in all only settlements are presented

It serves as basis for central government planning. The register is also used for checking the willingness of private individuals to submit tax returns. Editing and imputation are done by the data manager. As a result of joining the data with settlement (municipality) identification numbers the data in the relevant theme will be available by settlements.

The data of settlements are joined by the individual identifiers within territorial codes, which are maintained by the HCSO

The character of data is of no significance since the data are aggregated at the level of settlements

Observation of the travel agent activities as well as the revenues and motivation of travel of their associated. The purpose of the integration is supplement, checking and verification of the data received, data correction if deemed necessary

The identifier used is a unique reference number

The data of the enterprises from the report of the annual economic statistics are combined with the revenues of travel organization and travel agent activities in according to NACE Rev.2 79.11, 79.12, 79.90

Page 23: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

4a 4b 4c 4d 4e

Actions able to harmonize Case conversion Other

y y y y -

y sometimes n y

n y y y y

y / / / y

y y y

Conversion/erasure of acronyms

Splitting/merge of variables

Page 24: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

y y y y

Yes (reference periods, classif y y y

Page 25: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

y y y y

Page 26: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

y

could help y y y

Page 27: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

y

Page 28: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

n n n n

Page 29: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

n n n

Harmonisation/selection of the units falling in the target population (problem of overcoverage of the administrative data set)

Page 30: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

n n n n n

y

Due to frequent changes in the administrative system settlement code numbers have to be switched to actual ones.

Correct spelling of settlement names is important for the identification

Page 31: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

Because of frequent changes in the administrative status the territorial codes applied in the relevant year have to be used in all cases.

The correct spelling of settlement names is important for identification

Because of frequent changes in the administrative status the territorial codes applied in the relevant year have to be used in all cases.

The correct spelling of settlement names is important for identification

Page 32: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

4f 5

Describe 'in broad lines' the subsequent steps

-

In order to get good When the trigram algorithm gives high percentages records can be considered a

data cleanning, synonyms, parsing (HMM) blocking (SQLs adapteds) and fuzzy matching with reweighted of variables for each record

It is the integration of 60 registers. This can’t be answered in a short moment.

Data undergoes optical character recognition before it enters matching

Results of estimates of overcount and undercount are validated by: - Clerical quality assurance processes - Checking estimates of over and under count against the Longitudinal Study estimates.

the period covered by different sets should be broughed in line with each other

1. checking definition used 2. population datasets 3. checking accuracy 4. other detection by linking data

Page 33: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

Cannot be answered up to now – practice will show

We have to clean some files (see also 5). The algorithm mentioned in point 2 recognizes automatically some acronyms (e.g. str. for street).

1. Depending on the quality of the files we check if there are any inconsistencies and make some corrections if necessary. For some registers we can skip this step, because of their very good quality 2. Automatically linking of the datasets based on an algorithm. 3. Manual control of special cases (e.g. names of streets that we didn’t find in the other register with the automatic procedure).

It depends on which variable would be used for the probabilistic record linkage; but the demographic variables have all the same format – so I think that no pre-processing phases are necessary.

- Merge of records to correct and harmonise of periods and to avoid of double records in one period '- Correction of national language (if it necessary) '- To count individual employees and link them with enterprises '- Preparation of linked data set and validation checks provided (corrected if necessary according to the methodology prepared for each statistical product) '- Preparation of output tables and verification

Page 34: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

Subsequent steps are not used.

Page 35: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

We used corrections for some values

Examples:• demography:Currently various data sources – both non-statistical, i.e. national and foreign registers, databases, information systems and also national statistical surveys (LFS and EU-SILC) are used. Nevertheless, following circumstances are taken into consideration:• specific character of data sources for which purposes holder (or country in case we use foreign sources) uses register or database; it refers also to statistical surveys – e.g. LFS, which aims to survey economic activity of the population and EU-SILC, which aims to

• national accounts:E.g. in process of balancing of GDP it is necessary to balance the both production and expenditure sides. In this case all inconsistencies are analyzed and in the end the corrections are introduced.

• enterprises:Check for population coverage.• STS comparison information: check for definitions, population coverage, frequency of collecting, the analysis of the quality of selected data sourcescorrections for incompatible values or inconsistencies

• services:1. Analyzing data sets collected within the statistical survey and selecting REGON of units which exceeded the thresholds settled for quarterly survey.2. Selecting REGON of units which exceeded the thresholds settled for quarterly survey from the register provided by the National Bank of Poland.3. Selecting NIP of units which exceeded the thresholds settled for import of services from the VAT system.4. Matching NIP of units selected in point 3. with NIP in base of statistical units (BJS) in order to gain REGON.Combining REGONs of all units selected within the points 1-3 and eliminating duplicates.

• another:It was the first experiment in which we tried to merge two datasets and some steps referred to corrections for incompatible values, inconsistencies, reweighting, imputation and others procedure ensuring the data integration result quality were considered. As mentioned above reweighting was used because donor dataset was sample survey. Imputation as a method of treatment nonresponse was not used.

Page 36: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

Once all the data sets have been integrated the final data set consists in a sample of employees from a sample of local units, that means, two-stage stratified sample. The following stages are:- Validation process: consists in using filters that allow us to separate valid records from those with inconsistencies to be revised.- Imputation process: invalid data are corrected or remove - Calculation of the final grossing up factors

Page 37: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

Corrections for incompatible values or inconsistencies Check for population coverage

For SBR needs actually we have to “clean” the data because some times some double records appear or there is a problem with ID numbers since some units submit their reports under someone else’s ID number and delivered data sets are not cleaned in that sense.

Different data sets are being compared in order to identify double entries. Those are checked to analyse the problem.Data sets are being compared to the last years data to check the completeness of coverage – if there are some missing units those are investigated – why they are not cover in this year’s data…

Page 38: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

For SBS needs cooperation with FINA (financial agency which collects RAA) has been established, and therefore annually financial reports of enterprises were extended with additional information’s needed for statistics, where CBS has defined content of variables needed and methodology explanations related. Before using input files we have to make the technical preparation (unified file and field format). To calculate SBS variables we have to share or merge some accounting information.

Different data sets are being compared in order to identify double entries. Those are checked to analyse the problem.Data sets are being compared to the last years data to check the completeness of coverage – if there are some missing units those are investigated – why they are not cover in this year’s data…

For Labour Market Statistics Department all pre-processing phases are made in IT Sector. Definition of variable is the same.

Different data sets are being compared in order to identify double entries. Those are checked to analyse the problem.Data sets are being compared to the last years data to check the completeness of coverage – if there are some missing units those are investigated – why they are not cover in this year’s data…

Page 39: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

In I&T, there are few text variables, the text is already customized and ready in capital letters.In I&T there are several monetary variables, these are corrected so that all monetary values are numerical and not blank or missing.

In the production of I&T we match several data sets and create new variables. In order to have the "right" population, we create a population codes in the register.

Identification of double and missing identifiers (problem of undercoverage of the register), check for population coverage

Page 40: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

The office has little expertise in the micro integration processingstage, which is sometimes left out altogether in our work. We need tobecome better educated in this aspect.

Several lines of VAT return are merged to harmonise with Intrastat data.Intrastat and Extrastat data are merged to be combined with SBS data.

Integrated data can be queried together, to check the consistency of Foreign Trade Statistics data. Bias between the data sources might have methodological reasons; therefore corrections are not necessary by all means.

Page 41: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

To obtain supplementary information

A simple record linkage. Defective connections are corrected manually

Step 1: The dataset is completed with so-called, I-variables (Imputed) and C-variables (Corrected) as new ones in order to have later the possibility of checking the effect of the correction.Step 2: Imputation of income data of persons not responding in the IS, using TS data. Methodology used: Statistical Matching, Cold Deck version. The primary data set is the IS and the secondary is the TS.Step 3: Imputation of missing data on income coming from second job or side-job. Methodology used: Statistical Matching, Cold Deck version. The primary data set is the IS and the secondary is the TS.Step 4: Imputation of missing data of income coming from enterprises. Methodology used: Statistical Matching, Hot Deck version. The primary data set is the data providers of IS and the secondary includes those of IS who did not respond this item.Step 5: Corrected of data sets with data obtained from the Statistical matching procedures. Step 6: Replacement of social incomes at personal and household income level using simulation techniques. 4 micro-modules of households and 12 ones of persons are used to correct social income data missing according to macro-level estimations. The distributions of parameter tables from external data sources (Tax, social security, National Bank, ...).

Conversion of txt format into s.a.s format

Page 42: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

These steps are made by the data manager

These steps are made by the data manager

n/a

The two data sets are integrated with the help of the reference number in excel (using the “V lookup” function), wrong figures can be recognized by comparison to preceding elements of time series respective

Page 43: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

6 7 8

Were the results published ? Comments

No published not accessible

www.cbs.nl

Unknown.

Nothing published

Have integrated microdata been released?

It is published in hundreds of tables in StatLine and hundreds of articles in journals and yearbooks.

Yes – official ONS population estimates and subsequent reports. Documentation can be provided on request. (Technical documentation has been produced for internal consumption and may be available on request.)

Page 44: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

No, it concerns a pilot project

No Not available at micro data level

No publication up to now as record linkage has not yet been done

Microdata cannot be accessed as there is the highest level of data protection to be followed

Page 45: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

No http://www.csb.gov.lv/csp/content/?cat=5413http://epp.eurostat.ec.europa.eu/portal/page/portal/microdata/eu_silc

Page 46: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

not applicable

Examples:

• national accounts:Data sources, procedure and methodology used in Polish national accounts calculation for the GNI purposes is described in detail in Gross National Income Inventory.

• enterprises:EUROSTAT SBS database• STS:The results were published in the final technical report from the action of project, which aim was analysis of the possibility of obtaining data on turnover for short-term statistics in the field of the trade and services from administrative data sources. and the evaluation of their practical application in the statistical system

• another :All results were published in technical report prepared by members of so called subgroup for statistical and mathematical methods for census. The report was written only in polish language and is available in Statistical Office in Poznan and Central Statistical Office in Warsaw.

Examples:

• regional statistics:Yes, because it is very important for the development of regional statistics, it reduce the burden on respondents and costs of conducted surveys. Integration of two or more data sets could be a better data source for public statistic. Through this, we will gain possibility to aggregate data at lower administrative levels.

• STS:The works connected with introducing tax data for small enterprises into the statistical system are curried on.

Preparations for the 2011 Population and Housing Census. Coherence and quality analysis of the 2000 Census and the Register of Construction Works; Pindmaa, K., Kaarna, K.; MONTHLY BULLETIN OF ESTONIAN STATISTICS No. 11/2007 Pp. 147–151, 153–157

Page 47: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

Microdata are not available

Yes, there is an annual release and the main results of the survey are published in tables that could be consulted on www.ine.es,Wage Structure Survey.The full link is:http://www.ine.es/jaxi/menu.do?type=pcaxis&path=%2Ft22%2Fp133&file=inebase&L=1

Page 48: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

no

SBR data is not published

Microdata are released for scientific purposes, in a format from which individuals cannot be identified either directly or indirectly. Data identifying statistical units other than individuals similarly are not released. The microdata are released by the Division of Statistical Information and Editions and only after the opinion of the Committee of Statistical Confidentiality.In EL.STAT, the main tools for protecting microdata are excluding obvious identifiers, limiting geographic detail and limiting the number o variables of the file, as follows:- Sampling- Issuing multiple files, one with more detailed geography and less detailed characteristics and other with less detailed geography and more detailed characteristics - Grouping, by splitting continuous variables into rages to reduce detail- Grouping and recoding into broad categories- Eliminating any variables that can be used to link to external sources that contain individual identifiers

Page 49: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

SBS Preliminary results were published in the First Release form (available on our web site: http://www.dzs.hr/default_e.htm)

Employment data are published regularly in First Releases, Statistical Yearbook etc. (available on our web site http://www.dzs.hr/)

Page 50: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

http://www.scb.se/Pages/Product____115933.aspxhttp://www.scb.se/Pages/Product____116008.aspxA. & B. Wallgren, Register-Based Statistics, 2007, Chichester: Wiley, Sect. 1.4.1.

Results will be published at the forthcoming Q2010 conference in Helsinki under the title:Vallo, A., Bielakova, A. (2010): Quality issues on the way from survey to administrative data: the case of SBS statistics of microenterprises in Slovakia.

Page 51: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

The results are used internaln/a This data integration is essential to check the quality of Foreign Trade Statistics data

Page 52: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

The microdata are not accessible

no no To compile the regional government data and to supply information to the social services statistics

n/a

A methodological documentation has been done but not published

Dissemination database – www.ksh.huCalculated indicators: Regional Statistical Yearbook, Megyei statisztikai évkönyv (Statistical Yearbook of the Counties, only in Hungarian) - Online store: http://portal.ksh.hu/pls/portal/CP.KSHSHOP_HTML_PRE_ENG?nn=231094903

Page 53: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

www.ksh.hu/database

Regional Statistical Yearbook of Hungary: Online-storewww.ksh.hu/database

http://portal.ksh.hu/pls/ksh/docs/hun/xftp/idoszaki/jeltur/jeltur08.pdfhttp://portal.ksh.hu/pls/portal/docs/PAGE/SZAMOKBAN_UTAZUNK_UJ/ELEMZESEK/STAT%20TUKOR%201035%200607%20MT.DOC

Revenue related individual microdata are not released. Only the totals are published by activity types

Page 54: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

This data integration is essential to check the quality of Foreign Trade Statistics data

Page 55: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

To compile the regional government data and to supply information to the social services statistics

Page 56: ec.europa.eu€¦ · XLS file · Web viewFoglio3 Foglio2 Foglio1 Manuel Herrera Espiñeira mherrera@ine.es People Crossing files. Search records in a database. Search of duplicates

country questionnairesBelgium 2Switzerland 1Estonia 1Spain 2Greece 1Croatia 3Hungary 6Luxemburg 1Latvia 2Netherlands 1Poland 1Sweden 1Slovak Rep. 1UK 2

25