Metadata management in Statistics Canada’s business surveysmetadata is managed effectively in its...

13
Page 1 of 13 Metadata management in Statistics Canada’s business surveys Invited paper to ICES IV, Montreal, Quebec, June 11 to 14, 2012 Submitted by Statistics Canada 1 1. Introduction Statistics Canada is currently standardizing its statistical processing systems. This requires that statistical metadata is managed effectively in its new corporate business architecture. Specifications for survey design, sample design, content, edits, imputation, estimation, products and services will be created in metadata systems and then used to drive the various steps of the process directly, without re-work or re- entry of information. In response, a metadata semantic model is being developed as a framework for common metadata terminology and understanding across the Agency. The model is based on the data and metadata requirements of the new Integrated Business Survey Program (IBSP) using the Generic Statistical Business Process Model (GSBPM) as a reference model. The goal is to establish common metadata definitions and central metadata repositories with standardized interfaces in order to reduce the number of redundant metadatabases, and build on the metadata objects already in the Agency’s corporate metadata registry and repository – the Integrated Metadatabase (IMDB). Plans are eventually to line up with the terminology of the Generic Statistical Information Model (GSIM), which is still under development. Another initiative is the Directive on the Management of Microdata Files and one on Aggregate Data Files, which are part of the Agency’s information management strategy for structured data from business, economic and social statistical programs. These two directives include guidelines with common metadata objects for describing the data files, and are based on the metadata objects from the IMDB and from the Data Documentation Initiative (DDI) metadata standard. The international metadata standards, DDI and Statistical Data and Metadata eXchange (SDMX) are used or being considered as part of the Agency’s metadata management strategy. Web services have been developed in DDI 3.0 format, which will expose metadata from the IMDB to be used in the Agency’s business surveys. This paper provides some examples of how the Agency is implementing a standardized approach to metadata management in business surveys. Section 2 of the paper provides the principles of Statistics Canada’s new business architecture; section 3 provides a high-level description of the Agency’s metadata semantic model and its relationship to the IMDB; section 4 discusses the use of the GSBPM and metadata modelling in the Integrated Business Survey Program (IBSP), section 5 shows the use of common metadata objects as part of the Agency’s management of data files; and section 6 gives examples of where Statistics Canada is already implementing international metadata standards. 1 Prepared by Alice Born ([email protected]) and Tim Dunstan ([email protected]) of Standards Division, Statistics Canada, Ottawa (April 23, 2012).

Transcript of Metadata management in Statistics Canada’s business surveysmetadata is managed effectively in its...

Page 1: Metadata management in Statistics Canada’s business surveysmetadata is managed effectively in its new corporate business architecture. Specifications for survey design, sample design,

Page 1 of 13

Metadata management in Statistics Canada’s business surveys

Invited paper to ICES IV, Montreal, Quebec, June 11 to 14, 2012 Submitted by Statistics Canada1

1. Introduction

Statistics Canada is currently standardizing its statistical processing systems. This requires that statistical metadata is managed effectively in its new corporate business architecture. Specifications for survey design, sample design, content, edits, imputation, estimation, products and services will be created in metadata systems and then used to drive the various steps of the process directly, without re-work or re-entry of information. In response, a metadata semantic model is being developed as a framework for common metadata terminology and understanding across the Agency. The model is based on the data and metadata requirements of the new Integrated Business Survey Program (IBSP) using the Generic Statistical Business Process Model (GSBPM) as a reference model. The goal is to establish common metadata definitions and central metadata repositories with standardized interfaces in order to reduce the number of redundant metadatabases, and build on the metadata objects already in the Agency’s corporate metadata registry and repository – the Integrated Metadatabase (IMDB). Plans are eventually to line up with the terminology of the Generic Statistical Information Model (GSIM), which is still under development. Another initiative is the Directive on the Management of Microdata Files and one on Aggregate Data Files, which are part of the Agency’s information management strategy for structured data from business, economic and social statistical programs. These two directives include guidelines with common metadata objects for describing the data files, and are based on the metadata objects from the IMDB and from the Data Documentation Initiative (DDI) metadata standard. The international metadata standards, DDI and Statistical Data and Metadata eXchange (SDMX) are used or being considered as part of the Agency’s metadata management strategy. Web services have been developed in DDI 3.0 format, which will expose metadata from the IMDB to be used in the Agency’s business surveys. This paper provides some examples of how the Agency is implementing a standardized approach to metadata management in business surveys. Section 2 of the paper provides the principles of Statistics Canada’s new business architecture; section 3 provides a high-level description of the Agency’s metadata semantic model and its relationship to the IMDB; section 4 discusses the use of the GSBPM and metadata modelling in the Integrated Business Survey Program (IBSP), section 5 shows the use of common metadata objects as part of the Agency’s management of data files; and section 6 gives examples of where Statistics Canada is already implementing international metadata standards. 1 Prepared by Alice Born ([email protected]) and Tim Dunstan ([email protected]) of Standards Division, Statistics Canada, Ottawa (April 23, 2012).

Page 2: Metadata management in Statistics Canada’s business surveysmetadata is managed effectively in its new corporate business architecture. Specifications for survey design, sample design,

Page 2 of 13

2. Principles of Statistics Canada’s new business architecture

In 2010, Statistics Canada implemented a long-term agency-wide review of its business architecture. It states that specifications for survey design, sample design, content, edits, imputation, estimation, products and services should be created and captured in metadata systems, and then used to drive the various steps of the process directly, without re-work or re-entry of information. The principles of this new business architecture that pertain to metadata management are: a) Metadata-driven processes - rather than construct metadata at the end of the process, metadata should be integral to the process and, indeed, drive the process. Metadata should precede data. b) Maximize re-use - this includes reuse of business processes, enabling computer systems, and information. If there is a metadata registry with specific metadata already in it, we need not develop another but rather develop the potential of the existing registry. It also implies not maintaining redundant copies of specific data and metadata. c) Statistical information management - a strong, corporate statistical information management framework is required and should include development of data service centres. This goal pertains to metadata management as well as data management and other types of statistical information management such as production information management. d) Eliminate rework - there is a need to identify and remove instances of rework. For example, micro-editing at collection, before processing, after processing, during estimation and prior to dissemination should not be supported anymore. This includes elimination of redundant metadata. These principles have guided the initiatives presented in this paper including the newly integrated business survey program, and will guide the future development of a corporate metadata management strategy.

3. Statistics Canada’s Integrated Metadatabase and Metadata Semantic Model

The Integrated Metadatabase (IMDB)2 is the only true corporate registry of statistical metadata in the Agency, and is a good example of metadata management at Statistics Canada. IMDB data model is based on the international metadata standard, ISO/IEC 11179 Metadata Registries and its extension – Corporate Metadata Repository (CMR) – which covers the metadata requirements over the survey life cycle. Using this standardized approach, the metadata in the IMDB are structured the same way, and provide a source of common metadata objects for Agency’s business architecture. As part of the Agency’s effort to standardize metadata terminology and structure, Statistics Canada’s information architecture group is developing enterprise metadata semantic model diagrams to illustrate the metadata objects in the survey production process and their relationships (Figure 1). Most of the objects have been taken from the Integrated Metadatabase (IMDB). Many of the individual objects have their own more detailed maps and others are yet to be developed. Figure 2 is an example of a detailed diagram for survey instrument, which is based on the IMDB’s broad definition of survey (i.e., direct survey, administrative file, and derived survey).

2 For more details on Statistics Canada’s Integrated Metadatabase see Metadata to support the survey life cycle Statistics Canada (2007): Integrated Metadatabase (IMDB) – A metadata repository to support the survey life cycle, Working Paper 4, UNECE Workshop on the Common Metadata Framework, Vienna, Austria, 4-6 July 2007.

Page 3: Metadata management in Statistics Canada’s business surveysmetadata is managed effectively in its new corporate business architecture. Specifications for survey design, sample design,

Page 3 of 13

Figure 1: Statistics Canada’s enterprise metadata semantic model diagram.

Page 4: Metadata management in Statistics Canada’s business surveysmetadata is managed effectively in its new corporate business architecture. Specifications for survey design, sample design,

Page 4 of 13

Figure 2: Metadata semantic model diagram for survey instrument.

The IMDB provides structure and rules for registering descriptions for each of Statistics Canada’s 900+ surveys, of which 342 are currently active. Of the currently active surveys, 180 are a mixture of establishment- or enterprise-based business surveys covering agriculture, manufacturing, energy, distributive trades, transportation and services. The descriptions (or reference metadata) are aimed at helping users interpret the statistical data, and include metadata on survey purpose, survey instruments (questionnaires), methodology and data accuracy. Also included for each business survey is definitional metadata – metadata on variables, classifications, questions and response choices. These metadata are versioned over time with each new release of data from the business survey. Below is an example of the metadata record generated from the IMDB for the Annual Survey of Service Industries: Food Services and Drinking Places (Figure 3).

Page 5: Metadata management in Statistics Canada’s business surveysmetadata is managed effectively in its new corporate business architecture. Specifications for survey design, sample design,

Page 5 of 13

Figure 3: Reference and definitional metadata for the Annual Survey of Service Industries: Food Services and Drinking Places

http://www23.statcan.gc.ca:81/imdb/p2SV.pl?Function=getSurvey&SDDS=4704&lang=en&db=imdb&adm=8&dis=2

In addition, the metadata in the IMDB supports some of the metadata requirements for phases of the Generic Statistical Business Process Model (GSBPM). At the specify needs and design phase, IMDB can be consulted and used by survey managers for identifying existing variables and questions for measuring the variables (Figure 4). As part of the dissemination phase, utilization of IMDB content by other public information

dissemination system takes the following forms: a) The information modules accessible on Statistics Canada website containing statistical data.

Page 6: Metadata management in Statistics Canada’s business surveysmetadata is managed effectively in its new corporate business architecture. Specifications for survey design, sample design,

Page 6 of 13

They include The Daily, CANSIM, Summary tables, Statistics Canada Publications and Statistics by subject portals. The Daily articles offer a link to the related IMDB survey records describing the survey methodology, the variables estimated and data accuracy. In addition, the Summary tables, the Publications and Statistics by subject portals provide links to the related IMDB survey records. Another portal, Statistics by variable (Figure 5), allows users to access business statistics released on a specific variable (i.e., “type of revenue of establishment”).3 As a result of these linkages, users finding a relevant product in any one module can easily and quickly find related products and associated metadata.

b) Statistical products available in electronic form including Public Use Microdata Files (PUMFs).

The Data Liberation Initiative (DLI) uses the DDI standard to provide the documentation related to the PUMFs. The metadata already existing in IMDB are re-used instead of being re-created.

c) Publication appendices on methodology and data quality.

Authors are encouraged to re-use the content of the related IMDB survey records for these publication sections. More and more authors are doing so.

Input data Micro-data Confidentialaggregate data

Public output data

3Build

4Collect

5Process

6Analyze

7Disseminate

Operationaldata Registers Survey

DataAdministrative

Data

Datastores

Operational Data Stores

1Specify needs

Metadata/paradata IMDB

8Archive

IMDB

2Design

Quality management and metadata management

Figure 4: Relationship of metadata content in the IMDB to the Generic Statistical Business Process Model (GBSPM) and the data stores in Statistics Canada.

3 See variables for business statistics at: http://www23.statcan.gc.ca:81/imdb/p3Variable.pl?Function=getStatDEIndex&CLItem_Id=112722&CE_Id=1065&CE_Start=01010001&t=t&db=imdb&dis=2&adm=8

Page 7: Metadata management in Statistics Canada’s business surveysmetadata is managed effectively in its new corporate business architecture. Specifications for survey design, sample design,

Page 7 of 13

Figure 5: Statistics by variable for business statistics.

4. The Generic Statistical Business Process Model in the Integrated Business Survey Program

Statistics Canada is currently undertaking a major redesign of its business statistics program. Once completed in 2016, the Integrated Business Statistics Program (IBSP) will provide a common framework for approximately 120 business surveys across ten different statistical programs. The IBSP will be implemented in several phases. Phase one will include integration of annual manufacturing surveys, services, distributive trades, and the Capital Expenditure Survey. Phase two will integrate transportation surveys and the energy program surveys. The final phase integrates industrial finance and agriculture programs into the IBSP.

Page 8: Metadata management in Statistics Canada’s business surveysmetadata is managed effectively in its new corporate business architecture. Specifications for survey design, sample design,

Page 8 of 13

The IBSP business development team is using the Generic Statistical Business Process Model (GSBPM) to help achieve many of these goals.4 The approach was to map their existing business processes (i.e., IBSP Business Processes) to the GSBPM (Figure 6). Business use cases were then produced for each of the IBSP business process groups. Finally, a detailed business process model was produced by expanding the second level of the GSBPM and writing names and descriptions for the customized sub phases and sub-sub phases of the model. There were a number of benefits derived from using GSBPM in developing the IBSP business processes. GSBPM is a well-structured high-level business process model facilitating description of coherent detailed standard business processes. Duplication and redundancy of processes is avoided by rigorous use of the GSBPM. The development of business activity descriptions was facilitated through use of the GSBPM. Description of information needs was the by-product of development of the business use cases. Use of the GSBPM promoted the use of common tools and generalized systems fulfilling the business requirements of many of the processes. Documentation of the detailed business process was completed as the business process analysis advanced. Management of future changes in a process or sub process (Change Management) is greatly facilitated by use of the GSBPM. As a by-product of doing a detailed business analysis using a standard reference model such as the GSBPM, the need for tools at each phase was determined and the number of tools minimized by making use of common tools. The IBSP has a set of common tools in use from the previous program and these tools can now be updated and revised to reflect the business needs generated by doing a thorough business analysis. 4.1 Metadata environment in the IBSP Figure 7 shows metadata environment of the IBSP, which includes definitional, operational/systems and methodological metadata. The IBSP metadata environment uses common (“generalized”) tools and metadata from the IMDB and other metadata repositories in its “metadata-driven” statistical processing. The IBSP metadata environment includes strategies (i.e., content, collection and methodology), tools to implement those strategies, and standard metadata to drive the processing. The metadata objects in the diagram are all used, transformed, or produced by the different tools interfacing with the environment. With the development of the Generic Statistical Information Model (GSIM) in progress, the IBSP metadata objects and the IMDB metadata objects have been created with the model in mind. They will fit into the third or fourth level of the model directly or into a grouping in a higher level of the model. In this way, the IBSP vision is starting to converge with international metadata standards and reference models for systems and content development.

4 Statistics Canada adopted the Generic Statistical Business Process Model (GSBPM) as a reference model on March 15, 2010. It is based on version 4 (2009) developed by the Joint UNECE / Eurostat / OECD Work Session on Statistical Metadata (METIS).

Page 9: Metadata management in Statistics Canada’s business surveysmetadata is managed effectively in its new corporate business architecture. Specifications for survey design, sample design,

Page 9 of 13

Expected # of workshops 2 4 3 3 3 3 3 5 4 3 2Level 1 Level 2

(phase) (process)

1.1 Determine needs for information x1.2 Consult and confirm need x1.3 Establish output objectives x1.4 Identify concepts x1.5 Check data availability x1.6 Prepare business case x2.1 Design outputs x2.2 Design variable descriptions x2.3 Design data collection methodology x2.4 Design frame & sample methodology x2.5 Design statistical processing methodology x2.6 Design production systems & workflow x3.1 Build data collection instrument x3.2 Build or enhance process components x3.3 Configure workflows x3.4 Test production system x3.5 Test statistical business process x3.6 Finalize production system x4.1 Select sample x4.2 Set up collection x4.3 Run collection x4.4 Finalize collection5.1 Integrate data x5.2 Classify & code x5.3 Review, validate & edit x5.4 Impute x5.5 Derive new variables & statistical units x5.6 Calculate weights x5.7 Calculate aggregates x5.8 Finalize data files x6.1 Prepare draft output x6.2 Validate outputs x6.3 Scrutinize & explain x6.4 Apply disclosure control x6.5 Finalize outputs7.1 Update output systems7.2 Produce dissemination products x7.3 Manage release of dissemination products7.4 Promote dissemination products7.5 Manage user support8.1 Define archive rules x8.2 Manage archive repository x8.3 Preserve data and associated metadata x8.4 Dispose of data & associated metadata x9.1 Gather evaluation inputs9.2 Conduct evaluation9.3 Agree action plan

Start DateFinish Date

Design/Build/Test System Component

Sampling and Collection Processes

Design Methodology

Build Collection Instrument

Design/Build/Test System

Workflow

9 Evaluate

5 Process

6 Analyze

7 Disseminate

8 Archive

4

1 Needs

2 Design

3 Build

4 Collect

Specify Needs

Design Content and Collection

11

ArchivePost

Collection Processing

Analyze

GSBPM IBSP Business Process Model grouping

1 2 3 5 6

Disseminate

7 8 9 10

Figure 6: Map of the GSBPM phases and processes to the IBSP business process model.

Page 10: Metadata management in Statistics Canada’s business surveysmetadata is managed effectively in its new corporate business architecture. Specifications for survey design, sample design,

Page 10 of 13

Tools and Interfaces

Outside IBSP Metadata Environment

IBSP Metadata Environment

Content Strategy Sampling StrategyCollection Strategy

PCP – Rolling Estimates/Common Editing/ Estimation Strategy

Definitional Metadata

Operational/ System Metadata

Survey Content Concept

Questionnaire, Modules, Blocs,

Questions, Variables

Derived Variables

Capture Specifications for all modes

Edit Specifications for all modes

Output Specifications for all modes

GSAM Metadata

Edit and Imputation Specs,Data Integration Specs,Rolling Estimates Parameters (Data Mappings, Supporting data set identifiers and location of origin, supporting paradata)Estimation Specs,Allocation or Roll Up Specs,Decumulation Specs,Reporting Specs for MFUL, RE Iterations, TE list

GES Metadata

Quality Indicators, Data Integration Flags,Targeted Editing Reports,Audit Files,Recording of Parameters, Rolling Estimate Iteration Record, Revision Metadata

Output Parameters

to SNA, SMA

Banff Metadata

Figure 7: The IBSP metadata environment

5. Information Management Directives

As part of the implementation of Statistics Canada’s Policy on Information Management and Information Management (IM) strategy, there are directives that describe the management of the department’s statistical microdata holdings (including business statistics) with respect to type of data file, required documentation and retention periods. These directives are an example of promoting common metadata objects for documenting data files in Statistics Canada. The directives were developed with regard to two metadata standards in particular, ISO/IEC 11179 and DDI. The goal is to automate as much of the documentation/metadata of data files as possible. The metadata objects in the directives were specified in a way that they can be mapped to the present or future content of the IMDB and they can also be mapped to the DDI metadata standard. Codebooks, Data Dictionaries, and Record Layouts are recognized as being important documents for documentation of data files. Although the general content of these documents has been specified, the structure and specific content has not. This is a good first step and leads the way for other IM directives. Below are the required metadata objects for microdata files (Table 1).

Page 11: Metadata management in Statistics Canada’s business surveysmetadata is managed effectively in its new corporate business architecture. Specifications for survey design, sample design,

Page 11 of 13

Table 1: Metadata requirement for microdata files

1. Identification and description

• Name of survey, project, program or other statistical activity • Statistical Data Documentation System (SDDS) number from the IMDB, where applicable • Reference period, where applicable • File format (e.g., Excel, SAS, text, Oracle, SQL) • Version identifier, where applicable • Responsible manager • Identification according to the categories of statistical microdata files • Retention period • Retention trigger date, where applicable • Description to adequately distinguish among various files when there is more than one file within

the same category of statistical microdata files for a single survey, project, program or other statistical activity

• Technical metadata using SAS syntax which will read the file, if applicable (sometimes referred to as “Control cards”, it includes the record layout, variable names, variable labels and value labels; other software syntax, such as SPSS and Stata, can be added or substituted as required)

• Control totals for verification purposes • Important notes to data users, as appropriate (e.g., coverage, imputation, preliminary or revised

data) • Population or universe statement • Users’ guide describing the data concepts, including data quality reports • Description of disclosure control methods • Logic for derived variables

2. For each variable within the file

• Name • Type and position within file, where applicable • Label or description • Valid values and value labels, where applicable

6. Next steps 6.1 Generic Statistical Information Model Statistics Canada is participating in the collaborative development of a Generic Statistical Information Model (GSIM) with Australia, New Zealand, Norway, Sweden and United Kingdom, and input from other countries. This model will provide the common information/metadata objects used by all producers of statistics throughout the statistical business processes (Figure 8). The model will be most efficiently used along with the Generic Statistical Business Process Model (GSBPM). The aim is to enable development of information and statistical production systems using a common language and concepts from the GSIM and common processes from the GSBPM. The GSIM is independent from, though informed by various metadata standards, making it a very useful reference framework for exchanging metadata and data with international organizations. Statistics Canada is currently considering GSIM as its common metadata framework as part of its metadata management strategy. This requires extensive collaboration with the Agency’s experts in statistical metadata standards and concepts, business architects, methodologists and experts in survey processes.

Page 12: Metadata management in Statistics Canada’s business surveysmetadata is managed effectively in its new corporate business architecture. Specifications for survey design, sample design,

Page 12 of 13

Figure 8: Provisional version of the Generic Statistical Information Model (GSIM). Statistics Canada has already started using the GSBPM to model our existing business survey processes as described above. The next step will be to use the information objects from the GSIM and map them to the various processes to represent information flows and transformations. Once completed, the map would represent the whole statistical production system start to finish with the information objects in place. Having such a map in a common language allows for development of tools for data sharing and exchange. 6.2 Metadata standards The use of recognized internationally approved metadata standards is essential for efficient metadata management and an important part of the development of a metadata strategy at Statistics Canada. Statistics Canada is already providing some of its microdata files in DDI 2.0 and 3.1 formats, and exploring the use of SDMX for data dissemination. Statistics Canada has developed an IMDB to DDI XML web service to transform metadata presently in the IMDB to formats that can be readily consumed by researchers accessing microdata files in universities and in Statistics Canada’s Research Data Centres. Also, web services to expose metadata from the IMDB in DDI 3.1 XML format to our business survey processing systems, the IBSP, are being developed. This will ultimately reduce the need for redundant metadata systems in the Agency, particularly for definitional metadata such as variables, statistical classifications and questions. Canadian Socio-economic Information Management System (CANSIM) tables are presently being presented in Standard Data and Metadata Exchange (SDMX) format as an option for those who wish to

Page 13: Metadata management in Statistics Canada’s business surveysmetadata is managed effectively in its new corporate business architecture. Specifications for survey design, sample design,

Page 13 of 13

get data in XML.5This format provides a flat file with a structural metadata file in SDMX. Presently, the SDMX Metadata Common Vocabulary (content standard) is not being used.

7. Conclusion Statistics Canada is involved in various metadata management initiatives including the development and implementation of a corporate metadata registry (IMDB), adoption of the GSBPM as a reference model for our business processes, the development of Information management directives for the management of micro- and aggregate data files, the development of explicit metadata-related goals for business surveys, and participation in various international efforts to develop a statistical information model for NSOs. Statistics Canada is involved internationally in the Statistical Network and METIS, the statistical metadata expert group for the UNECE. We have initiated internal projects to explore the use of various metadata standards such as SDMX and DDI in our business. Over time, it is being recognized that sound metadata management is a very important part of the statistical production process, and that metadata-driven systems using well-harmonized metadata lead to greater efficiency, flexibility, understanding of the data, and increased ability to share and upload data to international organisations. The pursuit of better metadata management will continue to be a priority for Statistics Canada as we develop a comprehensive corporate metadata management strategy in the future.

5 For example, data from monthly Wholesale trade, sales by the North American Industry Classification System (NAICS), can be downloaded in SDMX ML : http://www5.statcan.gc.ca/cansim/a26;jsessionid=23F526CCDA378F09514A80386D059E52?lang=eng&retrLang=eng&id=0810011&pattern=081-0011..081-0013&tabMode=dataTable&srchLan=-1&p1=-1&p2=31.