Social Statistics Integrated Information Architecture (Stiina) · • Integrate new data sources...
Transcript of Social Statistics Integrated Information Architecture (Stiina) · • Integrate new data sources...
Social Statistics Integrated Information
Architecture and metadata driven services
Antti Santaharju & Toni Räikkönen
COE on S-DWH workshop, Warsaw 22.11.2018
Contents
• Social Statistics Integrated Information Architecture (Stiina)
• Metadata driven software architecture
• Using VTL in Editing and imputation service
• Software demonstrations
• Architectural aspects of Editing and imputation service
• Experiences so far
29.11.2018 Statistics Finland2
STIINA Social Statistics Integrated Information Architecture
• STIINA program aims to modernize the statistical production of
social statistics
• The aim is to build an integrated production system for about 70
official statistics – including both register-based and survey statistics
• Timeframe 2017–2020
• Work is based on international statistical models and standards:
• GSBPM
• GSIM
• CSPA
• VTL
3 29.11.2018 Statistics Finland
STIINA goals
• Emphasize information content instead of organizational silos
• Integrate new data sources into statistical production
(daily data deliveries, data from open APIs, web scraping…)
• Build faster, more automatized and transparent production
processes (metadata, process control system)
• Find ways to integrate data from different types of statistics
• Ensure the flexibility of statistical production when data and
information needs are constantly changing
4 29.11.2018 Statistics Finland
29.11.2018 Statistics Finland5
DATA COLLECTION
Data Repositories for Data
Collection
Population and Social
Data Repository
Organization Data Repository
Built Environment Data Repository
Commodity Data Repository
Emissions and Energy
Data Repository
GeospatialData Repository
ANALYSISAND
REPORTING
Micro Data Repository
Macro Data Repository
DISSEMINATION
Data Repositories for Dissemination
Metadata Repository
DATA
PROCESSING
Population and Social Data Repository
29.11.2018 Statistics Finland6
Person
Education
Labor market
Health
Person
Relationships
Housing
Justice
Income
information
Living
conditons
Built
Environment Organisation
Geospatial
Data
Metadata
Social Statistics Integrated Information Architecture (STIINA)
7
Data
Repositories
for Data
Collection
Data collection Analysis and Reporting DisseminationData Processing
Population and
Social Data
Repository
Person Data
Housing
Labor Market
Income and
Consumption
Education
Living conditions
Health
Justice and Elections
Built Environment
Data
Organization Data
Geospatial Data
Metadata Repository
Micro Data and
Macro Data Repositories,
Data Repositories for
Dissemination
Data Warehouses
29.11.2018
Direct Data
Collection
Administrative
Data
New Data
resources
Statistical
Releases
Statistical
Databases
Research
Data
Other
Products and
Services
Services for
statistical
processes
Information
services
Statistics Finland
29.11.2018 Statistics Finland8
STIINA projects 2017–2020
20182017 2019
Population I
Editing and imputing
Labor market and
Income Information I
Data warehouse
and data
confidentiality
Education I
Data collection
processes
2020
Labor market and
Income Information
II
Justice Living conditions
Health
Metadata I
Statistical
methods
Dissemination Dissemination II
Education IIPopulation II
and Housing
Labor market and
Income
Information III
Geospatial data
and services
Stiina Services 4:
Metadatat II
Elections
Data collection
processes II
Service-oriented projects
Data-oriented projects
Housing
Editing
and
imputing
II
Services as a part of statistical production process
29.11.2018 Statistics Finland9
Statistical production process
GSBPM-services toolbox
Service
Service
GSBPM 5.3
Review &
validate
Service
Service
GSBPM 5.5
Derive new
variables
Service
Service Service
Service
Service
GSBPM 5.4
Edit & impute
Metadata
VTL
Software architecture in Statistics Finland
29.11.2018 Statistics Finland11
GSBPM
(1) whenever possible
GSIMCSPA
Micro
services
Metadata
driven
Re-use
Technology
neutrality
Cloud native (1)
Unified
methods
Unified
metadata
definitions
From data acquisition to data analysis and
dissemination – Case STIINA
29.11.201812
M E T A D A T A
Data marts
Automated Data
Acquisition
Process
Raw data
warehouse
Operational
data
warehouse
A&R and
dissemination
data warehouse
Continuous ETL
Statistics Finland
Data Storage Layer
Application Layer
The architectural style of the operational
environment
29.11.201813
Data Virtualization Layer
Data virtualization
• Isolates the data storage
• Offers services to clients
• GSIM based interfaces
Data storage
• Located in the on premises env or in
the cloud
• Accessible only thru the virtualization
layer
Clients
Statistics Finland
GSIM modeled metadata
architecture
29.11.201814
VariablesData
Structures
Rules
Identifiers
Different kind of rules
- value domains
- data types
- formation rules
- VTL statements
Definitions of
- Represented variables
- VariablesDefinitions of
Data Structures
Administers unique identifiers of objects
(URNs, HTTP URIs, DOIs etc.)
Links represented variables
to a data structure
Defines the rules for
represented variables
e.g. value domains,
Defines the rules for
instance variables
e.g. data types, precisions
Concepts
Definitions of the concepts
Classificati
ons
Defines a value domain
for a classification
Process Output
Process
ExecutionProcess
Metrics
Population
Defines a variable
Definitions of populations
Defines the formation rule
of a population
Defines a structure of
a population
Statistics Finland
Metadata driven APIs
29.11.2018 Statistics Finland15
Data
MartOp.
Data
VaultData
Mart
Interfaces
Data
OutData
In
ID: URN:x-stat:meta:dataset:y
Var1
Var2
Varx
ID: URN:x-stat:meta:dataset:y
Var1
Var2
Varx
In order to use the APIs the corresponding
metadata definition must be included with
the service call
Data StructureData Structure
Editing and imputation service
29.11.2018 Statistics Finland16
Edit Specification and Analysis
Edit Summary Statistics Tables
ErrorLocalization
Deterministic Imputation
DonorImputation
Imputation Estimators
Prorating
MassImputation
OutlierDetection
Amendment
Review
Selection
Source:
Generic Statistical Data Editing Models
(UNECE)
Editing and imputation service methods
• Current methods
• If-then rule method (VTL)
• Banff Outlier Detection
• Banff Imputation estimators
• Planned Banff methods
• Donor imputation
• Error localisation
• Deterministic imputation
• Prorating
• Edit summary statistics
29.11.2018 Statistics Finland17
Metadata driven editing service
29.11.2018 Statistics Finland18
Editing service
Process
management
Editing rules and
parameters
Data description
- Edited data
- Editing history
- Frequency reports
- Impact reports
Data
Example method: Banff outlier
• Input:
• data to be edited (matrix form)
• parameters
• Output:
• status data with flagged cells (name-value form)
19 29.11.2018 Statistics Finland
22 29.11.2018 Statistics Finland
VTL input
SAS-code preview
VTL functions
Operators
etc.
Variable list
Demo: Metadata management and VTL
29.11.2018 Statistics Finland23
• Demo
Editing and Imputation Service
Statistical Libraries
E&I Service – the architectural style
29.11.2018 Statistics Finland25
SAS BANFFPython
Pandas
Library X
R VIM
Library Y
Staging
API
Metadata
services
Process Engine
Metadata
editor
Demo: Editing process
• Demo
29.11.2018 Statistics Finland26
E&I Service – process flow
29.11.2018 Statistics Finland27
E&I Service
Process Engine
Method 1 Method 2 … Method n
4) Invoke method calls3) Signal the start event
6) Signal the end event
Data
In
Data
Out
Staging
BI Web Services
1) Invoke the API call
with data 2) Load the data
to the staging area
5) Invoke
SAS BI WS
7) Load the results
from staging
8) Return the results
E&I Service - the role of SAS BI WS
29.11.2018 Statistics Finland28
Staging
BI Web Services
TransformBANFF
Internal data
area
Transform
E&I Service
Store data
Invoke SAS service
Invoke E&I Service
BANFFBANFFBANFF
29.11.2018
Editing
meta
Statistics Finland29
Rules
If-then-rule method
Rules: [
urn:stat-fi:meta:rule:9912,
urn:stat-fi:meta:rule:9937,
..
..
]
Id: urn:stat-fi:meta:rule:9912
Type: VTL
Value: error := if(a > b) then error = 1 else error = 0
Id: urn:stat-fi:meta:rule:9937
Type: VTL
Value: c:= if(isnull(c)) then c = 100 else c
SAS BIWSE&I Service
-in VTL stmts
-out sasds2
SAS
Code
SAS
Code
VTL Statements
VTL Translator
VT Parser
SAS Data Step
Code Generator
SAS DS2
Code Generator
R Code
Generator
[X] Code
Generator
If-then-rule method using VTL
”Social” challenges
• The change in perspective
• From customized solutions to unified methods and tools
• Difficult to please all users
• “My statistics is so special that I really can’t use that tool”
• Sometimes difficult to recognize who is the product owner
29.11.2018 Statistics Finland31
IT challenges
• Microservices increase the complexity quite a lot
• Orchestration / choreography
• Data by value / data by reference
• Requires a smooth DevOps process
• Performance with really huge datasets still unknown
29.11.2018 Statistics Finland32
”Social” success stories
• The valuation of GSIM model has increased vastly
• The users understand better why it is important to define the
metadata for the data objects
• Generic tools for other projects to use
• E&I service enables a cumulative, standardized audit trail and
reports
29.11.2018 Statistics Finland33
IT success stories
• Microservices = a really fast track to generic tools
• 15-20 services already in use / under construction
• GSIM based metamodel enables metadata driven architecture
really nicely
• The capabilities in IT have increased a lot
• Requires a smooth DevOps process
29.11.2018 Statistics Finland34
Future
• New microservices under development
• Derivation of new variables
• Aggregation
• GSIM based cloud native meta system under development
29.11.2018 Statistics Finland35