data analytics Workshop for official statistics · Inspired by the outcomes of a series of thematic...
Transcript of data analytics Workshop for official statistics · Inspired by the outcomes of a series of thematic...
1
data analytics Workshop for official statistics
Technical Report
2
Table of Contents Introduction ....................................................................................................................................................... 3
Workshop objective ....................................................................................................................................... 4
Report structure ............................................................................................................................................ 5
The Workshop sessions are then described in Annex 1, summarising the presentations and discussions that
took place. Annex 2 includes the Workshop programme. All presentations are available on the dedicated
webpage on the CROS portal. ........................................................................................................................... 5
Key conclusions and Recommendations from the Workshop .......................................................................... 5
Summary conclusions .................................................................................................................................... 5
Recommendations ......................................................................................................................................... 6
Annex 1 – Description of Workshop sessions.................................................................................................... 7
1.1 Opening session ....................................................................................................................................... 7
1.2 General Sessions: Expert lectures ........................................................................................................... 8
1.2.1 Organisation of the general sessions................................................................................................ 8
1.2.2 Expert lecture: Use cases and best practices in data analytics ........................................................ 8
1.2.3 Expert lecture: Trends in data analytics architectures ..................................................................... 8
1.2.3 Expert lecture: Future developments in data analytics and data science........................................ 9
1.2.4 Expert lecture: Engaging users and policymakers for data analytics ............................................. 10
1.3 General session – Mapping of existing data analytics initiatives in the ESS as well as in comparable
sectors ......................................................................................................................................................... 11
1.3.1 Organisation of the session ............................................................................................................ 11
1.3.2 Expert lecture: The Big Data Platform initiative of the EC Joint Research Centre ......................... 11
1.3.3 Expert lecture: The Logical Statistical Data warehouse of the Centre of Excellence on Data
Warehouse .............................................................................................................................................. 13
1.3.4 Expert lecture: Methods and algorithms in the UN Global Platform ............................................. 14
1.4 Parallel group session 1 – Data analytics in practice - real examples from the ESS .............................. 14
1.4.1 Objective of the session ................................................................................................................. 14
1.4.2 Summary of the session ................................................................................................................. 14
1.5 Parallel group session 2 – Statistical models and methods for data analytics ....................................... 15
1.5.1 Objective of the session ................................................................................................................. 15
1.5.2 Summary of the session ................................................................................................................. 15
1.6 Parallel group session 3 – Engaging with external stakeholders ............................................................ 15
1.6.1 Objective of the session ................................................................................................................. 15
1.6.2 Summary of the session ................................................................................................................. 15
1.7 Parallel group session 4 – What are the success factors for data analytics? ......................................... 16
1.7.1 Objective of the session ................................................................................................................. 16
1.7.2 Summary of the session ................................................................................................................. 16
3
1.8 Parallel group session 5 – Emerging data analytics tools and techniques.............................................. 16
1.8.1 Objective of the session ................................................................................................................. 16
1.8.2 Summary of the session ................................................................................................................. 16
1.9 Parallel group session 6 – Facilitating the data analytics of others ........................................................ 16
1.9.1 Objective of the session ................................................................................................................. 16
1.9.2 Summary of the session ................................................................................................................. 17
1.10 Parallel group session 7 – Joint ESS data analysis undertakings ........................................................... 17
1.10.1 Objective of the session ............................................................................................................... 17
1.10.2 Summary of the session ............................................................................................................... 17
1.11 Parallel group session 8 – Infrastructure and architecture as a solid basis for data analytics ............. 17
1.11.1 Objective of the session ............................................................................................................... 17
1.11.2 Summary of the session ............................................................................................................... 17
1.12 Parallel group session 9 – Data analytics in a changing environment .................................................. 18
1.12.1 Objective of the session ............................................................................................................... 18
1.12.2 Summary of the session ............................................................................................................... 18
Annex 2 – Programme of the Workshop ......................................................................................................... 19
Day 1 (10 September 2018) ......................................................................................................................... 19
Day 2 (11 September 2018) ......................................................................................................................... 19
Day 3 (12 September 2018) ......................................................................................................................... 20
Workshop programme flyer ............................................................................................................................ 20
Introduction Inspired by the outcomes of a series of thematic workshops dedicated to the modernisation of official
statistics1, the "data analytics Workshop for official statistics" (daWos) is a two-day event organised by
Eurostat and held in Amsterdam (10-11 September 2018) that aimed at addressing the challenges and
opportunities for the National Statistical Institutes (NSIs), the European Statistical System (ESS) and the
broader community of Official Statistics, in the area of data analytics.
Data analytics (DA) is the process of examining data sets in order to draw conclusions about the information
they contain, increasingly with the aid of specialised systems and software. Data analytics methodologies
include exploratory data analysis (EDA), which aims to find patterns and relationships in data, and
confirmatory data analysis (CDA), which applies statistical techniques to determine whether hypotheses
about a data set are true or false. EDA is often compared to detective work, while CDA is akin to the work of
a judge or jury during a court trial -- a distinction first drawn by Tukey in his 1977 book on EDA2.
1 This includes among others: the 2014 and 2016 ESS Big Data Workshops, the 2016 ESS Visualisation Workshop and
the ESS Workshop on dissemination of Official Statistics as open data.
2 J.W.Tukey (1977): Exploratory Data Analysis, Pearson.
4
More advanced types of DA, sometimes also referred to as Data science3, include data mining, which involves
sorting through large data sets to identify trends, patterns and relationships; predictive analytics, which seeks
to predict customer behaviour, equipment failures and other future events; and machine learning, an
artificial intelligence technique that uses automated algorithms to churn through data sets more quickly than
data scientists can do via conventional analytical modelling. Big data analytics applies data mining, predictive
analytics and machine learning tools to sets of big data that often contain unstructured and semi-structured
data. Text mining provides a means of analysing documents, emails and other text-based content.
DA applications involve more than just analysing data. Particularly on advanced analytics projects, much of
the required work takes place upfront, in collecting, integrating and preparing data and then developing,
testing and revising analytical models to ensure that they produce accurate results. Altogether, DA
technologies and techniques are widely used in commercial industries to enable organisations to make more-
informed business decisions and by scientists and researchers to verify or disprove scientific models, theories
and hypotheses.
Workshop objective The approach to the daWos event was thematically focused. Beyond sharing, from different perspectives
(e.g., in terms of methodology, technology, or policy), the actual experience and practice in DA (e.g., related
to data access and sharing, applications and services, tools and techniques, architecture and technologies, or
skills and know-how development), it also had the goal to build capabilities in this area to use data in better
way, or at least to provide new insights for Official Statistics.
The workshop aimed at increasing awareness of the issue of DA; identifying emerging best practices to
identify synergies and options for joint development efforts; presenting and sharing best practices, original
ideas, new tools, and past (successful or failed) experiences. Overall, these objectives were submitted to
contribute to the broader objective of facilitating cooperation within the ESS in the development and
implementation of common methodological solutions and practical applications, aligned with the ESS Vision
2020 flagship strategy.
To achieve all these objectives, the workshop included:
- Presentations from experts in different areas related to DA. These lectures aim at exposing the
audience with current trends and state-of-the-art development in the field of DA – not necessarily
connected to Official Statistics and not only referred to experiences in the ESS, but also from other
statistical systems and from the private sector – so as to provide useful insights into DA and related
issues.
- The workshop’s pillar was a set of parallel group sessions devoted to the identification and exchange
of relevant experiences in specific issues of DA, within and outside the ESS.
In this setting, participants could share the results achieved through existing DA activities, i.e., in-house use
cases and projects. In addition, they were also invited to further detail the mainstreaming of these activities,
e.g., how they were implemented in practice and whether they were actually deployed in production. This
way, the needs and requirements regarding methodological development and technological investment
could be addressed while describing all other issues met, in terms of management, financing and human
capital for instance.
3 D.Donoho (2017): 50 Years of Data Science, doi:10.1080/10618600.2017.1384734.
5
Report structure In order to enhance its reading and use, this report first presents the key conclusions from the Workshop
(next section).
The Workshop sessions are then described in Annex 1, summarising the presentations and discussions that
took place. Annex 2 includes the Workshop programme. All presentations are available on the dedicated
webpage on the CROS portal.
Key conclusions and Recommendations from the Workshop This section recalls the main conclusions and recommendations issued from the parallel group and plenary
sessions, which are presented in detail in Annex 1.
Summary conclusions In recent years, the opportunity to, in various creative ways, deliver information about the impact of certain
policies has emerged together with new sources of data. Still, whereas the statistical community has made
good progress on using these new sources, many questions and challenges remain when it comes to apply
DA so as to extract the relevant information. Advanced, state-of-the-art, DA tools and techniques are
increasingly necessary to harness new data sources. The availability of brand new techniques and
technologies make a significant change in leveraging today's accessible computing power to enable the
processing of large amounts and various types of data into relevant information through statistical analysis
and modelling. By creating faster, more agile and more tailored insights in data and content, it is – sometimes
prophetically – believed they can help answer increasingly complex questions that were previously
considered beyond reach, from description: "what has happened", to prescription: "what should we do?",
through diagnosis: "why did it happen?" and prediction: "what will happen?".
Although the evidence-based policymaking guiding principles seem not to change (as for the requirements
in terms of transparency, privacy and ethics, quality and robustness, and timeliness), the rise of DA – together
with new technologies and trends in sharing, handling, processing and analysing data certainly – calls for an
upgrade of evidence-based policy making practices in the NSIs. One of the main challenges facing NSIs is not
only to build the data analytics capabilities required to harness the new data sources (e.g., machine learning,
data mining, pattern recognition, natural language processing, etc…), but also to understand how these
capabilities can be deployed to help satisfy the demand from policymakers and policy users for faster, more
agile and more tailored insight. Beyond the need for new approaches and technologies, there is also a
necessary adaptation of new job profiles and skill sets for an effective use of data analytics services. There
seems to be a general understanding that new practices will require dynamic and innovative partnerships,
including public-private ones. This also represents further challenges for NSIs since their role may decrease
while that of independent data institutes might grow.
Furthermore, many users of Official Statistics are not willing to compromise at all around the quality and
trust of information for the promise of some improvements in timeliness, frequency and granularity. In a
context of “post-truth” society, not only quality and trust, but also openness, reproducibility, reliability and
sustainability of the statistical products and services are essential to ensure that evidence-based
policymaking is transparent and defensible. All such considerations further raise a number of significant
issues that need to be addressed by NSIs before deciding whether to embrace DA in the actual and regular
production of Official Statistics.
6
Recommendations Following the various discussions that took place at the daWos event, including the sharing of best practices,
practical experience as well as lessons learned by the participants, useful recommendations emerge to
address DA from different angles:
Applications and services: A catalogue that identifies the different use cases using DA as
implemented in the NSIs could be useful. The applications and services should be mapped against
existing standards for comparability. This should contribute to the identification of emerging best
practices and possible synergies and options for joint development efforts, addressing policy
issues at cross-border level.
Tools and techniques: A catalogue with the different DA software, algorithms and models that
have been used or are still used in the NSIs – including the wealth of assets made freely available
on the market – could be established to provide new users with some guarantee regarding the
quality of the tools and techniques. These software, algorithms and models could be considered
for further – possibly centralised – validation/certification by the NSIs and the community of
Official Statistics. When engaging in new software development, NSIs should start small and build
quickly software blocks/components that are modular, interoperable, and reusable to be shared
within the community and help enrich an ecosystem of DA tools and techniques. Prior to using
black-box DA systems – e.g. based on artificial intelligence – the issues of interpretability and/or
explicability to the final user of the product should be carefully considered. In general, the
reproducibility of production workflows should also be addressed to ensure transparency and
trust in the products.
Architecture and technologies: The data and process architecture needed to support future DA
capabilities are still under consideration. Besides traditional databases or new big data
infrastructure, logical data warehouses are worth exploring since they enable to deal with legacy
infrastructure without huge investment through data virtualisation. Curation of metadata and
abstract knowledge are essential in that perspective. Virtualised containerised environments are
also worth studying for serving applications.
Governance: Existing (good) governance regarding data should be extended to (good)
governance of methods and algorithms. Shared protocols towards a code of good practice for
using software resources could be developed, or more simply, already existing protocols, e.g.
best practices from the open source community, could be adopted. Fair principles (findable,
accessible, interoperable, and reusable) could be enforced for tools and techniques, good level
of documentation should be adopted for methods and algorithms.
Skills and expertise: DA is an exciting field that combines scientific inquiry, statistical knowledge,
subject-matter expertise, and computer programming. The necessary adaptation of job profiles
and skill sets to emerging roles for an effective use of DA services within the ESS needs to be
addressed (e.g., through the design of appropriate trainings).
7
Annex 1 – Description of Workshop sessions
This section describes first the contents of the opening, then the general sessions and finally the group
sessions (e.g., followed by wrap-up sessions aiming at summarising the discussions that took place during
the group sessions and highlighting the main topics/concerns /decisions of interest). Sessions were organised
around specific topics selected by Eurostat as follows (detailed descriptions are given in the corresponding
sections):
- Opening session
- The general sessions included a presentation on use cases and best practices in data analytics, trends
in data analytics architectures, future developments in data analytics and data science and engaging
users and policymakers for data analytics. Moreover, a general session containing three presentations of
experts and addressing other ongoing initiatives had concluded the first day.
- Group session 1: Data analytics in practice - real examples from the ESS
- Group session 2: Statistical models and methods for data analytics
- Group session 3: Engaging external stakeholders
- Group session 4: What are the success factors for data analytics?
- Group session 5: Emerging data analytics tools and techniques
- Group session 6: Facilitating the data analytics of others
- Group session 7: Joint ESS data analysis undertakings
- Group session 8: Infrastructure and architecture as a solid basis for data analytics
- Group session 9: Data analytics in a changing environment
1.1 Opening session Participants were welcomed, on behalf of Eurostat, by Ms Martina Hahn, Head of Unit Methodology and
Corporate Architecture in Eurostat. At first, the participants were reminded that the event is organised
to contribute to the goals defined in the ESS Vision 2020. The event actually allowed participants to
discuss how well the ESS is equipped with the necessary skills and technology of DA for Official Statistics,
how much this really helps users and how efficient is the investment by NSIs and Eurostat in this “new”
activity.
Ms Hahn stressed that the presence in the event of a mix of statisticians, IT experts, as well as the
presence of several international institutions (among others OECD and JRC) was a good sign of interest
in this topic. The objectives of the event were also recalled:
Identify and discuss the different use cases for DA by Official Statistics producers, in particular
the impact and opportunities created by external trends and developments (e.g. new user
demands, emergence of a rich DA market should be identified in this context).
Take stock of the development of DA in the ESS and identify the main challenges ahead for
fulfilling the use cases for DA, in particular the challenges related to data confidentiality.
Discuss the data and process architecture needed to support these new use cases and the future
DA capabilities in the NSIs and other ESS members, identifying in particular emerging best
practices and possible synergies and options for joint development efforts.
8
1.2 General Sessions: Expert lectures
1.2.1 Organisation of the general sessions The general sessions were organised as “expert lectures” delivered by four selected experts from
different fields (NSIs, academia and consultancy), showing diverse viewpoints and covering general issues
in order to give a quite complete idea of the potentiality of DA. In addition, at the end of the second day,
a plenary session on the existing DA initiatives in the ESS was organised. The sessions were followed by
discussions with the participants. The four expert lectures discussed are presented in the following.
1.2.2 Expert lecture: Use cases and best practices in data analytics Speaker: Mr Cédric Archambeau | Principal Applied Scientist, Amazon
This presentation exposes few of Amazon's data driven approaches adopted to make product
recommendations to its customers. The speaker shows how the expertise they built over the years in
machine learning and data science helps Amazon scale in a wide range of other domains, and create
better customer experiences. The application of machine learning to problems in natural language
processing, computer vision, search and forecasting are also discussed.
In the first part of the talk, an overview of a number of machine learning applications is given. The speaker
explains how these applications fit within the Amazon ecosystem to address the challenges that are faced
and how they help scale. While machine learning is routinely used in recommendation, fraud detection
and ad allocation, it plays a key role in devices such as the Kindle or the Echo, as well as the automation
of Kiva enabled fulfilment centres, statistical machine translation and automated Fresh produce
inspection. In the second part, the speaker discusses how machine learning is actually made more
accessible within the company by, for example, automating the fine-tuning of machine learning and, in
particular, deep learning algorithms. Applying complex predictive systems, such as machine learning-
based systems, in the wild requires manually tuning and adjusting knobs, broadly referred to as system
parameters or hyper-parameters. Black-box optimisation and in particular Bayesian optimisation
provides a natural framework for addressing this problem by taking the human expert out of the fine
tuning loop. Bayesian optimization is also shortly introduced in this area.
1.2.3 Expert lecture: Trends in data analytics architectures Speaker: Mr Rick van der Lans | Founder of R20/Consultancy BV, ambassador of Kadenza
To deliver data in support of their business intelligence needs, most organizations have developed a
classic data warehouse. New groups of business users with data science and data investigative needs
have also developed the data lake, which is most often a standalone system with almost no relationship
to the existing data warehouse. On top of this, some organizations have already started to develop a
third data delivery system for delivering data to specific business users—the data marketplace. Again,
this third data delivery system is being developed as an analytical island, not to mention the still-newer
world of streaming analytics.
Developing all these data delivery systems independently is far from ideal. Development-wise, wheels
are reinvented, resulting in low productivity, metadata replication, and inconsistencies across reports
and analyses. It is crucial for organisations to somehow bring these systems together. One solution is by
deploying a unified data delivery architecture based on data virtualization technology. Such architecture
can support a wide range of business users, from those demanding a highly agile environment such as
the marketplace to those requiring governable and auditable reports. This session had discussed how
these environments can be merged into a unified architecture.
9
Figure 1 – Data virtualization overview.
1.2.3 Expert lecture: Future developments in data analytics and data science Speaker: Mr Fernando Perez-Cruz |Chief Data Scientist at the Swiss Data Science Center
Implicit generative modelling has recently scratched the surface on how deep learning can be used as a
universal simulator. Until recently, deep learning has been used quite successfully to solve long standing
discriminative problems in computer vision, speech and natural language processing, basically showing
that hand-coded human-engineered features are suboptimal in the presence of: complex problems in
which human only have a basic understanding of the variability of the data; and, the availability of large
labelled data sets.
Figure 2 – Generative Adversarial Networks (GANs) in the bigger AI “big picture”. Flowcharts show how the different parts of an AI system relate to each other within different AI disciplines; shaded boxes indicate components that are able to learn from data.
Recently, Variational Auto-Encoders and Generative Adversarial Networks (GANs) have shown that the
same representation learning can be used for generative modelling. These implicit generative models do
not provide an interpretable model for the available data, but a universal simulator that it is able to
generate data similar to the one used for training. These tools can be used to simplify complex
simulations (e.g. climate models) or limited observations (e.g. cosmology or particle physics), opening
the door to Artificial Intelligence powered advances in many different fields of science. In this talk, the
speaker first presented the general approaches and methods as well as their potential use and current
shortcomings. In the second part of the talk, the speaker described a recent application of GANs for
10
password guessing. This is an ideal application to understand the need for GANs and understand why
they work and what their limitations are.
1.2.4 Expert lecture: Engaging users and policymakers for data analytics Speaker: Mr Franco Accordino |DG CONNECT
There is a growing need to improve forward thinking in policymaking practices and to enable citizens and
policy makers to co-create ideas and share evidence in order to feed policy reflections, and eventually
inform policy decisions. New policies are often thought up on the basis of current trends rather than by
capturing future opportunities offered, for instance, by long-term advances in science and technology.
The need to focus on short-term measures often prevents governments and businesses from orientating
their policy choices towards future possibilities, partly because they have been elected to come up with
tangible responses to current challenges that matter to citizens and partly because long-term investment
decisions may be too risky. This may make it difficult to put in place sustainable solutions to structural
problems.
Figure 3 – Building blocks for users', policymakers and citizens' engagement.
The challenges can be articulated along two main axes, highlighting typical tensions between different
policymaking mindsets: (i) evidence about the status of the real world vs. inspiration from longer-term
thinking; (ii) delegated leadership vs. participatory leadership. Policy Making 3.0 is a participatory and
evidence-based model designed to provide an answer to the above challenges. The essential elements
of the Policy Making 3.0 process are the following:
1. The implementation of policies co-developed by policymakers and stakeholders has an impact
on the real world (individuals, society, economy, environment etc.).
2. The real world is monitored and data are gathered, measured and analysed through knowledge
mining and statistical tools, which makes it possible to identify trends, issues and challenges and
to elicit scientific evidence.
3. The scientific evidence provides information that stakeholders and policymakers can use to
reshape policies.
4. Stakeholders and policymakers interact in social networks where other factors rather than
evidence emerge, such as personal opinions, corporate interests, lobbying, ideological values and
other ‘non-measurable’ factors (i.e. that cannot be easily sensed and automatically captured).
11
Such factors often prevail over the scientific evidence. There are also boundary constraints that
come in the form of values and laws (e.g. constitutional rules).
5. Policies may also be inspired by desirable visions and aspirations that are not necessarily in line
with current, short-term trends and can also be considered as part of the ‘emotional’ and
intuitive factors that influence decisions.
The Policy Making 3.0 model is implemented by Futurium, an online laboratory setup to co-develop
futures and policy ideas. This laboratory combines the informal nature of social networks with a
methodological approach of foresights to engage stakeholders in the co-creation of the futures that they
all want. The architecture consists of the following components: front-end participatory tools, knowledge
harvesting tools for both policymakers and stakeholders, data-crawling tools to extract knowledge from
popular social networks and embed it into the Futurium, data-gathering tools to fetch real world data
and to input it into the Futurium.
DORIS - Data Oriented Services is a pool of services enabling policy and support departments to fulfil
their daily job in an easy and cost-effective manner. DORIS is largely based on algorithms and data from
various sources and providers, meaningfully integrated and made accessible through tailored interfaces
(dashboards) and it is customisable to specific use-cases, available as general-purpose service (drive-in)
via web app or any Application Programming Interface (API).
Figure 4 – Example of data service: DORIS.
1.3 General session – Mapping of existing data analytics initiatives in the ESS as well as in
comparable sectors
1.3.1 Organisation of the session The general session at the end of the first is devoted to mapping on the existing DA initiatives in the ESS
as in comparable sectors. The presentation of the three experts discussed:
- The Big Data Platform initiative of the EC Joint Research Centre;
- The Logical Statistical Data warehouse of the Centre of Excellence on Data Warehouse;
- Methods and algorithms in the UN Global Platform.
1.3.2 Expert lecture: The Big Data Platform initiative of the EC Joint Research Centre Speaker: Mr Pierre Soille |EC – DG JRC
The project background is the explosion of digital data sources that led to the big data paradigm (Volume,
Velocity, and Variety of data streams); Earth observation (EO) entering big data thanks Copernicus
12
Sentinel satellites (full, free, and open data). Indeed, the increasing amount of free and open geospatial
data of interest to major societal questions calls for the development of innovative data-intensive
computing platforms for the efficient and effective extraction of information from these data. Following,
the big data task force of the Joint Research Centre (DG JRC) of the European Commission recommended
in late 2014 to start a big data pilot project on EO and Social Sensing. The development of the JRC Earth
Observation Data and Processing Platform (JEODPP) started in 2016.
The JEODPP platform is versatile in the sense that it accommodates different service levels to satisfy the
needs of a variety of users: batch processing, provision of legacy environments, and interactive
visualization and processing. All services are accessed through a web browser so that no dedicated client
software needs to be installed on the devices accessing the platform. A simplified representation of the
JEODPP architecture is shown in the figure below in the form of a three layers stack with the resources
layer at its basis, followed by the service layer, and the client layer at its top. The platform already
supports a variety of projects serving policy areas in agriculture, forestry, environment, disaster risk
management, development, health, and energy.
Figure 5 – JEODPP platform architecture: simplified view with its main layers and components.
The exponential growth of data and data sources is a matter of fact. The big data paradigm is permeating
all fields. Fair data principles also apply to data analysis. Challenge of turning data into insights facilitated
by platforms with data co-located with processing. Jupyter notebooks contribute to reproducible analysis
as well as knowledge sharing and collaborative working. It should be taking into account the importance
of interactive analysis and visualisation. Open standards including open API are needed to avoid platform
lock-in.
The project evolution is Big Data Analytics (2019-2020) with innovative approaches (AI/machine learning)
for combining large amounts of data originating from different sources. The project is supported by the
JEODPP and the initial focus will be on geospatial data and their combination with other data sources. It
will be the key enabler of data and knowledge sharing across JRC and towards partners and the link with
Copernicus Data and Information Access Services.
13
1.3.3 Expert lecture: The Logical Statistical Data warehouse of the Centre of Excellence on Data
Warehouse Speaker: Ms Sonia Quaresma |INE, PT
A Statistical-Data Warehouse (S-DWH) can be defined as a single corporate Data Warehouse fully based
on a metadata. An S-DWH is specialised in supporting production for multiple-purpose statistical
information. With an S-DWH different aggregate data on different topics should not be produced
independently from each other but as integrated parts of a comprehensive information system where
statistical concepts, micro data, macro data and infrastructures are shared. The Information Systems
connect the business to the infrastructure, this is represented by a conceptual organization of the
effective S-DWH which is able to support tactical demands.
In the layered architecture, in terms of data system, different assets are identified:
- the staging data are usually of temporary nature, and its contents can be erased, or archived, after the
DW has been loaded successfully;
- the operational data is a database designed to integrate data from multiple sources for additional
operations on the data. The data is then passed back to operational systems for further operations and
to the data warehouse for reporting;
- the Data Warehouse is the central repository of data which is created by integrating data from one or
more disparate sources and store current as well as historical data;
- data marts are kept in the access layer and are used to get data out to the users. Data marts are derived
from the primary information of a data warehouse, and are usually oriented to specific business lines.
Therefore, data, macro data and infrastructure are shared.
The Metadata Management of metadata used and produced in all different layers of the warehouse are
specifically defined in the Metadata framework and the Micro data linking. This is used for description,
identification and retrieval of information and links the various layers of the S-DWH, which occurs
through the mapping of different metadata description schemes; It contains all statistical actions, all
classifiers that are in use, input and output variables, selected data sources, descriptions of output tables,
questionnaires and so on. All these meta-objects are collected during design phase into one metadata
repository. It configures a metadata-driven system well-suited also for supporting the management of
actions or IT modules, in generic workflows.
A distributed computing platform leads to the Logical Statistical Data warehouse (LSDW) for the future.
LSDW warehouse means adding Semantic Data Abstraction. The abstraction layer allows a conceptual
generalization of the sources and of the integration outputs and being based on a semantic middleware
that supports context integration it describes each data taxonomy and relates it to each use-case
ontology in place. Moreover, a Virtualization Interface which allows access to all corporate data, in
different contexts as well as the creation of new contexts.
Figure 6 – Logical Statistical Data Warehouse.
14
1.3.4 Expert lecture: Methods and algorithms in the UN Global Platform Speaker: Mr Joni Karanka |ONS, UK
The UN Global Working Group (GWG) on Big Data for official statistics was created in 2014 by the UN
Statistical Commission to explore the benefits and challenges of the use of new data sources and
technologies for official statistics and SDG indicators. The GWG addresses issues pertaining to
methodology, quality, technology, data access, legislation, privacy, management and finance, and
provide adequate cost-benefit analyses. The main assets of statistical offices are data and algorithms -
for exploration and data science.
The UN Global Platform is envisaged as a marketplace for sharing and developing core catalogues of
services, data, metadata, methods, APIs, information technology tools and training materials. The global
network operates as a federated network of platforms at the national, regional and global levels, which
ensures interoperability and information-sharing among the platforms in the network through agreed
and defined interfaces. The platform is intended for use by its trusted partners and should meet the
requirements for research and development in the use of multisource data. Transparent partnership
agreements will need to be developed with private- and public-sector organizations so that the platform
partners contribute and derive value through a business model that is individually sustainable for all
stakeholders. The platform offers technology infrastructure and constitutes a network for data
innovation to facilitate global collaboration of the official statistical community.
The presentation discussed the algorithms in Official Statistics, methods as assets, and the methods
service using Algorithmia of the Big Data Global Working Group for the UN (hosted here).
1.4 Parallel group session 1 – Data analytics in practice - real examples from the ESS Chair: Mr Ó'Lúing Mervyn | Central Statistics Office (IE)
1.4.1 Objective of the session While intuitively understandable, the concept of DA can mean different things to different stakeholders.
At this session, examples from ESS members have illustrated what DA could actually achieve for a
statistical institute. A discussion on what are the lessons learned from these examples was also initiated.
1.4.2 Summary of the session Different use cases were discussed, e.g., some prototype/pilots and experimental products. DA may
be truly disrupting the way NSIs process data (e.g, from data collection to decision-making, through data
collection and indicator estimation). However, there are many limitations for adopting and running those
pilots in production and those were discussed. It was stressed that, prior to adopting a "full operational
Figure 7 – Technology components and main features for methods and algorithms sharing.
15
stack" in production, DA tools/methods (or parts of it), whose selection is often ad-hoc, need to be made
truly reusable and applicable. Beyond development and deployment, the operational integration of DA
is challenged by cultural/corporate consideration since DA is often seen as "yet another tool" with little
demonstrated added value by production units.
1.5 Parallel group session 2 – Statistical models and methods for data analytics Chair: Susie Fortier | StatCan (CA)
1.5.1 Objective of the session Advanced methods (e.g. model-based estimation, multivariate methods, forecasting/nowcasting and
microsimulation) are already being used by statistical institutes - sometimes in regular production, other
times in supporting processes or in analytical projects.
At this session, the use of advanced statistical methods in DA for official statistics was discussed. The
session also included a presentation from Statistics Iceland on rare events of time: outliers as extreme
values and models for population projections taking into account fertility, mortality and immigrations.
1.5.2 Summary of the session The presentation introduced the probabilistic methods used for the demographic projections on the
Icelandic population. The key points made during the presentation concern the description of data and
the formulation of the mathematical problem, while recalling the definition of rare events of time series
(outliers and extreme values) and introducing methods and models as dynamical (ARDL) models (short
term), arima/exponential smoothing models (long term). Finally, functional data modelling with time
series coefficient functions of orthonormal function expansions was presented.
1.6 Parallel group session 3 – Engaging with external stakeholders Chair: Laust Hvas Mortensen | Statistics Denmark (DK).
1.6.1 Objective of the session External users of DA may come from many different stakeholder groups (policymakers, policy analysts,
NGOs, journalists, researchers etc.) Moreover, requests for DA may arrive at a central contact point of a
statistical institute, or directly at the entity concerned. This raises the question on how to assure an
equitable treatment of stakeholders, and a sound use of resources. At this session, different approaches
taken to the interaction with external DA stakeholders were discussed.
1.6.2 Summary of the session Some of the topics discussed during this session are:
- how to communicate probability concept, in particular uncertainty. The way we communicate
uncertainty and change for different types of statistics needs to be tailored to suit the audience. We
should offer users the opportunity to “zoom in” according to their needs and be able to find detailed
information when they require it;
- transparency (assumptions to produce official statistics). Good statistical practice is fundamentally
based on transparent assumptions, reproducible results, and valid interpretations. In some situations,
guideline principles may conflict, requiring individuals to prioritize principles according to context.
However, in all cases, stakeholders have an obligation to act in good faith, to act in a manner that is
consistent with these guidelines, and to encourage others to do the same. Above all, professionalism
in statistical practice presumes the goal of advancing knowledge while avoiding harm; using statistics
in pursuit of unethical ends is inherently unethical;
- presenting data using infographics. These types of graphics present complex information quickly and
clearly. Infographics are increasingly popular because they can provide a great amount of complex
information succinctly, using visually appealing elements that draw attention and facilitate retention;
16
- importance of confidentiality in maintaining trust in official statistics. With Big Data, data subjects may
be unaware they are generating data and what it can be used for, despite the efforts of the social
media platforms in this respect. One of the challenges is to manage the acceptance of data re-use and
data linkage, which would not necessarily be expected by data subjects;
- identify the users and provide the specific products. Understanding how our statistics are used is
essential to maximising the public value of official statistics and ensuring that users are able to make
sound and informed judgements from official statistics.
1.7 Parallel group session 4 – What are the success factors for data analytics? Chair: Mervyn Ó'Lúing |Central Statistics Office (IE)
1.7.1 Objective of the session Due to their non-traditional nature, DA initiatives might face obstacles during development as well as
during subsequent deployment.
At this session, based on practical experience, the different aspects (organisational and infrastructural as
well as technical) to ensure the success of DA initiatives in NSIs were explored.
1.7.2 Summary of the session A presentation from Statistics Poland on what are not success factors for DA has been discussed. The key
question was on using advanced DA and the issues about the publication of the results. Two cases were
described. The first case was about the human capital composite indicators and the risk factors of
misinterpret the data government authorities. The second case concerns the big data application and the
framework of Map Reduce paradigm which raise questions regarding the quality of new data sources.
1.8 Parallel group session 5 – Emerging data analytics tools and techniques Chair: Joni Karanka | ONS (UK)
1.8.1 Objective of the session Countless advanced tools for DA are being developed. Some of them have a huge potential, whereas
others appear to not be applicable to official statistics (for instance because of their "black box" or
proprietary nature) - and yet others look impressive, but are in reality just statistical methods relabelled
as "data science".
At this session, it was discussed how to actually cut through the hype to find and deploy the modern data
analysis tools and techniques that are genuinely useful for official statistics.
1.8.2 Summary of the session The state-of-the-art and advanced DA tools and techniques already in use in the ESS (i.e. automatic data
retrieval, web-scraping, data cleaning, natural language processing, machine learning, pattern
recognition) and some examples (CSB-NL) was presented. Moreover, technical solutions (tools and
software) to share data and enable users to analyse them and the need for a common "playground"
platform (e.g., like the UNECE sandbox) was discussed.
1.9 Parallel group session 6 – Facilitating the data analytics of others Chair: Fabio Ricciato | Eurostat
1.9.1 Objective of the session NSIs – regardless of whether they focus on providing complete DA services – might wish to provide
standardised components (data access, analytical tools) – either for internal use or for external users of
data. By providing small components rather than complete deliverables, a community of data analysts
could be fostered. At the same time, while the reputational risks are mitigated by not providing any
17
analyses, the mere act of providing certain analysis tools might be construed as non-objectivity. At this
session, the nature and the approach to provide and share data analysis components were discussed.
1.9.2 Summary of the session The NSIs should improve the accessibility to their data for analysts. The improvements involve further
development of APIs, metadata and open data standards. The access to microdata for external data
analysts should generally also be improved. Beyond data, NSIs should provide analytical tools that take
into account the typology of the user (everyday users vs. expert users) and the cost of the engagement.
At the end, DA services should be made available to all users. A collaborative/participative approach for
users consists in sharing the source code. Computing/testing platforms help further engaging external
users with methods and tools but more resources are needed.
1.10 Parallel group session 7 – Joint ESS data analysis undertakings Chair: Matyas Meszaros |Eurostat
1.10.1 Objective of the session Many statistical institutes face similar data analysis requests. As for many other areas, there might be a
potential for ESS members to collaborate on certain aspects of DA: sharing tools, approaches and
methods. This the more as both the ESS Vision 2020 and the ESS priorities beyond 2020 set out needs
that could be tackled by DA initiatives.
At this session, the potential for ESS collaboration on DA was explored.
1.10.2 Summary of the session This session discussed some tangible actions and initiatives to launch for truly engaging NSIs in DA so as
to foster (or enhance when it already exists) cooperation. Examples from past collaborations run within
the ESS framework were regarded as best practices (e.g., JDemetra, VTL, mu- and Tau-argus). Some of
the suggested solutions are: the definition of a ESS-wide catalogue of ongoing projects and their use, the
adoption of a common repository for sharing resources, the use of generic and concise documentation
with examples, the running of training courses/online courses at basic level, the creation of a
helpdesk/community support, the sharing of small building blocks with enabled multilingualism.
1.11 Parallel group session 8 – Infrastructure and architecture as a solid basis for data
analytics Chair: Sonia Quaresma |INE (PT)
1.11.1 Objective of the session The support for systematic DA (beyond ad hoc one-shot exercises) could imply new architectural and
infrastructural requirements for a statistical institute. In some cases, this might be best tackled in a
general overhaul of legacy systems.
At this session, the landscape of present and future infrastructure to support DA was discussed.
1.11.2 Summary of the session The session included a presentation on data architecture done by ISTAT which illustrates the legacy
architecture integrated taken into account into the new model (e.g., adopting virtual layers within logical
warehouse). Existing Statistical Data warehouses (SDw) should be modernised if they are to stay relevant.
The Logical Statistical Data warehouse (LSDw) is the next evolutionary step up from the SDw. New
sources increase complexity of IT components moves the DWH architectures toward logical
architectures. The Logical DWH is a new management architecture combining the strengths of traditional
repository warehouses with alternative data management and access strategy. A Logical DWH is an
evolution and augmentation of DWH practices, not a replacement. Data Virtualization enables Logical
18
DWH. The Logical Statistical Data Warehouse is a virtual central statistical data store based on logical
layers for managing all available data of interest, improving to: produce the necessary information,
(re)use data to create new data/new outputs, perform DA, execute analysis, produce reports, and
support dashboard tools. Case Study: SBS-ICT by Web Mining. The case study focuses on the use of survey
data as a ground truth to create a classification model enabling the prediction of variables on Enterprises
ICT Survey.
1.12 Parallel group session 9 – Data analytics in a changing environment Chair: Jacopo Grazzini | Eurostat
1.12.1 Objective of the session New challenges (such as the General Data Protection Regulation) and opportunities (such as the Third
Data Package) are entering the DA scene. Further initiatives are also appearing on the horizon.
At this session, how to position DA in this changing environment was discussed.
1.12.2 Summary of the session Thought it is recognised that DA products are mature enough, the necessary investment by the NSIs to
integrate these solutions in production was discussed. Beyond the feasibility, the need for NSIs to adapt
is also acknowledged. In acknowledging the potential of DA, it was also mentioned that transformations
are actually needed for NSIs to adopt new roles/capabilities. The impact of algorithmic decision-
making and how NSIs can communicate about it, e.g. to comply with the requirement for transparency,
accountability and traceability, was further discussed. In particular, the need for humans to be involved
in the decision-making process was stressed while the adoption of DA methods and tools may guarantee
little control. Beyond the question of the data themselves, ethical/legal issues preventing the adoption
of DA in production, as well as possible ways to facilitate it, was raised.
19
Annex 2 – Programme of the Workshop
Day 1 (10 September 2018)
09:00–12:00 daWos satellite meeting
Implementing and showcasing data analytics
12:00–13:00 Welcome and registration of participants
13:00–14:00 lunch
14:00–14:30 Opening session – Ice breaker
14:30–15:30 Expert lecture on use cases and best practices in data analytics
Cédric Archambeau – Principal Applied Scientist, Amazon
15:30–16:30
Group sessions
Data analytics in practice -
real examples from the ESS
Statistical models and methods
for data analytics
Engaging with external
stakeholders
16:30–17:00 coffee break
17:00–17:30 Plenary wrap-up session reporting on group sessions I
17:30–19:00 Report and mapping of existing initiatives in the ESS and in comparable sectors
20:00 dinner
Day 2 (11 September 2018)
9:00–10:00 Expert lecture on trends in data analytics architectures
Rick van der Lans – Founder of R20/Consultancy BV
10:00–11:00 Group sessions
What are the success
factors for data analytics?
Emerging data analytics tools and
techniques
Facilitating the data
analytics of others
11:00–11:30 coffee break
11:30–12:00 Plenary wrap-up session reporting on group sessions II
12:00–13:00 Expert lecture on future developments in data analytics and data science
Fernando Perez-Cruz – Chief Data Scientist at the Swiss Data Science Center
13:00–14:00 lunch
14:00–15:00 Group sessions III
Joint ESS data
analysis undertakings
Infrastructure and architecture as a
solid basis for data analytics
Data analytics in a
changing environment
20
15:00–15:30 Plenary wrap-up session reporting on group sessions III
15:30–16:00 coffee break
16:00–17:00 Expert lecture on engaging users (and policymakers) for data analytics
Franco Accordino – Directorate-General for Communications Networks, Content and
Technology (European Commission)
17:00–18:00 Closing session
Day 3 (12 September 2018)
09:00–13:00 daWos satellite meeting
The geographical dimension of data
analytics
--
-
daWos satellite meeting
Balancing confidentiality and utility in data
analytics
13:00–14:00 lunch
Workshop programme flyer