Analytics in Official Statistics - Sas Institute › ... ›...

Post on 08-Jun-2020

1 views 0 download

Transcript of Analytics in Official Statistics - Sas Institute › ... ›...

Analytics in Official Statistics: From

Adaptive Survey Design to the U.S. 2020

Census

Michael T. Thieme

Assistant Director for Systems and Contracts

Decennial Census Programs

U.S. Census Bureau

1

The thoughts and opinions in this presentation are those of the presenter and not necessarily those of the U.S. Census Bureau

2

Disclaimer

▪Survey costs are rising

▪Confidence in government is declining

▪With it, the Public’s willingness to participate in surveys

▪Current methods for producing official statistics are unsustainable

3

Where we are:

How did we get here?

and

4

How do we keep going?

One Barrier to Where We Want to Go

5

1 Source: Fostering Interoperability in Official Statistics: Common Statistical Production Architecture (UNECE, 2013)

Accidental Architecture

6

This is what Accidental Architecture looks like at Census:

The Result?

▪Higher system costs

▪development, operations and maintenance

▪Nearly nonexistent interoperability

▪ Less data accessibility, discoverability, usability

▪Much more difficult to use data analytics and adaptive survey design approaches

7

Part of the Answer:

8

Adaptive Survey Design

9

Survey Data Collection Platform as a Service

A new approach at theU.S. Census Bureau

Concurrent Analysis and Estimation System

Unified Tracking System (Paradata Repository)

Centralized Operational Analysis and Control (Multimode Operational Control System)

CaRDS(ACS/

Decennial )

MAF/TIGER(Decennial)

Business Register(ECON)

Frame and Sample Systems

CaRDS(ACS and

Decennial)

BR, MADB, StEPS II(ECON)Integrated Field Operation Control

Systems

Time & Attendance

Systems

Response Processing

Systems

Centurion/ISR

iCADE ATAC CQA/IVRCOMET CLMS Enumeration

Modest Beginnings

▪National Survey of College Graduates▪Developed R-Indicator Model

▪Ran experiments

▪Built confidence

10

Modest Beginnings

▪Census Tests

▪2014, 2015, 2016

▪Administrative record modeling

▪Optimization of field work

▪Changed the way we do Censuses

11

The U.S. 2020 Census

Using Analytics to:

▪Optimize the 2020 Census paid advertising campaign

▪ Identify vacant housing units

▪Optimize the number of enumeration attempts

▪ Identify best time to knock on doors

▪Optimize field worker efficiency

12

13

2. SAS 9.4 for non-distributed processing

1. Hortonworks Hadoop for storage and in-database processing

3. SAS Viya for distributed processing

CAES 2020 Production Environment

3 years of testing non-distributed versus distributed along 3 dimensions

1. Performance: How fast can we go?

2. Accuracy: When we go fast, do we come up with the same result?

3. Cost: What does it take to achieve better performance and the same level of precision?

Technology Performance Accuracy Cost

2015 Pilot SAS LASR In-Memory

2016 Pilot SAS In-Database (via Map Reduce)

2018 Pilot SAS Viya In-Memory ? ? ?

The Journey to CAES 2020

15

Business Goal: speed up Decennial Administrative Records process

1. Performance:

Non-Distributed Model Processing Time: 38 HOURS

Distributed Model Processing Time: 2 HOURS

2. Accuracy:

Non-distributed and distributed RESULTS MATCHED

3. Cost

Roughly 4 HOURS required to convert and validate each model

Preserved existing code structure and Math-Stat way of working

2018 Pilot in Detail

APPENDIX SLIDES

MOVED SLIDES FROM PREVIOUS DRAFT TO BACK OF PRESENTATION

16

Performance of AdRec Modeling Programs

Performance of Long-Running Occupied Model Program

Accuracy of Scored Predictions from Occupied Model

Cost of Converting 9.4 LOGISTIC to Viya LOGSELECT

Worker Node 9

Worker Node 8

Worker Node 7

Worker Node 6

Worker Node 5

CAES

Business User

Developer

Admin

CAES Cluster

High Speed Local Network

Communication, No Data Moved

- NameNode 1- Resource Manager 2- Journal Keeper- Zookeeper- SAS Embedded Process

Master Node 1

20 CPU Cores256 GB Memory

12x 2 TB Disk Storage

Master Node 2

20 CPU Cores256 GB Memory

12x 2 TB Disk Storage

- Resource Manager 1- Hive Metastore 2- HiveServer 2- WebHCat 2- Journal Keeper- SAS Embedded Process

Master Node 3

20 CPU Cores256 GB Memory

12x 2 TB Disk Storage

- NameNode 1- History Server- Timeline Server- Journal Keeper- Zookeeper- SAS Embedded Process

Master Node 4

20 CPU Cores256 GB Memory

12x 2 TB Disk Storage

- Hive Metastore 1- HiveServer 1- WebHCat 1- Zookeeper- SAS Embedded Process

Worker Node 4

Worker Node 3

28 CPU Cores384 GB Memory

16 TB Disk Storage

- DataNode- NodeManager- Open Source R- SAS Embedded Process

Worker Node 2

Worker Node 1

20 CPU Cores256 GB Memory

16 TB Disk Storage

- Knox Gateway- HDP Clients- RStudio

Virtual Machine 2

8 CPU vCores32 GB Memory

1 TB vDisk Storage

- MySQL Database Server

Virtual Machine 3

8 CPU vCores32 GB Memory

1 TB vDisk Storage

- Ambari Server- Ranger Audit Server- Ranger Policy Server- Zeppelin- HST Server- Activity Analyzer

Virtual Machine 1

8 CPU vCores32 GB Memory

1 TB vDisk Storage

SAS 9.4 Metadata ServerSAS 9.4 Compute Server

SAS Mid-Tier Server

28 CPU Cores384 GB Memory

16 TB Disk Storage

SAS Viya Worker Node 4

28 CPU Cores384 GB Memory

16 TB Disk Storage

SAS Viya Worker Node 3

28 CPU Cores384 GB Memory

16 TB Disk Storage

SAS Viya Worker Node 2

28 CPU Cores384 GB Memory

16 TB Disk Storage

SAS Viya Worker Node 1

28 CPU Cores384 GB Memory

16 TB Disk Storage

- SAS Visual Analytics (Viya enabled)- SAS Visual Statistics (Viya enabled)- SAS Visual Data Mining

and Machine Learning

SAS Viya Controller Node

28 CPU Cores384 GB Memory

16 TB Disk Storage

SAS Viya Microservice Node

28 CPU Cores384 GB Memory

16 TB Disk Storage

Legend: Textured blue box: SAS Virtual Machine or Bare MetalSolid blue boxes: SAS Viya servers Bare Metal recommendedGreen Boxes: Hortonworks serversGreen Text: Hortonworks servicesBlue Text: SAS services

- SAS Metadata Server- SAS Web Server- SAS Web Application Server- SAS Web Clients- SAS Environment Manager- SAS Data Loader For Hadoop- SAS Scoring Accelerator for Hadoop- SAS Compute Server

SAS Desktop Client s

CAES 2020 Production Environment