Exploring Online Assessment Techniques: Wikis & Discussion Boards
Exploring the Application of Machine Learning Techniques ...
Transcript of Exploring the Application of Machine Learning Techniques ...
![Page 1: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/1.jpg)
1 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Exploring the Application of Machine Learning Techniques
to Construct R-indicators
Arcenis Rojas & Lucilla Tan
Division of Consumer Expenditure Surveys2019 AAPOR Conference
May 17, 2019
![Page 2: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/2.jpg)
2 — U.S. BUREAU OF LABOR STATISTICS • bls.gov2 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Consumer Expenditure Survey (CE)
![Page 3: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/3.jpg)
3 — U.S. BUREAU OF LABOR STATISTICS • bls.gov3 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Motivation Problem: Non-response bias violates many
assumptions that are made during the sampling procedure and can lead to biased survey estimates.
Potential Solution: Develop an indicator of representativeness of the respondent pool while data collection is still on-going.
This can inform the allocation of recruitment resources of under-represented groups.
![Page 4: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/4.jpg)
4 — U.S. BUREAU OF LABOR STATISTICS • bls.gov4 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Motivation Develop a representativeness indicator for a
specific expenditure category (food) to monitor the representativeness of the respondent pool with respect to select characteristics over the course of data collection during the survey year.
We’re looking for variables that are associated with both food expenditures and survey participation.
![Page 5: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/5.jpg)
5 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Background
Representativeness Indicator (R-Indicator): Measures the risk of potential non-response bias based on weighted, estimated propensities of response.
R(x) = 1-2* S( 𝜌𝑖 )
S( 𝜌𝑖 ) = 1
𝑁−1 𝑖=1
𝑁 𝑑𝑖 ( 𝜌𝑖 − 𝜌 )2
di : design weight for sample unit i
𝜌 ∶ weighted sample average of
response propensities
𝜌𝑖 : the estimated response
propensity for unit i
![Page 6: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/6.jpg)
6 — U.S. BUREAU OF LABOR STATISTICS • bls.gov6 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Background
Sample Design
Data Collection
Production
Data Quality
R- Indicator
![Page 7: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/7.jpg)
7 — U.S. BUREAU OF LABOR STATISTICS • bls.gov7 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Food Model
x1 + x2 + x3...
- - -- - -- - -- - -
PDB [TEST]
Food ($) Predictions
$$$$
$$$$
Food ($) [TEST]
Accuracy Check
Food Variables
Survey Response Model
x1 + x2 + x3...
- - -- - -- - -- - -
CHI + CEQ + PDB{FOOD} [TEST]
Response [TEST]
1010
CHI + CEQ + PDB{FOOD} [TRAIN]
- - - 1- - - 0- - - 1- - - 0
PDB [TRAIN]
- - - $- - - $- - - $- - - $
1010
Predicted Class
%%%%
Response Propensity
Accuracy Check
Survey Response Variables
The Pipeline
![Page 8: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/8.jpg)
8 — U.S. BUREAU OF LABOR STATISTICS • bls.gov8 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Data Description 2015 Consumer Expenditure Interview Survey
(CEQ)
CE Interview Survey Contact History Instrument (CHI)
The 2015 Census Planning Database (PDB) Geographic aggregation: tract-level (2010 Census boundaries)
Incorporates the 2009-2013 American Community Survey (ACS) five-year estimates
Latest data available at the time study started
![Page 9: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/9.jpg)
9 — U.S. BUREAU OF LABOR STATISTICS • bls.gov9 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Data Description Initial sample size: 36,226
After data cleaning: 32,255
21,546 (66.8%) were survey participants
Initial number of PDB variables: 114
After eliminating highly correlated and near-zero variance PDB variables: 54
CHI variables: 2
CEQ variables: 6
![Page 10: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/10.jpg)
10 — U.S. BUREAU OF LABOR STATISTICS • bls.gov10 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Data Description Data covered 5 periods over 2015
Feb
Feb-Mar
Feb-Jun
Feb-Sept
Feb-Dec
We could compute an R-Indicator over time
Continuous monitoring is the motivation to build a repeatable process
![Page 11: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/11.jpg)
11 — U.S. BUREAU OF LABOR STATISTICS • bls.gov11 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Selecting Algorithms
Desired model characteristics:
High prediction accuracy
Dimension reduction
Interpretability
Smooth propensity distribution
![Page 12: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/12.jpg)
12 — U.S. BUREAU OF LABOR STATISTICS • bls.gov12 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Selecting Algorithms
Classification Tree
Random Forest
Logistic Regression LASSO
Prediction Accuracy
Dimension Reduction
Interpretability
Smooth Propensity Distribution
![Page 13: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/13.jpg)
13 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Classification Tree
![Page 14: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/14.jpg)
14 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Classification Tree
Easy to interpret
Good dimension reduction
Propensities are “chunky”
Cannot always handle missing values
Sensitive to tuning parameter specification
Pros: Cons:
![Page 15: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/15.jpg)
15 — U.S. BUREAU OF LABOR STATISTICS • bls.gov15 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Selecting Algorithms
Classification Tree
Random Forest
Logistic Regression LASSO
Prediction Accuracy
Dimension Reduction
Interpretability
Smooth Propensity Distribution
![Page 16: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/16.jpg)
16 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Random Forest
![Page 17: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/17.jpg)
17 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Random Forest
High accuracy
Great for dimension reduction
Interpretation not as clear as other models
Easily biased if not properly tuned
Cannot handle missing values
Pros: Cons:
![Page 18: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/18.jpg)
18 — U.S. BUREAU OF LABOR STATISTICS • bls.gov18 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Selecting Algorithms
Classification Tree
Random Forest
Logistic Regression LASSO
Prediction Accuracy
Dimension Reduction
Interpretability
Smooth Propensity Distribution
![Page 19: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/19.jpg)
19 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Logistic Regression
Easy to interpret
Propensities are smooth
Good for explaining variance
No dimension reduction
Does not necessarily predict well
Pros: Cons:
![Page 20: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/20.jpg)
20 — U.S. BUREAU OF LABOR STATISTICS • bls.gov20 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Selecting Algorithms
Classification Tree
Random Forest
Logistic Regression LASSO
Prediction Accuracy
Dimension Reduction
Interpretability
Smooth Propensity Distribution
![Page 21: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/21.jpg)
21 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
LASSO
Min Lambda Lambda 1SE
![Page 22: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/22.jpg)
22 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Least Absolute Shrinkage and Selection Operator (LASSO)
High prediction accuracy
Easy interpretation
Great for dimension reduction
Coefficients do not necessarily indicate the magnitude of an effect.
Narrative may not be intuitive
Pros: Cons:
![Page 23: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/23.jpg)
23 — U.S. BUREAU OF LABOR STATISTICS • bls.gov23 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Selecting Algorithms
Classification Tree
Random Forest
Logistic Regression LASSO
Prediction Accuracy
Dimension Reduction
Interpretability
Smooth Propensity Distribution
![Page 24: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/24.jpg)
24 — U.S. BUREAU OF LABOR STATISTICS • bls.gov24 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Food Model
x1 + x2 + x3...
- - -- - -- - -- - -
PDB [TEST]
Food ($) Predictions
$$$$
$$$$
Food ($) [TEST]
Accuracy Check
Food Variables
Survey Response Model
x1 + x2 + x3...
- - -- - -- - -- - -
CHI + CEQ + PDB{FOOD} [TEST]
Response [TEST]
1010
CHI + CEQ + PDB{FOOD} [TRAIN]
- - - 1- - - 0- - - 1- - - 0
PDB [TRAIN]
- - - $- - - $- - - $- - - $
1010
Predicted Class
%%%%
Response Propensity
Accuracy Check
Survey Response Variables
The Pipeline
![Page 25: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/25.jpg)
25 — U.S. BUREAU OF LABOR STATISTICS • bls.gov25 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
LASSO ExplainedThe lasso regression coefficient estimates are obtained by solving the optimization problem that can be generally characterized as:
𝛽𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝑖=1
𝑛 𝑦𝑖 − 𝛽0 − 𝑗=1𝑝
𝛽𝑗𝑥𝑖𝑗
2+ 𝜆 𝑗=1
𝑝|𝛽𝑗|
where ≥ 0 is the shrinkage parameter that controls the relative impact of the two terms. The effect of the penalty is to get a more generalized (than a strict) fit to the data to minimize over-fitting.
![Page 26: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/26.jpg)
26 — U.S. BUREAU OF LABOR STATISTICS • bls.gov26 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Training the LASSO Split the data 50-50 into a training set and a
testing set
Used 10-fold cross validation to find the best shrinkage parameter () and used a mixing parameter of = 1
Selected the largest lambda within one standard error of the minimum cross-validation mean-standard error, which we call “Lambda 1SE”
Ran LASSO
![Page 27: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/27.jpg)
27 — U.S. BUREAU OF LABOR STATISTICS • bls.gov27 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Predictors of Food Expenditure Covariates for Food Expenditure:
Started with 54 variables from the PDB
After running LASSO with “Lambda 1SE” we were left with 2 variables with non-zero coefficients:
– Average Household Income (PDB)
– Average House Value (PDB)
![Page 28: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/28.jpg)
28 — U.S. BUREAU OF LABOR STATISTICS • bls.gov28 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Predicting Survey Participation Inputs:
– Average Household Income (PDB)
– Average House value (PDB)
– Census Region (CEQ)
– Dwelling Unit Structure Type (CEQ)
– Household Size (CEQ)
– Homeowner / Renter (CEQ)
– Urbanicity (CEQ)
– Survey Wave (CEQ)
– Number of Contact Attempts (CHI)
– Ever Changed Interviewer (CHI)
![Page 29: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/29.jpg)
29 — U.S. BUREAU OF LABOR STATISTICS • bls.gov29 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Predicting Survey Participation After running LASSO with “Lambda 1SE” we
were left with 6 variables with non-zero coefficients:
– Average Household Income (PDB)
– Household Size (CEQ)
– Homeowner / Renter (CEQ)
– Urbanicity (CEQ)
– Number of Contact Attempts (CHI)
– Ever Changed Interviewer (CHI)
![Page 30: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/30.jpg)
30 — U.S. BUREAU OF LABOR STATISTICS • bls.gov30 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Model Accuracy Baseline was the unregularized GLM Logistic
modelUnit response model: model performance comparison using Test subsample with model
parameters estimated on Train subsample
Model predictors Regression Proportion prediction
accuracy (cut-off value prob
>0.5)*
Area under the
ROC
1SE- regularized (6
predictors)
GLM logistic 0.7261 0.699
Unregularized (10
predictors)
GLM logistic 0.7263 0.703
* Units with predicted probabilities >0.5 were classified as respondents.
![Page 31: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/31.jpg)
31 — U.S. BUREAU OF LABOR STATISTICS • bls.gov31 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Period 5 Final Model
Predictors Coeff SE p-value
(Intercept) 1.414 0.041 0.000
Household income -2.79E-06 3.49E-07 0.000
No. contact attempts -0.089 0.004 0.000
HH size - one -0.088 0.031 0.005
HH size - three 0.369 0.037 0.000
HH size – 4+ 0.214 0.040 0.000
Changed interviewer -1.508 0.034 0.000
Renter 0.179 0.028 0.000
Rural area 0.504 0.051 0.000
* Prediction Accuracy = 72.8%
![Page 32: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/32.jpg)
32 — U.S. BUREAU OF LABOR STATISTICS • bls.gov32 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Period 5 Model Propensity Distribution
![Page 33: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/33.jpg)
33 — U.S. BUREAU OF LABOR STATISTICS • bls.gov33 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
R-Indicators by Period and Model
Period Months
of Data
Classification
Tree
Logistic
Regression
LASSO
Regression
1 1 0.730 0.668 0.734
2 2 0.702 0.658 0.729
3 5 0.690 0.632 0.682
4 8 0.671 0.617 0.653
5 11 0.678 0.622 0.657
![Page 34: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/34.jpg)
34 — U.S. BUREAU OF LABOR STATISTICS • bls.gov34 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
R-Indicators
![Page 35: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/35.jpg)
35 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Lessons Learned
Distribution of estimated unit response propensities with variable Language included as a predictor (Accuracy > 98%)
![Page 36: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/36.jpg)
36 — U.S. BUREAU OF LABOR STATISTICS • bls.gov36 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Lessons Learned No. sample units with missing value for
LANGUAGE: 10,536
No. of survey non-respondents = 10,393
=> 98.7% of sample units with missing value for LANGUAGE were non-respondents
KNOW YOUR DATA!!!
![Page 37: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/37.jpg)
39 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
![Page 38: Exploring the Application of Machine Learning Techniques ...](https://reader034.fdocuments.in/reader034/viewer/2022051204/627842ad244f9f7fe510f1c5/html5/thumbnails/38.jpg)
Contact Information
40 — U.S. BUREAU OF LABOR STATISTICS • bls.gov
Division of Consumer Expenditure Surveys
www.bls.gov/cex
Arcenis Rojas
Economist
202-691-6884
Lucilla Tan
Senior Economist
202-691-5128