Population Estimation from Large Data: The Case of the BRFSS
-
Upload
dylan-oliver -
Category
Documents
-
view
33 -
download
0
description
Transcript of Population Estimation from Large Data: The Case of the BRFSS
Population Estimation from Large Data:The Case of the BRFSS
Carol Pierannunzi, PhDLead Survey Methodologist, Population Health Surveillance Branch,
CDC
National Center for Chronic Disease Prevention and Health Promotion
Division of Population Health
Purpose of Today’s Discussion:Population Estimation Using BRFSS as a
Case Explain how large data sets can be used to generalize to a
population
Explain the BRFSS as part of public health surveillance
Examine sampling for population estimation
Illustrate problems in data collection
Examine how probability of selection impacts
weighting(design weighting)
Examine weighting (post stratification, iterative proportion
fitting/raking)
Discuss impact of complex sampling in data analyses
Take a look at the future of population estimation through
new methods
Population Estimation
Census Intercensal estimation
Voting studies Consumer confidence studies Marketing studies Social and political attitudes and values Needs assessments, gap analyses
and PUBLIC HEALTH surveillance
Public Health Surveillance
The U.S. is highly variable in terms of Geography Population demographics Distribution of disease burden and risk factors Organization of state and local public health infrastructure
Public health programs are primarily designed and delivered by state and local jurisdictions, which address their unique needs within their unique contexts.
Public health surveillance provides data for needs assessments and program evaluation, contributing to the effectiveness and efficiency of public health programs.
What is the BRFSS?
BRFSS is a partnership between CDC and state health departments to produce data which benefit states, territories, localities and public health professionals.
BRFSS includes 57 state/territorial level telephone surveys on health status, health risk behaviors and chronic conditions. Most jurisdictions have collected data since the mid 1980s.
Collects data from approximately 450,000 persons each year.
It is the only source of public health behavior and risk factor data at state/local/territorial level for most states.
BRFSS Partnership Provides a Unique Dataset
State/territorial-level estimates and confidence intervals
Selected Metropolitan/Micropolitan Area Risk Trends (SMART)
• Direct estimates and confidence intervals for cities and counties where sample size is sufficient
• Over 200 MMSAs in 2012
County-level indicators (7-year aggregation) which are used by:
• Community Health Rankings• Community Health Status Indicators• Health Indicators Warehouse• MedMap
Public datasets which are subsets of data collected by states
BRFSS Has Four Components:
1. Core Survey Implemented with standardized protocols. Includes regular and rotating core sections--required of all states and territories
2. Optional Survey Modules proposed by CDC Programs and other agencies (e.g., SAMHSA, Veteran’s Affairs)
3. State-Added Questions are developed by each state to meet their individual needs and issues--optional
4. Special Project Additions - proposed on as needed basis with dedicated funding (examples include Asthma call-back & H1N1)
BRFSS Core Survey
Immunization HIV/AIDS Diabetes Asthma Cardiovascular Disease Alcohol consumption
Exercise
Health Status Health Care Access Healthy Days Disability Tobacco Use Sleep
Exercise
BRFSS Rotating Core Questions
Even Years Breast/Cervical cancer
screening Prostate screening Colorectal cancer
screening Oral health Falls Seatbelt use Drinking & driving
Odd Years Fruits & vegetables Hypertension
awareness Cholesterol
awareness Arthritis burden Physical activity
Optional Survey Modules
Indoor air quality Intimate partner violence Osteoporosis Random child selection Reactions to race Secondhand smoke
policy Sexual violence Smoking cessation Visual impairment Weight control And more…
Adult asthma history Anxiety and depression Arthritis management Cardiovascular health Child immunization Childhood asthma Diabetes General preparedness Healthy days: symptoms Home environment Influenza
1999
Obesity Trends* Among U.S. AdultsBRFSS, 1990, 1999, 2008
(*BMI 30, or about 30 lbs. overweight for 5’4” person)
2008
1990
No Data <10% 10%–14% 15%–19% 20%–24% 25%–29% ≥30%
Detecting Emerging Issues
Steps in the BRFSS Data Production Process
Designing the sampling
Designing the questionnaire
Collecting data
Cleaning and processing data
Weighting datasets
Analyses with BRFSS using complex
sampling
SAMPLING
Sampling (1):Using telephone numbers as sample units
• Sample-to- population linkage problems (see Kalsbeek)
• Shared phones• Persons with
multiple phones• Persons with no
phone access (< 2%); see Waksberg-Mitofski method for adjustment
Sampling (2):
Unweighted Prevalence by Frame-to-Population
Linkage
(after adjusting for demographic variables)
0
20
40
60
80
One phone/one adult Multiple phones/one adultOne phone/ multiple adults Multiple phone/multiple adults
σ²σ²σ²
σ²σ²
σ²
σ² σ² σ²
*Prevalence estimates for ALL frame-to-population linkages, significantly different from one-to-one frame.
σ² Indicates statistically significant increases in variance.
Sampling (3):Let’s make it more complicated
Substate geostrata Public health districts Congressional districts Counties
Oversampling subpopulations Splitting samples to obtain more information Respondents could potentially be reached on more than one
phone Landline frame Cell Phone frame
• Difficult to estimate location based on phone number Ported phone numbers VOIP, security systems, OnStar: deterioration of confidence in
“phone” numbers
Sample (4):How it finally comes out
Landline sampleCell phone
sample
Sample of persons living in other states with
state phone numbers
GeostrataSample approximately 8,000,000
numbers = 450,000 interviews
Sample (5):A few comments on sampling
Sample designs must take in account how the data will be analyzed
Corrections to population-to-frame linkages must be made Calculate the probability of selection for each potential
respondent and adjust (design weighting) Sampling cannot account for lack of
coverage Surveys with only landline phone numbers Persons without phones in all phone surveys
Samples can be purchased, but take care Phone samples are deteriorating (especially landline
samples)
DESIGNING THE QUESTIONNAIRE
Designing the Questionnaire (1) Questions may be proposed by:
States CDC programs (e.g., nutrition, chronic disease programs) HHS Other federal agencies ( VA, SAMHSA)
Questions are subjective things Validation
• Norming from large populations• Validating against “gold standard”/ other surveys• Test/retest reliability estimation
Cognitive testing• Focus groups• Field testing
Designing the Questionnaire (2) Identical questions can be compared across
surveys with different samples Identical questions can be compared across
time Questions need periodic review
Example: eye dilation questions revised due to changes in medical technology
Length of the questionnaire Too much jargon (e.g. diabetes v “sugar”)
Designing the Questionnaire (3):How questions affect data
Order of questions/ order of responses can change response Most CATI software will randomize order of response sets
Language barriers can affect outcomes Questions adopted from clinical use are not
always appropriate for phone interview Sensitive questions
Suicide IPV
Does respondent know the answer Too much recall is problematic Behaviors are easier to measure than
attitudes
Designing the Questionnaire (4):Sample questions
2.3 During the past 30 days, for about how many days did poor physical or mental health
keep you from doing your usual activities, such as self-care, work, or recreation? (85-86)
_ _ Number of days8 8 None7 7 Don’t know / Not sure9 9 Refused
Do you have any kind of health care coverage, including health insurance, prepaid plans
such as HMOs, government plans such as Medicare, or Indian Health Service? (87)
1 Yes [If PPHF state go to Module 4, Question 1, else continue] 2 No
7 Don’t know / Not sure9 Refused
Open ended question
Closed ended question
Column numbers
Skip patter
n
COLLECTING DATA
Collecting Data (1):Let’s don’t bore you with the details
Specific guidelines often found in technical documentation of surveys (see www.cdc.gov/BRFSS) Calling times Training/supervising interviewers Software applications for data entry Screening respondents Maintaining data quality Computing response and cooperation rates
AAPOR, CASRO, ASA,AEA standards for data collection and reporting
Collecting Data (2):Missing data and statistical inference
Respondents can refuse questions Income most refused question on most surveys Sensitive questions
High levels of nonresponse may indicate that question is poor
Imputation of data needed for weighting (non- ignorable missing values(Andridge, R. R. and Little, R. J. A. (2010), A Review of Hot Deck Imputation for Survey Non-response. International Statistical Review, 78: 40–64. )
Nearest neighbor Hot deck imputation Mean/median value replacement Predictive imputation methods
Imputation of other data (ignorable missing values)
Data Collection (3):Consequences of missing data
Reduce n in analyses May result in collapsing categories or
geographic areas Bias weighting process Bias estimates Total nonresponse is measured by response
rate Some journals will not publish data from surveys where
response rates are low Low response rates are increasingly problematic Check unweighted demographic characteristics against
census data to determine whether there is a pattern of nonresponse.
Item nonresponse is refusal to answer a specific question
CLEANING AND PROCESSING DATA
Cleaning and Processing Data (1):Basics
Out-of-range codes Checking responses against each other
Zip code and county responses should match Double checking column locations
Modules may mean lots of empty cells in the data layout Producing calculated variables
Calculated variables are noted in the dataset by a leading underscore (e.g. _BMICAT4)
• BMI• Binge drinker• Everyday smoker• Persons under 65 without health insurance
Cleaning and Processing Data (2):It’s not as easy as it looks
Automated processes do not eliminate data cleaning problems Clean data continuously during collection period to avoid
problems Watch for data patterns which are possible,
but not likely Unusual number of persons aged 77, but not 78 or 76
Watch for latency in response Clumping of responses around multiples of 5 or 7 when
asked how many times per month When data are collected from a number of
sources, there are likely to be response differences that must be standardized
WEIGHTING
Weighting (1):Overview
Weighting matches the respondents to the population using demographic characteristics which are KNOWN in the population and observed in the sampled respondents.
Generally weighting includes race, sex, age, by region.
BRFSS also uses Hispanic ethnicity, education, marital status, home ownership and phone ownership, but NOT income.
Each person interviewed is assigned a weight, which is the number of persons in the population represented by that single respondent
Weighting (2):Two part process
Design weights account for the probability of selection Based on the number of phones and eligible
respondents at each phone number dialed, as well as accounting for the probability of selection of each individual given the number of phone numbers in the sample
Post-stratification weights adjust the responses according to the race, age, marital status, home ownership category, education level, ethnicity and sex of each respondent and the corresponding proportions of persons who match their demographic characteristics in the population. This requires that you know the proportions in the
population
Weighting (3):Design and geostrata weighting
Takes into account the geographic region/strata of the sample.
Design weight uses number of adults in household and number of phones in household for landline sample.
BRFSS landline sample is drawn using low/high density strata within each of the geostrata (1-70+ per state)
Stratum weight (_STRWT) = NRECSTR (number of records in the strata)/ NRECSEL (number of records selected)
Weighting (4):Calculating the design weight
Design Weight = _STRWT* (1/NUMPHON2) * NUMADULT NUMPHON2= number of phones within the household NUMADULT = number of adults eligible for the survey
within the household Questions for the design weights are asked
in screening questions and in demographic sections of the survey
Weighting (5):Calculating the post stratification
weight
Only one weight per data unit Combine design and post-stratification
weights:
Total Weight = Dweight * PSweight
For BRFSS we use iterative proportional fitting (IPF, also known as raking) to get the post stratification weight.
Weighting (6):Old methods using traditional post
stratification
Post Stratification was based on known demographics of the population For BRFSS Post stratification included:
· Regions within states· Race/ Ethnicity (in detailed categories)· Gender· Age (in 7 categories)
Post-stratification forces the sum of the weighted frequencies to equal the population estimates for the region or state by race, age ,and gender
Post stratification weights are applied to the responses, allowing for estimates of how groups of non-respondents would have answered survey questions
Weighting (7):Old methods of post-stratification
Post-stratification Adjustment Factor is calculated for each race/ethnicity, gender, and age group combination. Requires knowledge of each subset of each factor at the
geographic level of interest –otherwise categories must be collapsed
Requires a minimum number of persons in each cell—otherwise categories must be collapsed
All weighting variables were imposed on the process in a single step
Weighting (8): Weight trimming with old methods of
post stratification
Sometimes post-stratification resulted in very small or disproportionately large weights within age/race/gender/region categories
Weight trimming or category collapsing would be done if categories were disproportionately large or too small (< 50 responses)
Weighting (9):Iterative Proportional Fitting
Rather than adjusting weights to categories, IPF adjusts for each dimension separately in an iterative process.The process will continue up to 125 times, or until data converges to Census estimates.
Region
Age
Race
Gender
Phone Type
Home OwnershipEducation
Marital
Status
Gender by Race
Age by
Gender
Age by
Race
Weighting (10):New Variables Introduced as Controls
With IPF
Education Marital status Home ownership/renter Telephone source (cell phone
or landline)
NOTE: It is not possible to get subcategories of other variables by phone ownership, so post stratification was no longer possible.
Weighting (11):Post stratification vs. iterative
proportional fitting
Post Stratificatio
n
Iterative
Proportional Fitting
Operates with less computer time
Allows for incorporation of new variables.Allows for incorporation of cell phone data.Seems to more accurately represent population data (reduces bias).
Weighting (12):Raking – Iteration 1
43
First Control Variable
Output Weight Sum of
WeightsTarget Total
Sum of Weights
Difference
% of Output
WeightsTarget % of
WeightsDifference
in %Age 18-24,Male 87122.60 95468 -8345.40 6.533 7.159 -0.626
Age 18-24,Female 77180.40 90249 -13068.60 5.788 6.768 -0.980
Age 25-34,Male 109419.36 118670 -9250.64 8.206 8.899 -0.694
Age 25-34,Female 114395.17 112007 2388.17 8.579 8.400 0.179
Age 35-44,Male 121328.71 117184 4144.71 9.099 8.788 0.311
Age 35-44,Female 115609.98 113779 1830.98 8.670 8.533 0.137
Age 45-54,Male 138658.26 127077 11581.26 10.398 9.530 0.869
Age 45-54,Female 136904.33 127439 9465.33 10.267 9.557 0.710
Age 55-64,Male 90338.77 95032 -4693.23 6.775 7.127 -0.352
Age 55-64,Female 91693.43 97422 -5728.57 6.876 7.306 -0.430
Age 65-74,Male 57475.54 54171 3304.54 4.310 4.062 0.248
Age 65-74,Female 62709.50 61828 881.50 4.703 4.637 0.066
Age 75+,Male 49772.58 46515 3257.58 3.733 3.488 0.244
Age 75+,Female 80867.37 76635 4232.37 6.064 5.747 0.317
Should be
│.025│ or less
Weighting (13):Raking – Iteration 1
44
Second Control Variable
Output Weight Sum of Weights
Target Total
Sum of Weights
Difference
% of Output
Weights
Target % of
WeightsDifference
in %WH NH 1151321.16 1156947 -5625.84 86.340 86.762 -0.422
OT NH 15305.42 12036 3269.42 1.148 0.903 0.245
HISP 85300.51 84230 1070.51 6.397 6.317 0.080
BL NH,AS NH,AI NH 81548.91 80263 1285.91 6.116 6.019 0.096
Weighting (14):Raking - Iteration 1
Third Control Variable
Input Weight Sum of Weights
Target Total
Sum of Weights
Difference
% of Input
WeightsTarget %
of WeightsDifference
in %Less than HS 89962.05 143928 -53966.35 6.746 10.793 -4.047
HS Grad 412857.99 414505 -1646.81 30.961 31.085 -0.123
Some College 388163.96 448218 -60054.20 29.109 33.613 -4.504
College Grad 442492.00 326825 115667.37 33.183 24.509 8.674
45
Weighting (15):Raking – Iteration 1
46
Fourth Control Variable
Output Weight Sum of
WeightsTarget Total
Sum of Weights
Difference
% of Output
WeightsTarget % of
WeightsDifference
in %Married 816399.38 792326 24073.29 61.223 59.418 1.805
Never married, member unmarried couple
277180.73 300111 -22930.01 20.786 22.506 -1.720
Divorced, Widowed, Separated 239895.88 241039 -1143.29 17.990 18.076 -0.086
Fifth Control Variable
Output Weight Sum of Weights
Target Total
Sum of Weights
Difference
% of Output
WeightsTarget % of
WeightsDifference
in %Phone interruption 78558.62 82944 -4385.49 5.891 6.220 -0.329
No Phone Interruption 1254917.38 1250532 4385.49 94.109 93.780 0.329
Weighting (16):Raking – Iteration 1
47
Sixth Control Variable
Output Weight Sum of
WeightsTarget Total
Sum of Weights
Difference
% of Output
WeightsTarget % of
WeightsDifference
in %Male, WH NH 553107.34 552171 936.34 41.479 41.408 0.070
Male, BL NH,AS NH,AI NH,OT NH,HISP
101008.49 101946 -937.51 7.575 7.645 -0.070
Female, WH NH 598213.82 604776 -6562.18 44.861 45.353 -0.492
Female, HISP 38304.69 32837 5467.69 2.873 2.463 0.410
Female, BL NH,AS NH,AI NH,OT NH
42841.66 41746 1095.66 3.213 3.131 0.082
Weighting (17):Raking – Iteration 1
48
Seventh Control Variable
Output Weight Sum of
WeightsTarget Total
Sum of Weights
Difference
% of Output
WeightsTarget % of
WeightsDifference
in %18-34, WH NH 308020.95 332809 -24788.05 23.099 24.958 -1.859
18-34, BL NH,AS NH,AI NH,OT NH,HISP
80096.58 83585 -3488.42 6.007 6.268 -0.262
35-54, WH NH 442299.71 421539 20760.71 33.169 31.612 1.557
35-54, BL NH,AS NH,AI NH,OT NH,HISP
70201.57 63940 6261.57 5.265 4.795 0.470
55+, WH NH 401000.50 402599 -1598.50 30.072 30.192 -0.120
55+, BL NH,AS NH,AI NH,OT NH,HISP
31856.70 29004 2852.70 2.389 2.175 0.214
Weighting (18):Raking – Iteration 1
49
Eighth Control Variable
Output Weight Sum of
WeightsTarget Total
Sum of Weights
Difference
% of Output
WeightsTarget % of
WeightsDifference
in %Cell Phone Only 210390.11 197088 13302.35 15.778 14.780 0.998
Landline Only 270206.34 280297 -10090.31 20.263 21.020 -0.757
Landline and Cell Phone 852879.55 856092 -3212.04 63.959 64.200 -0.241
Weighting (19):Raking – Iteration 2
50
First Control Variable
Output Weight Sum of Weights
Target Total
% of Output
WeightsTarget %
of Weights
Difference in % from
Iteration1Difference
in %Age 18-24,Male 94727.80 95468 7.104 7.159 -0.626 -0.056
Age 18-24,Female 87222.36 90249 6.541 6.768 -0.980 -0.227
Age 25-34,Male 116312.81 118670 8.723 8.899 -0.694 -0.177
Age 25-34,Female 110348.83 112007 8.275 8.400 0.179 -0.124
Age 35-44,Male 118670.65 117184 8.899 8.788 0.311 0.111
Age 35-44,Female 113723.15 113779 8.528 8.533 0.137 -0.004
Age 45-54,Male 130207.90 127077 9.765 9.530 0.869 0.235
Age 45-54,Female 130419.01 127439 9.780 9.557 0.710 0.223
Age 55-64,Male 93001.49 95032 6.974 7.127 -0.352 -0.152
Age 55-64,Female 96092.37 97422 7.206 7.306 -0.430 -0.100
Age 65-74,Male 54156.67 54171 4.061 4.062 0.248 -0.001
Age 65-74,Female 62303.45 61828 4.672 4.637 0.066 0.036
Age 75+,Male 47039.67 46515 3.528 3.488 0.244 0.039
Age 75+,Female 79249.83 76635 5.943 5.747 0.317 0.196
Weighting (20):Raking - Iteration 7
First Control Variable
Output Weight Sum of
WeightsTarget Total
% of Output
Weights
Target % of
Weights
Difference in % from
Iteration1Difference
in %Age 18-24,Male 95491.87 95468 7.161 7.159 -0.626 0.002
Age 18-24,Female 90265.83 90249 6.769 6.768 -0.980 0.001
Age 25-34,Male 118621.93 118670 8.896 8.899 -0.694 -0.004
Age 25-34,Female 111985.21 112007 8.398 8.400 0.179 -0.002
Age 35-44,Male 117205.13 117184 8.789 8.788 0.311 0.002
Age 35-44,Female 113769.71 113779 8.532 8.533 0.137 -0.001
Age 45-54,Male 127088.93 127077 9.531 9.530 0.869 0.001
Age 45-54,Female 127437.46 127439 9.557 9.557 0.710 -0.000
Age 55-64,Male 95037.18 95032 7.127 7.127 -0.352 0.000
Age 55-64,Female 97426.08 97422 7.306 7.306 -0.430 0.000
Age 65-74,Male 54168.73 54171 4.062 4.062 0.248 -0.000
Age 65-74,Female 61831.76 61828 4.637 4.637 0.066 0.000
Age 75+,Male 46503.23 46515 3.487 3.488 0.244 -0.001
Age 75+,Female 76642.96 76635 5.748 5.747 0.317 0.001
51
All less than │.025
│
Weighting (21):Raking - Iteration 7
Eighth Control Variable
Output Weight Sum of Weights
Target Total
% of Output Weights
Target % of
Weights
Difference in % at
Iteration 1
Difference in %
Cell Phone Only 197101.32 197088 14.781 14.780 0.998 0.001
Landline Only 280285.25 280297 21.019 21.020 -0.757 -0.001
Landline and Cell Phone
856089.43 856092 64.200 64.200 -0.241 -0.000
52
**** Program terminated at iteration 7 because all current percents differ from target percents by less than 0.025*****
Weighting (22):BRFSS weights for LL and CP
interviews, 2012
Descriptive Statistics FINAL WEIGHT: LAND-LINE AND CELL-PHONE DATA
N Minimum Maximum Mean Std. Deviation
475687 2.63 27725.38 510.9614 955.04937
At least one person interviewed only represented 2.63 people in his/her
state
At least one person
interviewed represented over 27,700 people in
his/her state
Most respondents represented
about 511 people in their respective
states
Thank [email protected]
For more information please contact Centers for Disease Control and Prevention
1600 Clifton Road NE, Atlanta, GA 30333Telephone: 1-800-CDC-INFO (232-4636)/TTY: 1-888-232-6348E-mail: [email protected] Web: http://www.cdc.gov
The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.
National Center for Chronic Disease Prevention and Health Promotion
Division of Population Health