Population Estimation from Large Data: The Case of the BRFSS

Population Estimation from Large Data:The Case of the BRFSS

Carol Pierannunzi, PhDLead Survey Methodologist, Population Health Surveillance Branch,

CDC

National Center for Chronic Disease Prevention and Health Promotion

Division of Population Health

Purpose of Today’s Discussion:Population Estimation Using BRFSS as a

Case Explain how large data sets can be used to generalize to a

population

Explain the BRFSS as part of public health surveillance

Examine sampling for population estimation

Illustrate problems in data collection

Examine how probability of selection impacts

weighting(design weighting)

Examine weighting (post stratification, iterative proportion

fitting/raking)

Discuss impact of complex sampling in data analyses

Take a look at the future of population estimation through

new methods

Population Estimation

Census Intercensal estimation

Voting studies Consumer confidence studies Marketing studies Social and political attitudes and values Needs assessments, gap analyses

and PUBLIC HEALTH surveillance

Public Health Surveillance

The U.S. is highly variable in terms of Geography Population demographics Distribution of disease burden and risk factors Organization of state and local public health infrastructure

Public health programs are primarily designed and delivered by state and local jurisdictions, which address their unique needs within their unique contexts.

Public health surveillance provides data for needs assessments and program evaluation, contributing to the effectiveness and efficiency of public health programs.

What is the BRFSS?

BRFSS is a partnership between CDC and state health departments to produce data which benefit states, territories, localities and public health professionals.

BRFSS includes 57 state/territorial level telephone surveys on health status, health risk behaviors and chronic conditions. Most jurisdictions have collected data since the mid 1980s.

Collects data from approximately 450,000 persons each year.

It is the only source of public health behavior and risk factor data at state/local/territorial level for most states.

BRFSS Partnership Provides a Unique Dataset

State/territorial-level estimates and confidence intervals

Selected Metropolitan/Micropolitan Area Risk Trends (SMART)

• Direct estimates and confidence intervals for cities and counties where sample size is sufficient

• Over 200 MMSAs in 2012

County-level indicators (7-year aggregation) which are used by:

• Community Health Rankings• Community Health Status Indicators• Health Indicators Warehouse• MedMap

Public datasets which are subsets of data collected by states

BRFSS Has Four Components:

1. Core Survey Implemented with standardized protocols. Includes regular and rotating core sections--required of all states and territories

2. Optional Survey Modules proposed by CDC Programs and other agencies (e.g., SAMHSA, Veteran’s Affairs)

3. State-Added Questions are developed by each state to meet their individual needs and issues--optional

4. Special Project Additions - proposed on as needed basis with dedicated funding (examples include Asthma call-back & H1N1)

BRFSS Core Survey

Immunization HIV/AIDS Diabetes Asthma Cardiovascular Disease Alcohol consumption

Exercise

Health Status Health Care Access Healthy Days Disability Tobacco Use Sleep

Exercise

BRFSS Rotating Core Questions

Even Years Breast/Cervical cancer

screening Prostate screening Colorectal cancer

screening Oral health Falls Seatbelt use Drinking & driving

Odd Years Fruits & vegetables Hypertension

awareness Cholesterol

awareness Arthritis burden Physical activity

Optional Survey Modules

Indoor air quality Intimate partner violence Osteoporosis Random child selection Reactions to race Secondhand smoke

policy Sexual violence Smoking cessation Visual impairment Weight control And more…

Adult asthma history Anxiety and depression Arthritis management Cardiovascular health Child immunization Childhood asthma Diabetes General preparedness Healthy days: symptoms Home environment Influenza

1999

Obesity Trends* Among U.S. AdultsBRFSS, 1990, 1999, 2008

(*BMI 30, or about 30 lbs. overweight for 5’4” person)

2008

1990

No Data <10% 10%–14% 15%–19% 20%–24% 25%–29% ≥30%

Detecting Emerging Issues

Steps in the BRFSS Data Production Process

Designing the sampling

Designing the questionnaire

Collecting data

Cleaning and processing data

Weighting datasets

Analyses with BRFSS using complex

sampling

SAMPLING

Sampling (1):Using telephone numbers as sample units

• Sample-to- population linkage problems (see Kalsbeek)

• Shared phones• Persons with

multiple phones• Persons with no

phone access (< 2%); see Waksberg-Mitofski method for adjustment

Sampling (2):

Unweighted Prevalence by Frame-to-Population

Linkage

(after adjusting for demographic variables)

0

20

40

60

80

One phone/one adult Multiple phones/one adultOne phone/ multiple adults Multiple phone/multiple adults

σ²σ²σ²

σ²σ²

σ²

σ² σ² σ²

*Prevalence estimates for ALL frame-to-population linkages, significantly different from one-to-one frame.

σ² Indicates statistically significant increases in variance.

Sampling (3):Let’s make it more complicated

Substate geostrata Public health districts Congressional districts Counties

Oversampling subpopulations Splitting samples to obtain more information Respondents could potentially be reached on more than one

phone Landline frame Cell Phone frame

• Difficult to estimate location based on phone number Ported phone numbers VOIP, security systems, OnStar: deterioration of confidence in

“phone” numbers

Sample (4):How it finally comes out

Landline sampleCell phone

sample

Sample of persons living in other states with

state phone numbers

GeostrataSample approximately 8,000,000

numbers = 450,000 interviews

Sample (5):A few comments on sampling

Sample designs must take in account how the data will be analyzed

Corrections to population-to-frame linkages must be made Calculate the probability of selection for each potential

respondent and adjust (design weighting) Sampling cannot account for lack of

coverage Surveys with only landline phone numbers Persons without phones in all phone surveys

Samples can be purchased, but take care Phone samples are deteriorating (especially landline

samples)

DESIGNING THE QUESTIONNAIRE

Designing the Questionnaire (1) Questions may be proposed by:

States CDC programs (e.g., nutrition, chronic disease programs) HHS Other federal agencies ( VA, SAMHSA)

Questions are subjective things Validation

• Norming from large populations• Validating against “gold standard”/ other surveys• Test/retest reliability estimation

Cognitive testing• Focus groups• Field testing

Designing the Questionnaire (2) Identical questions can be compared across

surveys with different samples Identical questions can be compared across

time Questions need periodic review

Example: eye dilation questions revised due to changes in medical technology

Length of the questionnaire Too much jargon (e.g. diabetes v “sugar”)

Designing the Questionnaire (3):How questions affect data

Order of questions/ order of responses can change response Most CATI software will randomize order of response sets

Language barriers can affect outcomes Questions adopted from clinical use are not

always appropriate for phone interview Sensitive questions

Suicide IPV

Does respondent know the answer Too much recall is problematic Behaviors are easier to measure than

attitudes

Designing the Questionnaire (4):Sample questions

2.3 During the past 30 days, for about how many days did poor physical or mental health

keep you from doing your usual activities, such as self-care, work, or recreation? (85-86)

_ _ Number of days8 8 None7 7 Don’t know / Not sure9 9 Refused

Do you have any kind of health care coverage, including health insurance, prepaid plans

such as HMOs, government plans such as Medicare, or Indian Health Service? (87)

1 Yes [If PPHF state go to Module 4, Question 1, else continue] 2 No

7 Don’t know / Not sure9 Refused

Open ended question

Closed ended question

Column numbers

Skip patter

n

COLLECTING DATA

Collecting Data (1):Let’s don’t bore you with the details

Specific guidelines often found in technical documentation of surveys (see www.cdc.gov/BRFSS) Calling times Training/supervising interviewers Software applications for data entry Screening respondents Maintaining data quality Computing response and cooperation rates

AAPOR, CASRO, ASA,AEA standards for data collection and reporting

Collecting Data (2):Missing data and statistical inference

Respondents can refuse questions Income most refused question on most surveys Sensitive questions

High levels of nonresponse may indicate that question is poor

Imputation of data needed for weighting (non- ignorable missing values(Andridge, R. R. and Little, R. J. A. (2010), A Review of Hot Deck Imputation for Survey Non-response. International Statistical Review, 78: 40–64. )

Nearest neighbor Hot deck imputation Mean/median value replacement Predictive imputation methods

Imputation of other data (ignorable missing values)

Data Collection (3):Consequences of missing data

Reduce n in analyses May result in collapsing categories or

geographic areas Bias weighting process Bias estimates Total nonresponse is measured by response

rate Some journals will not publish data from surveys where

response rates are low Low response rates are increasingly problematic Check unweighted demographic characteristics against

census data to determine whether there is a pattern of nonresponse.

Item nonresponse is refusal to answer a specific question

CLEANING AND PROCESSING DATA

Cleaning and Processing Data (1):Basics

Out-of-range codes Checking responses against each other

Zip code and county responses should match Double checking column locations

Modules may mean lots of empty cells in the data layout Producing calculated variables

Calculated variables are noted in the dataset by a leading underscore (e.g. _BMICAT4)

• BMI• Binge drinker• Everyday smoker• Persons under 65 without health insurance

Cleaning and Processing Data (2):It’s not as easy as it looks

Automated processes do not eliminate data cleaning problems Clean data continuously during collection period to avoid

problems Watch for data patterns which are possible,

but not likely Unusual number of persons aged 77, but not 78 or 76

Watch for latency in response Clumping of responses around multiples of 5 or 7 when

asked how many times per month When data are collected from a number of

sources, there are likely to be response differences that must be standardized

WEIGHTING

Weighting (1):Overview

Weighting matches the respondents to the population using demographic characteristics which are KNOWN in the population and observed in the sampled respondents.

Generally weighting includes race, sex, age, by region.

BRFSS also uses Hispanic ethnicity, education, marital status, home ownership and phone ownership, but NOT income.

Each person interviewed is assigned a weight, which is the number of persons in the population represented by that single respondent

Weighting (2):Two part process

Design weights account for the probability of selection Based on the number of phones and eligible

respondents at each phone number dialed, as well as accounting for the probability of selection of each individual given the number of phone numbers in the sample

Post-stratification weights adjust the responses according to the race, age, marital status, home ownership category, education level, ethnicity and sex of each respondent and the corresponding proportions of persons who match their demographic characteristics in the population. This requires that you know the proportions in the

population

Weighting (3):Design and geostrata weighting

Takes into account the geographic region/strata of the sample.

Design weight uses number of adults in household and number of phones in household for landline sample.

BRFSS landline sample is drawn using low/high density strata within each of the geostrata (1-70+ per state)

Stratum weight (_STRWT) = NRECSTR (number of records in the strata)/ NRECSEL (number of records selected)

Weighting (4):Calculating the design weight

Design Weight = _STRWT* (1/NUMPHON2) * NUMADULT NUMPHON2= number of phones within the household NUMADULT = number of adults eligible for the survey

within the household Questions for the design weights are asked

in screening questions and in demographic sections of the survey

Weighting (5):Calculating the post stratification

weight

Only one weight per data unit Combine design and post-stratification

weights:

Total Weight = Dweight * PSweight

For BRFSS we use iterative proportional fitting (IPF, also known as raking) to get the post stratification weight.

Weighting (6):Old methods using traditional post

stratification

Post Stratification was based on known demographics of the population For BRFSS Post stratification included:

· Regions within states· Race/ Ethnicity (in detailed categories)· Gender· Age (in 7 categories)

Post-stratification forces the sum of the weighted frequencies to equal the population estimates for the region or state by race, age ,and gender

Post stratification weights are applied to the responses, allowing for estimates of how groups of non-respondents would have answered survey questions

Weighting (7):Old methods of post-stratification

Post-stratification Adjustment Factor is calculated for each race/ethnicity, gender, and age group combination. Requires knowledge of each subset of each factor at the

geographic level of interest –otherwise categories must be collapsed

Requires a minimum number of persons in each cell—otherwise categories must be collapsed

All weighting variables were imposed on the process in a single step

Weighting (8): Weight trimming with old methods of

post stratification

Sometimes post-stratification resulted in very small or disproportionately large weights within age/race/gender/region categories

Weight trimming or category collapsing would be done if categories were disproportionately large or too small (< 50 responses)

Weighting (9):Iterative Proportional Fitting

Rather than adjusting weights to categories, IPF adjusts for each dimension separately in an iterative process.The process will continue up to 125 times, or until data converges to Census estimates.

Region

Age

Race

Gender

Phone Type

Home OwnershipEducation

Marital

Status

Gender by Race

Age by

Gender

Age by

Race

Weighting (10):New Variables Introduced as Controls

With IPF

Education Marital status Home ownership/renter Telephone source (cell phone

or landline)

NOTE: It is not possible to get subcategories of other variables by phone ownership, so post stratification was no longer possible.

Weighting (11):Post stratification vs. iterative

proportional fitting

Post Stratificatio

n

Iterative

Proportional Fitting

Operates with less computer time

Allows for incorporation of new variables.Allows for incorporation of cell phone data.Seems to more accurately represent population data (reduces bias).

Weighting (12):Raking – Iteration 1

43

First Control Variable

Output Weight Sum of

WeightsTarget Total

Sum of Weights

Difference

% of Output

WeightsTarget % of

WeightsDifference

in %Age 18-24,Male 87122.60 95468 -8345.40 6.533 7.159 -0.626

Age 18-24,Female 77180.40 90249 -13068.60 5.788 6.768 -0.980

Age 25-34,Male 109419.36 118670 -9250.64 8.206 8.899 -0.694

Age 25-34,Female 114395.17 112007 2388.17 8.579 8.400 0.179

Age 35-44,Male 121328.71 117184 4144.71 9.099 8.788 0.311

Age 35-44,Female 115609.98 113779 1830.98 8.670 8.533 0.137

Age 45-54,Male 138658.26 127077 11581.26 10.398 9.530 0.869

Age 45-54,Female 136904.33 127439 9465.33 10.267 9.557 0.710

Age 55-64,Male 90338.77 95032 -4693.23 6.775 7.127 -0.352

Age 55-64,Female 91693.43 97422 -5728.57 6.876 7.306 -0.430

Age 65-74,Male 57475.54 54171 3304.54 4.310 4.062 0.248

Age 65-74,Female 62709.50 61828 881.50 4.703 4.637 0.066

Age 75+,Male 49772.58 46515 3257.58 3.733 3.488 0.244

Age 75+,Female 80867.37 76635 4232.37 6.064 5.747 0.317

Should be

│.025│ or less


44

Second Control Variable

Output Weight Sum of Weights

Target Total

Sum of Weights

Difference

% of Output

Weights

Target % of

WeightsDifference

in %WH NH 1151321.16 1156947 -5625.84 86.340 86.762 -0.422

OT NH 15305.42 12036 3269.42 1.148 0.903 0.245

HISP 85300.51 84230 1070.51 6.397 6.317 0.080

BL NH,AS NH,AI NH 81548.91 80263 1285.91 6.116 6.019 0.096

Weighting (14):Raking - Iteration 1

Third Control Variable

Input Weight Sum of Weights

Target Total

Sum of Weights

Difference

% of Input

WeightsTarget %

of WeightsDifference

in %Less than HS 89962.05 143928 -53966.35 6.746 10.793 -4.047

HS Grad 412857.99 414505 -1646.81 30.961 31.085 -0.123

Some College 388163.96 448218 -60054.20 29.109 33.613 -4.504

College Grad 442492.00 326825 115667.37 33.183 24.509 8.674

45


46

Fourth Control Variable


WeightsTarget Total

Sum of Weights

Difference

% of Output

WeightsTarget % of

WeightsDifference

in %Married 816399.38 792326 24073.29 61.223 59.418 1.805

Never married, member unmarried couple

277180.73 300111 -22930.01 20.786 22.506 -1.720

Divorced, Widowed, Separated 239895.88 241039 -1143.29 17.990 18.076 -0.086

Fifth Control Variable


Target Total

Sum of Weights

Difference

% of Output

WeightsTarget % of

WeightsDifference

in %Phone interruption 78558.62 82944 -4385.49 5.891 6.220 -0.329

No Phone Interruption 1254917.38 1250532 4385.49 94.109 93.780 0.329


47

Sixth Control Variable


WeightsTarget Total

Sum of Weights

Difference

% of Output

WeightsTarget % of

WeightsDifference

in %Male, WH NH 553107.34 552171 936.34 41.479 41.408 0.070

Male, BL NH,AS NH,AI NH,OT NH,HISP

101008.49 101946 -937.51 7.575 7.645 -0.070

Female, WH NH 598213.82 604776 -6562.18 44.861 45.353 -0.492

Female, HISP 38304.69 32837 5467.69 2.873 2.463 0.410

Female, BL NH,AS NH,AI NH,OT NH

42841.66 41746 1095.66 3.213 3.131 0.082


48

Seventh Control Variable


WeightsTarget Total

Sum of Weights

Difference

% of Output

WeightsTarget % of

WeightsDifference

in %18-34, WH NH 308020.95 332809 -24788.05 23.099 24.958 -1.859

18-34, BL NH,AS NH,AI NH,OT NH,HISP

80096.58 83585 -3488.42 6.007 6.268 -0.262

35-54, WH NH 442299.71 421539 20760.71 33.169 31.612 1.557

35-54, BL NH,AS NH,AI NH,OT NH,HISP

70201.57 63940 6261.57 5.265 4.795 0.470

55+, WH NH 401000.50 402599 -1598.50 30.072 30.192 -0.120

55+, BL NH,AS NH,AI NH,OT NH,HISP

31856.70 29004 2852.70 2.389 2.175 0.214


49

Eighth Control Variable


WeightsTarget Total

Sum of Weights

Difference

% of Output

WeightsTarget % of

WeightsDifference

in %Cell Phone Only 210390.11 197088 13302.35 15.778 14.780 0.998

Landline Only 270206.34 280297 -10090.31 20.263 21.020 -0.757

Landline and Cell Phone 852879.55 856092 -3212.04 63.959 64.200 -0.241


50



Target Total

% of Output

WeightsTarget %

of Weights

Difference in % from

Iteration1Difference

in %Age 18-24,Male 94727.80 95468 7.104 7.159 -0.626 -0.056

Age 18-24,Female 87222.36 90249 6.541 6.768 -0.980 -0.227

Age 25-34,Male 116312.81 118670 8.723 8.899 -0.694 -0.177

Age 25-34,Female 110348.83 112007 8.275 8.400 0.179 -0.124

Age 35-44,Male 118670.65 117184 8.899 8.788 0.311 0.111

Age 35-44,Female 113723.15 113779 8.528 8.533 0.137 -0.004

Age 45-54,Male 130207.90 127077 9.765 9.530 0.869 0.235

Age 45-54,Female 130419.01 127439 9.780 9.557 0.710 0.223

Age 55-64,Male 93001.49 95032 6.974 7.127 -0.352 -0.152

Age 55-64,Female 96092.37 97422 7.206 7.306 -0.430 -0.100

Age 65-74,Male 54156.67 54171 4.061 4.062 0.248 -0.001

Age 65-74,Female 62303.45 61828 4.672 4.637 0.066 0.036

Age 75+,Male 47039.67 46515 3.528 3.488 0.244 0.039

Age 75+,Female 79249.83 76635 5.943 5.747 0.317 0.196




WeightsTarget Total

% of Output

Weights

Target % of

Weights

Difference in % from

Iteration1Difference

in %Age 18-24,Male 95491.87 95468 7.161 7.159 -0.626 0.002

Age 18-24,Female 90265.83 90249 6.769 6.768 -0.980 0.001

Age 25-34,Male 118621.93 118670 8.896 8.899 -0.694 -0.004

Age 25-34,Female 111985.21 112007 8.398 8.400 0.179 -0.002

Age 35-44,Male 117205.13 117184 8.789 8.788 0.311 0.002

Age 35-44,Female 113769.71 113779 8.532 8.533 0.137 -0.001

Age 45-54,Male 127088.93 127077 9.531 9.530 0.869 0.001

Age 45-54,Female 127437.46 127439 9.557 9.557 0.710 -0.000

Age 55-64,Male 95037.18 95032 7.127 7.127 -0.352 0.000

Age 55-64,Female 97426.08 97422 7.306 7.306 -0.430 0.000

Age 65-74,Male 54168.73 54171 4.062 4.062 0.248 -0.000

Age 65-74,Female 61831.76 61828 4.637 4.637 0.066 0.000

Age 75+,Male 46503.23 46515 3.487 3.488 0.244 -0.001

Age 75+,Female 76642.96 76635 5.748 5.747 0.317 0.001

51

All less than │.025

│


Eighth Control Variable


Target Total

% of Output Weights

Target % of

Weights

Difference in % at

Iteration 1

Difference in %

Cell Phone Only 197101.32 197088 14.781 14.780 0.998 0.001

Landline Only 280285.25 280297 21.019 21.020 -0.757 -0.001

Landline and Cell Phone

856089.43 856092 64.200 64.200 -0.241 -0.000

52

**** Program terminated at iteration 7 because all current percents differ from target percents by less than 0.025*****

Weighting (22):BRFSS weights for LL and CP

interviews, 2012

Descriptive Statistics FINAL WEIGHT: LAND-LINE AND CELL-PHONE DATA

N Minimum Maximum Mean Std. Deviation

475687 2.63 27725.38 510.9614 955.04937

At least one person interviewed only represented 2.63 people in his/her

state

At least one person

interviewed represented over 27,700 people in

his/her state

Most respondents represented

about 511 people in their respective

states

Thank [email protected]

For more information please contact Centers for Disease Control and Prevention

1600 Clifton Road NE, Atlanta, GA 30333Telephone: 1-800-CDC-INFO (232-4636)/TTY: 1-888-232-6348E-mail: [email protected] Web: http://www.cdc.gov

The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.

National Center for Chronic Disease Prevention and Health Promotion

Division of Population Health

Population Estimation from Large Data: The Case of the BRFSS

Documents

Transcript of Population Estimation from Large Data: The Case of the BRFSS