Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S....

35
Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin, University of Pennsylvania Larry Buron, Abt Associates Inc.

Transcript of Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S....

Page 1: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

Preparing Your Data for Analysis

September 13-14, 2005St. Louis, Missouri

Sponsored by the U.S. Department of Housing and Urban Development

Steve Poulin, University of Pennsylvania

Larry Buron, Abt Associates Inc.

Page 2: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

2

General Data Quality Issues

• Missing Data• Missing Cases• Missing Responses

• Inaccurate Data

Page 3: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

3

Missing Data Effects

• Biases results if subjects with valid data are significantly different from those with missing data

• Concern about the effects of bias increases with number of missing values

• Missing personal identifying data (Social Security Number, name, birth date, and gender) makes it more difficult to unduplicate client records, thereby inflating counts of homeless persons

• Missing Program Exit Dates make it appear that clients have never exited a shelter, thereby overstating their length of time in shelter and inflating the count of homeless person in the time period

• Don’t Know and Refused responses have same effect as blanks, although they may useful for reducing the number of missing responses

Page 4: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

4

Checking for Missing Data

Missing Cases• A one-day census for any shelter can be calculated by

selecting cases with a Program Entry Date less than or equal to a particular date AND • Program Exit Date greater than or equal to the date (or

possibly the next date, depending on how exit date is collected)

• Program Exit Date that is null• The occupancy rate for a date can be calculated by dividing

the one-day census by the shelter’s capacity• Lower than expected occupancy rates suggest that the

shelter is not recording all persons served• Higher than expected occupancy rates suggests a failure to

enter exit dates

Page 5: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

5

Checking for Missing Data

Missing Responses• The potential for bias and reasons for missing responses

can be explored by comparing profiles of persons with missing responses to those with valid responses

• Compare the percent of missing across providers to determine if it is a site-specific training issue

Page 6: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

6

Inaccurate Data

• Misrepresents the description of homeless clients

• Inaccurate personal identifying data compromises the unduplication process

Page 7: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

7

Checking for Inaccurate Data

• Out-of-range category codes• Program entry dates greater than program exit

dates• results in negative lengths of stay values

• Birth dates greater than program entry dates• results in negative ages

• Look at the distribution of each variable by provider type to look for values that do not make sense (e.g. over-age clients at a youth shelter or a person over 100-years old at any shelter)

Page 8: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

8

Special Data Standards Issues

Social Security Number• If only a partial SSN is recorded, the database

should fill in the missing numbers with blanks so that the provided numbers are saved in the correct place of the Social Security Number

• This maximizes matching ability

Page 9: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

9

Special Data Standards Issues

Name, Date of Birth, and Gender• Combination of these data elements can be used to create

an alternative unique identifier if Social Security Number is missing

• Missing elements of Name (first name, middle name, last name) or Date of Birth (month, day, year) makes alternative identifier less unique

• Names must be recorded consistently to maximize matching• nicknames should be avoided

• Format for birth dates (MM/DD/YYYY) must be consistent to create alternative identifiers that match

Page 10: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

10

Special Data Standards Issues

Race• HMIS should allow more than one of the five race

categories to be selected• This will require the creation of multiple fields in

the HMIS (e.g., race1, race2, etc.)

Page 11: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

11

Special Data Standards Issues

Residence Prior to Program Entry• Residences can be recorded in greater detail than

the 17 categories available in the Data Standards, but they must fit in one and only one of these categories

• Residences that do not fit in any of the 16 specified categories must be recorded as Other

Page 12: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

12

Special Data Standards Issues

Program Entry Date & Program Exit Date• Missing Program Exit Dates is a common problem,

especially if exits are frequent or ambiguous• Frequent exits occur in shelters that do not reserve beds for

more than one night; clients must be re-admitted every night. These shelters should think about automating the creation of an exit date or entering the exit date at the same time as the entry date.

• Ambiguous exits occur when beds are reserved for clients who leave shelters temporarily

• When in doubt if an exit has occurred, record the date of exit

• Shelter visits can be “reconstructed” by calculating the time between the last exit date and the next entry date

Page 13: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

13

Improving Data Quality

• Feedback• Training• Sanctions

Page 14: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

14

Feedback

• Data quality problems should be regularly communicated to service providers

• Reports of value to the service providers should be provided to encourage a vested interest in the quality of the data collected

Page 15: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

15

Training

• Commonly occurring data quality issues should be identified by HMIS administrators

• Training sessions for service provider staff should be organized to address these issues

Page 16: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

16

Sanctions

• As the most extreme measure, shelter providers could be sanctioned for submitting poor quality data

• Sanctions may range from withholding funds to denying the provision of HMIS reports

• Gentle persuasion always preferred

Page 17: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

17

Data Manipulation

• This phase of data preparation usually takes longer than the analysis!

• Major steps:• Recoding data values• Computing new data values• Merging datasets

Page 18: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

18

Recoding Data Values

• Categories may be collapsed into fewer, more meaningful categories• e.g., the 17 types of residences prior to

program entry could be collapsed into the following categories: Street, Housed, and Institutional, Other, Unknown

• Discrete values can be collapsed into categories• e.g., specific ages can be recoded into age

categories, such as 0-17, 18-30, 31 to 50, etc.• Recoding data can improve categorical statistical

analysis techniques, such as chi-square analysis

Page 19: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

19

Computing New Data Values

• Length of stay in shelter can be computed by subtracting Program Entry Date from Program Exit Date (and possibly adding “1” depending on how exit data are collected)

• Age at program entry is computed by subtracting Date of Birth from Program Entry Date

• Computing new data values creates more data for analysis

Page 20: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

20

Merging Datasets

• Two types of merging data from different datasets may be necessary• Adding cases• Adding variables

Page 21: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

21

Merging Datasets

• Adding cases may involve merging clients from different datasets into one dataset, creating a bigger dataset with more clients for analysis• This process is facilitated by the use of the same

field names in each of the merged datasets• Adding variables merges variables from

different datasets for the same clients• Will require use of a key variable that connects

data for the same clients, such as a client ID• The new dataset will contain more information per

client than the original datasets, thereby expanding the opportunities for analysis

Page 22: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

22

What Is Extrapolation?

• Extrapolation is a method for estimating the total number of people receiving homeless residential services when some, but not all, of the residential service providers participate in HMIS.

Page 23: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

23

Why Do We Need To Extrapolate?

• Because otherwise we will undercount the number of people who use homeless residential services if some providers do not participate in HMIS

• If all providers participate in HMIS, extrapolation is not needed

Page 24: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

24

Methods of Extrapolation

• Simple proportionate extrapolation works when providers who do not participate in HMIS can be considered “missing at random.”

• Regression-based extrapolation method can work if providers are not missing at random as long as there is some overlap between types of providers that do and do not participate.

• If providers not participating are really different from any of providers participating (in terms of bed utilization or clientele characteristics), may not be able to accurately extrapolate.

Page 25: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

25

Simple Proportionate Extrapolation

• Proportionate extrapolation estimates the number of homeless persons served by non-participating providers as a proportion of the number of homeless persons served by participating providers. The proportion is determined by comparing the size of non-participating providers to participating providers.

• For example, if the group of non-participating providers is the same size as the group of participating providers, then estimate that the non-participating providers serve the same number of homeless persons as participating providers

Page 26: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

26

What Measure Of Provider Size Should Be Used?

• Need a measure of size that is correlated with the number of homeless persons served and is measurable for non-participating providers

• For shelter providers, bed capacity—the total number of beds a provider has to serve homeless clients—has the needed characteristics for the measure of size

• In general, the higher the bed capacity, the more clients a provider serves

• Bed capacity of a provider can be obtained whether or not the provider participates in HMIS

Page 27: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

27

How to Calculate Estimate for Non-Participants

• For HMIS participants, calculate the average number of clients served per bed as:

• The estimated number of clients served by non-participants is:

Total # of unique persons sheltered

÷ Total bed capacity

Average # of clients served per bed of the HMIS participants

x Total bed capacity of

the HMIS non-participants

Page 28: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

28

Total Estimate of Homeless Persons

• Actual number of homeless persons using HMIS participating providers

+ Estimated number of homeless persons using non-participating providers

Page 29: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

29

Example

# of

Providers # of Beds # of Homeless

Persons

Participating Providers

20 750 2,250

Non-Participating Providers

10 250 ?

a. For participating providers, the number of clients per

available bed is 2,250/750 = 3. b. Estimate of number of clients for non-participating

providers is 3 * 250 = 750. c. Total estimate of homeless persons is 2,250 +

750 = 3000.

Page 30: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

30

Conduct Extrapolation Separately for Different Types

of Providers

• Extrapolation will be more accurate if it is done separately for different types of providers that are likely to have a different number of clients served per unit of size (because of different utilization or turnover rates)

• For example, for homeless residential service providers, you may want to separate as follows:• Emergency shelter beds for individuals• Emergency shelter beds for families• Transitional housing beds for individuals• Transitional housing beds for families

Page 31: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

31

Total Number of People Using Homeless Services

• Once you have an estimate of homeless clients for different types of service providers, you may want to aggregate these estimates to arrive at an estimate of the total number of homeless clients served in your entire jurisdiction.

• To get a unique count of people using homeless services, you need to eliminate double counting of people who used more than one type of provider.

Page 32: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

32

Double Counting Illustration

A +

A+C = Unduplicated Count of Users of Provider Type Y

C = OverlapA+C = Unduplicated Count of Users of Provider-Type X

BCCA +

A+C = Unduplicated Count of Users of Provider Type Y

C = OverlapA+C = Unduplicated Count of Users of Provider-Type X

BCC

Page 33: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

33

Regression-Based Extrapolation

• With this technique, you estimate the number and characteristics of clients served based on the characteristics of the providers

• For example, you regress the number of clients served on the size of the provider, the type of provider, and other provider characteristics (e.g., special populations served) for participating providers.

• Then you apply the model to non-participating providers to estimate the number and characteristics of people they served.

• The benefit of this method is that it bases the extrapolation on more than just the size of non-participating providers.

Page 34: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

34

Non-Participating Providers are “Missing at Random”

• Compare the non-size characteristics of participating and non-participating providers.

• For counts of the number of homeless people served, compare whether participating and non-participating providers serve same number of people per unit of size.

• For the characteristics of homeless people, compare the characteristics of people served by participating and non-participating providers.• If you don’t have client-level data, you can compare the

service populations. Men? Women? Veterans? DV Victims? Special needs populations? Age? Race/ethnicity?

Page 35: Preparing Your Data for Analysis September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban Development Steve Poulin,

September 13-14, 2005 St. Louis, MissouriSponsored by the U.S. Department of Housing and Urban Development

35

Concluding Thoughts on Extrapolation

• Achieving 100-percent participation in your HMIS will result in more accurate estimates and eliminate the need to extrapolate.

• Extrapolation techniques are more accurate the higher your participation rate. The rules of thumb are:• 75% participation generally results in very good

estimates;• < 50% participation rate estimates are less reliable

(unless the participants are truly a random sample)