Unit 5b: The Logistic Regression Approach to Life Table Analysis © Andrew Ho, Harvard Graduate...

18
Unit 5b: The Logistic Regression Approach to Life Table Analysis © Andrew Ho, Harvard Graduate School of Education Unit 5b– Slide 1 ttp://xkcd.com/881 /

Transcript of Unit 5b: The Logistic Regression Approach to Life Table Analysis © Andrew Ho, Harvard Graduate...

Unit 1e: Checking the Assumptions Made About Residuals

Unit 5b: The Logistic Regression Approach to Life Table Analysis Andrew Ho, Harvard Graduate School of EducationUnit 5b Slide 1http://xkcd.com/881/

Replicating Life Table Analysis with Logistic RegressionInterpreting coefficients using the noconstant option.Fitting the Hazard Function with polynomial regression. Andrew Ho, Harvard Graduate School of EducationUnit 5b Slide 2Multiple RegressionAnalysis (MRA)

Do your residuals meet the required assumptions?Test for residual normalityUse influence statistics to detect atypical datapointsIf your residuals are not independent, replace OLS by GLS regression analysisUse Individual growth modelingSpecify a Multi-level ModelIf time is a predictor, you need discrete-time survival analysisIf your outcome is categorical, you need to useBinomial logistic regression analysis (dichotomous outcome)Multinomial logistic regression analysis (polytomous outcome)If you have more predictors than you can deal with,Create taxonomies of fitted models and compare them.Form composites of the indicators of any common construct.Conduct a Principal Components AnalysisUse Cluster AnalysisUse non-linear regression analysis.Transform the outcome or predictorIf your outcome vs. predictor relationship is non-linear,Use Factor Analysis:EFA or CFA?

Course Roadmap: Unit 5bTodays Topic Area Andrew Ho, Harvard Graduate School of EducationUnit 5b Slide 3Person-PeriodDatasetID PERIOD EVENT1 1 12 1 02 2 13 1 14 1 15 1 05 2 05 3 05 4 05 5 05 6 05 7 05 8 05 9 05 10 05 11 05 12 06 1 17 1 07 2 07 3 07 4 07 5 07 6 07 7 07 8 07 9 07 10 07 11 07 12 0Etc.So, why not replace life-table analysis by the logistic-regression analysis of EVENT on PERIOD in the person-period dataset?From a technical perspective, this turns out to be exactly the right thing to do.Its then called Discrete-Time Survival Analysis.In our earlier life-table analysis in the person-period dataset:EVENT recorded whether the teacher experienced the event of interest (quitting teaching) in each time PERIOD.Conceptually, in these analyses:EVENT served as a (dichotomous) outcome.PERIOD served as a predictor.In a person-period dataset:Each person has one row of data in each time-period.Their data record continues until, and includes, the time-period in which they experience the event of interest, or are censored:A person cannot be present in a time-period unless they had a value of 0 for EVENT in the previous period.In other words, they must have survived the prior period.So, the person-period dataset has been formatted to permit each person to be present in a particular time period only if they are a legitimate member of the risk set in that period.Notice how, in the person-period dataset, outcome EVENT has been encoded to embody the same conditionality present in the definition of the hazard probability

The Person-Period Dataset3 Andrew Ho, Harvard Graduate School of EducationUnit 5b Slide 4*--------------------------------------------------------------------------------* Input the person-period dataset, name and label the variables in the dataset.* Note that this is a different input dataset -- in person-period format, rather* than the prior person-level format -- than the one that was used in the previous* data-analytic handout on life-table analysis, in Unit5a.do*--------------------------------------------------------------------------------* Input the person-period dataset: infile ID PERIOD EVENT P1-P12 /// using ""C:\Users\Andrew Ho\Documents\Dropbox\S-052\Raw Data\SPEC_ED_PP.txt" * Label the variables: label variable ID "Teacher Identification Code" label variable PERIOD "Current Time Period" label variable EVENT "Did Teacher Quit in this Time Period?

* Label the values of important categorical variables: * Dichotomous event occurrence variable EVENT: label define eventlbl 0 "No Quit" 1 "Quit" label values EVENT eventlbl*--------------------------------------------------------------------------------* Inspect the structure of the new person-period dataset.* Notice that there is one row per discrete time-period for each person.*-------------------------------------------------------------------------------- list ID EVENT PERIOD P1-P12 in 1/40Heres the STATA code that kicks off Data-Analytic Handout, Unit5b.do, in which I conduct the suggested logistic regression analyses of EVENT. In Unit5a.do, recall that I provided code that allows you to convert the person-level dataset to the person-period dataset. Here I list the values of EVENT and P1 thru P12 for the few cases we inspected on the previous slide.Here are the time-period indicators -- P1 through P12 -- that were present in the person-period dataset, but were input and ignored up to this point.

Loading in the dataset4Unit 5b Slide 5

Calculating Hazard Probabilities in Person-Period Datasets

Andrew Ho, Harvard Graduate School of Educationtabulate EVENT PERIOD, column

This calculates what we see in the table above. count(ID) gives us our Total in each PERIOD, sum(EVENT) gives us the number who Quit by PERIOD, and NEVENT/NPERIOD gives us our Hazard Probabilities by PERIOD.

5

Unit 5b Slide 6

Calculating Survival Probabilities in Person-Period Datasets Andrew Ho, Harvard Graduate School of Educationpreserve and, at the end, restore, allows us to mess with our dataset and get it back at the end.

Our collapsed dataset with HAZARDP (collapsed) and SURVIVEP (calculated)

6 Andrew Ho, Harvard Graduate School of EducationUnit 5b Slide 7ColVarVariable Description Labels 1IDTeacher identification code.Integer2PERIODIndicates discrete time period to which record refers.Integer3EVENTDummy variable indicating whether the teacher experienced the event of interest in this period.0 = no; 1 = yes4P1Is this the first year of the teaching career?0 = no; 1= yes5P2Is this the second year of the teaching career?0 = no; 1= yes6P3Is this the third year of the teaching career?0 = no; 1= yes7P4Is this the fourth year of the teaching career?0 = no; 1= yes8P5Is this the fifth year of the teaching career?0 = no; 1= yes9P6Is this the sixth year of the teaching career?0 = no; 1= yes10P7Is this the seventh year of the teaching career?0 = no; 1= yes11P8Is this the eighth year of the teaching career?0 = no; 1= yes12P9Is this the ninth year of the teaching career?0 = no; 1= yes13P10Is this the tenth year of the teaching career?0 = no; 1= yes14P11Is this the eleventh year of the teaching career?0 = no; 1= yes15P12Is this the twelfth year of the teaching career?0 = no; 1= yesTo conduct logistic regression analyses in the person-period dataset, we must think about how we represent time PERIOD in our models -- recall that the dataset contains a vector of predictors that we have not yet used General Specification of PERIODDichotomous predictors, P1 thru P12 are defined to distinguish among the discrete time periods.For each person in each period, each of the time period indicators, P1 thru P12, is set to 1 in the corresponding period, and 0 in other periods.Representing PERIOD by this vector of dummies in our logistic regression analysis provides the most general specification possible for any potential relationship between EVENT and PERIOD.

The Discrete of DTSA: The Person-Period Dummy Variables7 +--------------------------------------------------------------------------------------+ | ID EVENT PERIOD P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 | |--------------------------------------------------------------------------------------| 1. | 1 Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | 2. | 2 No Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | 3. | 2 Quit 2 0 1 0 0 0 0 0 0 0 0 0 0 | 4. | 3 Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | 5. | 4 Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | |--------------------------------------------------------------------------------------| 6. | 5 No Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | 7. | 5 No Quit 2 0 1 0 0 0 0 0 0 0 0 0 0 | 8. | 5 No Quit 3 0 0 1 0 0 0 0 0 0 0 0 0 | 9. | 5 No Quit 4 0 0 0 1 0 0 0 0 0 0 0 0 | 10. | 5 No Quit 5 0 0 0 0 1 0 0 0 0 0 0 0 | |--------------------------------------------------------------------------------------| 11. | 5 No Quit 6 0 0 0 0 0 1 0 0 0 0 0 0 | 12. | 5 No Quit 7 0 0 0 0 0 0 1 0 0 0 0 0 | 13. | 5 No Quit 8 0 0 0 0 0 0 0 1 0 0 0 0 | 14. | 5 No Quit 9 0 0 0 0 0 0 0 0 1 0 0 0 | 15. | 5 No Quit 10 0 0 0 0 0 0 0 0 0 1 0 0 | |--------------------------------------------------------------------------------------| 16. | 5 No Quit 11 0 0 0 0 0 0 0 0 0 0 1 0 | 17. | 5 No Quit 12 0 0 0 0 0 0 0 0 0 0 0 1 | 18. | 6 Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | 39. | 12 No Quit 4 0 0 0 1 0 0 0 0 0 0 0 0 | 40. | 12 No Quit 5 0 0 0 0 1 0 0 0 0 0 0 0 | +--------------------------------------------------------------------------------------+ Andrew Ho, Harvard Graduate School of EducationUnit 5b Slide 8Here are the values of the time-period indicators for a few teachers from the person-period dataset Here are the original 12 years of data on the time periods in which Teacher #5 was present in the person-period datasetThe time-period indicators, P1 - P12, identify each time-period in a very general wayIn the 1st time period:P1 = 1P2 thru P12 = 0In the 2nd time period:P2 = 1P1 & P3 thru P12 = 0In the 12th time period:P12 = 1, P1 thru P11 = 0.

The Discrete of DTSA: Person-Period Dummies as Time Period Indicators8Unit 5b Slide 9

The Discrete of DTSA: Person-Period Dummies as Time Period Indicators +--------------------------------------------------------------------------------------+ | ID EVENT PERIOD P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 | |--------------------------------------------------------------------------------------| 1. | 1 Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | 2. | 2 No Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | 3. | 2 Quit 2 0 1 0 0 0 0 0 0 0 0 0 0 | 4. | 3 Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | 5. | 4 Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | |--------------------------------------------------------------------------------------| 6. | 5 No Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | 7. | 5 No Quit 2 0 1 0 0 0 0 0 0 0 0 0 0 | 8. | 5 No Quit 3 0 0 1 0 0 0 0 0 0 0 0 0 | 9. | 5 No Quit 4 0 0 0 1 0 0 0 0 0 0 0 0 | 10. | 5 No Quit 5 0 0 0 0 1 0 0 0 0 0 0 0 | |--------------------------------------------------------------------------------------| 11. | 5 No Quit 6 0 0 0 0 0 1 0 0 0 0 0 0 | 12. | 5 No Quit 7 0 0 0 0 0 0 1 0 0 0 0 0 | 13. | 5 No Quit 8 0 0 0 0 0 0 0 1 0 0 0 0 | 14. | 5 No Quit 9 0 0 0 0 0 0 0 0 1 0 0 0 | 15. | 5 No Quit 10 0 0 0 0 0 0 0 0 0 1 0 0 | |--------------------------------------------------------------------------------------| 16. | 5 No Quit 11 0 0 0 0 0 0 0 0 0 0 1 0 | 17. | 5 No Quit 12 0 0 0 0 0 0 0 0 0 0 0 1 | 18. | 6 Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | 39. | 12 No Quit 4 0 0 0 1 0 0 0 0 0 0 0 0 | 40. | 12 No Quit 5 0 0 0 0 1 0 0 0 0 0 0 0 | +--------------------------------------------------------------------------------------+

Hazard Function Andrew Ho, Harvard Graduate School of EducationYou might notice that the Hazard Function shows the conditional means of the dichotomous variable, EVENT, on the predictor variable, PERIOD.If we wanted to model these means, and test the null hypothesis that all means are equal, how might we do it?

In the population, are hazard probabilities different across years of teaching?tabulate EVENT PERIOD, column9Unit 5b Slide 10

A Model for each of the Means +--------------------------------------------------------------------------------------+ | ID EVENT PERIOD P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 | |--------------------------------------------------------------------------------------| 1. | 1 Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | 2. | 2 No Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | 3. | 2 Quit 2 0 1 0 0 0 0 0 0 0 0 0 0 | 4. | 3 Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | 5. | 4 Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | |--------------------------------------------------------------------------------------| 6. | 5 No Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | 7. | 5 No Quit 2 0 1 0 0 0 0 0 0 0 0 0 0 | 8. | 5 No Quit 3 0 0 1 0 0 0 0 0 0 0 0 0 | 9. | 5 No Quit 4 0 0 0 1 0 0 0 0 0 0 0 0 | 10. | 5 No Quit 5 0 0 0 0 1 0 0 0 0 0 0 0 | |--------------------------------------------------------------------------------------| 11. | 5 No Quit 6 0 0 0 0 0 1 0 0 0 0 0 0 | 12. | 5 No Quit 7 0 0 0 0 0 0 1 0 0 0 0 0 | 13. | 5 No Quit 8 0 0 0 0 0 0 0 1 0 0 0 0 | 14. | 5 No Quit 9 0 0 0 0 0 0 0 0 1 0 0 0 | 15. | 5 No Quit 10 0 0 0 0 0 0 0 0 0 1 0 0 | |--------------------------------------------------------------------------------------| 16. | 5 No Quit 11 0 0 0 0 0 0 0 0 0 0 1 0 | 17. | 5 No Quit 12 0 0 0 0 0 0 0 0 0 0 0 1 | 18. | 6 Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | 39. | 12 No Quit 4 0 0 0 1 0 0 0 0 0 0 0 0 | 40. | 12 No Quit 5 0 0 0 0 1 0 0 0 0 0 0 0 | +--------------------------------------------------------------------------------------+

Hazard Function Andrew Ho, Harvard Graduate School of EducationWe could fit this model with the dummy variables that we have:

regress EVENT P1-P12// OR// The i. notation auto-creates dummy variablesregress EVENT i.PERIOD

There are two problems with this statistical model as written. What are they?tabulate EVENT PERIOD, column10Unit 5b Slide 11

A Model for each of the Logits? Andrew Ho, Harvard Graduate School of Education

A model for the log-odds (logits) of teachers exiting the system for the first time, given survival through a given number of years of teaching.

We might think of PERIOD as a continuous variable, but lets start by trying to reproduce the Hazard Probabilities at each discrete period, in the same way that we would estimate probabilities for a large number of racial/ethnic groups or polychotomies/categories.11Unit 5b Slide 12

The Discrete-Time Hazard Model: Reproducing Life Tables Andrew Ho, Harvard Graduate School of EducationP1P2P3P4P5P6P7P8P9P10P11P12Percentage11.57%11.02%11.58%10.76%8.91%8.25%6.01%4.81%4.22%3.69%2.47%1.28%Odds0.1310.1240.1310.1210.0980.0900.0640.0510.0440.0380.0250.013Log-Odds-2.034-2.089-2.033-2.116-2.325-2.408-2.749-2.985-3.122-3.261-3.676-4.346

Hazard Function

12

A No-Constant (Zero-Constant) ModelP1P2P3P4P5P6P7P8P9P10P11P12Percentage11.57%11.02%11.58%10.76%8.91%8.25%6.01%4.81%4.22%3.69%2.47%1.28%Odds0.1310.1240.1310.1210.0980.0900.0640.0510.0440.0380.0250.013Log-Odds-2.034-2.089-2.033-2.116-2.325-2.408-2.749-2.985-3.122-3.261-3.676-4.346

Hazard Function

Unit 5b Slide 13 Andrew Ho, Harvard Graduate School of Education13Unit 5b Slide 14

How Logistic Models Replicate (and Extend?) Life Table Analyses Andrew Ho, Harvard Graduate School of Education

Logistic Regression provides us a statistical model for Hazard Probabilities and allows us to ask questions about differences in Hazard Probabilities in the population. Does the probability of exit really decline over time in the population (conditional on survival to that point?)Now, we can extend this analysis by adding predictors (What about certified teachers? Age? The year that they started?). And, instead of modeling the logit at each PERIOD, we can use a more parsimonious model for the trajectory of Hazard Probabilities over time.14

Unit 5b Slide 15

Instead of logit EVENT P2-P12, why not logit EVENT PERIOD? Andrew Ho, Harvard Graduate School of EducationWhat is the estimated change in the Hazard Probability (in logits) per unit PERIOD? Is this change different from 0 in the population?Preparing for some polynomial regression.Linear, quadratic, and cubic fits to the Hazard function.15

Unit 5b Slide 16

A linear model for the logits Andrew Ho, Harvard Graduate School of EducationWhen PERIOD = 0, the estimated logit of exiting the system is -1.76. Remember your logit scale. This is a fitted probability of 14.7%.

This is a linear model. Why are the fitted probabilities clearly curvilinear? And does this seem like a good fit to you?16

Unit 5b Slide 17

Quadratic Fit Andrew Ho, Harvard Graduate School of EducationWhen PERIOD = 0, the estimated logit of exiting the system is -2.06. Remember your logit scale. This is a fitted probability of 11.3%. Remember that coefficients from polynomial regression equations are, like coefficients from all interactions, difficult to interpret on their own. We graph: Is this a quadratic function?Does this seem like a better fit to you?17

Unit 5b Slide 18

Cubic Fit Andrew Ho, Harvard Graduate School of EducationWhen PERIOD = 0, the estimated logit of exiting the system is -2.14. Remember your logit scale. This is a fitted probability of 10.5%. Is this a cubic function?Does this seem like a better fit to you?Remember that coefficients from polynomial regression equations are, like coefficients from all interactions, difficult to interpret on their own. We graph: 18