An Introduction to Group-Based Trajectory Modeling and PROC TRAJ Richard Charnigo

An Introduction to Group-Based Trajectory Modeling and PROC TRAJ

Richard CharnigoProfessor of Statistics and BiostatisticsDirector of Statistics and Psychometrics Core, [email protected]

Objectives

First ~80 minutes:

1.Be able to describe a group-based trajectory modeland, in particular, distinguish it from a conventional regression model.

2. Be able to interpret results obtained from fitting a group-based trajectory model via PROC TRAJ.

Last ~40 minutes:

3. Be able to fit a group-based trajectory model via PROC TRAJ.

Motivating example

The Excel file at {www.richardcharnigo.net/traj}contains a simulated data set:

Five hundred college freshmen (“ID”) were asked to estimate how many times per month they consumed marijuana during their freshman (“Y1”), sophomore (“Y2”), junior (“Y3”), and senior (“Y4”) years of high school.

Later they were asked to estimate their marijuana use during freshman year of college (“Y5”).

They were also assessed on reward seeking; for ease of interpretation, we standardize this variable (“X”).

Motivating example

Two possible “research questions” are:

i. What are prototypical trajectories of marijuana use within the population of college students from which this sample was drawn ?

ii. Is the trajectory that best describes the experience of a particular student associated with that student’s level of reward seeking ?

We can develop more complicated and realistic scenarios ( e.g., with additional personality variables and/or interventions ), but this simple scenario will help us begin to understand group-based trajectory modeling and PROC TRAJ.

Exploratory data analysis

Before pursuing group-based trajectory ( or any other statistical ) modeling, we are well-advised to perform exploratory data analysis.

This can alert us to gross mistakes in the data set, heretofore undetected, which may otherwise threaten the validity of our results.

This can also suggest an appropriate probability distribution to use with the group-based trajectory model and help us to anticipate what the results may be.


Quantiles (Definition 5)

Quantile Estimate100% Max 4

99% 3

95% 2

90% 1

75% Q3 1

50% Median 0

25% Q1 0

10% 0

5% 0

1% 0

0% Min 0

Basic Statistical Measures

Location VariabilityMean 0.362000 Std Deviation 0.71553

Median 0.000000 Variance 0.51198

Mode 0.000000 Range 4.00000

Interquartile Range 1.00000


Quantiles (Definition 5)

Quantile Estimate100% Max 14

99% 12

95% 9

90% 7

75% Q3 1

50% Median 0

25% Q1 0

10% 0

5% 0

1% 0

0% Min 0

Basic Statistical Measures

Location VariabilityMean 1.454000 Std Deviation 2.91563

Median 0.000000 Variance 8.50089

Mode 0.000000 Range 14.00000

Interquartile Range 1.00000


The preceding slides show descriptive statistics for Y1 and Y5. ( We can similarly examine descriptive statistics for Y2, Y3, and Y4. ) Here are a few observations:

• As anticipated, the possible values of Y1 and Y5 are nonnegative, and they appear to have been recorded ( or rounded ) to the nearest integer.

• The distributions of Y1 and Y5 are right-skewed, and there are lots of 0’s.

• Both the mean and the variance for Y5 are greater than the corresponding quantities for Y1.


Our observations suggest the following:

• Because there are lots of 0’s, there is no transformation that will bring Y1 or Y5 to approximate normality.

• However, because Y1 and Y5 are integer-valued, a Poisson ( or similar ) probability distribution may be applicable.

• Since Y5 has greater mean and variance than Y1, we anticipate some divergence between trajectories over time and at least one trajectory showing increasing marijuana use over time.

A first trajectory model

Let t denote time in years. If we set time 0 to be high school graduation, then we have t = -3, -2, -1, 0, and 1 corresponding to Y1 through Y5.

Suppose for now --- the viability of this supposition can be assessed later --- that there are three subpopulations whose mean levels of marijuana use over time ( called “trajectories” ) are defined by exponentials of linear functions

f1(t) = exp(a1 + b1 t), f2(t) = exp(a2 + b2 t), and f3(t) = exp(a3 + b3 t).

The exponentials are needed because f1(t), f2(t), and f3(t) must be nonnegative.


Suppose that the distribution of Yk ( 1 < k < 5 ) in the first subpopulation is Poisson with mean f1( k-4 ), in the second is Poisson with mean f2( k-4 ), and in the third is Poisson with mean f3( k-4 ).

Finally, suppose that the probability of belonging to subpopulation j ( 2 < j < 3 ) divided by the probability of belonging to subpopulation 1 is of the form exp(cj + dj X). If dj > 0, then higher levels of reward seeking increase the above ratio; if dj < 0, then they decrease the above ratio.


A group-based trajectory model is thus distinguished from a conventional regression model in that a latent variable --- namely, the subpopulation to which one belongs --- is intermediate between what might be thought of as the independent variable (here, reward seeking) and the dependent variable (here, marijuana use).

Consequently, and importantly, the difference between two trajectories is typically much greater than the difference between mean levels among persons “high” on the independent variable versus persons “low” on the independent variable.


1 1 1 2 2 2 3 3 3

1 1 1 1 12

2

2

223

3

3

3

3


The preceding figure shows results from fitting the group-based trajectory model via PROC TRAJ.

Approximately 65.3% of persons belong to a subpopulation that is essentially abstinent from marijuana, about 19.4% to a subpopulation whose marijuana use increases and then decreases, and about 15.3% to a subpopulation whose marijuana use continually increases.

Dashed lines represent estimates of f1(t), f2(t), and f3(t) when they are assumed to be exponentials of linear functions; solid lines represent estimates without such a constraint.

A first trajectory modelObs ID Y1 Y2 Y3 Y4 Y5 T1 T2 T3 T4 T5 X GRP1PRB GRP2PRB GRP3PRB GROUP

5 5 0 0 1 0 0 -3 -2 -1 0 1 0.08 0.995814 0.004186 0.000000 1

6 6 2 3 6 4 0 -3 -2 -1 0 1 2.75 0.000000 0.243606 0.756394 3

7 7 0 0 0 0 1 -3 -2 -1 0 1 -0.97 0.998364 0.001636 0.000000 1

8 8 2 4 3 8 8 -3 -2 -1 0 1 0.7 0.000000 0.000002 0.999998 3

9 9 1 0 1 4 5 -3 -2 -1 0 1 2.78 0.000000 0.071390 0.928610 3

10 10 1 4 0 0 1 -3 -2 -1 0 1 0.53 0.000634 0.999287 0.000078 2

Obs _MODEL_ _MODEL2_ _TYPE_ _NAME_ INTERC1 LINEAR1 INTERC21 ZIP PARMS -2.240945095 -0.061055892 0.4881041958

Obs LINEAR2 INTERC3 LINEAR3 CONST2 X2 CONST3 X31 0.0881527887 1.6413614616 0.404393847 -1.196677753 1.1816491462 -2.400466075 2.4141657075

Obs _LOGLIK_ _BIC1_ _BIC2_ _AIC_ _CONVERGE_1 -2580.343083 -2611.416123 -2619.463313 -2590.343083 4

Obs T AVG1 AVG2 AVG3 PRED1 PRED2 PRED31 -3.00000 0.13401 0.48801 1.17827 0.12774 1.25063 1.53446

2 -2.00000 0.10507 1.68610 2.66516 0.12017 1.36588 2.29923

3 -1.00000 0.13408 2.58710 3.57223 0.11305 1.49175 3.44515

4 0.00000 0.08391 1.57138 5.24892 0.10636 1.62922 5.16219

5 1.00000 0.11002 1.20619 7.52729 0.10006 1.77937 7.73500


The preceding tables display additional results.

The first table shows variable values for six subjects, along with the estimated probabilities that the subjects belong to the three subpopulations.

The second and third tables present estimates of a1, b1, a2, b2, a3, b3, c2, d2, c3, and d3. Companion output, which is displayed by PROC TRAJ on screen only, provides accompanying p-values.

The fourth table provides indices of model fit, and the fifth table specifies the numbers used to construct the figure displayed earlier.


Visually, the estimate of f2(t) appears somewhat unsatisfactory. There are corresponding discrepancies between the “AVG2” and “PRED2” columns in the fifth table.

Therefore, let us consider a second group-based trajectory model in which the trajectories are defined by exponentials of quadratic functions

f1(t) = exp(a1 + b1 t + g1 t2), f2(t) = exp(a2 + b2 t + g2 t2), and f3(t) = exp(a3 + b3 t + g3 t2).

A second trajectory model

1 1 1 2 2 2 3 3 3

1 1 1 1 12

2

2

223

3

3

3

3

A second trajectory modelObs ID Y1 Y2 Y3 Y4 Y5 T1 T2 T3 T4 T5 X GRP1PRB GRP2PRB GRP3PRB GROUP

5 5 0 0 1 0 0 -3 -2 -1 0 1 0.08 0.992863 0.007137 0.000000 1

6 6 2 3 6 4 0 -3 -2 -1 0 1 2.75 0.000000 0.868232 0.131768 2

7 7 0 0 0 0 1 -3 -2 -1 0 1 -0.97 0.999285 0.000715 0.000000 1

8 8 2 4 3 8 8 -3 -2 -1 0 1 0.7 0.000000 0.000000 1.000000 3

9 9 1 0 1 4 5 -3 -2 -1 0 1 2.78 0.000000 0.008870 0.991130 3

10 10 1 4 0 0 1 -3 -2 -1 0 1 0.53 0.001133 0.998748 0.000119 2

Obs _MODEL_ _MODEL2_ _TYPE_ _NAME_ INTERC1 LINEAR1 QUADRA11 ZIP PARMS -2.311574846 0.0558397704 0.0514642791

Obs LINEAR2 QUADRA2 INTERC3 LINEAR3 QUADRA3 CONST2 X21 -0.469526884 -0.296767096 1.6939847836 0.3771947256 -0.029055771 -1.157274099 1.197230375

Obs T AVG1 AVG2 AVG3 PRED1 PRED2 PRED31 -3.00000 0.13401 0.48801 1.17827 0.12774 1.25063 1.53446

2 -2.00000 0.10507 1.68610 2.66516 0.12017 1.36588 2.29923

3 -1.00000 0.13408 2.58710 3.57223 0.11305 1.49175 3.44515

4 0.00000 0.08391 1.57138 5.24892 0.10636 1.62922 5.16219

5 1.00000 0.11002 1.20619 7.52729 0.10006 1.77937 7.73500

Obs CONST3 X3 _LOGLIK_ _BIC1_ _BIC2_ _AIC_ _CONVERGE_1 -2.356619304 2.313769971 -2504.788285 -2545.183238 -2555.644584 -2517.788285 4

A second trajectory model

Some comments are in order:

• The estimate of f2(t) looks much better now.

• The guess about which subpopulation subject 6 belongs to has changed ( and appears more reasonable now ).

• The BIC1, BIC2, and AIC have increased by approximately 66, 64, and 73 points respectively. These are overwhelming changes, suggesting that the second group-based trajectory model provides a much better fit to the data than the first group-based trajectory model.

Is that the best we can do ?

Besides moving from linear functions to quadratic functions, other modifications are possible.

One, for which I provide SAS code at {www.richardcharnigo.net/traj}, entails replacing the ordinary Poisson probability distribution by the zero-inflated Poisson probability distribution. The idea is that, especially in the first subpopulation, there may be too many 0’s to be compatible with the ordinary Poisson probability distribution. Accounting for this zero inflation may provide a better fit to the data.


Another possible modification is to change the quadratic functions to cubic or even quartic functions. ( With only five time points, we cannot go beyond polynomials of degree four. )

In fact, the polynomial degree need not be the same for each subpopulation. For instance, a linear function may suffice for the first and third subpopulations, while ( at least ) a quadratic function appears necessary for the second subpopulation.


We face the practical problem, though, of deciding which modifications to make.

Rather than consider dozens ( or hundreds ) of possible competing models, a more feasible approach may be to start with the most complicated model that one is willing to entertain ( for example, with quartic polynomials for each subpopulation ) and then perform “backward elimination”.


To do this, remove whichever model feature has the largest p-value, while respecting the hierarchical principle that simpler features cannot be removed before more complicated features.

Thus, for example, the linear term cannot be removed from a quadratic polynomial.

Once all remaining model features have p-values less than 0.05 ( or are ineligible for removal ), stop and create a table of model fit indices corresponding to the various steps of the backward elimination.


The step in the backward elimination at which the model fit indices are optimized can be used to select a final model. ( Matters become a bit more complicated, though, if the model fit indices are not in agreement about this. )

Also, if we are unsure whether three is the best number of groups, then the above process can be repeated with, say, two groups and four groups. Model fit indices can then be used to choose among the final two-group model, the final three-group model, and the final four-group model.

Other capabilities of PROC TRAJ

Worth mentioning here, though not illustrated in this presentation or in the SAS code at {www.richardcharnigo.net/traj}, are three additional capabilities of PROC TRAJ:

• The dependent variable need not have the (zero-inflated) Poisson probability distribution; the normal and Bernoulli probability distributions can be accommodated as well.

• Multiple independent variables can be accommodated.

Other capabilities of PROC TRAJ

• Multiple, related dependent variables can be accommodated. If there are two ( for instance, marijuana use and alcohol use ), then PROC TRAJ provides one latent variable defining subpopulations on the first dependent variable and a separate latent variable defining subpopulations on the second. Part of the output from PROC TRAJ then estimates the probabilities of membership in the subpopulations defined by the second latent variable given membership in a subpopulation defined by the first. If there are more than two, then PROC TRAJ provides a single latent variable defining subpopulations on all dependent variables simultaneously.

Trying out PROC TRAJ

With this background, let us open SAS and work our way through at least some of the SAS code at {www.richardcharnigo.net/traj}.

This is also an opportunity to experiment and make some changes to the SAS code. For instance, you can see what PROC TRAJ does when a quadratic function is replaced by a cubic function or when a quadratic function is retained for only one of the three subpopulations.

An Introduction to Group-Based Trajectory Modeling and PROC TRAJ Richard Charnigo

Documents

Transcript of An Introduction to Group-Based Trajectory Modeling and PROC TRAJ Richard Charnigo