A Method for Estimating the Correlations Between Observed and IRT Latent Variables or Between Pairs...

Post on 19-Dec-2015

216 views 0 download

Tags:

Transcript of A Method for Estimating the Correlations Between Observed and IRT Latent Variables or Between Pairs...

A Method for Estimating the Correlations Between Observed and IRT Latent Variables or Between Pairs of IRT Latent

Variables

Alan NicewanderPacific Metrics

Presented at a conference to honor Dr. Michael W. Browne of the Ohio State University,

September 9-10, 2010

• Using the factor analytic version of item response (IRT) models,

– Estimates of the correlations between the latent variables measured by test items are derived.

– Also, estimates of the correlations between the

latent variables measured by test items and external, observed variables are derived

Brief Derivations of the Correlations

• The normal ogive model for multiple-choice, dichotomous items may be written as,

• where, θ is the latent proficiency variable, ai is the item slope parameter, bi is the item location parameter, ci is a guessing parameter, and φ(t) is the normal density function.

• Another useful version of this model is the so-called factor analytic representation: Let Yi be a latent response variable that is a linear function of θ plus error,

• where λi may be considered as a factor loading and εi is an error variable. It is further assumed that Yi

and θ are normally distributed with zero means and unit variances, and that εi is uncorrelated with θ and Yi.

• Let γi be a response threshold, defined so that if Yi > γi the item is gotten correct and then (1) may be rewritten as,

• A graphical representation of this equation is given on the following slide.

• Then if λi and γi are rescaled as,

• And (1) becomes,

A Graph Showing Yi , the Latent Response Variable , Mapped into (1, 0) Using the Response Threshold, γi

Estimating correlations between the latent variables measured by dichotomous items

• Suppose one wants to determine the correlation between the latent variables, θi and θj, that underlie the observed item responses, ui and uj . Let Yi and Yj be the latent response variables for the two MC items.

• or,

• The resulting equation,

does not seem useful in that it involves two, latent correlations, and

However, from the definition of the tetrachoric correlation it follows that,

where is the guessing-corrected tetrachoriccorrelation coefficient.

Estimating the correlations between the latent variables for dichotomous items and external, observed variables

• It is fairly easy to extend the logic above in order to derive a means for computing correlations between observed variables and IRT latent variables.

• First, define Zk as an observed variable scaled to have zero mean and unit variance. Then the correlation between Zk and the latent response variable, Yi , assumed to underlying a MC item, ui is given by,

• Repeating the previous equation,

we once again have an equation with two, latent correlation coefficients; However, following from the definition of a biserial correlation, we may substitute and obtain a solution involving only observables,

where ρ*bis (Zk,ui) is the guessing-corrected biserial correlation between the observed variables, Zk and ui.

Extending the latent-variable correlations to polytomous test items.

• In order to simplify exposition, only polytomous items having three categories are modeled.

• Generalization of the methods described below to items with more than three categories is very straightforward.

• Let xij be the score for item i scored in category j (j = 1, 2, …, m). Under commonly-used scoring rules, a three-category item would be scored 0, 1, or 2.

• As was done above, for the case of MC items having binary scores, let Y*i be the latent response variable underlying the polytomous item xij:

• where λi is a factor loading.

• Let γi1 and γi2 be two response thresholds, defined so that:

λi and these two thresholds may be rescaled into IRT slope and location parameters, viz.

• Fitting the previous model to data may be done with the nomal ogive version of Samejima’s (1969) Graded Response model, or the more commonly used logistic approximation thereof.

• However the correlations we are seeking here do not depend on the locations, but only the estimates of the , the item slope parameters.

21/ iiia 21/ iiia

• First, define the correlation between the two latent response variables that underlie two, polytomous items, xij and xkj, using previous logic,

• Then, solve for ρ(θi θk) and substitute for λi and λk

Computing the correlation between an external variable and the latent variable measured by a

polytomous item.

• From earlier derivations, it is fairly obvious that the correlation between an external variable Zk and the latent variable measured by a polytomous item, xij is,

• where, ρpoly_s(Zk, xij) is the polyserial correlation between the external variable and the score on the polytomous item.

Some Numerical Examples

Computing the correlations between the latent variables measured by three polytomous items, each having three

categories.

• Ten replications of 300 observations were simulated using the values of a, b1 and b2 given below in Table 1, and with true values of ρ(θ1,θ2) = ρ(θ2,θ3) = .6, and ρ(θ1,θ3) = 1.

Table 1. Summary averages (std. error) and parameters for simulations of three, polytomous items having three categories.

Items x1 x2 x3 locations a-values replications

x1

x2

x3

1

.276(.04)

.394(.04)

.198(.04)

1

.193(.05)

.295(.04)

.155(.04)

1

-1, 1

-.5, 1.5

-.2, 1.5

1

.8

.6

10

10

10

* Mean phi-correlations above the diagonal, mean polychoric correlations below diagonal

Estimating slopes in latent variable regression problems

Consider a multivariate multiple regression that involves regressing a multidimensional vector of latent variables, θ, onto a multidimensional vector of observed scores, z.

What are the slopes of the latent variables on the observed Z’s (moderated by observed scores on a test designed to measure the latent dimensions)?

The multivariate, multiple regression model may be expressed

as,

θ = Γz + ε ,

where Γ contains the regression slopes, and ε the residuals.

A Bayesian solution to this regression problem was originally

proposed by Mislevy (1985), and this solution was

implemented using the EM-algorithm-based program called

C-Group (ETS, 1993).

Notice that this regression system could be solved using

ordinary least squares if the dispersion matrices Σθθ = E(θθ')

and Σθz = E(θz') were known.

With a little reflection, the estimates derived here could be

used to compute an estimate of Σθz (and Σθθ ). These in turn

could be used to estimate

Γ = Σzz-1Σzθ.

In order to compare C-Group and OLS solutions for the latent-

variable regression problem, a data set consisting of five θ's and six

observed Z's was simulated, each with n=2,000, using the following

population dispersion matrices:

Σθθ = a (5x5) constant-correlation matrix with ρ = .9.

Σzz= a (6x6) identity matrix.

Σθz = a (5x6) matrix with identical rows = (.4, .2, 0, .4, .2, 0).

It is easy to see that, given these dispersion matrices, the population

value of Γ is equal to ΣθZ.Using well-known procedures, normal

random numbers were generated for 2,000 cases and transformed to

have the above dispersion matrices in expectation (Browne, 1969).

θ1 θ2 θ3 θ4 θ5

Z1 .4 .4 .4 .4 .4

Z2 .2 .2 .2 .2 .2

Z3 0 0 0 0 0

Z4 .4 .4 .4 .4 .4

Z5 .2 .2 .2 .2 .2

Z6 0 0 0 0 0

Summary of Regression Slopes of Five Latent Variables, θi , on Six Observed Variables , Zk

In addition to these simulated data, the responses to fifty NAEP Math items were simulated using 1990 NAEP item-parameter estimates.

These simulated data for 2,000 cases were then used to

generate estimates of the regression slopes in Γ using both

the C-group EM solution and the OLS solution

The whole process was repeated ten times, and means and

standard deviations of the estimates computed. These

summary statistics are shown in Table 2 on the following

page:

Θ1 Θ2 Θ3 Θ4 Θ5

True slope .40 .40 .40 .40 .40

OLS-LVC est.

.43 (.04) .40 (.05) .39 (.03) .40 (.02) .38 (.05)

C-GRP.est. .43 (.03) .38 (.03) .42 (.03) .39 (.03) .47 (.04)

True slope .20 .20 .20 .20 .20

OLS-LVC est.

.21 (.04) .21 (.04) .21 (.04) .21 (.02) .21 (.04)

C-GRP. est.

.22 (.02) .21 (.03) .22 (.04) .21 (.03) .24 (.04)

True slope 0 0 0 0 0

OLS-LVC est.

.01 (.04) .04 (.05) .01 (.03) .02 (.03) .01 (.05)

C-GRP. est

.01 (.03) .03 (.03) .01 (.02) .02 (.03) .01 (.04)

Θ1 Θ2 Θ3 Θ4 Θ5

True slope .40 .40 .40 .40 .40

OLS-LVC est.

.44 (.02) .38 (.04) .39 (.02) .43 (.02) .40 (.04)

C-GRP est.

.43 (.03) .39 (.03) .42 (.02) .41 (.02) .47 (.03)

True slope .20 .20 .20 .20 .20

OLS-LVC est.

.23 (.04) .20 (.06) .20 (.02) .21 (.03) .20 (.04)

C-GRP. est.

.22 (.02) .19 (.02) .22 (.02) .21 (.02) .20 (.04)

True slope 0 0 0 0 0

OLS-LVC est.

.00 (.03) .01 (.06) -.01 (.03) -.01 (.03) .03 (.04)

C-GRP. est.

.00 (.03) .01 (.04) -.01 (.03) -.01 (.02) .01 (.04)

Tabled estimates are the means of ten replications with n=2,000 each. Values in parentheses are estimated standard errors.

The coefficients developed in this inquiry have a very simple form for two basic reasons:

1. They make strong assumptions about data and,2. The exotic correlation coefficients on which they are

based (tetrachoric-polychoric and biserial-polybiserial) do the “heavy lifting”, mathematically, because of the complex calculations that are entailed in their computation.

3. It is also the case that the standard errors of these coefficients are rather large, and they almost certainly will require large samples for accuracy.