Estimating a Latent Trait Model by Factor Analysis of Tetrachoric Correlations

2/8/12 Estimating a Latent Trait Model by Factor Analysis of Tetrachoric Correlations

1/13www.john-uebersax.com/stat/irt.htm

Estimating a Latent Trait Model byFactor Analysis of Tetrachoric

Correlations

Introduction Estimating threshold parameters Estimating correlation parameters

The tetrachoric correlation Factoring the matrix of tetrachoric correlations Factoring tetrachoric correlations with SAS

Example Calculating factor scores Final thoughts

Citation

References

Introduction

We consider here estimation of a fully Gaussian latent trait model. "FullyGaussian" means that both the latent trait and measurement error are assumed to be

normally distributed. This model, equivalent to what Item Response Theory (IRT)terms a two-parameter normal (2PN) model with a normally distributed latent trait,

is often used in educational testing, social science research, and, increasingly, in

health research. It is also useful--perhaps one of the best methods--for binary datafactor analysis.

For general background on latent trait and IRT models (we use these termsinterchangeably), the two classic texts are Lord and Novick (1968) and Lazarsfeldand Henry (1968; 1968 was a good year!). Heinen (1996) and Hulin, Drasgow

and Parsons (1983) are also good sources. For background on Gaussian latent traitmodels see Bock and Aitkin (1981), Bock and Lieberman (1970), and Bock,Gibbons and Muraki (1988)--the last reference is less technical. For interesting

discussion of newer, Bayesian methods for estimating IRT models, see Johnson andAlbert (1999).

The best way to estimate the fully Gaussian latent trait model, a method developedby Bock and colleagues, is FIML (full-information maximum likelihood) estimation.With this approach, all model parameters are estimated simultaneously by maximumlikelihood. The method is very effective, but, for multidimensional latent trait models,

requires specialized software, such as TESTFACT (Wilson, Wood & Gibbons,1998). [Unidimensional versions can be estimated with LEM (Vermunt, 1988), orLTM (Uebersax, 2000), both free programs.]



Another way to estimate the fully Gaussian latent trait model has long been known--

in fact, it was the only option for many years. The Bock method was rightly seen asan improvement. Fairly recently, though, studies (Knol & Berger, 1991; Parry &McArdle, 1991) reexamined the older method, and concluded that it is still very

viable.

This older method--sometimes called the heuristic method--is based on factor

analysis of the tetrachoric correlations between item or variable pairs. (Forconvenience, we will use the term 'item' herein to refer generically to any binaryvariable.)

It is no surprise that the FIML and heuristic estimation methods produce verysimilar results; despite apparent differences, both are based on the same

fundamental model. The FIML method formulates the problem as a standard latenttrait or IRT model, whereas the heuristic method views it as a factor analysisproblem. Mislevy (1986), and, more formally, Takane and de Leeuw (1987)describe how these models are the same.

In the following sections, we will describe the fully Gaussian latent trait model andits parameters, explain how to estimate the parameters by the heuristic method, and

consider an example.

This discussion will not consider the case of ordered-polytomous data. However,

methods similar to those described here are applicable to such data.

The Fully Gaussian Latent Trait Model

As noted above, the fully Gaussian latent trait model can be understood as a factor

analysis model (to be precise, a common-factor model with probit link functions;this will become clear), or a latent trait/IRT model. In the former case, the model

assumes that, underlying each dichotomous manifest variable (observed variable),Xi, is a corresponding latent continuous variable, Yi. Dichotomous responses

result from the operation of thresholds relative to the latent continuous variables.

The latent trait/IRT version is most easily described for the unidimensional case: (a)

there is a normally distributed latent trait; (b) associated with each item is an item

response function that gives the probability of a positive response for arespondent/examinee with each level of the latent trait; and (c) item response

functions have a characteristic "S" shape--specifically, that of a "normal ogive,"

better known as the cumulative distribution function (cdf) of a normal curve.

Both models--the factor analysis model and the latent trait/IRT model--have two

sets of parameters. These have various names; we shall simply call them thecorrelation parameters and the threshold parameters. We will consider these

parameters mainly in terms of the factor analysis model, which is consistent with

present aims.



Threshold parameters

These are the values of thresholds that discretize latent continuous variables to

produce dichtomous observed values. These correspond--that is, they are related ina simple way--to what are alternatively termed location, difficulty, or b parameters

in various latent trait/IRT models.

Correlation parameters

These are the correlations of latent continuous variables with the underlying factors.

That is, they are latent factor loadings. These correspond--that is, again, are related

by very simple functions--to what are variously known as slope, dispersion, a, ormeasurement error parameters in latent trait/IRT models.

We shall now consider how to estimate the threshold and correlation parameters bythe heuristic method.

Estimating Threshold Parameters

Threshold parameters are estimated by applying the probit function to the marginal

response proportions for each item.

The probit function

The probit function is the inverse of the cumulative distribution function (cdf) of the

standard normal curve. We typically denote the latter by Phi() [i.e., F()] and the

former function by Phi-1(), or probit(). The Phi function is familiar from its use insignificance testing. For some normal deviate, z, Phi(z) is the area under the

standard normal curve to the left of z. For example, if z = 0, then

Phi(z) = Phi(0) = .5.

because half the standard normal curve is to the left of 0.

The probit function does the reverse: one supplies the proportion of the curve to the

left of some z value, and the probit function gives the z value. So, for example,

probit(.5) = 0.,

Applying the probit function

The heuristic estimation of threshold parameters is, therefore, quite simple. First, for

each item, one notes the proportion of negative responses. This is the proportion of

cases that fall below the response threshold on the corresponding latent continuous

variable--for example, the proportion of patients who are below a physician's

threshold for making a positive diagnosis, or the proportion of respondents whose

opinion strength is below the threshold required to answer "yes" to a survey



question. Let pi denote this proportion. Letting ti denote the threshold for item i,

ti = probit(pi).

Doing this calculation for each item gives the heuristic item threshold estimates.

Calculating the probit function

The probit function is usually calculated with a polynomial approximation. Several

good routines are available. Some are described in Abramowitz and Stegun (1972)

and Kennedy and Gentle (1980).

See a simple BASIC probit function program.

Download an executable probit function program.

The thresholds calculated by this method are relative to the latent continuousvariables. For a unidimensional latent trait model, one can also consider the location

of the thresholds relative to the latent trait. These latter thresholds, sometimes called

the location parameters in IRT, correspond to the latent trait level where probability

of a positive response for an item is .5 (i.e., where the item response function is .5,

at the inflection point of the "S" curve.) Letting bi denote these thresholds,

bi = ti/rhoi,

where rhoi is the correlation parameter for item i, calculated as described in the next

section.

Estimating Correlation Parameters

To estimate the correlation parameters of the latent trait model, we will first

calculate the matrix of tetrachoric correlations for all item pairs, and will then factor

analyze this matrix.

The tetrachoric correlation

The tetrachoric correlation (Drasgow, 1988; Harris, 1988; the former reference is

more complete) between two dichotomous items estimates the Pearson correlation

one would obtain if the two constructs were measured continuously. In terms of our

definitions, the tetrachoric correlation for manifest variables Xi and Xj, which we

denote by r*ij, is equal to the Pearson correlation between the corresponding latent

continuous variables Y1 and Y2. That is:

r*ij = r(Y1, Y2).

To dispel any possible confusion, it bears mentioning that we have referred to two

different types of latent correlations. What we've termed the correlation parameters



of the latent trait model are the correlations of latent continuous variables with

factors. These we will derive from the tetrachoric correlations, which are the

correlations of latent continuous variables with each other.

An intuitive explanation of the tetrachoric correlation

Figure 1 portrays the joint distribution of the latent continuous variables, Y1 and Y2,

corresponding to manifest items X1 and X2.

Figure 1 (draft)

Bivariate normal distribution of latent continuousvariables Y1 and Y2 and discretizing thresholds

t1 and t2

Y2 | . | . . | . | . | . | t | . d | c . 2 +------------+---------------- | . a | b . | . | | . | +------------|-------------------> Y1 t1

a, b, c and d denote the proportions of cases that fall in each region defined by the

two items' thresholds. For example, a is the proportion below both items'

thresholds--e.g., the cases that fail or respond "no" to both items.

These proportions correspond to the summary of data as a 2 x 2 cross-classification table, as in Table 1.

Table 1 (draft)

Cross-classification of responses for Item 1 and Item 2

Item 1 0 1 +-------+-------+ 0 | a | b | p2 Item 2 +-------+-------+ 1 | c | d | q2 +-------+-------+ p1 q1 1



Note that a, b, c and d here are defined as proportions, not frequencies.

Once we know the observed cross-classification proportions a, b, c and d for a

study, it is a simple matter to estimate the model represented by Figure 1.

Specifically, we want to know the locations of the discretizing thresholds, t1 and t2,

and a third parameter, which determines the "fatness" of the ellipse: this third

parameter is the tetrachoric correlation for Items 1 and 2, or r*ij.

The principle of estimation is very simple: basically, a computer program tries

various combinations for t1, t2 and r*, until values are found for which the expected

proportions for a, b, c and d in Figure 1 are as close as possible to the actual,

observed proportions in Table 1. The parameter values that do so are regarded as

(estimates of) the true, population values.

In fact, our probit-based threshold estimates are exactly right (this is so for the

tetrachoric, but not the polychoric correlation, a generalization of the tetrachoric to

ordered category data), so we need only find the right value of r*.

Estimating tetrachoric correlations

A computer program is necessary to calculate the tetrachoric correlation. Several

algorithms are available. One method (the original approach) is based on a series

expansion, called the tetrachoric series. More recent approaches rely on numerical

integration. Nearly all modern algorithms are accurate to many decimal places.

Older algorithms that rely on a series expansion with too few terms (e.g., a sixth-

degree polynomial expansion) are potentially inaccurate. A good subroutine

(Brown, 1977) can be downloaded from the Applied Statistics section of StatLib.

Download TetMat program to estimate tetrachoric correlations.

For more information about software options, click here.

SAS will calculate tetrachoric correlations, as an option for PROC FREQ. The only

drawback is that it calculates these one at a time, rather than constructing a matrix

of tetrachoric correlations between all item pairs. A search of the SAS web site for

the term "polychoric", though, will reveal a macro for constructing such a matrix.

Once one has constructed a matrix of tetrachoric correlations between all item pairs

(i, j), the next step is to factor analyze the matrix.

Factoring the matrix of tetrachoric correlations

One may factor analyze the matrix of tetrachoric correlations just as one would amatrix of Pearson correlations. One can use any software that will estimate a

common factor model.



It is possible for a tetrachoric correlation matrix to not be positive definite. Knol and

Berger (1991) suggest that, with ML or Generalized (Weighted) least-squares(GLS/WLS) estimation of the common factor model, if the matrix is indefinite or ill-

conditioned with respect to inversion, a smoothing procedure is required; they also

describe a procedure for this.

The extent to which this extra step is required is unclear. It perhaps becomes more

of an issue as the number of items increases. If one uses iterated principal factor

analysis or unweighted least-squares estimation, then the smoothing procedure is

not required.

LISREL can be used for the factor analysis. LISREL works well for a 1-factor

model (see Example below). However, it may be awkward for multi-factor models;

LISREL is better suited for confirmatory than for exploratory factor analysis. IfLISREL is used, one should not use ML estimation, as the assumptions of that

method do not apply to tetrachoric correlations.

Factoring tetrachorics with SAS PROC FACTOR

Another possibility is to use SAS PROC FACTOR. For this, one reads the matrix

of tetrachoric correlations into a special SAS data set that is specified as a

correlation matrix. This data set is then used as the input for SAS PROC

FACTOR.

For example, let "matrix.in" be the name of an external file (e.g., a DOS, Windows,

or Unix file) that contains the tetrachoric correlations among five items, as follows:

ITEM1 1.000 . . . . ITEM2 .170 1.000 . . . ITEM3 .228 .189 1.000 . . ITEM4 .107 .111 .187 1.000 . ITEM5 .067 .172 .105 .201 1.000

Note that: (1) this matrix has the form of a lower-half matrix, with main diagonal

elements included; (2) the main diagonal elements are unities; (3) elements of the

upper half of the matrix are explicitly supplied as missing values ("."); and (4) avariable name begins each row of the matrix.

I believe one could also supply a full correlation matrix here. However it is

important to supply *some* values for the upper half of the matrix; otherwise when

reading the data SAS will advance to the next line looking for the values above the

main diagonal. (In theory you could use the SAS input option MISSOVER to

prevent that, but not if any row of the correlation matrix occupies more than one

physical line of the file.)

Also, this is just an example; in practice one might better supply the correlations to

more decimal places.



A sample SAS program to analyze such data is as follows:

* Input the matrix of tetrachoric correlations ; * and store as a SAS correlation data set (refer ; * to SAS manual under PROC CORR for details) ;

data one (type=corr); infile 'matrix.in'; _type_='corr'; input _name_ $ item1-item5; run;

* Do the factor analysis;

proc factor data = one method = prinit n = 2 rotate = v scree ; run;

The PROC FACTOR step requests estimation by the "PRINIT" (iterated principal

factor analysis or IPFA) method, a two-factor model, varimax rotation, and a scree

test of eigenvalues.

Based on limited experience, I have found the PRINIT method better for factoring

tetrachorics than most other SAS factoring methods (a comparable method is

available with SPSS). Knol and Berger found good results with this method, but

suggested that unweighted least-squares estimation may be preferable in order to

avoid possible "Heywood cases." (A Heywood case is an estimation problem

where a commonality estimate becomes 1.0 or greater).

Regardless of the program used, one will find on the output a listing of factor

loadings of each item on each factor. These loadings, the item--factor correlations,

are the correlation parameters we seek--that is, the estimated correlations of latent

continuous variables with the factors.

For a unidimensional model, we denote this value by rhoi (i.e., i).

For a multidimensional model, we could denote the latent correlation of item i with

Factor c by rhoic (i.e., ic). We could alternatively denote these as rho(Yi, thetac)

[i.e., (Yi, c)], that is, the correlation of latent continuous variable Yi with the c'th

factor/latent trait, thetac.

As with ordinary factor analysis, one may test models with varying numbers offactors. Or one may use analysis to confirm certain hypotheses. However, any

goodness-of-fit statistic that depends on the assumption of continuous variables

does not strictly apply; such statistics may roughly guide model selection and

comparison, but disregard the p-values.



Example

This example uses the LSAT-6 data analyzed by Bock and Lieberman (1970) and

others. The data are responses to five items of the Law School Admissions Test

(LSAT) test by 1000 examinees (see Bock & Lieberman for data). We proceed in

two steps.

Step 1: Calculate the threshold estimates and tetrachoriccorrelations.

For this, we use the PRELIS (Joreskog & Sorbom, 1988) program. The PRELIS

command file is as follows:

LSAT-6 data (Bock & Lieberman, 1970) da ni = 5 nobs = 1000 missing = -9 treatment = listwise raw-data-from file = lsat6.raw labels item1 item2 item3 item4 item5 ordinal item1 - item5 output matrix = pmatrix sm = lsat6.mat sa=test.acv

These commands are for PRELIS Version 1; commands for Version 2 might differ

slightly. The commands state that there are five variables (items 1 to 5), to be

treated as ordinal, not continuous data. A matrix of estimated Pearson correlations

(pmatrix) is requested; since the data are dichotomous, PRELIS will estimate the

Pearson correlations as tetrachoric correlations. Raw data are supplied in the file

LSAT6.RAW. The tetrachoric correlation matrix will be written to file

LSAT6.MAT. The asymptotic variance/covariance matrix for estimated parameters

is written to the file TEST.ACV. LISREL uses this matrix for Weighted Least

Squares estimation (if you plan to use Unweighted Least Squares estimation with

LISREL, this matrix is not needed). PRELIS will only calculate this matrix if listwise

(not pairwise) deletion of missing cases is specified.

The output file produced by these commands gives the one-way item marginal totals

as follows:

CATEGORY VARIABLE 1 2 ________

ITEM1 76 924 ITEM2 291 709 ITEM3 447 553 ITEM4 237 763 ITEM5 130 870

From these we calculate, for each item i, the marginal proportion pi and threshold ti= probit(pi). For example, for the first item,

p1 = 76 / (76 + 924) = 76/1000 = .076



and

t1 = probit(.076) = -1.4325.

Doing this for all items gives the heuristic threshold estimates shown in Table 2.

Note: It appears that PRELIS 2 supplies these thresholds directly; there is no need

to calculate them from marginal proportions.

The matrix of tetrachoric correlations, stored in file LSAT6.MAT, is as follows:

(6D13.6) .100000D+01 .170316D+00 .100000D+01 .227522D+00 .189091D+00 .100000D+01 .107186D+00 .111147D+00 .186680D+00 .100000D+01 .665002D-01 .172422D+00 .105492D+00 .200924D+00 .100000D+01

The matrix as shown on the printed output, more readable, is:

ESTIMATED CORRELATION MATRIX ITEM1 ITEM2 ITEM3 ITEM4 ITEM5 ________ ________ ________ ________ ________ ITEM1 1.000 ITEM2 .170 1.000 ITEM3 .228 .189 1.000 ITEM4 .107 .111 .187 1.000 ITEM5 .067 .172 .105 .201 1.000

Step 2. Factor analyze the matrix of tetrachoric correlations.

For this we use LISREL. The command file is:

FACTOR ANALYSIS OF DICHOTOMOUS VARIABLES: LSAT6 DATA -- WEIGHTEDLEAST SQUARESDA NI=5 NO=1000 MA=PMPM .100000D+01 .170316D+00 .100000D+01 .227522D+00 .189091D+00 .100000D+01 .107186D+00 .111147D+00 .186680D+00 .100000D+01 .665002D-01 .172422D+00 .105492D+00 .200924D+00 .100000D+01AC FI=TEST.ACVMO NX=5 NK=1 LX=FU,FR PH=ST TD=SY,FIFR TD 1 1 TD 2 2 TD 3 3 TD 4 4 TD 5 5OU

This simply requests a 1-common factor model of the data. The AC command tells

LISREL where to find an asymptotic variance/covariance matrix for used with

Weighted Least Squares (WLS) estimation. Because of this, LISREL will use WLS

estimation even though it is not explicitly requested.

The LISREL output shows the loadings of each latent continuous variable on the

common factor, as follows:

LISREL ESTIMATES (WEIGHTED LEAST SQUARES) LAMBDA X



KSI 1 ________ VAR 1 .389 VAR 2 .397 VAR 3 .471 VAR 4 .377 VAR 5 .342

These values (here labeled KSI 1) are the heuristic estimates of the correlation

parameters rhoi of the latent trait model.

Table 2 compares the heuristic threshold and correlation parameter estimates with

the FIML estimates reported by Bock and Lieberman (1970).

Table 2. Heuristic and Full Information Maximum Likelihood (FIML)Parameter Estimates for LSAT-6 Data

------------------------------------------------

Heuristic estimates FIML estimatesa

------------------- ---------------- Thresh- Corre- Thresh- Corre- Item olds lations olds lations (i) (ti) (rhoi) (ti) (rhoi)

------------------------------------------------- 1 -1.4325 .389 -1.4329 .3856 2 -.5505 .397 -.5505 .3976 3 -.1332 .471 -.1332 .4732 4 -.7160 .377 -.7159 .3749 5 -1.1264 .342 -1.1264 .3377 -------------------------------------------------

aBock & Lieberman (1970); p. 191.

The heuristic and FIML estimates are very close, especially for the threshold

parameters. Possibly, using more accurate FIML estimation than was available in

1970, the threshold estimates would be identical.

Calculating Factor Scores

For some studies one will be content simply to reveal the dimensional structure of

the data. Other times one will want to locate each object in the multidimensional

latent trait space. This is equivalent to calculating a factor score for each case,

relative to each factor.

Bock and Aitkin (1981) describe three methods to estimate such factor scores:Maximum Likelihood (ML), Maximum a posteriori (MAP), and Expected a

posteriori (EAP) estimation. The EAP estimates are fairly easily obtained,

especially for a one-factor model; for details, click here .

Final Thoughts



At the outset it was suggested that the heuristic method may be useful mainly to

those who do not have access to the specialized software needed for FIML

estimation of a multidimensional latent trait/IRT model. However, after writing this, it

occurs to me that the heuristic model may be somewhat more flexible than currentversions of the FIML approach. For example, with the heuristic method, one can

easily combine binary and continuous variables in the same analysis. One would

simply construct a correlation matrix where correlations are estimated as:

Tetrachoric/polychoric correlations, if both variables are dichtomous/orderedcategoricalBiserial/polyserial correlations, if one variable is continuous and one is

dichotomous/ordered categoricalPearson correlations, if both variables are continuous

One would then factor-analyze the resulting matrix.

Similarly, it would appear that the heuristic method can more easily permit certain

uses of covariates to improve estimation of latent trait scores for individual cases.

These same things can, in principle, be done with FIML estimation, but it is unclear

that software exists to do them.

To cite this article:

Uebersax JS. Estimating a latent trait model by factor analysis of tetrachoriccorrelations. 2000. Statistical Methods for Rater Agreement web site. Availableat: http://john-uebersax.com/stat/irt.htm. Accessed mmm dd, yyyy.

References

Abramowitz M, Stegun IA. Handbook of mathematical functions. Washington DC: U. S.Government Printing Office, 1972.

Bock RD, Aitkin M. Marginal maximum likelihood estimation of item parameters:application of an EM algorithm. Psychometrika, 1981, 46, 443-459.

Bock RD, Lieberman M. Fitting a response curve model for dichotomously scoreditems. Psychometrika, 1970, 35, 179-198.

Bock RD, Gibbons R, Muraki E. Full-information item factor analysis. AppliedPsychological Measurement, 1988, 12, 261-280.

Brown MB. Algorithm AS 116: the tetrachoric correlation and its standard error. AppliedStatistics, 1977, 26, 343-351.

Drasgow F. Polychoric and polyserial correlations. In Kotz L, Johnson NL (Eds.),Encyclopedia of statistical sciences. Vol. 7 (pp. 69-74). New York: Wiley, 1988.

Gibbons RD. TESTFACT Version 3 users manual. Chicago: Scientific SoftwareInternational, 1998.

Harris B. Tetrachoric correlation coefficient. In Kotz L, Johnson NL (Eds.),Encyclopedia of statistical sciences. Vol. 9 (pp. 223-225). New York: Wiley, 1988.

Heinen T. (1996). Latent class and discrete latent trait models: similarities and



differences. Thousand Oaks, California: Sage, 1996.

Hulin CL, Drasgow F, Parsons CK. Item response theory. Homewood, Illinois: DowJones-Irwin, 1983.

Johnson VE, Albert JH. Modeling ordinal data. New York: Springer, 1999.

Joreskog KG, Sorbom, D. PRELIS 1 User's Manual, Version 1. Chicago: ScientificSoftware, Inc., 1988.

Kennedy WJ Jr, Gentle JE. Statistical computing. New York: Dekker, 1980.

Knol DL, Berger MP. Empirical comparison between factor analysis andmultidimensional item response models. Multivariate Behavioral Research, 1991, 46,457-477.

Lazarsfeld PF, Henry NW. Latent Structure Analysis, Boston: Houghton Mifflin, 1968.

Lord FM, Novick MR. Statistical theories of mental test scores. Reading,Massachusetts: Addison-Wesley, 1968.

Mislevy RJ. Recent developments in the factor analysis of categorical variables. Journalof Educational Statistics, 1986, 11, 3-31.

Parry CD, McArdle JJ. An applied comparison of methods for least-squares factoranalysis of dichotomous variables. Applied Psychological Measurement, 1991, 15, 35-46

Takane Y, de Leeuw J. On the relationship between item response theory and factoranalysis of discretized variables. Psychometrika, 1987, 52, 393-408.

Uebersax J. LTM: a program for unidimensional latent trait modeling. In preparation,2000.

Vermunt J. LEM users manual. Tilburg, The Netherlands: Tilburg University, 1998.

Wichura MJ. Algorithm AS 241: Percentage points of the normal distribution. AppliedStatistics, 1988, 37(3) 477-484.

Go to Top of page Go to Latent Structure Analysis Go to Statistical Methods for Rater Agreement

This page maintained by John Uebersax PhD

Revised: 10 August 2006 (corrected example matrix, new software link)

Copyright (c) by John S. Uebersax. All rights reserved.

Estimating a Latent Trait Model by Factor Analysis of Tetrachoric Correlations

Documents

Transcript of Estimating a Latent Trait Model by Factor Analysis of Tetrachoric Correlations