supp.apa.orgsupp.apa.org/.../APL-APL2-Arthur20113255-RR-F1.FINAL.docx · Web viewThe short form of...

Supplemental Materials

Comparative Evaluation of Three Situational Judgment Test Response Formats in Terms of Construct-Related Validity, Subgroup Differences, and Susceptibility to Response

Distortion

by W. Arthur Jr. et al., 2014, Journal of Applied Psychology

http://dx.doi.org/10.1037/a0035788

Study 2

Competing Explanations for Study 1’s Results

The gist of the shared-common-response-method explanation of Study 1’s (i.e., the field

study) results is that because the rate-SJT shared a similar response format with the personality

measure (i.e., a Likert-type response format), and the rank- and most/least-SJTs shared what

appears to be a similar response format with the GMA test (i.e., a multiple-choice-type response

format), the rate-SJT–FFM personality traits relationships and the rank-SJT–GMA relationship

are an artifact of the shared common response method instead of the posited differences in the

cognitive and information processing demands. Specifically, the posited shared-common-

response-method effect is a form of method bias in which at least part of the variance in the

relationship between two or more tests or measures can be attributed to the format similarities

between the measures (Podsakoff, MacKenzie, & Podsakoff, 2012). So, in the absence of a

design that controls for the response formats by holding them constant across all the measures,

the shared-common-response-method explanation cannot be fully discounted as an alternative

explanation. Therefore, given the increased sensitivity to threats associated with common-

method bias, a lab-based experiment (Study 2) that simulates the requisite conditions and

investigates their resultant effects was undertaken to address this concern. So, Study 2 crossed

the three response formats with the integrity-based SJT measure, and measures of GMA and the

specified FFM personality traits. Consequently, if the shared-common-response-method

explanation best accounts for the observed response format effects, then the highest positive

relationships should be obtained for the matched response formats (e.g., rank-SJT/rank-GMA,

1

rate-SJT/rate-GMA, and most/least-SJT/most/least-GMA) compared to the other (mismatched)

conditions. Specifically, the correlations for the matched response formats should be positive and

largest compared to mismatched correlations which ideally, should all be zero. On the hand, if

the differences-in-g-loading explanation best accounts for the results of Study 1, then said results

should be replicated such that the rank-SJT should display the strongest positive relationships

with GMA regardless of the GMA response format, and the rate-SJT should display the strongest

positive correlations with the specified FFM personality traits regardless of the personality

measure response format.

Response Distortion

Study 2 also presented an opportunity to investigate the comparative susceptibility of the

three SJT response formats to response distortion when they are used to measure a noncognitive

construct. Paulhus (2002) highlights the distinction between self-deception and impression

management as facets of social desirability responding. The focus of the present study is on

impression management or deliberate response distortion which pertains to individuals

consciously presenting themselves falsely to create a favorable impression.

Given the socially sensitive nature of the construct assessed in the present study, it is not

unreasonable to expect high levels of response distortion to threaten the efficacy and utility of an

integrity-based SJT measure. Consequently, an important question is whether the extent of this

threat varies as a function of the SJT response format. It is our proposition that the SJT response

format is a design feature that may influence the susceptibility of noncognitive SJTs to response

distortion. In some socially sensitive domains such as integrity testing and loss prevention, SJT

items may contain response options that are all considered undesirable behaviors to varying

degrees. In the context of a rate-SJT, test takers are able to rate all the responses uniformly high

or low. However, for the rank-SJT, and to a lesser extent the most/least-SJT, test takers must

make nuanced distinctions between undesirable response options. Specifically, for the rank-SJT,

test takers are forced to make distinctions between all response options. Consequently, it was

posited that the response formats differ in the degree of difficulty associated with responding in a

socially desirable manner such that because the rank- and most/least-SJTs engender making

comparative assessments between the response options, all of which may be undesirable to

2

varying degrees, it should be more difficult to engage in response distortion with these formats

than the rate response format.

Test Taker Reactions and Response Format Score Reliabilities

Study 2 also permitted the comparative evaluation of test takers’ reactions to the three

SJT response formats, along with the score reliabilities of the response formats. Increasing a

test’s complexity correspondingly increases its cognitive load, which is likely to exhibit an

inverse relationship with respondents’ reactions to the test. Specifically, it was posited that test

takers would react more favorably to the easier response formats. Prior work has shown that

different item formats can produce differences in test taker reactions (Shyamsunder & McCune,

2009). Furthermore, Bradshaw (1990) showed that the perceived difficulty of a placement test

was negatively related to individuals’ reactions to the test. Consequently, because of the

increased information processing demands associated with the rank response format, it was

anticipated that test takers will perceive this format as being more difficult.

Method

Participants. An initial sample of 505 students was recruited from the psychology

department human subjects pool of a large southwestern U.S. university. Complete data were

available for 492 and thus this represented the final study sample. The mean age of the sample

was 18.77 years (SD = 1.49), and 320 (65.0%) of the participants were female. Three hundred

and thirty (67.1%) self-reported their race as White. There were 18 African Americans, 86

Hispanics, and 35 Asian Americans. Five participants reported their race as “other,” 18 as multi-

racial, and one person did not report their race.

Measures.

SJT. The SJT items, response formats, scoring keys, and scoring methods were the same

as those used in Study 1. However, unlike Study 1, the SJT was administered in a paper-and-

pencil format (instead of via the internet) and under proctored conditions. Consequently,

although the SJT was still untimed, completion time was obtained via participant self-reports; the

3

process and procedure for obtaining these data are described in detail in the Design and

Procedure section.

GMA. The short form of the Raven’s Advanced Progressive Matrices (APM; Arthur &

Day, 1994; Arthur, Tubre, Paul, & Sanchez-Ku, 1999), which consists of 2 practice items and 12

test items, was used to operationalize GMA. Arthur et al. (1999) reported a 1-week test–retest

reliability of .76. Three different response formats for the APM that matched the SJT response

formats were developed for Study 2. Using a 5-point Likert scale (1 = very poor fit, 5 = very

good fit), the rate response format required test takers to rate how accurately each of the 8 answer

options for each item fit the item pattern to correctly answer the item. For the rank response

format, test takers ranked the eight response options (1 = best fit, 8 = worst fit) on how accurately

they fit the item pattern to correctly answer the item. The most/least response format required

test takers to indicate which of the 8 response options most accurately completed the item pattern

and which least accurately did so. On the basis of two rounds of pilot testing, the preceding three

response formats of the APM were administered with a 30-min time limit. However, the standard

APM, which was administered at Time 2, used the standard 15-min time limit.

The rate-, rank-, and most/least-APM scoring keys were developed for the present study

using a panel of seven upper-level industrial/organizational psychology PhD graduate students as

SMEs. The SMEs generated the keys via consensus. For the rate response format, the SMEs first

individually rated the response options for each test item on 5-point scale (5 = very good fit)

prior to a consensus meeting. The response option with the best fit (i.e., the correct answer as per

the standard test scoring key) was rated a 5, and all other response options were assigned ratings

lower than this. Based on the initial level of SME agreement for each response option, the panel

discussed their ratings until they reached consensus. Thus, the SME’s consensus ratings served

as the answer key for the rate-APM. Test takers were awarded a point for each response rating

that matched the SME rating such that test takers could receive 0–8 points for each test item, and

test scores could range from 0 to 96 points (for the 12 items). However, for ease of interpretation

and comparative purposes, the scores were scaled to 100.

The SMEs’ consensus ratings for the rate response format were used to inform the

development of the rank scoring key. Specifically, the response option that was rated a 5 (i.e., the

4

correct answer as per the standard test scoring key) was assigned a rank of 1, and subsequent

options were ranked accordingly on the basis of their ratings. In the event of a tie in ratings, the

SMEs had to reach consensus on the appropriate rank order of the response options in question.

Like the rate-APM, the rank-APM used the same scoring method as its corresponding SJT. Thus,

test takers received a point for each response ranking that matched the SME ranking.

Consequently, test takers could receive 0–8 points for each item, and test scores could range

from 0 to 96. However, for ease of interpretation and comparative purposes, the scores were

scaled to 100.

For the most/least-APM response format, the highest and lowest ranked response options

from the rank-APM scoring key were designated as the most and least effective response

options, respectively, and the same scoring method as its corresponding SJT was used to score

test takers’ responses. Consequently, scores could range from 2 to 2 for each item, and the item

scores were summed across the 12 items such that test scores could range from 24 to 24. For

ease of interpretation and comparative purposes, the test scores were scaled to 100.

Personality measure. A 50-item (10 items per dimension) FFM International Personality

Item Pool measure (Goldberg, 1999; Goldberg et al., 2006) was used to measure the personality

traits of interest. Thus, whereas the whole measure was administered, only the agreeableness,

conscientiousness, and emotional stability factors were used for the purposes of the present

study. Goldberg (1992, 1999) reported internal consistency reliability estimates of .82, .79,

and .86 for agreeableness, conscientiousness, and emotional stability scores, respectively.

Although the administration of the personality measures was not timed, participants were asked

to record their start and end times in their test booklets.

Three different response formats for the personality measure that matched the SJT and

APM response formats were used. The first was the standard Likert response format in which

participants rated the items on a 5-point scale (1 = very inaccurate, 5 = very accurate) in terms of

the extent to which each item statement was descriptive of them. After reverse scoring the

specified items, each dimension score was computed as the average of the responses to the items

that comprised the specified dimension. Dimension scores could therefore range from 1 to 5. The

5

internal consistency reliability estimates obtained for this response format were .80, .80, and .83

for agreeableness, conscientiousness, and emotional stability, respectively.

The rank response format used a frequency-based response format in which participants

estimated the relative frequency of how often each of the five response levels (ranging from very

inaccurate to very accurate) for each item was descriptive of their behavior (Edwards & Woehr,

2007). Thus, participants assigned a percentage to each response level accordingly, without ties,

such that the percentages summed to 100%. The resultant effect was that in the absence of ties,

participants had to rank order the response levels in terms of how they were descriptive of them.

Using the same scoring method reported by Edwards and Woehr (2007), the score for each

dimension could range from 1 to 5. The internal consistency reliability estimates obtained for the

rank response format were .77, .72, and .84 for agreeableness, conscientiousness, and emotional

stability, respectively.

For the most/least response format for the personality measure, participants indicated

which of the five response levels (ranging from very inaccurate to very accurate) was most

descriptive of them and which was least descriptive. The score for each item was computed as

the difference between the numerical value of the most and the least response. Consequently,

participants’ scores could range from 4 to 4 for each item. After reverse scoring the specified

items, the dimension score (i.e., agreeableness, conscientiousness, and emotional stability) was

computed as the average of the scores for the items that comprised that dimension. Therefore,

test scores could range from 4 to 4 for each dimension. For ease of interpretation and

comparative purposes, scores were scaled to a 1–5 point scale for each dimension to match the

score range of the rate and rank response formats. The internal consistency reliability estimates

obtained for most/least response format were .74, .78, and .78 for agreeableness,

conscientiousness, and emotional stability, respectively.

Response distortion. Response distortion was operationalized as scores on the 20-item

impression management scale of the Balanced Inventory of Desirable Responding, Version 6,

Form 40A (BIDR; Paulhus, 1991). Participants rated each item on a 7-point Likert scale (1 = not

true, 7 = very true), and after reverse scoring the specified items, the test taker’s score was

computed as the number of items rated as a 6 or 7 (Paulhus, 1991). Consequently, scores could

6

range from 0 to 20. Reported internal consistency reliability estimates range from .77 to .86

(Konstabel, Aavik, & Allik, 2006; Stober & Dette, 2002). An internal consistency reliability

of .72 was obtained for the present study.

Test taker reactions. Three items, rated on a 5-point Likert scale (1 = strongly disagree, 5

= strongly agree), were created to measure the perceived difficulty of the SJT response format

that test takers completed at Time 2. The test taker’s score was the average of the three items,

and an internal consistency reliability estimate of .84 was obtained for the perceived difficulty

ratings.

A single item was developed to assess test takers’ preferences for each of the three SJT

response formats. Specifically, participants were presented with a single SJT sample item

depicting all three response formats. They were then instructed to rate (1–5) the extent to which

they would prefer to use each response format if they had to complete the SJT again. Thus,

unlike the difficulty ratings where participants rated only the SJT they completed at Time 2,

participants provided preference ratings for all three response formats.

Design and procedure. Study 2 used a two-part mixed factorial design, which is

illustrated in Table S1. Thus, participants were assessed at two time points (each lasting one and

a half hours). At Time 1, participants were randomly assigned to complete one of the three SJT

response formats and one of the three GMA/FFM response formats (i.e., the GMA and FFM

measure completed by the participant had the same response format). Therefore, the Time 1

design was a 3 (SJT: rate vs. rank vs. most/least) 3 (GMA/FFM: rate vs. rank vs. most/least)

between-subjects design. After a 5–9 day interval (M = 6.99, SD = 0.74; 395 [80.3%] of the

sample had a 7-day retest interval), participants returned for the Time 2 session where they were

again randomly assigned to complete one of the three SJT response formats. Consequently, at

Time 2 approximately a third of the participants retested on the same version of the SJT that they

had completed at Time 1, while the remaining two thirds completed a different version. All

participants completed the standard APM, the BIDR, and the test taker reactions measure.

For both study sessions, participants were tested in groups of approximately 100. Each

participant received a test booklet that contained the measures for their condition for the

7

specified session. So, for the Time 1 session, the participants’ test booklets corresponded to one

of the nine conditions illustrated in Table S1. Because the APM was timed, it was always

administered first, and so all participants started the APM at the same time, and no one exceeded

the allotted time limit. A 30-min limit was used for the rate-, rank-, and most/least-APM at Time

1, and the standard 12-min limit was used for the standard APM at Time 2. After the

administration of the APM, the participants then completed the measures in their test booklets at

their own pace. A digital clock was displayed on an overhead projector screen, and participants

were instructed to record their start and end times in the specified section in their test booklets

for the SJT and personality measure. Concerning the presentation order, at Time 1, participants

completed the APM, then the SJT and FFM measure as listed. At Time 2, they again first

completed the APM, the BIDR next, and then the test taker reactions measure.

Results

Table S2 presents the descriptive statistics for the study variables that used the rate, rank,

and most/least response formats, and Table S3 presents the same statistics for those variables that

used their standard response formats. The results for the competitive tests for the differences-in-

g-loading versus shared-common-response-method explanations are presented in Table S4. For

these competitive tests, if the shared-common-response-method explanation best accounts for the

observed response format effects, then the highest positive correlations should be obtained for

the matched response formats (e.g., rank-SJT/rank-GMA, rate-SJT/rate-GMA, and most/least-

SJT/most/least-GMA) compared to the other (mismatched) conditions. Specifically, the matched

response formats (which are represented by the underlined correlations [the diagonal cells] in

Table S4) should be positive and largest compared to the off-diagonal correlations which ideally,

should all be zero.

The results reported in Table S4 did not indicate a pattern that is supportive of the shared-

common-response-method explanation. First, for the standard (multiple-choice) GMA, the

pattern of results replicated those for Study 1—the rank-SJT displayed the strongest relationship

with GMA scores, followed by the most/least-SJT, and then the rate-SJT. Second, in general, the

rank-SJT displayed the strongest relationships with GMA scores regardless of the GMA response

8

format; and much weaker relationships were obtained for the rate- and most/least-SJTs and

GMA scores, again, regardless of the response format.

A similar pattern of results was obtained in reference to the personality relationships as

well. Specifically, the results were not indicative of a pattern where the relationships for the

matched response formats (i.e., the underlined correlations in Table S4) were positive and largest

compared to the other (mismatched) conditions. However, it is also worth noting that the FFM

results for Study 2 did not unambiguously replicate those for Study 1 in that they were not all

statistically significant and their magnitudes were smaller than that obtained in Study 1.

Nevertheless, in terms of their patterns, the rate-SJT generally displayed higher relationships

with the standard FFM scores (average of Time and Time 2 correlations = .30, .09, and .11 for

agreeableness, conscientiousness, and emotional stability, respectively) compared to the rank-

SJT (corresponding average of Time 1 and Time 2 correlations are .00, .19, and .13) and the

most/least-SJT (corresponding average of Time 1 and Time 2 correlations are .06, .02, and .10).

Thus, in summary, in terms of both the GMA and personality traits, the results were more

aligned with the differences-in-g-loading, and not the shared-common-response-method

explanation; however, they did not entirely eliminate the latter.

The completion time results for Study 1 were also replicated in Study 2. As the results in

Table S3 indicate, the rank-SJT had a completion time that was longer than the most/least-SJT,

which was in turn longer than the rate-SJT. It is also noteworthy that the completion times for the

FFM measure (i.e., rate mean = 3.73 min, SD = 1.12; rank mean = 22.23 min, SD = 9.86; and

most/least mean = 6.67 min, SD = 1.96) paralleled those for SJT measure. This convergence of

results across these two measures suggests that at least in the context of noncognitive measures,

the differential cognitive load engendered by the different response formats may be construct

invariant; ranking is a more cognitively demanding task and accordingly takes longer to

complete.

Table S5 presents the sex-based subgroup differences for each SJT response format.

Consistent with the results for Study 1, these results indicate that women generally obtained

higher scores than men, with these differences being statistically significant with the rank-SJT at

both Time 1 and Time 2. The comparatively small number of non-White participants and the

9

restricted age of the sample did not permit a replicative investigation of race- and age-based

subgroup differences, respectively.

The correlations between the SJT and response distortion scores, which are presented in

Table S4, indicated a positive relationship such that individuals who were likely responding in a

socially desirable manner also had higher SJT scores. The largest correlation was obtained for

the rate-SJT (average of Time 1 and Time 2 correlation = .29), which was larger than that

obtained for the most/least-SJT (average of Time 1 and Time 2 correlation = .15). However, the

difference was not statistically significant (zr = 1.33, p > .05). And contrary to expectation, the

rank-SJT correlation was also larger (average of Time 1 and Time 2 correlation = .23) than that

for the most/least-SJT, but they were not significantly different (zr = 0.75, p > .05). Hence, it

would seem that in terms of the magnitude of the correlations, the rate-SJT was the most

susceptible to response distortion, and the most/least-SJT was the least susceptible.

Concerning test taker reactions, the results presented in Table SA.3 indicate that the rank-

SJT was rated as the most difficult response format, with the rate- and most/least being of equal

difficulty. In addition, in terms of preferences, the most/least-SJT was the most preferred,

followed by the rate-SJT, and then the rank-SJT, which was the least preferred.

Table S6 presents the reliability estimates. As in Study 1, the SJT scores displayed high

levels of internal consistency with the rate-SJT having the highest reliability estimate (Time 1

= .94, Time 2 = .95), followed by the rank-SJT (Time = .76, Time 2 = .81), and then the

most/least-SJT (Time 1 = .69, Time 2 = .76). The test–retest reliabilities also indicate that the

rate-SJT had higher levels of reliability (.69) than the most/least-SJT (.64), which was in turn

higher than the rank-SJT (.59). And finally, the rank-SJT and most/least-SJT displayed the

highest alternate-form reliabilities (.67 and .70), followed by the rate-SJT and rank-SJT (.47

and .35), and then the rate-SJT and most/least-SJT (.29 and .40).

Discussion

We acknowledge that the rate, rank, and most/least response formats for the GMA test,

and the rank and most/least for the personality measure are atypical. Nevertheless, they were

10

necessary to permit the comparative test of the differences-in-g-loading versus shared-common-

response-method explanations. That being said, the results of Study 2 are in accord with those of

Study 1 in terms of (a) the SJT completion times; (b) the sex-based subgroup differences; (c) the

internal consistency reliability estimates; and more importantly (d) the postulated relationships

with GMA and the specified FFM personality traits, a set of findings that were generally

replicated irrespective of the response format of the personality or GMA measure. In spite of the

noticeable differences in design, measures, and sample type (a summary of which is presented in

Table S7), the convergence between the results of the two studies that were undertaken to

investigate these issues was very high, lending further support to the robustness of the findings.

However, it is acknowledged that the FFM results for Study 2 did not unambiguously replicate

those for Study 1; nevertheless, the general pattern of results was the same.

Study 2 also permitted the investigation of additional issues that were not undertaken in

Study 1. These additional results indicated the rate-SJT was the most susceptible to response

distortion and the most/least-SJT the least. The rate- and rank-SJT correlations with response

distortion were quite similar in magnitude. Concerning test taker reactions, the rank-SJT

engendered the least favorable reactions, and the rate response format displayed comparatively

more favorable test taker reactions compared to the rank. This is probably because the rate

response format allows for ties between response options, such that test takers are not forced to

differentiate between similar response options (as may be the case with the rank, and to a lesser

extent, the most/least response formats); all response options can be assigned the same

effectiveness rating within a given item making it an easier cognitive task. In contrast, the rank

format forces test takers to make distinctions between response options that may be quite similar

(from their perspective).

Finally, the rate-SJT scores demonstrated the highest levels of internal consistency and

test–retest reliability, and again, the rank-SJT the least. However, the rank- and most/least-SJT

demonstrated the highest alternate-form reliability, and the rate-SJT displayed the lowest

correlation with the other two response formats. This lends additional credence to the

differences-in-g-loading explanation. Specifically, because the rank and most/least response

11

formats have similar levels of cognitive load, it would be expected that they would display the

highest inter-form correlation, as was obtained here.

References

Arthur, W., Jr., & Day, D. V. (1994). Development of a short form for the Raven Advanced

Progressive Matrices Test. Educational and Psychological Measurement, 54, 394–403.

Arthur, W., Jr., Tubre, T. C., Paul, D. S., & Sanchez-Ku, M. L. (1999). College-sample

psychometric and normative data on a short form of the Raven Advanced Progressive

Matrices Test. Journal of Psychoeducational Assessment, 17, 354–361.

Bradshaw, J. (1990). Test-takers’ reactions to a placement test. Language Testing, 7, 13–30.

Edwards, B. D., & Woehr, D. J. (2007). An exmination and evaluation of frequency-based

personality measurement. Personality and Individual Differences, 43, 803–814.

Goldberg, L. R. (1992). The development of markers for the Big-Five factor structure.

Psychological Assessment, 4, 26–42.

Goldberg, L. R. (1999). A broad-bandwidth, public domain, personality inventory measuring the

lower-level facets of several five-factor models. In I. Mervielde, I. Deary, F. De Fruyt, &

F. Ostendorf (Eds.), Personality psychology in Europe (Vol. 7, pp. 7–28). Tilburg, the

Netherlands: Tilburg University Press.

Goldberg, L. R., Johnson, J. A., Eber, H. W., Hogan, R., Ashton, M. C., Cloninger, C. R., &

Gough, H. G. (2006). The international personality item pool and the future of public-

domain personality measures. Journal of Research in Personality, 40, 84–96.

Konstabel, K., Aavik, T., & Allik, J. (2006). Social desirability and consensual validity of

personality traits. European Journal of Personality, 20, 549–566.

12

Paulhus, D. L. (1991). Balanced Inventory of Desirable Responding (BIDR) reference manual

for Version 6. (Manual available from author at the Department of Psychology,

University of British Columbia, Vancouver, British Columbia V6T IY7, Canada.)

Paulhus, D. L. (2002). Socially desirable responding: The evolution of the construct. In H.

Braun, D. Jackson, & D. Wiley (Eds.), The role of constructs in psychological and

educational measurement (pp. 49–69). Mahwah, NJ: Erlbaum.

Podsakoff, P. M., MacKenzie, S. B., & Podsakoff, N. P. (2012). Sources of method bias in social

science research and recommendations on how to control it. Annual Review of

Psychology, 63, 539–569.

Shyamsunder, A., & McCune, E. A. (2009). Test-taker reactions to item formats used in online

selection assessments. Paper presented at the 24th Annual Conference of the Society for

Industrial and Organizational Psychology, Atlanta, GA.

Stober, J., & Dette, D. E. (2002). Comparing continuous and dichotomous scoring of the

Balanced Inventory of Desirable Responding. Journal of Personality Assessment, 78,

370–389.

13

Table S1

Study 2 Research Design and Participant Assignments

SJT response format

Rate Rank Most/least

TIME 1 GMA1/FFM response format

Rate 57 54 53

Rank 55 53 57

Most/least 54 54 55

TIME 2 GMA2, RD,test taker reactions 149 162 181

Note. The numbers in the cells represent the number of participants in each condition. GMA1 = the short form of the Raven’s Advanced Progressive Matrices using either the rate, rank, or most/least response format; GMA2 = the short form of the Raven’s Advanced Progressive Matrices using the standard response format; FFM = personality measure (i.e., International Personality Item Pool) using either the rate, rank, or most/least response format; RD = response distortion measure (i.e., Balanced Inventory of Desirable Responding).

14

Table S2

Study 2 Descriptive Statistics for Variables That Used the Rate, Rank, and Most/Least Response Formats

Variable Response formatRate Rank Most/least

M SD M SD M SDSJT 1 59.32 16.35 53.47 11.72 83.05 6.64SJT 2 56.86 18.25 53.38 13.61 83.92 7.40GMA 39.51 7.27 25.24 11.33 76.42 7.65Agreeableness 4.11 0.53 3.62 0.46 4.14 0.57Conscientiousness 3.69 0.63 3.52 0.43 3.77 0.66Emotional stability 3.21 0.70 3.11 0.55 3.18 0.74

Note. SJT scores range from 0 to 100, GMA scores from 0 to 100, and FFM scores from 1 to 5. Because the absolute magnitude of the means and standard deviations (but not the correlations) are impacted by the specific scoring algorithm used to score the tests, statistical tests for differences between the means of the different response formats are not presented because the results are dependent on the specific scoring algorithm used.

15

Table S3

Study 2 Descriptive Statistics for Variables That Used Their Standard Response Formats

Variable M SD

GMA 68.31 19.52

Response distortion 6.85 3.48

SJT completion times (Time 1)

Rate 12.28a 3.47

Rank 14.75b 3.63

Most/least 13.29c 3.22

SJT completion times (Time 2)

Rate 8.84a 2.45

Rank 10.42b 2.98

Most/least 8.98a 3.08

FFM completion times

Rate 3.73a 1.12

Rank 22.23b 9.86


Perceived difficulty of SJT response format

Rate 2.08a 0.90

Rank 2.33b 0.81

Most/least 2.10a 0.87

Preference for SJT response format

Rate 3.33a 1.48

Rank 2.56b 1.38


16

Note. GMA scores range from 0 to 100; response distortion from 0 to 20; and SJT perceived difficulty and preference from 1 to 5. Neither the difficulty nor preference ratings were related to the Time 2 SJT (response format) condition. Completion times are reported in minutes. Where there are multiple rows for a variable, means with different superscripts are significantly different from each other (p < .05, one-tailed).

17

Table S4

Study 2 Integrity-Based Situational Judgment Test Correlations With General Mental Ability and the Specified Five-Factor Model Personality Traits for All Response Formats

SJT response formatRate Rank Most/least

Time 1 Time 2 Time 1 Time 2 Time 1 Time 2GMA response format Standard .01 .06 .29* .36* .11 .21* Rate .06 .14 .36* .46* .12 .11 Rank .11 .08 .33* .35* .06 .00 Most/least .08 .16 .04 .10 .04 .15FFM rate format Agreeableness .30* .30* .10 .10 .10 .02 Conscientiousness .00 .17 .25* .12 .09 .13 Emotional stability .27* .06 .15 .10 .12 .07FFM rank format Agreeableness .04 .14 .19 .25* .17 .34* Conscientiousness .19 .02 .11 .13 .01 .12 Emotional stability .02 .13 . 26* .15 .17 .10FFM most/least format Agreeableness .07 .27* .08 .09 .12 .03 Conscientiousness .14 .39* .17 .40* .08 .09 Emotional stability .15 .05 .04 .29* .07 .00 Response distortion .30* .27* .17* .28* .16* .13*

Note. Underlined correlations represent those that would be expected to be highest positive correlations relative to the others, if the results are indicative of a shared-common-response-method effect.

*p < .05 (one-tailed).

18

Table S5

Study 2 Sex-Based Subgroup Differences for Integrity-Based Situational Judgment Tests for All Response Formats

Time 1 Time 2

Rate Rank Most/least Rate Rank Most/least

Male Female Male Female Male Female Male Female Male Female Male Female

NMSDD

5156.8417.48

—

11460.4215.780.22

5649.3612.80

—

10755.6310.560.55*

6582.837.48—

9983.196.060.05

5556.4218.73

—

9357.1218.060.04

5749.8614.02

—

10555.3013.060.41*

6183.217.33—

12184.277.440.14

Note. Females are compared to males such that a positive d indicates that males scored higher than females.

*p < .05 (one-tailed).

19

Table S6

Study 2 Test–Retest, Alternate-Form, and Internal Consistency Reliabilities for the Integrity-Based Situational Judgment Test Scores

Time 1Test–retest and alternate-form Coefficient

Rate Rank Most/least

Time 2

Rate .69 .47A1 .29B1 .95Rank .35A2 .59 .67C1 .81

Most/least .40B2 .70 C2 .64 .76Coefficient

.94 .76 .69

Note. Retest interval = 5–9 days, M = 6.99, SD = 0.74, and 80.3% of sample had a retest interval of 7 days. Test–retest reliabilities are in the diagonal. The length of the retest interval is not related to the retest condition (i.e., Time 2 SJT response format condition). Superscripts indicate alternative-form reliability pairs such that, for instance, A1 denotes the rank/rate and A2 the rate/rank alternate-form reliabilities.

20

Table S7

Differences in Study 1 and Study 2 Design and Methodological Protocol

Study 1 Study 2

Quasi-experimental design Experimental design

Between-subjects design Between- and within-subjects design

Operational field setting Lab setting

Job applicants College students

High-stakes testing Low-stakes testing

Large sample (n = 31,194) Relatively small sample (n = 492)

Internet-based protocol Paper-and-pencil assessment

Unproctored administration of measures Proctored administration of measures

Objectively recorded completion times Self-recorded completion times

Proprietary GMA and FFM measures Standardized GMA and FFM measures

21

supp.apa.orgsupp.apa.org/.../APL-APL2-Arthur20113255-RR-F1.FINAL.docx · Web viewThe short form of...

Documents

Transcript of supp.apa.orgsupp.apa.org/.../APL-APL2-Arthur20113255-RR-F1.FINAL.docx · Web viewThe short form of...