supp.apa.orgsupp.apa.org/.../APL-APL2-Arthur20113255-RR-F1.FINAL.docx · Web viewThe short form of...
Transcript of supp.apa.orgsupp.apa.org/.../APL-APL2-Arthur20113255-RR-F1.FINAL.docx · Web viewThe short form of...
Supplemental Materials
Comparative Evaluation of Three Situational Judgment Test Response Formats in Terms of Construct-Related Validity, Subgroup Differences, and Susceptibility to Response
Distortion
by W. Arthur Jr. et al., 2014, Journal of Applied Psychology
http://dx.doi.org/10.1037/a0035788
Study 2
Competing Explanations for Study 1’s Results
The gist of the shared-common-response-method explanation of Study 1’s (i.e., the field
study) results is that because the rate-SJT shared a similar response format with the personality
measure (i.e., a Likert-type response format), and the rank- and most/least-SJTs shared what
appears to be a similar response format with the GMA test (i.e., a multiple-choice-type response
format), the rate-SJT–FFM personality traits relationships and the rank-SJT–GMA relationship
are an artifact of the shared common response method instead of the posited differences in the
cognitive and information processing demands. Specifically, the posited shared-common-
response-method effect is a form of method bias in which at least part of the variance in the
relationship between two or more tests or measures can be attributed to the format similarities
between the measures (Podsakoff, MacKenzie, & Podsakoff, 2012). So, in the absence of a
design that controls for the response formats by holding them constant across all the measures,
the shared-common-response-method explanation cannot be fully discounted as an alternative
explanation. Therefore, given the increased sensitivity to threats associated with common-
method bias, a lab-based experiment (Study 2) that simulates the requisite conditions and
investigates their resultant effects was undertaken to address this concern. So, Study 2 crossed
the three response formats with the integrity-based SJT measure, and measures of GMA and the
specified FFM personality traits. Consequently, if the shared-common-response-method
explanation best accounts for the observed response format effects, then the highest positive
relationships should be obtained for the matched response formats (e.g., rank-SJT/rank-GMA,
1
rate-SJT/rate-GMA, and most/least-SJT/most/least-GMA) compared to the other (mismatched)
conditions. Specifically, the correlations for the matched response formats should be positive and
largest compared to mismatched correlations which ideally, should all be zero. On the hand, if
the differences-in-g-loading explanation best accounts for the results of Study 1, then said results
should be replicated such that the rank-SJT should display the strongest positive relationships
with GMA regardless of the GMA response format, and the rate-SJT should display the strongest
positive correlations with the specified FFM personality traits regardless of the personality
measure response format.
Response Distortion
Study 2 also presented an opportunity to investigate the comparative susceptibility of the
three SJT response formats to response distortion when they are used to measure a noncognitive
construct. Paulhus (2002) highlights the distinction between self-deception and impression
management as facets of social desirability responding. The focus of the present study is on
impression management or deliberate response distortion which pertains to individuals
consciously presenting themselves falsely to create a favorable impression.
Given the socially sensitive nature of the construct assessed in the present study, it is not
unreasonable to expect high levels of response distortion to threaten the efficacy and utility of an
integrity-based SJT measure. Consequently, an important question is whether the extent of this
threat varies as a function of the SJT response format. It is our proposition that the SJT response
format is a design feature that may influence the susceptibility of noncognitive SJTs to response
distortion. In some socially sensitive domains such as integrity testing and loss prevention, SJT
items may contain response options that are all considered undesirable behaviors to varying
degrees. In the context of a rate-SJT, test takers are able to rate all the responses uniformly high
or low. However, for the rank-SJT, and to a lesser extent the most/least-SJT, test takers must
make nuanced distinctions between undesirable response options. Specifically, for the rank-SJT,
test takers are forced to make distinctions between all response options. Consequently, it was
posited that the response formats differ in the degree of difficulty associated with responding in a
socially desirable manner such that because the rank- and most/least-SJTs engender making
comparative assessments between the response options, all of which may be undesirable to
2
varying degrees, it should be more difficult to engage in response distortion with these formats
than the rate response format.
Test Taker Reactions and Response Format Score Reliabilities
Study 2 also permitted the comparative evaluation of test takers’ reactions to the three
SJT response formats, along with the score reliabilities of the response formats. Increasing a
test’s complexity correspondingly increases its cognitive load, which is likely to exhibit an
inverse relationship with respondents’ reactions to the test. Specifically, it was posited that test
takers would react more favorably to the easier response formats. Prior work has shown that
different item formats can produce differences in test taker reactions (Shyamsunder & McCune,
2009). Furthermore, Bradshaw (1990) showed that the perceived difficulty of a placement test
was negatively related to individuals’ reactions to the test. Consequently, because of the
increased information processing demands associated with the rank response format, it was
anticipated that test takers will perceive this format as being more difficult.
Method
Participants. An initial sample of 505 students was recruited from the psychology
department human subjects pool of a large southwestern U.S. university. Complete data were
available for 492 and thus this represented the final study sample. The mean age of the sample
was 18.77 years (SD = 1.49), and 320 (65.0%) of the participants were female. Three hundred
and thirty (67.1%) self-reported their race as White. There were 18 African Americans, 86
Hispanics, and 35 Asian Americans. Five participants reported their race as “other,” 18 as multi-
racial, and one person did not report their race.
Measures.
SJT. The SJT items, response formats, scoring keys, and scoring methods were the same
as those used in Study 1. However, unlike Study 1, the SJT was administered in a paper-and-
pencil format (instead of via the internet) and under proctored conditions. Consequently,
although the SJT was still untimed, completion time was obtained via participant self-reports; the
3
process and procedure for obtaining these data are described in detail in the Design and
Procedure section.
GMA. The short form of the Raven’s Advanced Progressive Matrices (APM; Arthur &
Day, 1994; Arthur, Tubre, Paul, & Sanchez-Ku, 1999), which consists of 2 practice items and 12
test items, was used to operationalize GMA. Arthur et al. (1999) reported a 1-week test–retest
reliability of .76. Three different response formats for the APM that matched the SJT response
formats were developed for Study 2. Using a 5-point Likert scale (1 = very poor fit, 5 = very
good fit), the rate response format required test takers to rate how accurately each of the 8 answer
options for each item fit the item pattern to correctly answer the item. For the rank response
format, test takers ranked the eight response options (1 = best fit, 8 = worst fit) on how accurately
they fit the item pattern to correctly answer the item. The most/least response format required
test takers to indicate which of the 8 response options most accurately completed the item pattern
and which least accurately did so. On the basis of two rounds of pilot testing, the preceding three
response formats of the APM were administered with a 30-min time limit. However, the standard
APM, which was administered at Time 2, used the standard 15-min time limit.
The rate-, rank-, and most/least-APM scoring keys were developed for the present study
using a panel of seven upper-level industrial/organizational psychology PhD graduate students as
SMEs. The SMEs generated the keys via consensus. For the rate response format, the SMEs first
individually rated the response options for each test item on 5-point scale (5 = very good fit)
prior to a consensus meeting. The response option with the best fit (i.e., the correct answer as per
the standard test scoring key) was rated a 5, and all other response options were assigned ratings
lower than this. Based on the initial level of SME agreement for each response option, the panel
discussed their ratings until they reached consensus. Thus, the SME’s consensus ratings served
as the answer key for the rate-APM. Test takers were awarded a point for each response rating
that matched the SME rating such that test takers could receive 0–8 points for each test item, and
test scores could range from 0 to 96 points (for the 12 items). However, for ease of interpretation
and comparative purposes, the scores were scaled to 100.
The SMEs’ consensus ratings for the rate response format were used to inform the
development of the rank scoring key. Specifically, the response option that was rated a 5 (i.e., the
4
correct answer as per the standard test scoring key) was assigned a rank of 1, and subsequent
options were ranked accordingly on the basis of their ratings. In the event of a tie in ratings, the
SMEs had to reach consensus on the appropriate rank order of the response options in question.
Like the rate-APM, the rank-APM used the same scoring method as its corresponding SJT. Thus,
test takers received a point for each response ranking that matched the SME ranking.
Consequently, test takers could receive 0–8 points for each item, and test scores could range
from 0 to 96. However, for ease of interpretation and comparative purposes, the scores were
scaled to 100.
For the most/least-APM response format, the highest and lowest ranked response options
from the rank-APM scoring key were designated as the most and least effective response
options, respectively, and the same scoring method as its corresponding SJT was used to score
test takers’ responses. Consequently, scores could range from 2 to 2 for each item, and the item
scores were summed across the 12 items such that test scores could range from 24 to 24. For
ease of interpretation and comparative purposes, the test scores were scaled to 100.
Personality measure. A 50-item (10 items per dimension) FFM International Personality
Item Pool measure (Goldberg, 1999; Goldberg et al., 2006) was used to measure the personality
traits of interest. Thus, whereas the whole measure was administered, only the agreeableness,
conscientiousness, and emotional stability factors were used for the purposes of the present
study. Goldberg (1992, 1999) reported internal consistency reliability estimates of .82, .79,
and .86 for agreeableness, conscientiousness, and emotional stability scores, respectively.
Although the administration of the personality measures was not timed, participants were asked
to record their start and end times in their test booklets.
Three different response formats for the personality measure that matched the SJT and
APM response formats were used. The first was the standard Likert response format in which
participants rated the items on a 5-point scale (1 = very inaccurate, 5 = very accurate) in terms of
the extent to which each item statement was descriptive of them. After reverse scoring the
specified items, each dimension score was computed as the average of the responses to the items
that comprised the specified dimension. Dimension scores could therefore range from 1 to 5. The
5
internal consistency reliability estimates obtained for this response format were .80, .80, and .83
for agreeableness, conscientiousness, and emotional stability, respectively.
The rank response format used a frequency-based response format in which participants
estimated the relative frequency of how often each of the five response levels (ranging from very
inaccurate to very accurate) for each item was descriptive of their behavior (Edwards & Woehr,
2007). Thus, participants assigned a percentage to each response level accordingly, without ties,
such that the percentages summed to 100%. The resultant effect was that in the absence of ties,
participants had to rank order the response levels in terms of how they were descriptive of them.
Using the same scoring method reported by Edwards and Woehr (2007), the score for each
dimension could range from 1 to 5. The internal consistency reliability estimates obtained for the
rank response format were .77, .72, and .84 for agreeableness, conscientiousness, and emotional
stability, respectively.
For the most/least response format for the personality measure, participants indicated
which of the five response levels (ranging from very inaccurate to very accurate) was most
descriptive of them and which was least descriptive. The score for each item was computed as
the difference between the numerical value of the most and the least response. Consequently,
participants’ scores could range from 4 to 4 for each item. After reverse scoring the specified
items, the dimension score (i.e., agreeableness, conscientiousness, and emotional stability) was
computed as the average of the scores for the items that comprised that dimension. Therefore,
test scores could range from 4 to 4 for each dimension. For ease of interpretation and
comparative purposes, scores were scaled to a 1–5 point scale for each dimension to match the
score range of the rate and rank response formats. The internal consistency reliability estimates
obtained for most/least response format were .74, .78, and .78 for agreeableness,
conscientiousness, and emotional stability, respectively.
Response distortion. Response distortion was operationalized as scores on the 20-item
impression management scale of the Balanced Inventory of Desirable Responding, Version 6,
Form 40A (BIDR; Paulhus, 1991). Participants rated each item on a 7-point Likert scale (1 = not
true, 7 = very true), and after reverse scoring the specified items, the test taker’s score was
computed as the number of items rated as a 6 or 7 (Paulhus, 1991). Consequently, scores could
6
range from 0 to 20. Reported internal consistency reliability estimates range from .77 to .86
(Konstabel, Aavik, & Allik, 2006; Stober & Dette, 2002). An internal consistency reliability
of .72 was obtained for the present study.
Test taker reactions. Three items, rated on a 5-point Likert scale (1 = strongly disagree, 5
= strongly agree), were created to measure the perceived difficulty of the SJT response format
that test takers completed at Time 2. The test taker’s score was the average of the three items,
and an internal consistency reliability estimate of .84 was obtained for the perceived difficulty
ratings.
A single item was developed to assess test takers’ preferences for each of the three SJT
response formats. Specifically, participants were presented with a single SJT sample item
depicting all three response formats. They were then instructed to rate (1–5) the extent to which
they would prefer to use each response format if they had to complete the SJT again. Thus,
unlike the difficulty ratings where participants rated only the SJT they completed at Time 2,
participants provided preference ratings for all three response formats.
Design and procedure. Study 2 used a two-part mixed factorial design, which is
illustrated in Table S1. Thus, participants were assessed at two time points (each lasting one and
a half hours). At Time 1, participants were randomly assigned to complete one of the three SJT
response formats and one of the three GMA/FFM response formats (i.e., the GMA and FFM
measure completed by the participant had the same response format). Therefore, the Time 1
design was a 3 (SJT: rate vs. rank vs. most/least) 3 (GMA/FFM: rate vs. rank vs. most/least)
between-subjects design. After a 5–9 day interval (M = 6.99, SD = 0.74; 395 [80.3%] of the
sample had a 7-day retest interval), participants returned for the Time 2 session where they were
again randomly assigned to complete one of the three SJT response formats. Consequently, at
Time 2 approximately a third of the participants retested on the same version of the SJT that they
had completed at Time 1, while the remaining two thirds completed a different version. All
participants completed the standard APM, the BIDR, and the test taker reactions measure.
For both study sessions, participants were tested in groups of approximately 100. Each
participant received a test booklet that contained the measures for their condition for the
7
specified session. So, for the Time 1 session, the participants’ test booklets corresponded to one
of the nine conditions illustrated in Table S1. Because the APM was timed, it was always
administered first, and so all participants started the APM at the same time, and no one exceeded
the allotted time limit. A 30-min limit was used for the rate-, rank-, and most/least-APM at Time
1, and the standard 12-min limit was used for the standard APM at Time 2. After the
administration of the APM, the participants then completed the measures in their test booklets at
their own pace. A digital clock was displayed on an overhead projector screen, and participants
were instructed to record their start and end times in the specified section in their test booklets
for the SJT and personality measure. Concerning the presentation order, at Time 1, participants
completed the APM, then the SJT and FFM measure as listed. At Time 2, they again first
completed the APM, the BIDR next, and then the test taker reactions measure.
Results
Table S2 presents the descriptive statistics for the study variables that used the rate, rank,
and most/least response formats, and Table S3 presents the same statistics for those variables that
used their standard response formats. The results for the competitive tests for the differences-in-
g-loading versus shared-common-response-method explanations are presented in Table S4. For
these competitive tests, if the shared-common-response-method explanation best accounts for the
observed response format effects, then the highest positive correlations should be obtained for
the matched response formats (e.g., rank-SJT/rank-GMA, rate-SJT/rate-GMA, and most/least-
SJT/most/least-GMA) compared to the other (mismatched) conditions. Specifically, the matched
response formats (which are represented by the underlined correlations [the diagonal cells] in
Table S4) should be positive and largest compared to the off-diagonal correlations which ideally,
should all be zero.
The results reported in Table S4 did not indicate a pattern that is supportive of the shared-
common-response-method explanation. First, for the standard (multiple-choice) GMA, the
pattern of results replicated those for Study 1—the rank-SJT displayed the strongest relationship
with GMA scores, followed by the most/least-SJT, and then the rate-SJT. Second, in general, the
rank-SJT displayed the strongest relationships with GMA scores regardless of the GMA response
8
format; and much weaker relationships were obtained for the rate- and most/least-SJTs and
GMA scores, again, regardless of the response format.
A similar pattern of results was obtained in reference to the personality relationships as
well. Specifically, the results were not indicative of a pattern where the relationships for the
matched response formats (i.e., the underlined correlations in Table S4) were positive and largest
compared to the other (mismatched) conditions. However, it is also worth noting that the FFM
results for Study 2 did not unambiguously replicate those for Study 1 in that they were not all
statistically significant and their magnitudes were smaller than that obtained in Study 1.
Nevertheless, in terms of their patterns, the rate-SJT generally displayed higher relationships
with the standard FFM scores (average of Time and Time 2 correlations = .30, .09, and .11 for
agreeableness, conscientiousness, and emotional stability, respectively) compared to the rank-
SJT (corresponding average of Time 1 and Time 2 correlations are .00, .19, and .13) and the
most/least-SJT (corresponding average of Time 1 and Time 2 correlations are .06, .02, and .10).
Thus, in summary, in terms of both the GMA and personality traits, the results were more
aligned with the differences-in-g-loading, and not the shared-common-response-method
explanation; however, they did not entirely eliminate the latter.
The completion time results for Study 1 were also replicated in Study 2. As the results in
Table S3 indicate, the rank-SJT had a completion time that was longer than the most/least-SJT,
which was in turn longer than the rate-SJT. It is also noteworthy that the completion times for the
FFM measure (i.e., rate mean = 3.73 min, SD = 1.12; rank mean = 22.23 min, SD = 9.86; and
most/least mean = 6.67 min, SD = 1.96) paralleled those for SJT measure. This convergence of
results across these two measures suggests that at least in the context of noncognitive measures,
the differential cognitive load engendered by the different response formats may be construct
invariant; ranking is a more cognitively demanding task and accordingly takes longer to
complete.
Table S5 presents the sex-based subgroup differences for each SJT response format.
Consistent with the results for Study 1, these results indicate that women generally obtained
higher scores than men, with these differences being statistically significant with the rank-SJT at
both Time 1 and Time 2. The comparatively small number of non-White participants and the
9
restricted age of the sample did not permit a replicative investigation of race- and age-based
subgroup differences, respectively.
The correlations between the SJT and response distortion scores, which are presented in
Table S4, indicated a positive relationship such that individuals who were likely responding in a
socially desirable manner also had higher SJT scores. The largest correlation was obtained for
the rate-SJT (average of Time 1 and Time 2 correlation = .29), which was larger than that
obtained for the most/least-SJT (average of Time 1 and Time 2 correlation = .15). However, the
difference was not statistically significant (zr = 1.33, p > .05). And contrary to expectation, the
rank-SJT correlation was also larger (average of Time 1 and Time 2 correlation = .23) than that
for the most/least-SJT, but they were not significantly different (zr = 0.75, p > .05). Hence, it
would seem that in terms of the magnitude of the correlations, the rate-SJT was the most
susceptible to response distortion, and the most/least-SJT was the least susceptible.
Concerning test taker reactions, the results presented in Table SA.3 indicate that the rank-
SJT was rated as the most difficult response format, with the rate- and most/least being of equal
difficulty. In addition, in terms of preferences, the most/least-SJT was the most preferred,
followed by the rate-SJT, and then the rank-SJT, which was the least preferred.
Table S6 presents the reliability estimates. As in Study 1, the SJT scores displayed high
levels of internal consistency with the rate-SJT having the highest reliability estimate (Time 1
= .94, Time 2 = .95), followed by the rank-SJT (Time = .76, Time 2 = .81), and then the
most/least-SJT (Time 1 = .69, Time 2 = .76). The test–retest reliabilities also indicate that the
rate-SJT had higher levels of reliability (.69) than the most/least-SJT (.64), which was in turn
higher than the rank-SJT (.59). And finally, the rank-SJT and most/least-SJT displayed the
highest alternate-form reliabilities (.67 and .70), followed by the rate-SJT and rank-SJT (.47
and .35), and then the rate-SJT and most/least-SJT (.29 and .40).
Discussion
We acknowledge that the rate, rank, and most/least response formats for the GMA test,
and the rank and most/least for the personality measure are atypical. Nevertheless, they were
10
necessary to permit the comparative test of the differences-in-g-loading versus shared-common-
response-method explanations. That being said, the results of Study 2 are in accord with those of
Study 1 in terms of (a) the SJT completion times; (b) the sex-based subgroup differences; (c) the
internal consistency reliability estimates; and more importantly (d) the postulated relationships
with GMA and the specified FFM personality traits, a set of findings that were generally
replicated irrespective of the response format of the personality or GMA measure. In spite of the
noticeable differences in design, measures, and sample type (a summary of which is presented in
Table S7), the convergence between the results of the two studies that were undertaken to
investigate these issues was very high, lending further support to the robustness of the findings.
However, it is acknowledged that the FFM results for Study 2 did not unambiguously replicate
those for Study 1; nevertheless, the general pattern of results was the same.
Study 2 also permitted the investigation of additional issues that were not undertaken in
Study 1. These additional results indicated the rate-SJT was the most susceptible to response
distortion and the most/least-SJT the least. The rate- and rank-SJT correlations with response
distortion were quite similar in magnitude. Concerning test taker reactions, the rank-SJT
engendered the least favorable reactions, and the rate response format displayed comparatively
more favorable test taker reactions compared to the rank. This is probably because the rate
response format allows for ties between response options, such that test takers are not forced to
differentiate between similar response options (as may be the case with the rank, and to a lesser
extent, the most/least response formats); all response options can be assigned the same
effectiveness rating within a given item making it an easier cognitive task. In contrast, the rank
format forces test takers to make distinctions between response options that may be quite similar
(from their perspective).
Finally, the rate-SJT scores demonstrated the highest levels of internal consistency and
test–retest reliability, and again, the rank-SJT the least. However, the rank- and most/least-SJT
demonstrated the highest alternate-form reliability, and the rate-SJT displayed the lowest
correlation with the other two response formats. This lends additional credence to the
differences-in-g-loading explanation. Specifically, because the rank and most/least response
11
formats have similar levels of cognitive load, it would be expected that they would display the
highest inter-form correlation, as was obtained here.
References
Arthur, W., Jr., & Day, D. V. (1994). Development of a short form for the Raven Advanced
Progressive Matrices Test. Educational and Psychological Measurement, 54, 394–403.
Arthur, W., Jr., Tubre, T. C., Paul, D. S., & Sanchez-Ku, M. L. (1999). College-sample
psychometric and normative data on a short form of the Raven Advanced Progressive
Matrices Test. Journal of Psychoeducational Assessment, 17, 354–361.
Bradshaw, J. (1990). Test-takers’ reactions to a placement test. Language Testing, 7, 13–30.
Edwards, B. D., & Woehr, D. J. (2007). An exmination and evaluation of frequency-based
personality measurement. Personality and Individual Differences, 43, 803–814.
Goldberg, L. R. (1992). The development of markers for the Big-Five factor structure.
Psychological Assessment, 4, 26–42.
Goldberg, L. R. (1999). A broad-bandwidth, public domain, personality inventory measuring the
lower-level facets of several five-factor models. In I. Mervielde, I. Deary, F. De Fruyt, &
F. Ostendorf (Eds.), Personality psychology in Europe (Vol. 7, pp. 7–28). Tilburg, the
Netherlands: Tilburg University Press.
Goldberg, L. R., Johnson, J. A., Eber, H. W., Hogan, R., Ashton, M. C., Cloninger, C. R., &
Gough, H. G. (2006). The international personality item pool and the future of public-
domain personality measures. Journal of Research in Personality, 40, 84–96.
Konstabel, K., Aavik, T., & Allik, J. (2006). Social desirability and consensual validity of
personality traits. European Journal of Personality, 20, 549–566.
12
Paulhus, D. L. (1991). Balanced Inventory of Desirable Responding (BIDR) reference manual
for Version 6. (Manual available from author at the Department of Psychology,
University of British Columbia, Vancouver, British Columbia V6T IY7, Canada.)
Paulhus, D. L. (2002). Socially desirable responding: The evolution of the construct. In H.
Braun, D. Jackson, & D. Wiley (Eds.), The role of constructs in psychological and
educational measurement (pp. 49–69). Mahwah, NJ: Erlbaum.
Podsakoff, P. M., MacKenzie, S. B., & Podsakoff, N. P. (2012). Sources of method bias in social
science research and recommendations on how to control it. Annual Review of
Psychology, 63, 539–569.
Shyamsunder, A., & McCune, E. A. (2009). Test-taker reactions to item formats used in online
selection assessments. Paper presented at the 24th Annual Conference of the Society for
Industrial and Organizational Psychology, Atlanta, GA.
Stober, J., & Dette, D. E. (2002). Comparing continuous and dichotomous scoring of the
Balanced Inventory of Desirable Responding. Journal of Personality Assessment, 78,
370–389.
13
Table S1
Study 2 Research Design and Participant Assignments
SJT response format
Rate Rank Most/least
TIME 1 GMA1/FFM response format
Rate 57 54 53
Rank 55 53 57
Most/least 54 54 55
TIME 2 GMA2, RD,test taker reactions 149 162 181
Note. The numbers in the cells represent the number of participants in each condition. GMA1 = the short form of the Raven’s Advanced Progressive Matrices using either the rate, rank, or most/least response format; GMA2 = the short form of the Raven’s Advanced Progressive Matrices using the standard response format; FFM = personality measure (i.e., International Personality Item Pool) using either the rate, rank, or most/least response format; RD = response distortion measure (i.e., Balanced Inventory of Desirable Responding).
14
Table S2
Study 2 Descriptive Statistics for Variables That Used the Rate, Rank, and Most/Least Response Formats
Variable Response formatRate Rank Most/least
M SD M SD M SDSJT 1 59.32 16.35 53.47 11.72 83.05 6.64SJT 2 56.86 18.25 53.38 13.61 83.92 7.40GMA 39.51 7.27 25.24 11.33 76.42 7.65Agreeableness 4.11 0.53 3.62 0.46 4.14 0.57Conscientiousness 3.69 0.63 3.52 0.43 3.77 0.66Emotional stability 3.21 0.70 3.11 0.55 3.18 0.74
Note. SJT scores range from 0 to 100, GMA scores from 0 to 100, and FFM scores from 1 to 5. Because the absolute magnitude of the means and standard deviations (but not the correlations) are impacted by the specific scoring algorithm used to score the tests, statistical tests for differences between the means of the different response formats are not presented because the results are dependent on the specific scoring algorithm used.
15
Table S3
Study 2 Descriptive Statistics for Variables That Used Their Standard Response Formats
Variable M SD
GMA 68.31 19.52
Response distortion 6.85 3.48
SJT completion times (Time 1)
Rate 12.28a 3.47
Rank 14.75b 3.63
Most/least 13.29c 3.22
SJT completion times (Time 2)
Rate 8.84a 2.45
Rank 10.42b 2.98
Most/least 8.98a 3.08
FFM completion times
Rate 3.73a 1.12
Rank 22.23b 9.86
Most/least 6.67c 1.96
Perceived difficulty of SJT response format
Rate 2.08a 0.90
Rank 2.33b 0.81
Most/least 2.10a 0.87
Preference for SJT response format
Rate 3.33a 1.48
Rank 2.56b 1.38
Most/least 3.77c 1.45
16
Note. GMA scores range from 0 to 100; response distortion from 0 to 20; and SJT perceived difficulty and preference from 1 to 5. Neither the difficulty nor preference ratings were related to the Time 2 SJT (response format) condition. Completion times are reported in minutes. Where there are multiple rows for a variable, means with different superscripts are significantly different from each other (p < .05, one-tailed).
17
Table S4
Study 2 Integrity-Based Situational Judgment Test Correlations With General Mental Ability and the Specified Five-Factor Model Personality Traits for All Response Formats
SJT response formatRate Rank Most/least
Time 1 Time 2 Time 1 Time 2 Time 1 Time 2GMA response format Standard .01 .06 .29* .36* .11 .21* Rate .06 .14 .36* .46* .12 .11 Rank .11 .08 .33* .35* .06 .00 Most/least .08 .16 .04 .10 .04 .15FFM rate format Agreeableness .30* .30* .10 .10 .10 .02 Conscientiousness .00 .17 .25* .12 .09 .13 Emotional stability .27* .06 .15 .10 .12 .07FFM rank format Agreeableness .04 .14 .19 .25* .17 .34* Conscientiousness .19 .02 .11 .13 .01 .12 Emotional stability .02 .13 . 26* .15 .17 .10FFM most/least format Agreeableness .07 .27* .08 .09 .12 .03 Conscientiousness .14 .39* .17 .40* .08 .09 Emotional stability .15 .05 .04 .29* .07 .00 Response distortion .30* .27* .17* .28* .16* .13*
Note. Underlined correlations represent those that would be expected to be highest positive correlations relative to the others, if the results are indicative of a shared-common-response-method effect.
*p < .05 (one-tailed).
18
Table S5
Study 2 Sex-Based Subgroup Differences for Integrity-Based Situational Judgment Tests for All Response Formats
Time 1 Time 2
Rate Rank Most/least Rate Rank Most/least
Male Female Male Female Male Female Male Female Male Female Male Female
NMSDD
5156.8417.48
—
11460.4215.780.22
5649.3612.80
—
10755.6310.560.55*
6582.837.48—
9983.196.060.05
5556.4218.73
—
9357.1218.060.04
5749.8614.02
—
10555.3013.060.41*
6183.217.33—
12184.277.440.14
Note. Females are compared to males such that a positive d indicates that males scored higher than females.
*p < .05 (one-tailed).
19
Table S6
Study 2 Test–Retest, Alternate-Form, and Internal Consistency Reliabilities for the Integrity-Based Situational Judgment Test Scores
Time 1Test–retest and alternate-form Coefficient
Rate Rank Most/least
Time 2
Rate .69 .47A1 .29B1 .95Rank .35A2 .59 .67C1 .81
Most/least .40B2 .70 C2 .64 .76Coefficient
.94 .76 .69
Note. Retest interval = 5–9 days, M = 6.99, SD = 0.74, and 80.3% of sample had a retest interval of 7 days. Test–retest reliabilities are in the diagonal. The length of the retest interval is not related to the retest condition (i.e., Time 2 SJT response format condition). Superscripts indicate alternative-form reliability pairs such that, for instance, A1 denotes the rank/rate and A2 the rate/rank alternate-form reliabilities.
20
Table S7
Differences in Study 1 and Study 2 Design and Methodological Protocol
Study 1 Study 2
Quasi-experimental design Experimental design
Between-subjects design Between- and within-subjects design
Operational field setting Lab setting
Job applicants College students
High-stakes testing Low-stakes testing
Large sample (n = 31,194) Relatively small sample (n = 492)
Internet-based protocol Paper-and-pencil assessment
Unproctored administration of measures Proctored administration of measures
Objectively recorded completion times Self-recorded completion times
Proprietary GMA and FFM measures Standardized GMA and FFM measures
21