David Bartram David@londonleadershipstrategy Twitter: @davidbartram_
1 PEOPLE PERFORMANCE Stability of OPQ32 personality constructs across languages, cultures and...
-
Upload
camilla-ryan -
Category
Documents
-
view
217 -
download
2
Transcript of 1 PEOPLE PERFORMANCE Stability of OPQ32 personality constructs across languages, cultures and...
1
PE
OPLE
PE
RFO
RM
AN
CE
PEOPLE PERFORMANCE
Stability of OPQ32 personality constructs across languages, cultures and countries
Dave Bartram, Research Director, SHL Group Ltd
MSU, October 2009
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
2
PEOPLE PERFORMANCEOverview
• The challenge – comparing people from diverse groups
• Using norms – what they are and how they’re used
• The OPQ32 – measurement properties• Review of studies looking at:
> Between country effects (19 countries)> First-language and ethnicity effects within S Africa
• Conclusions
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
3
PEOPLE PERFORMANCEThe Challenge
• Your client, an organization with staff across the world, wishes to evaluate talent using personality tools: e.g.> Selection
» Organization draws applicants from one set of countries (e.g. France, Germany, UK, Sweden and Australia) for expatriate assignments in some other set of countries (e.g. Brazil and China).
> Development
» Need to assess developmental needs globally.> Succession management
» Need to audit top talent across the world and the portfolio of capabilities that yields.
Copyright (c) SHL Group Ltd, 2009
4
PE
OPLE
PE
RFO
RM
AN
CE
PEOPLE PERFORMANCE
All involve making comparisons between people from different countries and cultures
using different languages
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
5
PEOPLE PERFORMANCE
Such cross-group comparisons are challenging…
• Differences can arise from real cultural or other group-related differences or from -
• Correctable biases:> Translation issues – specific items do not ‘work’ in one
language: Refine instrument. > Sample bias (demographics mix) - balance of demographics
within samples may differ between countries. Match or re-weight samples.
> Cultural or language-based response bias in responding: Use forced-choice item format or bias corrections.
• Non correctable bias:> Construct non-equivalence – the meaning of constructs
change between languages/cultures:» Check invariance of scale relationships.» Avoid culturally specific constructs.
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
6
PEOPLE PERFORMANCE
NORMS 101
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
7
PEOPLE PERFORMANCEWhy do we use norms?
• Norms provide a method for transforming ‘arbitrary’ raw scores into ‘standard’ scores:> Standard scores have properties independent of the
raw score scale they are based on.> E.g.: Stens (Mean=5.5, SD=2), Stanines (Mean=5,
SD=2), T-Scores (Mean=50, SD=10) and various percentile-based measures (e.g. Grades)
• Using different norms means the same raw scores get different standard scores or different raw scores may get the same standard scores. > NB Group norming destroys the underlying raw
score rank ordering of people
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
8
PEOPLE PERFORMANCE
A norm group reflects a particular profile of four types of variable:
• Endogenous (biological characteristics such as gender, age, race)
• Exogenous (environmental characteristics such as educational level and type, job level and type, organization, industrial sector, labour market, language)
• Examination (paper and pencil vs computer) setting and ‘stakes’ (e.g. pre-screening, selection, development, research)
• Temporal (e.g. generation effects; when data were collected etc.)
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
9
PEOPLE PERFORMANCE
Measurement properties of OPQ32
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
10
PEOPLE PERFORMANCEIntroduction to the OPQ Model
• OPQ32 is an occupational model of personality that describes 32 dimensions of people’s preferred or typical styles of behaviour at work.
• Subsets of the 32 scales can be aggregated to provide measures of the Big 5 and of the Great 8 competency factors.
PE
OPLE
PE
RFO
RM
AN
CE
11
PEOPLE PERFORMANCEForced-choice format [OPQ32i]
• Pros:Comparative judgement – overcomes problems of rating scaleShown to reduce response biases and fakingNot subject to cultural and translation biases associated with the scale point definitions and usage.
• Cons:More cognitively demanding for the test taker (easy with pairs – more difficult with quads)CTT scoring yields ipsative data with peculiar psychometric properties
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
12
PEOPLE PERFORMANCE
Percent raw score point change in score on Scalei when score changes one raw score point on Scalej
0
10
20
30
40
50
60
70
80
90
100
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Number of scales
Per
cen
tt
OPQ32 = 3.2%
< 5%
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
13
PEOPLE PERFORMANCE
OPQ32i OPQ32r
• While traditional scoring methods produce ipsative scales, we can use a different scoring approach to recover normative latent trait scores from OPQ32i using a multidimensional IRT model (Brown & Bartram, 2007; SHL, 2009).
• OPQ32r, launched in Sept 2009, uses forced-choice triplets and IRT scoring to produce normative scale scores.
• This presentation reports the results of analyses of OPQ32i normative IRT scale scores, where item data was available, along with the analyses on the ipsative scale data.
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
14
PEOPLE PERFORMANCE
FC scoring as a set of pair wise choices.
• For OPQ32i, CTT scoring, the total score across all scales is a fixed number (i.e. 4 for each quad with a fixed instrument total of 416).
• But: If we consider a complete ranking of 4 statements, A,B,C,& D, as 6 pairs, then we can score as follows:> A>B = 1, A<B =0> A>C = 1, A<C = 0 etc
• So the total score for a quad can vary between 0 and 6.
• The total score can now be between 0 and 624• It is no longer constrained…
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
15
PEOPLE PERFORMANCE
Likelihood of preferring item A to item B
trait B
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
trait A -3 B B B B B B B B B B B B
-2.5 A B B B B B B B B B B B
-2 A A B B B B B B B B B B
-1.5 A A A B B B B B B B B B
-1 A A A A B B B B B B B B
-0.5 A A A A A B B B B B B B
0 A A A A A A B B B B B B
0.5 A A A A A A A B B B B B
1 A A A A A A A A B B B B
1.5 A A A A A A A A A B B B
2 A A A A A A A A A A B B
2.5 A A A A A A A A A A A B
3 A A A A A A A A A A A A
Utilities of items are caused by underlying personality traitsThe higher a person’s standing on the trait, the higher the utility of the typical item (assuming that items are strong positive statements )
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
16
PEOPLE PERFORMANCE
-3
-1.5
0
1.5
3
0.0
0.5
1.0
-3
-1.5
0
1.5
3
theta r
Prob
abilit
ytheta q
Two-dimensional IRT model for paired comparisons
In this equation, • bi and bj are parameters describing the strength of the relationship between underlying factors θq and θr and the paired comparison• aij is the threshold for the paired comparison.
The latent traits are assumed to be normally distributed with unit variances and freely correlated.
( 1 , )ij q r ij i q j rP y
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
17
PEOPLE PERFORMANCEHow IRT scoring works: summary
• Responses to all blocks are coded as paired comparisons• Optimisation algorithm finds a combination of scores on the 32
OPQ traits at which the observed response pattern is most likely:
> Starting values are given to the 32 scores> Joint likelihood of participant’s response to all paired comparisons is
evaluated at starting values> Each iterative step finds a combination of the 32 scores where the
likelihood is improved> The algorithm stops when no better combination of scores can be found:
the final combination of 32 scores maximises the likelihood of the response given to all pairs
• The algorithm works not scale-by-scale, but on all scales and pairs simultaneously
• Scores estimated in this way are no longer ipsative; they have normative properties
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
18
PEOPLE PERFORMANCE
CONSTRUCT EQUIVALENCE OF OPQ32:
Between countries
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
19
PEOPLE PERFORMANCEAssessing construct equivalence
• OPQ32 is not a factor model so cannot compare fit of a factor model to the 32 scales across samples.
• Strong test of construct equivalence comes from requiring invariance in scale variances and correlation matrices: hypothesis being tested is that both matrices are samples from the same population (Bentler, 2005).
• Goodness of fit criteria:> Comparative Fit Index (CFI; Bentler, 1990) should be
greater than 0.95.> Root mean square error of approximation (RMSEA) should
be less than 0.08 for a reasonable fit and less than 0.05 for a good fit (Byrne, 2001).
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
20
PEOPLE PERFORMANCEComparing countries
• 74,244 working adults from 19 countries> 12 European countries, US, South Africa, Australia,
China, Hong Kong, India, New Zealand.> Country sample sizes ranged from 861 to 8,222 with
an average of 3,713
• 14 language versions: > 6 UK English countries; US English; 11 non-Engish
samples of different European languages; 1 sample of ‘simplified’ Chinese.
• Item level data was available on 11 of the European country data and normative IRT scores have been computed for those.
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
21
PEOPLE PERFORMANCE
Construct equivalence: Summary
• Compared correlation matrices for each country against UKE matrix.
• Exceptionally good fit for all English and European languages> Ipsative: median CFI = 0.982 (min 0.960), median RMSEA
= 0.019 (max 0.028) – includes US, S Africa and Aust..> IRT Normative: median CFI = 0.989 (min 0.982), median
RMSEA = 0.024 (max 0.029) – Europe only.
• For Chinese version (simplified Chinese) the test identified a slight misfit in the model> CFI=0.945, RMSEA=0.033> Constraints violated: correlations betweenRule Following – Conventional r=0.67 (0.45 English version)Forward Thinking – Achieving r=0.36 (0.17 English version)
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
22
PEOPLE PERFORMANCE
Normative and Ipsative scale results: Europe
IRT Normative [k=32] Ipsative [k=31]
language N CFI RMSEA CFI RMSEA
English 3,978
Danish 8,274 .991 .022 .990 .014
Dutch 5,499 .988 .024 .989 .015
Belgium Dutch 2,109 .989 .024 .983 .019
Finnish 1,943 .985 .028 .979 .022
French 3,806 .988 .025 .981 .020
German 4,733 .982 .029 .981 .019
Italian 1,758 .988 .024 .968 .025
Norwegian 7,622 .989 .024 .989 .015
Portuguese 1,026 .990 .023 .960 .028
Swedish 9,044 .991 .021 .990 .014Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
23
PEOPLE PERFORMANCE
Manager vs. Non-manager differences across countries
Very little between country variation
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
24
PEOPLE PERFORMANCE
CONSTRUCT EQUIVALENCE OF OPQ32:
Within South Africa
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
25
PEOPLE PERFORMANCESouth African data
• The study was carried on 32,020 people to assess construct invariance as well as scale mean differences between ethnic and first-language groups on the OPQ32i.
• OPQ32i was administered in English. All candidates were proficient in English to at least Grade 12.> The OPQ32i was scored both conventionally (as ipsative scale
scores) and using the multidimensional IRT model to recover latent normative scores
• Comparison of the covariance structures of the samples was carried out using Structural Equation Modeling with EQS on both ipsative [k=31] and the normative IRT latent trait scale scores [k=32].> Both produce similar results (Only normative score results
presented here).
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
26
PEOPLE PERFORMANCESouth African data (2009)
• 32,020 candidates in various industry sectors> 52.10% females and 47.90% males> Mean age of 30.67 years (SD=8.23)> 47.60% African, 13.50% Coloured, 9.60% Indian and
29.10 White> 37.39% Grade 12, 16.316% Certificates, 30.99%
degrees and 15.31% post graduate degrees
• First-language known for 25,094 of the candidates> 25.90% Afrikaans, 27.10% English, 2.10% Venda,
2.20% Tsonga, 21.90% Nguni (Zulu, Xhosa, Swati & Ndebele) and 20.80%, Sotho (North Sotho, South Sotho, Tswana)
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
27
PEOPLE PERFORMANCEGroup comparisons
• First, compared correlation patterns for major ethnic groups. > Different ethnic groups would generally have
different languages as their first (native) language: » English and Africaans for the White and Coloured
groups, » English for the Indian group, » Native African languages for the African group.
• Second, groups were formed by first (native) language. Each was compared to the group whose first language was English (N=6,793).
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
28
PEOPLE PERFORMANCEComparisons by ethnic group
Copyright (c) SHL Group Ltd, 2009
ethnic group
N CFI RMSEA
White 9,318
African 15,255 .972 .035
Coloured 4,308 .992 .020
Indian 3,083 .997 .012
PE
OPLE
PE
RFO
RM
AN
CE
29
PEOPLE PERFORMANCEComparisons by first-language
Copyright (c) SHL Group Ltd, 2009
language N CFI RMSEA
English 6,793
Africaans 6,494 .998 .011
Nguni 5,488 .978 .031
Sotho 5,232 .976 .032
Tsonga 555 .991 .021
Venda 532 .990 .022
PE
OPLE
PE
RFO
RM
AN
CE
30
PEOPLE PERFORMANCEComparison of African language groups
• The two large African language groups (Nguni and Sotho) were tested for similarity of their correlation matrices. > The fit indices in this case are remarkably good. > Clearly, there are more similarities between personality constructs
for the two native African language groups than between English and African language groups.
• The only two groups that show a similar level of fit are the English and the Africaans groups.
Copyright (c) SHL Group Ltd, 2009
language N CFI RMSEA
Nguni 5,488
Sotho 5,232 .999 .006
PE
OPLE
PE
RFO
RM
AN
CE
31
PEOPLE PERFORMANCESouth African data
• These analyses suggest that the relationships between scales can be considered comparable for the different ethnic and first-language groups in South Africa.
• This does not imply that average scale scores are invariant across groups.
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
32
PEOPLE PERFORMANCEComparisons between 2006 & 2009 data
• 2006:• 20,132 candidates in various industry sectors
> 56.75% females and 43.25% males> Mean age of 30.67 years (SD=8.22)> 45.22% African, 15.78% coloured, 10.37% Indian and
28.64% white> 41.60% Grade 12, 16.26% Certificates, 28.51% degrees
and 13.64% post graduate degrees
• First-language known for 13,322 of the candidates> 28.79% Afrikaans, 26.23% English 1.54% Venda, 1.88%
Tsonga, 22.26% Nguni (Zulu, Xhosa, Swati & Ndebele) and 19.30%, Sotho (North Sotho, South Sotho, Tswana)
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
33
PEOPLE PERFORMANCEEthnic differences: 2006 vs 2007-2009
African vs. white
-0.60
-0.40
-0.20
0.00
0.20
0.40
0.60
Pers
uasi
ve
Cons
trolli
ng
Outs
poke
n
Inde
pend
entM
inde
d
Outg
oing
Affilia
tive
Soci
ally
Confi
dent Mo
dest
Dem
ocra
tic Carin
g
Data
Ratio
nal
Eval
uativ
e
Beha
viou
ral
Conv
entio
nal
Conc
eptu
al
Inno
vativ
e
Varie
tySe
ekin
g
Adap
tabl
e
Forw
ardT
hink
ing
Deta
ilCon
scio
us
Cons
cien
tious
Rule
Follo
wing
Rela
xed
Wor
ryin
g
Toug
hMin
ded
Optim
istic
Trus
ting
Emot
iona
llCon
trolle
d Vigo
rous
Com
petit
ive
Achi
evin
g
Deci
sive
Effe
ct si
ze (d
) 2006 (N=14868)
2009 (N=24573)
Small to moderate differences between ethnic groupsCopyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
34
PEOPLE PERFORMANCEFirst-language: 2006 vs 2007-2009
Nguni vs. Afrikaans
-0.60
-0.40
-0.20
0.00
0.20
0.40
0.60
Per
suas
ive
Con
stro
llin
g
Out
spok
en
Inde
pend
entM
inde
d
Out
goin
g
Affi
liat
ive
Soc
iall
yCon
fide
nt
Mod
est
Dem
ocra
tic
Car
ing
Dat
aRat
iona
l
Eva
luat
ive
Beh
avio
ural
Con
vent
iona
l
Con
cept
ual
Inno
vati
ve
Var
iety
See
king
Ada
ptab
le
Forw
ardT
hink
ing
Det
ailC
onsc
ious
Con
scie
ntio
us
Rul
eFol
low
ing
Rel
axed
Wor
ryin
g
Tou
ghM
inde
d
Opt
imis
tic
Tru
stin
g
Em
otio
nall
Con
trol
led
Vig
orou
s
Com
peti
tive
Ach
ievi
ng
Dec
isiv
e
Eff
ect s
ize
(d)
2006 (N=6407)
2009 (N=11982)
Small to moderate differences between languagesCopyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
35
PEOPLE PERFORMANCESouth African Data: Conclusions
• Effects sizes are very similar over time (2006 vs 2007-2009)
• Overall, effect sizes were either small or moderate when first-language and ethnic groups were compared.
• Analyses of fit show good evidence of construct equivalence across ethnicity and first-language.
• Level of fit is greater in some cases than others as one would expect.
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
36
PEOPLE PERFORMANCE
CONCLUSIONS
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
37
PEOPLE PERFORMANCE
General conclusions from the OPQ data sets:
• Forced-choice format version of OPQ32 is very robust in terms of construct equivalence across countries and, for S Africa, between first-language and ethnic groups within country.
• Some differences occur between and within countries in terms of average scale scores.>These effects are relatively small when
compared with other demographics (e.g. gender, managerial position).
>Effect sizes are generally not of substantive significance in terms of individual profile interpretation.
Copyright (c) SHL Group Ltd, 2009
38
PE
OPLE
PE
RFO
RM
AN
CE
PEOPLE PERFORMANCE
To return to the original question…
Where there is construct equivalence does it make sense to aggregate across samples for norming purposes in order to preserve the underlying raw score difference effects?
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
39
PEOPLE PERFORMANCE
The key question in choosing a norm reference group:
• ‘Does the norm group consist of the sort of persons with whom [the candidate] should be compared?” (Cronbach 1990, p127).
• Cronbach made clear that this does not even entail comparing people to others from their own demographic group.
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
40
PEOPLE PERFORMANCEIF…
• There is no evidence of construct bias• Translation is ok – no substantive DIF• There is control over general response bias
differences relating to culture or language• Norm demographics are equivalent or equate-
able
THEN…> We can aggregate across languages or countries> We can compare individual profiles using various
different norms
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
41
PEOPLE PERFORMANCEGeneral guidance (Bartram 2008)
• Aggregation of norms across countries/cultures/languages should not be automated.
• Correlation matrices should be checked for measurement invariance
• Norm samples should be checked for comparability of demographics – re-weighting adjustments made if necessary> Where demographics are associated with score differences,
the ‘mix’ within samples should be weighted to ensure comparability across samples
• Mix of countries or cultures should be ‘reasonable’, with more caution exercised for mixing more divergent cultures> Use country similarity (cluster analyses) as a guide for
‘reasonableness’
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
42
PEOPLE PERFORMANCEGeneral guidance (continued)
• In combining across countries or cultures, samples should be weighted appropriately for the final mix – e.g. equal weights rather than relative to populations or sample sizes.
• Adherence to these guidelines involves the exercise of expert judgement and needs to be dealt with on a case by case basis.
• Expert judgement is also needed when aggregation is not possible and people from different countries are being compared
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
43
PEOPLE PERFORMANCESummary
• Cross language, cross culture and cross country analyses of OPQ32i suggest that the effects on scale scores are small compared with other general demographics effects.
• For OPQ32i, consistency of construct equivalence across samples supports aggregation of ‘norms’ for national and multi-national assessment use.
• In general, construction of national or multi-national norms must be done with care to avoid aggregated data where there is poor construct equivalence or effects of systematic bias.
Copyright (c) SHL Group Ltd, 2009
44
PE
OPLE
PE
RFO
RM
AN
CE
PEOPLE PERFORMANCE
Thank you
Reference:> Bartram, D. (2008) Global norms: Towards some
guidelines for aggregating personality norms across countries. International Journal of Testing, 8:4, 315-333.
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
45
PEOPLE PERFORMANCE
What is ‘culture’ and when does it matter?
• Culture is a set of exogenous variables relating to> Shared values> Shared cognitions> Shared knowledge> Shared standards or cultural norms> Shared language.
• In practical terms, ‘culture’ matters when it is related to a group of people for whom within-group variability in terms of relevant constructs is relatively small compared to variability between them and other groups.
Copyright (c) SHL Group Ltd, 2009
PE
OPLE
PE
RFO
RM
AN
CE
46
PEOPLE PERFORMANCEWhy do we use norms?
• Norms provide a basis for comparing the scores of an individual with those of some well-defined reference group.> Hence norms are useful in the interpretation of test
scores.
• Users sometimes confuse norms and validity:> Scoring in the top ten percent of graduates
applicants on trait X is only good to know if trait X is positively correlated with criterion behaviour.
Copyright (c) SHL Group Ltd, 2009