1 PEOPLE PERFORMANCE Stability of OPQ32 personality constructs across languages, cultures and...

1

PE

OPLE

PE

RFO

RM

AN

CE

PEOPLE PERFORMANCE

Stability of OPQ32 personality constructs across languages, cultures and countries

Dave Bartram, Research Director, SHL Group Ltd

MSU, October 2009

Copyright (c) SHL Group Ltd, 2009

PE

OPLE

PE

RFO

RM

AN

CE

2

PEOPLE PERFORMANCEOverview

• The challenge – comparing people from diverse groups

• Using norms – what they are and how they’re used

• The OPQ32 – measurement properties• Review of studies looking at:

> Between country effects (19 countries)> First-language and ethnicity effects within S Africa

• Conclusions


PE

OPLE

PE

RFO

RM

AN

CE

3

PEOPLE PERFORMANCEThe Challenge

• Your client, an organization with staff across the world, wishes to evaluate talent using personality tools: e.g.> Selection

» Organization draws applicants from one set of countries (e.g. France, Germany, UK, Sweden and Australia) for expatriate assignments in some other set of countries (e.g. Brazil and China).

> Development

» Need to assess developmental needs globally.> Succession management

» Need to audit top talent across the world and the portfolio of capabilities that yields.


4

PE

OPLE

PE

RFO

RM

AN

CE

PEOPLE PERFORMANCE

All involve making comparisons between people from different countries and cultures

using different languages


PE

OPLE

PE

RFO

RM

AN

CE

5

PEOPLE PERFORMANCE

Such cross-group comparisons are challenging…

• Differences can arise from real cultural or other group-related differences or from -

• Correctable biases:> Translation issues – specific items do not ‘work’ in one

language: Refine instrument. > Sample bias (demographics mix) - balance of demographics

within samples may differ between countries. Match or re-weight samples.

> Cultural or language-based response bias in responding: Use forced-choice item format or bias corrections.

• Non correctable bias:> Construct non-equivalence – the meaning of constructs

change between languages/cultures:» Check invariance of scale relationships.» Avoid culturally specific constructs.


PE

OPLE

PE

RFO

RM

AN

CE

6

PEOPLE PERFORMANCE

NORMS 101


PE

OPLE

PE

RFO

RM

AN

CE

7

PEOPLE PERFORMANCEWhy do we use norms?

• Norms provide a method for transforming ‘arbitrary’ raw scores into ‘standard’ scores:> Standard scores have properties independent of the

raw score scale they are based on.> E.g.: Stens (Mean=5.5, SD=2), Stanines (Mean=5,

SD=2), T-Scores (Mean=50, SD=10) and various percentile-based measures (e.g. Grades)

• Using different norms means the same raw scores get different standard scores or different raw scores may get the same standard scores. > NB Group norming destroys the underlying raw

score rank ordering of people


PE

OPLE

PE

RFO

RM

AN

CE

8

PEOPLE PERFORMANCE

A norm group reflects a particular profile of four types of variable:

• Endogenous (biological characteristics such as gender, age, race)

• Exogenous (environmental characteristics such as educational level and type, job level and type, organization, industrial sector, labour market, language)

• Examination (paper and pencil vs computer) setting and ‘stakes’ (e.g. pre-screening, selection, development, research)

• Temporal (e.g. generation effects; when data were collected etc.)


PE

OPLE

PE

RFO

RM

AN

CE

9

PEOPLE PERFORMANCE

Measurement properties of OPQ32


PE

OPLE

PE

RFO

RM

AN

CE

10

PEOPLE PERFORMANCEIntroduction to the OPQ Model

• OPQ32 is an occupational model of personality that describes 32 dimensions of people’s preferred or typical styles of behaviour at work.

• Subsets of the 32 scales can be aggregated to provide measures of the Big 5 and of the Great 8 competency factors.

PE

OPLE

PE

RFO

RM

AN

CE

11

PEOPLE PERFORMANCEForced-choice format [OPQ32i]

• Pros:Comparative judgement – overcomes problems of rating scaleShown to reduce response biases and fakingNot subject to cultural and translation biases associated with the scale point definitions and usage.

• Cons:More cognitively demanding for the test taker (easy with pairs – more difficult with quads)CTT scoring yields ipsative data with peculiar psychometric properties


PE

OPLE

PE

RFO

RM

AN

CE

12

PEOPLE PERFORMANCE

Percent raw score point change in score on Scalei when score changes one raw score point on Scalej

0

10

20

30

40

50

60

70

80

90

100

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

Number of scales

Per

cen

tt

OPQ32 = 3.2%

< 5%


PE

OPLE

PE

RFO

RM

AN

CE

13

PEOPLE PERFORMANCE

OPQ32i OPQ32r

• While traditional scoring methods produce ipsative scales, we can use a different scoring approach to recover normative latent trait scores from OPQ32i using a multidimensional IRT model (Brown & Bartram, 2007; SHL, 2009).

• OPQ32r, launched in Sept 2009, uses forced-choice triplets and IRT scoring to produce normative scale scores.

• This presentation reports the results of analyses of OPQ32i normative IRT scale scores, where item data was available, along with the analyses on the ipsative scale data.


PE

OPLE

PE

RFO

RM

AN

CE

14

PEOPLE PERFORMANCE

FC scoring as a set of pair wise choices.

• For OPQ32i, CTT scoring, the total score across all scales is a fixed number (i.e. 4 for each quad with a fixed instrument total of 416).

• But: If we consider a complete ranking of 4 statements, A,B,C,& D, as 6 pairs, then we can score as follows:> A>B = 1, A<B =0> A>C = 1, A<C = 0 etc

• So the total score for a quad can vary between 0 and 6.

• The total score can now be between 0 and 624• It is no longer constrained…


PE

OPLE

PE

RFO

RM

AN

CE

15

PEOPLE PERFORMANCE

Likelihood of preferring item A to item B

trait B

-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3

trait A -3 B B B B B B B B B B B B

-2.5 A B B B B B B B B B B B

-2 A A B B B B B B B B B B

-1.5 A A A B B B B B B B B B

-1 A A A A B B B B B B B B

-0.5 A A A A A B B B B B B B

0 A A A A A A B B B B B B

0.5 A A A A A A A B B B B B

1 A A A A A A A A B B B B

1.5 A A A A A A A A A B B B

2 A A A A A A A A A A B B

2.5 A A A A A A A A A A A B

3 A A A A A A A A A A A A

Utilities of items are caused by underlying personality traitsThe higher a person’s standing on the trait, the higher the utility of the typical item (assuming that items are strong positive statements )


PE

OPLE

PE

RFO

RM

AN

CE

16

PEOPLE PERFORMANCE

-3

-1.5

0

1.5

3

0.0

0.5

1.0

-3

-1.5

0

1.5

3

theta r

Prob

abilit

ytheta q

Two-dimensional IRT model for paired comparisons

In this equation, • bi and bj are parameters describing the strength of the relationship between underlying factors θq and θr and the paired comparison• aij is the threshold for the paired comparison.

The latent traits are assumed to be normally distributed with unit variances and freely correlated.

( 1 , )ij q r ij i q j rP y


PE

OPLE

PE

RFO

RM

AN

CE

17

PEOPLE PERFORMANCEHow IRT scoring works: summary

• Responses to all blocks are coded as paired comparisons• Optimisation algorithm finds a combination of scores on the 32

OPQ traits at which the observed response pattern is most likely:

> Starting values are given to the 32 scores> Joint likelihood of participant’s response to all paired comparisons is

evaluated at starting values> Each iterative step finds a combination of the 32 scores where the

likelihood is improved> The algorithm stops when no better combination of scores can be found:

the final combination of 32 scores maximises the likelihood of the response given to all pairs

• The algorithm works not scale-by-scale, but on all scales and pairs simultaneously

• Scores estimated in this way are no longer ipsative; they have normative properties


PE

OPLE

PE

RFO

RM

AN

CE

18

PEOPLE PERFORMANCE

CONSTRUCT EQUIVALENCE OF OPQ32:

Between countries


PE

OPLE

PE

RFO

RM

AN

CE

19

PEOPLE PERFORMANCEAssessing construct equivalence

• OPQ32 is not a factor model so cannot compare fit of a factor model to the 32 scales across samples.

• Strong test of construct equivalence comes from requiring invariance in scale variances and correlation matrices: hypothesis being tested is that both matrices are samples from the same population (Bentler, 2005).

• Goodness of fit criteria:> Comparative Fit Index (CFI; Bentler, 1990) should be

greater than 0.95.> Root mean square error of approximation (RMSEA) should

be less than 0.08 for a reasonable fit and less than 0.05 for a good fit (Byrne, 2001).


PE

OPLE

PE

RFO

RM

AN

CE

20

PEOPLE PERFORMANCEComparing countries

• 74,244 working adults from 19 countries> 12 European countries, US, South Africa, Australia,

China, Hong Kong, India, New Zealand.> Country sample sizes ranged from 861 to 8,222 with

an average of 3,713

• 14 language versions: > 6 UK English countries; US English; 11 non-Engish

samples of different European languages; 1 sample of ‘simplified’ Chinese.

• Item level data was available on 11 of the European country data and normative IRT scores have been computed for those.


PE

OPLE

PE

RFO

RM

AN

CE

21

PEOPLE PERFORMANCE

Construct equivalence: Summary

• Compared correlation matrices for each country against UKE matrix.

• Exceptionally good fit for all English and European languages> Ipsative: median CFI = 0.982 (min 0.960), median RMSEA

= 0.019 (max 0.028) – includes US, S Africa and Aust..> IRT Normative: median CFI = 0.989 (min 0.982), median

RMSEA = 0.024 (max 0.029) – Europe only.

• For Chinese version (simplified Chinese) the test identified a slight misfit in the model> CFI=0.945, RMSEA=0.033> Constraints violated: correlations betweenRule Following – Conventional r=0.67 (0.45 English version)Forward Thinking – Achieving r=0.36 (0.17 English version)


PE

OPLE

PE

RFO

RM

AN

CE

22

PEOPLE PERFORMANCE

Normative and Ipsative scale results: Europe

IRT Normative [k=32] Ipsative [k=31]

language N CFI RMSEA CFI RMSEA

English 3,978

Danish 8,274 .991 .022 .990 .014

Dutch 5,499 .988 .024 .989 .015

Belgium Dutch 2,109 .989 .024 .983 .019

Finnish 1,943 .985 .028 .979 .022

French 3,806 .988 .025 .981 .020

German 4,733 .982 .029 .981 .019

Italian 1,758 .988 .024 .968 .025

Norwegian 7,622 .989 .024 .989 .015

Portuguese 1,026 .990 .023 .960 .028

Swedish 9,044 .991 .021 .990 .014Copyright (c) SHL Group Ltd, 2009

PE

OPLE

PE

RFO

RM

AN

CE

23

PEOPLE PERFORMANCE

Manager vs. Non-manager differences across countries

Very little between country variation


PE

OPLE

PE

RFO

RM

AN

CE

24

PEOPLE PERFORMANCE

CONSTRUCT EQUIVALENCE OF OPQ32:

Within South Africa


PE

OPLE

PE

RFO

RM

AN

CE

25

PEOPLE PERFORMANCESouth African data

• The study was carried on 32,020 people to assess construct invariance as well as scale mean differences between ethnic and first-language groups on the OPQ32i.

• OPQ32i was administered in English. All candidates were proficient in English to at least Grade 12.> The OPQ32i was scored both conventionally (as ipsative scale

scores) and using the multidimensional IRT model to recover latent normative scores

• Comparison of the covariance structures of the samples was carried out using Structural Equation Modeling with EQS on both ipsative [k=31] and the normative IRT latent trait scale scores [k=32].> Both produce similar results (Only normative score results

presented here).


PE

OPLE

PE

RFO

RM

AN

CE

26

PEOPLE PERFORMANCESouth African data (2009)

• 32,020 candidates in various industry sectors> 52.10% females and 47.90% males> Mean age of 30.67 years (SD=8.23)> 47.60% African, 13.50% Coloured, 9.60% Indian and

29.10 White> 37.39% Grade 12, 16.316% Certificates, 30.99%

degrees and 15.31% post graduate degrees

• First-language known for 25,094 of the candidates> 25.90% Afrikaans, 27.10% English, 2.10% Venda,

2.20% Tsonga, 21.90% Nguni (Zulu, Xhosa, Swati & Ndebele) and 20.80%, Sotho (North Sotho, South Sotho, Tswana)


PE

OPLE

PE

RFO

RM

AN

CE

27

PEOPLE PERFORMANCEGroup comparisons

• First, compared correlation patterns for major ethnic groups. > Different ethnic groups would generally have

different languages as their first (native) language: » English and Africaans for the White and Coloured

groups, » English for the Indian group, » Native African languages for the African group.

• Second, groups were formed by first (native) language. Each was compared to the group whose first language was English (N=6,793).


PE

OPLE

PE

RFO

RM

AN

CE

28

PEOPLE PERFORMANCEComparisons by ethnic group


ethnic group

N CFI RMSEA

White 9,318

African 15,255 .972 .035

Coloured 4,308 .992 .020

Indian 3,083 .997 .012

PE

OPLE

PE

RFO

RM

AN

CE

29

PEOPLE PERFORMANCEComparisons by first-language


language N CFI RMSEA

English 6,793

Africaans 6,494 .998 .011

Nguni 5,488 .978 .031

Sotho 5,232 .976 .032

Tsonga 555 .991 .021

Venda 532 .990 .022

PE

OPLE

PE

RFO

RM

AN

CE

30

PEOPLE PERFORMANCEComparison of African language groups

• The two large African language groups (Nguni and Sotho) were tested for similarity of their correlation matrices. > The fit indices in this case are remarkably good. > Clearly, there are more similarities between personality constructs

for the two native African language groups than between English and African language groups.

• The only two groups that show a similar level of fit are the English and the Africaans groups.


language N CFI RMSEA

Nguni 5,488

Sotho 5,232 .999 .006

PE

OPLE

PE

RFO

RM

AN

CE

31

PEOPLE PERFORMANCESouth African data

• These analyses suggest that the relationships between scales can be considered comparable for the different ethnic and first-language groups in South Africa.

• This does not imply that average scale scores are invariant across groups.


PE

OPLE

PE

RFO

RM

AN

CE

32

PEOPLE PERFORMANCEComparisons between 2006 & 2009 data

• 2006:• 20,132 candidates in various industry sectors

> 56.75% females and 43.25% males> Mean age of 30.67 years (SD=8.22)> 45.22% African, 15.78% coloured, 10.37% Indian and

28.64% white> 41.60% Grade 12, 16.26% Certificates, 28.51% degrees

and 13.64% post graduate degrees

• First-language known for 13,322 of the candidates> 28.79% Afrikaans, 26.23% English 1.54% Venda, 1.88%

Tsonga, 22.26% Nguni (Zulu, Xhosa, Swati & Ndebele) and 19.30%, Sotho (North Sotho, South Sotho, Tswana)


PE

OPLE

PE

RFO

RM

AN

CE

33

PEOPLE PERFORMANCEEthnic differences: 2006 vs 2007-2009

African vs. white

-0.60

-0.40

-0.20

0.00

0.20

0.40

0.60

Pers

uasi

ve

Cons

trolli

ng

Outs

poke

n

Inde

pend

entM

inde

d

Outg

oing

Affilia

tive

Soci

ally

Confi

dent Mo

dest

Dem

ocra

tic Carin

g

Data

Ratio

nal

Eval

uativ

e

Beha

viou

ral

Conv

entio

nal

Conc

eptu

al

Inno

vativ

e

Varie

tySe

ekin

g

Adap

tabl

e

Forw

ardT

hink

ing

Deta

ilCon

scio

us

Cons

cien

tious

Rule

Follo

wing

Rela

xed

Wor

ryin

g

Toug

hMin

ded

Optim

istic

Trus

ting

Emot

iona

llCon

trolle

d Vigo

rous

Com

petit

ive

Achi

evin

g

Deci

sive

Effe

ct si

ze (d

) 2006 (N=14868)

2009 (N=24573)

Small to moderate differences between ethnic groupsCopyright (c) SHL Group Ltd, 2009

PE

OPLE

PE

RFO

RM

AN

CE

34

PEOPLE PERFORMANCEFirst-language: 2006 vs 2007-2009

Nguni vs. Afrikaans

-0.60

-0.40

-0.20

0.00

0.20

0.40

0.60

Per

suas

ive

Con

stro

llin

g

Out

spok

en

Inde

pend

entM

inde

d

Out

goin

g

Affi

liat

ive

Soc

iall

yCon

fide

nt

Mod

est

Dem

ocra

tic

Car

ing

Dat

aRat

iona

l

Eva

luat

ive

Beh

avio

ural

Con

vent

iona

l

Con

cept

ual

Inno

vati

ve

Var

iety

See

king

Ada

ptab

le

Forw

ardT

hink

ing

Det

ailC

onsc

ious

Con

scie

ntio

us

Rul

eFol

low

ing

Rel

axed

Wor

ryin

g

Tou

ghM

inde

d

Opt

imis

tic

Tru

stin

g

Em

otio

nall

Con

trol

led

Vig

orou

s

Com

peti

tive

Ach

ievi

ng

Dec

isiv

e

Eff

ect s

ize

(d)

2006 (N=6407)

2009 (N=11982)

Small to moderate differences between languagesCopyright (c) SHL Group Ltd, 2009

PE

OPLE

PE

RFO

RM

AN

CE

35

PEOPLE PERFORMANCESouth African Data: Conclusions

• Effects sizes are very similar over time (2006 vs 2007-2009)

• Overall, effect sizes were either small or moderate when first-language and ethnic groups were compared.

• Analyses of fit show good evidence of construct equivalence across ethnicity and first-language.

• Level of fit is greater in some cases than others as one would expect.


PE

OPLE

PE

RFO

RM

AN

CE

36

PEOPLE PERFORMANCE

CONCLUSIONS


PE

OPLE

PE

RFO

RM

AN

CE

37

PEOPLE PERFORMANCE

General conclusions from the OPQ data sets:

• Forced-choice format version of OPQ32 is very robust in terms of construct equivalence across countries and, for S Africa, between first-language and ethnic groups within country.

• Some differences occur between and within countries in terms of average scale scores.>These effects are relatively small when

compared with other demographics (e.g. gender, managerial position).

>Effect sizes are generally not of substantive significance in terms of individual profile interpretation.


38

PE

OPLE

PE

RFO

RM

AN

CE

PEOPLE PERFORMANCE

To return to the original question…

Where there is construct equivalence does it make sense to aggregate across samples for norming purposes in order to preserve the underlying raw score difference effects?


PE

OPLE

PE

RFO

RM

AN

CE

39

PEOPLE PERFORMANCE

The key question in choosing a norm reference group:

• ‘Does the norm group consist of the sort of persons with whom [the candidate] should be compared?” (Cronbach 1990, p127).

• Cronbach made clear that this does not even entail comparing people to others from their own demographic group.


PE

OPLE

PE

RFO

RM

AN

CE

40

PEOPLE PERFORMANCEIF…

• There is no evidence of construct bias• Translation is ok – no substantive DIF• There is control over general response bias

differences relating to culture or language• Norm demographics are equivalent or equate-

able

THEN…> We can aggregate across languages or countries> We can compare individual profiles using various

different norms


PE

OPLE

PE

RFO

RM

AN

CE

41

PEOPLE PERFORMANCEGeneral guidance (Bartram 2008)

• Aggregation of norms across countries/cultures/languages should not be automated.

• Correlation matrices should be checked for measurement invariance

• Norm samples should be checked for comparability of demographics – re-weighting adjustments made if necessary> Where demographics are associated with score differences,

the ‘mix’ within samples should be weighted to ensure comparability across samples

• Mix of countries or cultures should be ‘reasonable’, with more caution exercised for mixing more divergent cultures> Use country similarity (cluster analyses) as a guide for

‘reasonableness’


PE

OPLE

PE

RFO

RM

AN

CE

42

PEOPLE PERFORMANCEGeneral guidance (continued)

• In combining across countries or cultures, samples should be weighted appropriately for the final mix – e.g. equal weights rather than relative to populations or sample sizes.

• Adherence to these guidelines involves the exercise of expert judgement and needs to be dealt with on a case by case basis.

• Expert judgement is also needed when aggregation is not possible and people from different countries are being compared


PE

OPLE

PE

RFO

RM

AN

CE

43

PEOPLE PERFORMANCESummary

• Cross language, cross culture and cross country analyses of OPQ32i suggest that the effects on scale scores are small compared with other general demographics effects.

• For OPQ32i, consistency of construct equivalence across samples supports aggregation of ‘norms’ for national and multi-national assessment use.

• In general, construction of national or multi-national norms must be done with care to avoid aggregated data where there is poor construct equivalence or effects of systematic bias.


44

PE

OPLE

PE

RFO

RM

AN

CE

PEOPLE PERFORMANCE

Thank you

Reference:> Bartram, D. (2008) Global norms: Towards some

guidelines for aggregating personality norms across countries. International Journal of Testing, 8:4, 315-333.


PE

OPLE

PE

RFO

RM

AN

CE

45

PEOPLE PERFORMANCE

What is ‘culture’ and when does it matter?

• Culture is a set of exogenous variables relating to> Shared values> Shared cognitions> Shared knowledge> Shared standards or cultural norms> Shared language.

• In practical terms, ‘culture’ matters when it is related to a group of people for whom within-group variability in terms of relevant constructs is relatively small compared to variability between them and other groups.


PE

OPLE

PE

RFO

RM

AN

CE

46

PEOPLE PERFORMANCEWhy do we use norms?

• Norms provide a basis for comparing the scores of an individual with those of some well-defined reference group.> Hence norms are useful in the interpretation of test

scores.

• Users sometimes confuse norms and validity:> Scoring in the top ten percent of graduates

applicants on trait X is only good to know if trait X is positively correlated with criterion behaviour.


1 PEOPLE PERFORMANCE Stability of OPQ32 personality constructs across languages, cultures and...

Documents

Transcript of 1 PEOPLE PERFORMANCE Stability of OPQ32 personality constructs across languages, cultures and...