Validating Assessment Centers Kevin R. Murphy Department of Psychology Pennsylvania State...

Validating Assessment Centers

Kevin R. MurphyDepartment of Psychology

Pennsylvania State University, USA

A Prototypic AC

• Groups of candidates participate in multiple exercises

• Each exercise designed to measure some set of behavioral dimensions or competencies

• Performance/behavior in exercises is evaluated by sets of assessors

• Information from multiple assessors is integrated to yield a range of scores

Common But Deficient Validation Strategies

• Criterion-related validity studies

– Correlate OAR with criterion measures

• e.g., OAR correlates .40 with performance measures, but written ability tests do considerably better (.50’s)

– There may be practical constraints to using tests, but psychometric purists are not concerned with the practical


• Construct Validity Studies

– Convergent and Discriminant Validity assessments

• AC scores often show relatively strong exercise effects and relatively weak dimension/competency effects

• This is probably not the right model for assessing construct validity, but it is the one that has dominated much of the literature


• Content validation– Map competencies/behavioral descriptions onto

the job

– If competencies measures by AC show reasonable similarity to job competencies, content validity is established

• Track record for ACs is nearly perfect because job information is used to select competencies, but evidence that competencies are actually measured is often scant

Ask the Wrong Question, Get the Wrong Answer

• Too many studies ask “Are Assessment Centers Valid?”

• The Question should be “Valid for What?”– That is, validity is not determined by the

measurement procedure or even by the data that arises from that procedure. Validity is determined by what you attempt to do with the data

Sources of Validity Information

• Validity for what?– Determine the ways you will use the data coming

out of an AC. ACs are not valid or invalid in general, they are valid for specific purposes

• Cast a wide net!– Virtually everything you do that gives you insight

into what the data coming out of an AC mean can be thought of as part of the validation process


• Raters– Rater training, expertise, agreement

• Exercises– What behaviors are elicited, what situational

factors affect behaviors

• Dimensions– Is there evidence to map from AC behavior to

dimensions to job


• Scores– Wide range of assessments of the relationships

among the different scores obtained in the AC process provide validity information

• Processes– Evidence that the processes used in an AC tend to

produce reliable and relevant data is part of the assessment of validity

Let’s Validate An Assessment Center!• Design the AC• Identify the data that come out of an AC• Determine how you want to use that data• Collect and evaluate information relevant to

those uses– Data from pilot tests– Analysis of AC outcome data– Evaluations of AC components and process– Lit reviews, theory and experience

Design• Job - Entry-level Human Resource Manager

• Competencies– Active Listening– Speaking– Management of Personnel Resources – Social Perceptiveness

• Being aware of others' reactions and understanding why they react as they do.

– Coordination • Adjusting actions in relation to others' actions.

– Critical Thinking – Reading Comprehension– Judgment and Decision Making– Negotiation – Complex Problem Solving

DesignExercise #1 Exercise #2 Exercise #3

Competency 1

Competency 2

Competency 3

Competency 4

Competency 5

Competency 6

Populate the Matrix – which competencies and what exercises?

Assessors

• How many, what type, which exercises?

Exercise 1 Exercise 2 Exercise 3

Assessor 1

Assessor 2

Assessor 3

Assessor 4

Assessor 5

Assessment Data• Individual behavior ratings?– How will we set these up so that we can assess their

accuracy or consistency?

• Individual competency ratings?– How will we set these up so that we can assess their

accuracy or consistency?

• Pooled ratings– What level of analysis?

• OAR

Uses of Assessment Data• Competency

– Is it important to differentiate strengths and weaknesses?

• Exercise– Is AC working as expected (exercise effect might or might not be

confounds)

• OAR– Do you care how people did overall?

• Other– Process tracing for integration. Is it important how ratings change in

this process?

Validation

• The key question in all validation efforts is whether the inferences you want to draw from the data can be supported or justified

– A question that often underlies this assessment involves determining whether the data are sufficiently credible to support any particular use

Approaches to Validation

• Assessment of the Design– Competency mapping• Do exercises engage the right competencies• Are competency demonstrations in AC likely to

generalize• Are these the right competencies?

– Can assessors discriminate competencies?• Are the assessors any good?

– Do we know how good they are


• Assessment of the Data– Inter-rater agreement– Distributional assessments– Reliability and Generalizability analysis– Internal structure– External correlates


• Assessment of the Process– Did assessors have opportunities to observe

relevant behaviors?– What is the quality of the behavioral information

that was collected?– How were behaviors translated into evaluations– How were observations and evaluations

integrated


• Assessment of the Track Record– Relevant theory and literature– Relevant experience with similar ACs• Outcomes with dissimilar ACs

Assessment of the Design:Competencies

• Competency Mapping (content validation)– Do exercises elicit behaviors that illustrate the

competency• Are we measuring the right competencies?• Evidence that exercises reliably elicit the competencies

– Generalizability from AC to world of work

Assessment of the Design:Assessor Characteristics

• Training and expertise• What do we know about their performance as

assessors– One piece of evidence for validity might be

information that will allow us to evaluate the performance or the likely performance of our assessors

Assessment of the Data

• Distributional assessments– Does the distribution of scores make sense– Is the calibration of assessors reasonable given

the population being assessed– Is the variability in scores?


• Reliability and Generalizability analyses– Distinction between reliability and validity is not

as fundamental as most people believe– Assessments of reliability are an important part of

validation– The natural structure of AC data fits nicely with

generalizability theory


• Generalizability– AC data can be classified according to a number of

factors – rater, ratee, competency, exercise– ANOVA is the stating point for generalizability

analysis – i.e., identifying the major sources of variability• Complexity of ANOVA design depends largely on

whether the same assessors evaluate all competencies and exercises or some


• Generalizability – an example

– Use ANOVA to examine the variability of scores as a function of• Candidates• Dimensions (Competencies)• Exercises (potential source of irrelevant variance)• Assessors


Candidates Overall differences in candidate performance

Dimensions Does the pool of candidates show more strength in some competency areas than others?

Assessors Are assessors calibrated?

C x D Do candidates show different strengths and weaknesses?

C x A Do assessors agree about candidates?

A X D Do assessors agree about dimensions (competencies)

C x D x A Do assessors agree in their evaluations of the patterns of strength and weakness of different candidates?


• Internal Structure– Early in the design phase, articulate your

expectations regarding the relationships among competencies and dimensions

– This articulation becomes the foundation for subsequent assessments• It is impossible to tell if the correlations among ratings

of competencies are too high or too low unless you have some idea of the target you are shooting for


• Internal Structure– Confirmatory factor analysis is much better than

exploratory for making sense of the internal structure

– Exercise effects are not necessarily a bad thing. No matter how good assessors are, they cannot ignore overall performance levels• Halo is not necessarily an error, it is part of the

judgment process all assessors use


• Confirmatory Factor Models– Exercise only• Does this model provide a reasonable fit?

– Competency• Does this model provide a reasonable fit?

– Competency + exercise• How much better is the fit when you include both sets

of factors?


• External Correlates– The determination of external correlates depends

strongly on• The constructs/competencies you are trying to

measure• the intended uses of the data


• External Correlates– Alternate measures of competencies– Measures of the likely outcomes and correlates of

these competencies

Competencies– Active Listening– Speaking– Management of Personnel Resources – Social Perceptiveness– Coordination – Critical Thinking – Reading Comprehension– Judgment and Decision Making– Negotiation – Complex Problem Solving

Alternative Measures

Critical Thinking , Reading ComprehensionStandardized tests

Judgment and Decision MakingSupervisory ratings, Situational

Judgment Tests

Possible Correlates

• Active Listening– Success in coaching assignments– Sought as mentor

• Speaking– Asked to serve as spokesman, public speaker

• Negotiation– Success in bargaining for scarce resources

Assessments of the Process

• Opportunities to observe– Frequency with which target behaviors are

recorded

• Quality of the information that is recorded– Detail and consistency• Influenced by format – e.g., narrative vs. checklist

Assessments of the Process

• Observations to evaluations– How is this done?– Consistent across assessors?

• Integration– Clinical vs. statistical• Statistical integration should always be present but

should not necessarily trump consensus• Process by which consensus moves away from

statistical summary should be transparent and documented

Assessment of the Track Record

• The history of similar ACs forms part of the relevant research record

• The history of dissimilar ACs is also relevant

The Purpose-Driven AC

• What are you trying to accomplish with this AC?

• Is there evidence this AC or ones like it have accomplished or will accomplish this thing?

– Suppose the AC is intended principally to serve as part of leadership development. Identifying this principal purpose helps to identify relevant criteria

Criteria

• Advancement• Leader success• Follower satisfaction• Org success in dealing with turbulent

environments– The process of identifying criteria is largely one of

thinking through what the people and the organization would be like if you AC worked

An AC Validation Report

• Think of validating an AC the same way a pilot does his or her pre-flight checklist

• The more you know about each of the items on the checklist, the more compelling the evidence that the AC is valid for its intended purpose

AC Validity Checklist

• Do you know (and how do you know) whether:

– The exercises elicit behaviors that are relevant to the competencies you are trying to measure

– These AC demonstrations of competency are likely to generalize



– Raters have the skill, training, expertise needed?

– Raters agree in their observations and evaluations

– Their resolutions of disagreements make sense



– Score distributions make sense • Are there differences in scores received by candidates?• Can you distinguish strengths from weaknesses


• Do you know (and how do you know) whether:– Analyses of Candidate X Dimensions X Assessors

yields sensible outcomes• Assessor – are assessors calibrated?• C X D – do candidates show patterns of strength and

weakness?• A X D – do assessors agree about dimensions• C X D X A – do assessors agree about evaluations of

patterns of strengths and weaknesses


• Do you know (and how do you know) whether:– Factor structure makes sense, given what you are

trying to measure• Do you know anything about the relationships among

competencies?• Is this reflected in the sorts of factor models that fit?


• Do you know (and how do you know) whether:– Competency scores related to• Alternate measures of these competencies• Likely outcomes and correlates of these competencies


• Do you know (and how do you know) whether:– There are Competency X Treatment Interactions• Identifying individual strengths and weaknesses of

most useful when different patterns will lead to different treatments (training programs, development opportunities) and when making the right treatment decision for each individual leads to better outcomes than treating everyone the same


• Do you know (and how do you know) whether:– The process supports good measurement• Do assessors have opportunities to observe relevant

behaviors?• Do they record the right sort of information?• Is there a sensible process for getting from behavior

observation to competency judgment?


• Do you know (and how do you know) whether:– The integration process helps or hurts• How is integration done?• Is it the right method given the purpose of the AC?• How much does the integration process change the

outcomes?


• Do you know (and how do you know) whether:– Other similar ACs have worked well– Other dissimilar ACs have worked better, worse,

etc.


• Don’t overdo the checklist metaphor– A pilot will not take off unless everything on the

list checks out– Validation is not an all-or-none thing• More evidence is better• Broadly based evidence is better than lots of one kind• Validation checklist can help you improve AC

– Your goal is not to make an AC perfect, but to accumulate evidence

Validating Assessment Centers Kevin R. Murphy Department of Psychology Pennsylvania State...

Documents

Transcript of Validating Assessment Centers Kevin R. Murphy Department of Psychology Pennsylvania State...