Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing...

32
Sam Lohmann, Karen R. Diller, and Sue F. Phelps portal: Libraries and the Academy, Vol. 19, No. 3 (2019), pp. 429–460. Copyright © 2019 by Johns Hopkins University Press, Baltimore, MD 21218. Potholes and Pitfalls on the Road to Authentic Assessment Sam Lohmann, Karen R. Diller, and Sue F. Phelps abstract: This case study discusses an assessment project in which a rubric was used to evaluate information literacy (IL) skills as reflected in undergraduate students’ research papers. Subsequent analysis sought relationships between the students’ IL skills and their contact with the library through various channels. The project proved far longer and more complex than expected and yielded inconclusive results. We reflect on what went wrong and highlight lessons learned in the process. Special attention is paid to issues of project management and statistical analysis, which proved crucial stumbling blocks in the effort to conduct a meaningful authentic assessment. Introduction I n 2013, a team of librarians at Washington State University (WSU) Vancouver be- gan an authentic assessment project, a form of assessment that looks for evidence of knowledge and skills in the performance of meaningful real-world tasks. The project was intended to determine whether students’ contact with the library had an impact on their information literacy (IL) skills as reflected in their writing. What began as a seemingly simple one-semester pilot project grew into a multiyear study. Ultimately, due to a series of unforeseen issues, it became a long, time-consuming study with few conclusive or actionable results. In the end, the project was most interesting for what it taught the researchers about the complexity and importance of research design and project management. This article attempts to provide an account of these issues, which include norming and interrater reliability, planning, and iterative design. To tell this story, we will depart somewhat from the conventions of a primary research article, which typically focuses on reporting and interpreting findings after concisely outlining the research methods. Instead, we will focus on our methods and process, concluding with the lessons we learned from this largely unsuccessful project. This mss. is peer reviewed, copy edited, and accepted for publication, portal 19.3.

Transcript of Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing...

Page 1: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Sam Lohmann, Karen R. Diller, and Sue F. Phelps 429

portal: Libraries and the Academy, Vol. 19, No. 3 (2019), pp. 429–460. Copyright © 2019 by Johns Hopkins University Press, Baltimore, MD 21218.

Potholes and Pitfalls on the Road to Authentic AssessmentSam Lohmann, Karen R. Diller, and Sue F. Phelps

abstract: This case study discusses an assessment project in which a rubric was used to evaluate information literacy (IL) skills as reflected in undergraduate students’ research papers. Subsequent analysis sought relationships between the students’ IL skills and their contact with the library through various channels. The project proved far longer and more complex than expected and yielded inconclusive results. We reflect on what went wrong and highlight lessons learned in the process. Special attention is paid to issues of project management and statistical analysis, which proved crucial stumbling blocks in the effort to conduct a meaningful authentic assessment.

Introduction

In 2013, a team of librarians at Washington State University (WSU) Vancouver be-gan an authentic assessment project, a form of assessment that looks for evidence of knowledge and skills in the performance of meaningful real-world tasks. The

project was intended to determine whether students’ contact with the library had an impact on their information literacy (IL) skills as reflected in their writing. What began as a seemingly simple one-semester pilot project grew into a multiyear study. Ultimately, due to a series of unforeseen issues, it became a long, time-consuming study with few conclusive or actionable results. In the end, the project was most interesting for what it taught the researchers about the complexity and importance of research design and project management. This article attempts to provide an account of these issues, which include norming and interrater reliability, planning, and iterative design.

To tell this story, we will depart somewhat from the conventions of a primary research article, which typically focuses on reporting and interpreting findings after concisely outlining the research methods. Instead, we will focus on our methods and process, concluding with the lessons we learned from this largely unsuccessful project. This

mss

. is pe

er rev

iewed

, cop

y edit

ed, a

nd ac

cepte

d for

publi

catio

n, po

rtal 1

9.3.

Page 2: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Potholes and Pitfalls on the Road to Authentic Assessment430

Our statistical results are included in Appendix B, since they are ancillary to the article but may help the reader to understand the study. Though suggestive and generally en-couraging, the assessment results are constrained by methodological issues. However, a detailed account of the research process itself not only offers an opportunity for reflection but also may save future researchers time and frustration.

Background

Authentic Assessment

Over the past decade, librarians at many institutions have sought new methods for as-sessing the impact and value of library programs, often moving away from traditional usage indicators such as circulation statistics toward quantitative and qualitative mea-sures that attempt to assess the library’s impact on the mission and goals of its parent institution.1 The need to measure the impact of library instruction has become especially urgent. Because many libraries provide IL instruction, librarians have sought ways to assess the effectiveness of that teaching, usually by assessing students’ IL skills in some way. Since their publication in 2000, the Association of College and Research Librar-ies (ACRL) Information Literacy Competency Standards for Higher Education, the Standards, were the most influential model for articulating and measuring IL learning outcomes in the United States, at least until the adoption of the ACRL Framework for Information Literacy for Higher Education in 2016.2 Starting from this common reference point, researchers have used a wide variety of methods to assess IL. These techniques can be broadly divided between indirect methods, which use an instrument such as a standardized test or survey to evaluate IL as an isolated skill set, and direct or authentic assessment methods, which seek evidence of students’ learning within the context of their regular coursework.

Kathleen Montgomery uses the term authentic assessment to describe practices that include “the holistic performance of meaningful, complex tasks in challenging environ-ments that involve contextualized problems.”3 Such practices have given strong support to the use of rubrics for assessment of IL and other skills, as advocated by Montgomery, by the team of Karen Diller and Sue Phelps, and by Megan Oakleaf.4 Articles published after the completion of data collection for the present study have extended and affirmed the use of rubrics for authentic IL assessment.5

Assessment at WSU Vancouver Library

WSU Vancouver is an urban campus of Washington state’s only land-grant university and a large Research I institution, which offers a full range of programs and engages in extensive research activity. Its original campus is in Pullman, Washington. WSU Van-couver, across the Columbia River from Portland, Oregon, serves about 3,500 students annually, the majority of them transfer students. IL instruction is a major part of the library’s mission, and librarians frequently engage in classroom teaching. This teaching typically takes the form of one- or two-hour sessions tailored to the instructors’ requests, often combining a basic but essential procedural demonstration with more interactive, critical engagement in IL issues. In addition to one-shot bibliographic instruction ses-

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.

Page 3: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Sam Lohmann, Karen R. Diller, and Sue F. Phelps 431

sions, librarians teach a one-credit online IL elective, Accessing Information for Research, each semester. In addition to library instruction, reference interactions, content on the library website, and instruction from nonlibrary faculty may also contribute to students’ IL learning.

WSU Vancouver’s General Education Learning Goals include an IL goal based on the ACRL Standards. Part of the library’s mission is to support this goal. To determine whether the library did this effectively and how it might better support the goal, in-struction librarians met and developed an assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians deter-mined that the most relevant, actionable re-sults would be obtained through authentic IL assessment methods, specifically the use of a rubric to score student research papers. After securing university funds to help cover the anticipated work hours and receiving an Institutional Review Board waiver, we moved forward with the project.

Our chosen method had the advantage of being flexible, applicable to a wide range of skill levels and assignments, and directly tied to student coursework. If the results could be analyzed in relation to student library use data and library instruction records, they would have the potential to demonstrate the value of existing library interventions and to inform new plans. However, weaknesses became apparent as we moved forward. For instance, there was no way to account for IL learning that occurred outside the library as part of students’ regular coursework, employment, or extracurricular activities. In addition, data collection and analysis proved surprisingly time-consuming, and our methods took some unforeseen directions in the process.

Methods

Data Collection

Since this article’s main purpose is to recount and reflect on methodological issues that arose during our study, we will discuss the research methods chronologically and in de-tail, beginning with our preparations prior to data collection. Following an explanation of our two rounds of data collection and the complications that arose, we will describe our research questions and the data analysis methods by which we sought to answer them.

Preparation and Planning

In preparation, we substantially revised a rubric that had been used for a previous IL assessment project on campus.6 Because WSU’s stated undergraduate learning goal closely follows the language of the ACRL Standards, the latter were used to guide both the previous and the revised rubrics.7 The present assessment focused more narrowly on individual student research papers rather than on summative and reflective port-folios, so we chose to limit the rubric to the goals and outcomes that could be directly

the librarians determined that the most relevant, actionable results would be obtained through au-thentic IL assessment methods, specifically the use of a rubric to score student research papers.

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.

Page 4: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Potholes and Pitfalls on the Road to Authentic Assessment432

observed in a typical research paper, leaving out those that could only be assessed through observation and documentation of students’ research practices prior to writ-ing. Eight learning outcomes were identified, based on aspects of ACRL Standards One ( “. . . determines the nature and extent of the information needed”), Four ( “. . . uses information effectively to accomplish a specific purpose”), and Five (“. . . understands many of the economic, legal, and social issues surrounding the use of information and accesses and uses information ethically and legally”); and on corresponding university-wide undergraduate learning goals. These outcomes were described in the rubric in terms of three qualitative rankings, “Emerging,” “Developing,” and “Integrating,” each subdivided into two possible scores, resulting in a 6-point ordinal scale—that is, a scale that allowed for ranking of the data. Standards Two and Three were omitted because the researchers did not believe they could be objectively assessed using student writing. The full text of the rubric appears as Table 1.

We also created a survey (see Appendix A) to gather data on students’ contact with the library through various channels, as well as other potentially relevant information, such as the students’ major, gender, number of semesters completed, and transfer sta-tus (that is, whether the student transferred or began at WSU Vancouver as a first-year student). A staff member obtained participants’ enrollment records from WSU’s Office of Institutional Research and compared them with the bibliographic instruction records routinely kept by library staff. Once the number of bibliographic instruction sessions attended by each student had been recorded, the staff member removed all identifying information from the demographic surveys and data spreadsheets to ensure anonymity before providing them to the researchers.

Initial Sample

With this process in place, we selected a sample of courses for the first iteration of the assessment, in the fall of 2013. For this initial sample, we sought courses that included students at various stages in their college careers (both first-year and transfer students), involved a substantial writing assignment with a research component, and had high enrollment. Given these criteria and the size of our institution, we did not believe that a true random sample of courses could be achieved. Instead, we tried to make a representa-tive selection. Three English course sections at the 100, 200, and 300 levels were chosen, along with a 400-level History course section. By arrangement with the instructors, we visited the four classes to solicit participation and distributed demographic surveys and consent forms to those who chose to participate. We gave participants a brief, intention-ally vague description of the research project, explaining that we would look at their research papers and that their anonymity would be protected. We avoided references to IL or to the specific purpose of the study. Near the end of the semester, 46 usable papers were collected, anonymized, and prepared for scoring.

A team of four librarians participated in scoring the student papers using the rubric. In preparation, they held a norming session in which they discussed the rubric, scored a set of sample papers individually, compared the results, and further discussed points of ambiguity and disparity that arose. The goal was not to produce identical scores but to produce scores with a difference of no more than one point for each rubric facet and no

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.

Page 5: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Sam Lohmann, Karen R. Diller, and Sue F. Phelps 433

Tabl

e 1.

Rubr

ic fo

r ass

essm

ent o

f inf

orm

atio

n lit

erac

y

Em

ergi

ng

Dev

elop

ing

Inte

grat

ing

1

2

3

4

5

6

1. D

eter

min

es th

e ex

tent

and

type

of i

nfor

mat

ion

need

ed*

a. U

nder

stan

ds th

e m

ultip

le

Dra

ws p

rimar

ily fr

om a

necd

otal

or

Lim

ited

but m

ore

than

ow

n.

Bala

nced

. Goo

d re

pres

enta

tion

of

view

poin

ts re

leva

nt to

nee

d

pers

onal

exp

erie

nce.

vi

ewpo

ints

m

ultip

le v

iew

poin

ts.

even

if a

ssig

nmen

t doe

s not

st

ate

this

as a

requ

irem

ent.

b. U

ses m

ultip

le so

urce

type

s So

urce

s all

from

one

sour

ce ty

pe.

Mix

of t

wo

sour

ce ty

pes.

Part

ially

M

ix o

f mor

e th

an tw

o so

urce

type

s.

(i.e.

, jou

rnal

art

icle

, boo

k,

sa

tisfie

s the

info

rmat

ion

type

of t

he

Satis

fies o

r exc

eeds

the

expe

cted

ne

wsp

aper

, etc

.)

assi

gnm

ent w

ith so

me

varie

ty o

f in

form

atio

n ty

pe o

f the

ass

ignm

ent.

sour

ce m

ater

ial.

c. E

xten

t of s

ourc

e m

ater

ials

is a

A

mou

nt o

f sou

rce

mat

eria

l is

Am

ount

of s

ourc

e m

ater

ial p

artia

lly

Am

ount

of s

ourc

e m

ater

ial

ppro

pria

te to

ass

ignm

ent

belo

w re

quire

men

ts o

f ass

ignm

ent.

satis

fies t

he re

quire

men

ts o

f the

sa

tisfie

s or e

xcee

ds th

e re

quire

men

ts o

f

as

sign

men

t. th

e as

sign

men

t.

4. A

sses

s cre

dibi

lity

and

appl

icab

ility

of i

nfor

mat

ion

sour

ces.

A

ccep

ts in

form

atio

n w

ithou

t A

rtic

ulat

es a

nd/o

r app

lies b

asic

C

lear

ly a

rtic

ulat

es a

nd a

pplie

s

ques

tion.

For

exa

mpl

e: q

uote

s ev

alua

tion

crite

ria to

info

rmat

ion

sour

ces

eval

uatio

n cr

iteria

to th

e in

form

atio

n.

so

urce

s with

out c

omm

ent o

r in

a p

artia

l or l

imite

d w

ay. F

or e

xam

ple:

Fo

r exa

mpl

e: a

ll so

urce

s are

tim

ely

for

ev

alua

tion;

sour

ces a

re n

ot

mix

of s

ourc

es w

hich

are

tim

ely

and

th

e to

pic;

all

sour

ces a

re a

utho

ritat

ive;

timel

y fo

r top

ic; s

ourc

es a

re

not t

imel

y; m

ix o

f aut

horit

ativ

e an

d m

entio

ns m

ore

than

one

eva

luat

ive

in

appr

opria

te fo

r pro

ject

. no

naut

horit

ativ

e so

urce

s; m

entio

ns o

ne

crite

rion

per s

ourc

e.

aspe

ct o

f cre

dibi

lity

but i

gnor

es o

ther

s.

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.

Page 6: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Potholes and Pitfalls on the Road to Authentic Assessment434

5. U

ses i

nfor

mat

ion

effe

ctiv

ely

to a

ccom

plis

h a

purp

ose

(suc

cess

ful c

ompl

etio

n of

ass

ignm

ent).

a. D

emon

stra

tes u

nder

stan

ding

of

Use

s sou

rces

out

of c

onte

xt.

Dem

onst

rate

s som

e un

ders

tand

ing

of

Resp

ects

the

cont

ext a

nd in

tegr

ity o

f th

e im

port

ance

of p

uttin

g so

urce

s Fo

r exa

mpl

e: d

isto

rts o

ppos

ing

how

con

text

is im

port

ant w

hen

usin

g so

urce

s of i

nfor

mat

ion.

For

exa

mpl

e:

into

con

text

and

mai

ntai

ning

the

vi

ewpo

ints

. so

urce

s to

supp

ort a

rgum

ents

. in

tegr

ates

opp

osin

g vi

ewpo

ints

into

co

ntex

tual

mea

ning

of s

ourc

es.

broa

der c

onte

xts.

b. S

ucce

ssfu

lly in

tegr

ates

ow

n

Relie

s hea

vily

on

quot

atio

ns.

Use

s mor

e pa

raph

rasi

ng th

an q

uota

tions

. In

tegr

ates

quo

tatio

ns a

nd p

arap

hras

es

know

ledg

e w

ith th

e kn

owle

dge

Fo

r exa

mpl

e: q

uota

tions

do

not

For e

xam

ple:

quo

tatio

ns o

r ref

eren

ces

appr

opria

tely

to fo

rmul

ate

of o

ther

s. se

rve

a pu

rpos

e; q

uota

tions

are

not

se

rve

a pu

rpos

e bu

t are

not

wel

l use

d an

arg

umen

t. Fo

r exa

mpl

e: so

urce

s

inte

grat

ed in

to a

n ov

eral

l fo

r tha

t pur

pose

; som

e qu

otat

ions

or

are

used

for b

ackg

roun

d in

form

atio

n,

ar

gum

ent o

r the

sis.

For a

nnot

ated

pa

raph

rase

s are

effe

ctiv

ely

inte

grat

ed

to su

ppor

t stu

dent

’s th

esis

and

/or a

s

bibl

iogr

aphi

es: s

ourc

es a

re n

ot

into

an

over

all a

rgum

ent o

r the

sis.

For

supp

ort f

or a

spec

ific

poin

t. Fo

r

rele

vant

to to

pic,

and

stud

ent

anno

tate

dbib

liogr

aphi

es: m

ajor

ity o

f an

nota

ted

bibl

iogr

aphi

es: A

ll so

urce

s

does

n’t r

ecog

nize

this

. so

urce

s are

rele

vant

to to

pic.

ar

e re

leva

nt, a

nd re

leva

ncy

is n

oted

in

an

nota

tions

.

Citi

ng/d

ocum

entin

g so

urce

s.

Mak

es m

ultip

le e

rror

s whe

n ci

ting

Mak

es m

inim

al e

rror

s whe

n ci

ting

M

akes

no

erro

rs w

hen

citin

g so

urce

s FO

RMA

T O

NLY

. so

urce

s in

text

and

in re

fere

nce

list.

sour

ces i

n te

xt a

nd in

refe

renc

e lis

t.

in te

xt a

nd in

refe

renc

e lis

t.

6. A

cces

s and

use

info

rmat

ion

ethi

cally

and

lega

lly.

a. R

ecog

nize

s pla

giar

ism

and

nee

d

Use

s inf

orm

atio

n w

ithou

t A

ckno

wle

dges

the

sour

ce o

f inf

orm

atio

n A

lway

s ack

now

ledg

es th

e so

urce

of

for d

ocum

enta

tion.

re

fere

ncin

g th

e so

urce

s of t

hat

mos

t of t

he ti

me.

Som

ewha

t cle

ar a

s to

info

rmat

ion.

info

rmat

ion.

w

hich

wor

k is

the

stud

ent’s

and

whi

ch

is fr

om a

sour

ce.

* Rub

ric it

em n

umbe

rs fo

llow

the

num

berin

g of

stan

dard

s and

sam

ple

outc

omes

in th

e Ass

ocia

tion

of C

olle

ge a

nd R

esea

rch

Libr

arie

s Inf

orm

atio

n Li

tera

cy

Com

pete

ncy

Stan

dard

s for

Hig

her E

duca

tion

(200

0).

Tabl

e 1.

, con

t.

Em

ergi

ng

Dev

elop

ing

Inte

grat

ing

1

2

3

4

5

6

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.

Page 7: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Sam Lohmann, Karen R. Diller, and Sue F. Phelps 435

more than eight points in the aggregate score that resulted from adding the eight facet scores together. Through the discussion process, the four raters agreed on sufficiently similar scores and developed a consensus as to how the rubric would be applied in practice.

To minimize the effect of individual raters, each paper was scored separately by two randomly assigned raters. In cases where their total scores for a given paper differed by more than eight points (that is, by more than one point per rubric facet on average), a third rater with no knowledge of the other scores would also evaluate the paper. Because the rubric scale was ordinal, subsequent analysis would use the median of the two or three rater scores. Using this method, an initial sample of 46 papers from fall 2013 was scored by four raters.

In addition to simply assessing the IL skills of our sample, we wanted to identify any significant relationship between IL and contact with the library through various channels. We also hoped to detect any significant differences related to the demographic variables collected in our survey. In the statistical analysis process, we sought significant relationships between resolved median IL rubric scores as dependent variables—that is, the variables we wanted to measure—and 12 independent variables, indicating library contact and demographic features, factors that might cause some change in the rubric scores. Tables 2, 3, and 4 summarize these variables and our approach to quantifying them, as well as central tendencies and distributions for our sample. Table 2 lists the nine rubric variables, including eight individual facets reflecting distinct skills, as well as an aggregate total score. The demographic factors listed in Table 3 include gender, transfer status, age, and number of semesters at WSU Vancouver. Table 4 lists the library contact variables, including an aggregate library contact score used to estimate overall exposure to library interventions.

Because the contact variables shown in Table 4 (C1–C5) are measured on a variety of scales, it was necessary to recode these measures before adding them together to produce a weighted aggregate library contact variable (C6). As shown in Table 5, variables for time in library, website use, library assistance, and attendance at a bibliographic instruc-tion session (C1, C2, C3, and C5, respectively) were divided into categories for lower contact (one point per variable) and higher contact (two points per variable). In that way, each would have equivalent weight in the aggregate score. Because we assumed that an individual appointment with a librarian would have higher impact than other contact methods, the appointment variable (C4) received greater weight, with three points added in cases where an appointment had been made. By adding the results of these five transformed variables, an aggregate score between 5 and 11 was obtained as an estimate of overall library contact.8

First Complication: Expanding the Sample

Analysis of the initial sample suggested positive correlations between IL and library contact, but these correlations were not statistically significant (p > .05). We decided to extend the study and seek a larger, more representative sample of courses and students. Because it would be difficult to reach a large sample on our small campus, we elected to retain the initial sample, add a second sample using identical methods, and analyze

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.

Page 8: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Potholes and Pitfalls on the Road to Authentic Assessment436

Tabl

e 2.

Rubr

ic v

aria

bles

with

med

ians

from

two

roun

ds o

f sco

ring

Vari

able

R

ubri

c ite

m*

Res

olve

d m

edia

n in

R

esol

ved

med

ian

first

roun

d (N

= 7

7),

in s

econ

d ro

und

wit

h in

terq

uart

ile

(N =

77)

wit

h

ra

nge

(IQ

R)†

IQ

R

R1: M

ultip

le v

iew

poin

ts

1a

. Und

erst

ands

the

mul

tiple

vie

wpo

ints

rele

vant

to n

eed

even

if th

e

4.0

(IQR

= 1.

25)‡

3.

0 (IQ

R =

1)

as

sign

men

t doe

s not

stat

e th

is re

quire

men

t.

R2: S

ourc

e ty

pes

1b. U

ses m

ultip

le so

urce

type

s (i.e

., jo

urna

l art

icle

, boo

k, n

ewsp

aper

, etc

.) 4.

5 (IQ

R =

1)

4.0

(IQR

= 1.

5)

R3: E

xten

t of s

ourc

es

1c. E

xten

t of s

ourc

e m

ater

ials

is a

ppro

pria

te to

ass

ignm

ent.

5.0

(IQR

= 1.

5)

4.0

(IQR

= 1.

5)

R4: A

sses

smen

t of s

ourc

es

4. A

sses

s cre

dibi

lity

and

appl

icab

ility

of i

nfor

mat

ion

reso

urce

s. 3.

0 (IQ

R =

1.25

) 2.

5 (IQ

R =

1)

R5: C

onte

xt

5a. D

emon

stra

tes u

nder

stan

ding

of t

he im

port

ance

of c

onte

xt a

nd m

aint

aini

ng

4.0

(IQR

= 1.

5)

3.0

(IQR

= 1)

the

cont

extu

al m

eani

ng o

f sou

rces

.

R6: I

nteg

ratio

n 5b

. Suc

cess

fully

inte

grat

es o

wn

know

ledg

e w

ith th

e kn

owle

dge

of o

ther

s. 3.

5 (IQ

R =

1.5)

3.

0 (IQ

R =

1.25

)

R7: C

itatio

n fo

rmat

5c

. Citi

ng/d

ocum

entin

g so

urce

s. FO

RMA

T O

NLY

. 4.

0 (IQ

R =

1.75

) 3.

5 (IQ

R =

1.5)

R8: D

ocum

enta

tion

6. R

ecog

nize

s pla

giar

ism

and

nee

d fo

r doc

umen

tatio

n.

4.5

(IQR

= 1.

75)

3.5

(IQR

= 4.

5)

R9: A

ggre

gate

rubr

ic sc

ore§

31 (I

QR

= 8)

28

(IQ

R =

10)

* Rub

ric it

em n

umbe

rs fo

llow

the

num

berin

g of

stan

dard

s and

sam

ple

outc

omes

in th

e Ass

ocia

tion

of C

olle

ge a

nd R

esea

rch

Libr

arie

s Inf

orm

atio

n Li

tera

cy

Com

pete

ncy

Stan

dard

s for

Hig

her E

duca

tion

(200

0).

† In

terq

uart

ile ra

nge

(IQR)

refe

rs to

the

rang

e of

the

mid

dle

50 p

erce

nt b

etw

een

the

low

er q

uart

ile a

nd th

e up

per q

uart

ile o

f the

sam

ple.

‡ Va

riabl

es R

1–R8

are

mea

sure

d on

a si

x-po

int o

rdin

al ru

bric

scal

e fr

om 1

(“em

ergi

ng”)

to 6

(“in

tegr

atin

g”).

§ Va

riabl

e R9

repr

esen

ts th

e su

m o

f var

iabl

es R

1–R8

, yie

ldin

g a

scor

e be

twee

n 8

and

48.

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.

Page 9: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Sam Lohmann, Karen R. Diller, and Sue F. Phelps 437

Table 3.Demographic variables

Variable Unit or scale of measurement Result for our sample (N = 77)

D1: Gender Nominal: male or female 57% male, 43% female

D2: Transfer status Nominal: started as first-year 16% started as first-year (whether the student student at WSU Vancouver or students; 86% started as transfer started Washington transferred with more than 30 students State University (WSU) credits from another institution as a first-year college student or as a transfer student) D3: Age Ratio (number of years); divided Mean age is 25 (standard into two age groups for analysis deviation = 8.17);* 68% under 25, (under 25 or 25 and over) 32% 25 and over

D4: Is this your first Nominal: yes or no. 51% yes, 49% no semester at WSU Vancouver?

D5: How many semesters Ratio: number of semesters. Mean number of semesters is 2.3 have you taken classes at (standard deviation = 1.61) WSU Vancouver? D6: What department is Nominal: Choice of 22 departments Ranked by frequency: History your academic major in?. (16); Business (10); Social Sciences (8); Human Development (7); English (5); Biology, Computer Science, Psychology (4 each); Engineering, Environmental Science (3 each); Creative Media and Digital Culture, Education, Public Affairs, Mathematics, Neuroscience, (2 each); Anthropology, Political Science, Sociology (1 each).*

Standard deviation is a measure of how tightly the numbers cluster around the mean.

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.

Page 10: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Potholes and Pitfalls on the Road to Authentic Assessment438

Tabl

e 4.

Libr

ary

cont

act (

C) v

aria

bles

Vari

able

Su

rvey

que

stio

n U

nit o

r sca

le o

f mea

sure

men

t R

esul

t for

our

sam

ple

(N =

77)

C1:

Tim

e in

libr

ary

How

muc

h tim

e do

you

est

imat

e yo

u

5-po

int o

rdin

al sc

ale:

Nev

er (1

), O

nce o

r tw

ice

Med

ian

= 4

(inte

rqua

rtile

rang

e

spen

d at

the

libra

ry?

a s

emes

ter (

2), O

nce a

mon

th (3

), O

nce a

wee

k (4)

, [IQ

R] =

2)*

A

t lea

st on

ce a

day

(5).

C2:

Web

site

use

H

ow o

ften

do y

ou u

se th

e W

ashi

ngto

n

5-po

int o

rdin

al sc

ale:

Nev

er (1

), O

nce o

r tw

ice a

Med

ian

= 3

(IQR

= 2)

Stat

e U

nive

rsity

(WSU

) Van

couv

er L

ibra

ry

sem

este

r (2)

, Onc

e a m

onth

(3),

Onc

e a w

eek (

4), A

t

web

site

to d

o re

sear

ch?

least

once

a da

y (5

). C

3: L

ibra

ry a

ssis

tanc

e H

ow m

any

times

hav

e yo

u as

ked

for l

ibra

ry

4-po

int o

rdin

al sc

ale:

Nev

er (1

), O

nce o

r tw

ice (2

),

Med

ian

= 2

(IQR

= 2)

assi

stan

ce (t

alki

ng to

som

eone

at t

he

3–5

times

(3),

Mor

e tha

n 5

times

(4).

re

fere

nce

desk

, e-m

ailin

g th

e lib

rary

, etc

.)

w

hile

doi

ng re

sear

ch?

C4:

App

oint

men

t H

ave

you

ever

mad

e an

app

oint

men

t to

mee

t N

omin

al: n

o/ye

s. 96

% n

o, 4

% y

es

with

libr

aria

n w

ith a

libr

aria

n on

e-on

-one

at W

SU

Va

ncou

ver L

ibra

ry?

C5:

Num

ber o

f N

/A (d

ata

from

enr

ollm

ent r

ecor

ds)

Ratio

(num

ber o

f bib

liogr

aphi

c in

stru

ctio

ns

Rank

ed b

y fr

eque

ncy:

1 (4

2%),

0 bi

blio

grap

hic

rang

ed fr

om 0

to 4

). (3

6%),

2 (1

7%),

3 (4

%),

4 (1

%).

inst

ruct

ion

sess

ions

M

ean

is 0

.92

(sta

ndar

d de

viat

ion

atte

nded

=

0.9)

†C

6: A

ggre

gate

libr

ary

N

/A

Ord

inal

: Wei

ghte

d su

m o

f val

ues a

s sho

wn

in

Med

ian

= 7

(IQR

= 1)

co

ntac

t sco

re

Ta

ble

5. M

inim

um p

ossi

ble

scor

e is

5;

max

imum

is 1

1.C

7: C

omm

unic

atio

n

If yo

u as

k fo

r lib

rary

ass

ista

nce

whi

le d

oing

In

clus

ive

nom

inal

scal

e w

ith fo

ur c

ateg

orie

s Ra

nked

by

freq

uenc

y: in

per

son

met

hod

rese

arch

, you

r met

hod

of c

omm

unic

atio

n is

(in

per

son,

pho

ne, e

-mai

l, or

inst

ant m

essa

ging

(4

6), n

o re

spon

se (2

1), e

-mai

l

(che

ck a

ll th

at a

pply

): [IM

]), p

lus a

n op

en-e

nded

“O

ther

” ca

tego

ry.

(14)

, pho

ne (4

), IM

(4).

*I

nter

quar

tile

rang

e (IQ

R) re

fers

to th

e ra

nge

of th

e m

iddl

e 50

per

cent

bet

wee

n th

e lo

wer

qua

rtile

and

the

uppe

r qua

rtile

of t

he sa

mpl

e.†

Stan

dard

dev

iatio

n is

a m

easu

re o

f how

tigh

tly th

e nu

mbe

rs c

lust

er a

roun

d th

e m

ean.

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.

Page 11: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Sam Lohmann, Karen R. Diller, and Sue F. Phelps 439

the combined results. To better approximate the makeup of the campus, we looked for courses that might include a greater proportion of transfer students. A second sample was gathered in fall 2014, consisting of 31 papers from three course sections: a 300-level Human Development course, a 300-level History course (required for transfer students in all majors), and a 400-level History course. (A fourth course section was initially in-cluded, but the instructor withdrew after deciding that no research assignment would be required.) Participants in both samples completed the same demographic survey, and the same rubric was used to score all papers. A new group of raters was convened to score the new papers, with the intention of combining the data from both groups into a total sample of 77 papers.

Further Complication: Re-Norming and Rescoring

At this point, human error complicated our plans. Due to changes in staffing, one person left and another person joined the rater group. Perhaps because this seemed like a minor

Table 5.Transformation of library contact (C) variables to calculate aggregate contact

Variable Lower contact Higher contact Cumulative possible Definition Point Definition Point score for of range value of range value variable C6

C1: Time in library Never, Once or 1 Once a week or At 2 2 twice a semester, least once a day. or Once a month. C2: Website use Never, Once or 1 Once a week or At 2 4 twice a semester, least once a day. or Once a month. C3: Library Never, or Once 1 3–5 times or More 2 6 assistance or twice. than 5 times.

C4: Appointment No. 1 Yes. 3 9 with librarian

C5: Number of Zero or one session. 1 Two, three, or 2 11 bibliographic four sessions. instruction sessions attended

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.

Page 12: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Potholes and Pitfalls on the Road to Authentic Assessment440

change (the new rater was also a member of the research team), we neglected to perform a second norming session before scoring the papers. Considerable time passed before we realized the significance of this oversight. It only came to light when we reviewed our preliminary results and noted the high level of disagreement between raters on a rubric facet that should have been among the easiest on which to agree, variable R7, which dealt strictly with the formatting of citations and references. Once we realized that our norming had been inconsistent, we attempted to correct the problem by convening a new group of raters, holding a new norming session, and then repeating the scoring process for all 77 papers. Since three of the five raters had been involved in the previous rounds of rating, the papers were systematically assigned to avoid repetition for those three raters. That is, papers were distributed randomly but with the constraint that no one would rate a paper they had already scored in the previous sessions. We planned to base our statistical analysis on the new score results but retained the earlier results for comparison.

Final Complication: Reliability and Rater Disagreement

Surprisingly, the new scores resulted in lower rater agreement than the previous set of scores, even though the initial set was produced by two different groups of raters without consistent norming. In the new round of scoring, 25 papers (32 percent) received total scores that varied by more than eight points, necessitating a third rating by another reader to resolve the disagreement. This is above the maximum acceptable level of rater disagreement, 30 percent, recommended by Steven Stemler.9 In comparison, the initial round of scores resulted in only about half as many needing a third rating; that is, 13 papers (17 percent) received total scores from two raters that differed by more than eight points. While this level of disagreement might be acceptable under other circumstances, we cannot fully credit these scores because they were obtained without a consistent norming process. Thus, neither set of scores can be considered a reliable measure of IL. When considering these scores in relation to such factors as library contact and demo-graphics, correlations cannot be convincing even when they are statistically significant. Under these conditions, the results of our statistical analysis (summarized in Appendix B) are suggestive at best.

It is unclear why we saw such high levels of disagreement even after norming. Comparison between pairs of raters did not indicate that any rater was at variance

with the group, nor have we identified any major ambiguities in the rubric. Our best guess is that several raters, perhaps all, had difficulty in applying the rubric consistently to a wide variety of subject areas and course levels. Perhaps this problem could have been remedied through more extensive norming discussions, a more diverse sample of papers

for norming, or use of a more sensitive reliability measure such as intraclass correlation, which describes the extent to which two or more raters agree when scoring the same set of things (discussed under Question 2, “How reliable and consistent were the rubric scores given by different raters?”).

Several raters, perhaps all, had difficulty in applying the rubric consistently to a wide variety of subject areas and course levels.

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.

Page 13: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Sam Lohmann, Karen R. Diller, and Sue F. Phelps 441

Statistical Analysis Process

Because of these setbacks, the following account of our data analysis will focus on the process of analysis—including secondary research and decision-making prior to run-ning the numbers. We will address the results of our analysis only in passing and only when they pertain to an understanding of the process. The primary value of our study lies in the methodological questions it raises, rather than in the results, many of which may not be valid for our sample population.

Although not all the goals of the investigation were clearly articulated when we initially planned our study, five research questions had emerged by the time we began analyzing the data. Two of the five dealt with the validity of the method itself, while the other three sought significant differences and evidence of impact in the resulting IL scores. We will address these questions one at a time to provide a sufficiently detailed context for the recommendations in the “Conclusion” section and the results presented in Appendix B. Table 6 lists the research questions and summarizes our statistical methods.10 Repeated scoring resulted in two sets of scores that could not be combined into a single meaningful data set. For each question, therefore, we applied statistical tests to both sets of scores and compared the results.

Resolving Discrepant Scores

To minimize the influence of individual raters, each student paper was read by two randomly assigned raters, and the median score was used for analysis purposes, as described later. To resolve discrepant scores and arrive at an accurate score for analysis, we adapted James Penny and Robert Johnson’s recommendation of the “parity method,” which was found to provide greater validity than other resolution methods in a study of rubric scoring.11 We used the median between two scores if they differed by eight points or less (or one point per rubric facet) and the median between three scores if the difference was greater than eight points, necessitating resolution by a third rater. In the analyses described later, we therefore used the resolved scores, derived from the medians of either two or three raters’ scores, as our IL rubric variables (R1–R9).

Question 1: Is the rubric scoring internally consistent in a way that suggests that all its parts contribute to IL as a single unifying construct?

This question involves two types of relationships among the rubric facets (variables R1–R8): internal consistency and significant difference. Consistency would indicate a shared construct (that is, IL) underlying all eight variables, each of which would then contribute to a meaningful aggregate score (variable R9). Significant difference would indicate that each variable measured a separate, independent skill or ability. If there was a lack of internal consistency, it would make sense to discard the aggregate score, but the scores for individual IL skills could still be analyzed separately. A statistical test called Cronbach’s alpha was used to test internal consistency among resolved median scores for variables R1–R8. This test also identifies variables that could be removed to

The primary value of our study lies in the method-ological questions it raises, rather than in the results.

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.

Page 14: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Potholes and Pitfalls on the Road to Authentic Assessment442

improve consistency. To test for significant differences between each possible pair of rubric variables, we conducted Wilcoxon signed-rank tests of significance, a standard nonparametric test—that is, one in which the data are not assumed to fit a normal dis-tribution—used in repeated-measures studies.

Question 2: How reliable and consistent were the rubric scores given by different raters?

Question 2 proved the most complex and required us to seek models outside the library literature. It pertains to two related issues, inter-rater reliability and methods of

Table 6.Research questions and statistical methods

Research question Statistical methods

Question 1: Is the rubric scoring internally consistent in a Cronbach’s alpha and Wilcoxon way that suggests that all its parts contribute to information signed-rank test* literacy as a single unifying construct?

Question 2: How reliable and consistent were the rubric Intraclass correlation coefficient scores given by different raters? (ICC);† modified percent agreement

Question 3: Do correlations exist between information Spearman’s rho and Kendall’s literacy and contact with the library? tau‡

Question 4: In what information literacy skills were the Comparison of resolved participants strongest and weakest? medians

Question 5: Did any of the demographic factors have a Mann-Whitney U and significant relationship to information literacy? Kruskal-Wallis H tests§ for differences between groups; Spearman’s rho and Kendall’s tau for correlations

*Cronbach’s alpha is a statistical test used to measure internal consistency. The Wilcoxon signed-rank test determines significant differences between each possible pair of rubric variables.†Intraclass correlation describes the extent to which two or more raters agree when rating the same set of things.‡Spearman’s rho and Kendall’s tau indicate the closeness of the relationship between two variables.§The Mann-Whitney U test and Kruskal-Wallis H test are used to assess differences between groups.

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.

Page 15: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Sam Lohmann, Karen R. Diller, and Sue F. Phelps 443

resolving rater disagreement. In planning our study, we initially assumed that we could rely on a single method for both purposes and chose the “modified” or “broadened” per-cent agreement measure described by Stemler, which involves calculating the percentage of agreement between two raters.12 The modified approach measures agreement within an acceptable range (for example, within one point) rather than exact agreement. This measure provided a convenient means of resolving rater disagreement and organizing our norming process, but we eventually realized it was not appropriate as a measure of inter-rater reliability.

Initially, we measured agreement as follows: During norming, raters attempted to reach agreement within one point for each rubric facet, or within eight points for the total score. This process allowed us to identify rubric facets with especially low agree-ment, which became the focus of the raters’ norming discussion. Following norming, and once each paper in our sample had been scored by two raters, this measure was also used to identify instances of disagreement that required a third rater to provide an accurate median score.

Although this was a convenient, straightforward way to identify and resolve rater disagreement, the percent agreement method is not considered a meaningful measure of inter-rater reliability because it has the potential to overestimate the level of consensus.13 After our second round of rating was completed, we came to understand this problem and reviewed the literature on inter-rater reliability in search of an appropriate method.

Stemler identifies three types of inter-rater reliability estimates, the consensus, consistency, and measurement approaches.14 A measurement estimate is most appropri-ate in this scenario because the rubric scale is ordinal and because each paper was rated by more than one rater, with the intention of using the central tendency (median) as a final score for analysis purposes. Various methods are used for this type of estimate, depending on the research design. Following Kevin Hallgren’s discussion, we determined that the most appropriate method would be to calculate intraclass correlation, how much two or more raters agree when rating the same set of things, since in our study all subjects were rated by multiple, randomly assigned raters, and we were concerned with the magnitude of disagreement rather than with a simple test of exact agreement.15 (In Stemler’s terminology, we were interested in measurement rather than consensus or consistency.) Specifically, we needed a one-way mixed, absolute-agreement, average-measures intraclass correlation, for the following reasons: randomly assigned pairs rated each paper (indicating a one-way model); we were interested in absolute agreement rather than rank consistency because we intended to use the mean between two raters’ scores for each paper; all papers were rated by at least two raters (indicating the average-measures unit); and we did not seek to generalize to a larger population of raters (mixed effect).16 We applied this measure to ratings by randomly assigned pairs of raters across the entire sample. To evaluate our score resolution method, we also calculated intraclass correlation for the subsets of cases that required a third reader for resolution and for those that did not. Domenic

The percent agreement method is not consid-ered a meaningful mea-sure of inter-rater reli-ability because it has the potential to overestimate the level of consensus.

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.

Page 16: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Potholes and Pitfalls on the Road to Authentic Assessment444

Cicchetti provides a widely cited standard for evaluating levels of reliability, whereby reliability is described as poor for intraclass correlation values below 0.40, fair for values between 0.40 and 0.59, good for values between 0.60 and 0.74, and excellent for values between 0.75 and 1.17

As a measure of inter-rater reliability, intraclass correlation is more meaningful and more sensitive than agreement rate. While agreement rate only detects cases of concurrence within a specific predetermined range, intraclass correlation measures correlation between raters for a given variable across the entire set of scores. In other words, intraclass correlation can detect patterns of consistency between raters even if one rater tends to score higher than another. For this reason, higher rates of agreement do not always coincide with higher intraclass correlation scores.

We became aware of the intraclass correlation approach late in the project, after the second round of norming and scoring. Had we understood it earlier, we could have applied an appropriate inter-rater reliability test to a sample of papers, each evaluated by all raters, in the norming phase and perhaps avoided the need to test inter-rater reli-ability for our entire sample afterward. This would have saved time and allowed us to move into the rating process with greater confidence in the reliability of our ratings.

Question 3: Do correlations exist between IL and contact with the library?

For Question 3, we tested for significant correlations between rubric scores and library contact (variables R1–R9 and C1–C6 in Tables 2 and 4) using Spearman’s rho, a

statistic that indicates the closeness of the relationship between two variables. Spearman’s rho is a standard nonparametric test appropriate for ordinal-level variables, in which the possible values can be ranked higher or lower than another. Because the narrow range of rubric scores resulted in many tied ranks, we also used another nonparametric correlation test, Kendall’s tau. As this was the central question for our study, we attempted to be as thorough as possible.

We tested the relationship of each form of library contact and the aggregate score for all forms to each rubric facet score and to the aggregate score for an overall IL construct. One of the contact variables, the method of communication used to contact the library (C7), was not analyzed because it involved overlapping, nonexclusive groups and did not impact the degree of contact between the student and the library.

Question 4: In what IL skills were the participants strongest and weakest?

Having established the significance of differences among rubric variables using Wilcoxon tests as described under Question 1 (“Is the rubric scoring internally consis-tent in a way that suggests that all its parts contribute to information literacy as a single unifying construct?”), Question 4 could be answered by comparing and ranking the resolved median scores for each rubric facet (R1–R8 in Table 2).

Question 5: Did any of the demographic factors have a significant relationship to IL?

As a measure of inter-rater reliability, intraclass correlation is more mean-ingful and more sensitive than agreement rate.

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.

Page 17: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Sam Lohmann, Karen R. Diller, and Sue F. Phelps 445

We conducted Mann-Whitney U tests to assess the differences in rubric scores (variables R1–R9) between groups based on demographic variables (D1–D4). This is a standard nonparametric test used to compare independent groups. Spearman’s rho and Kendall’s tau were used to test for significant correlations between rubric scores and either age (variable D3) or number of semesters at WSU (variable D4). For variable D6, academic major, a Kruskal-Wallis H test was used to assess differences among 18 majors. Kruskal-Wallis is a nonparametric test like Mann-Whitney but appropriate for comparing more than two groups.

Conclusion: Lessons Learned

Our statistical results (summarized in Appendix B) remain problematic. However, the long and frustrating process yielded valuable lessons of its own. Authentic IL assessment is a new field of study, and we believe that by providing an honest account of method-ological issues in a failed assessment, we make a valuable contribution to the existing literature. No universal method of assessment fits the needs of every institution in every situation, so practitioners will benefit from sharing their mistakes as well as their successes and consider-ing points of ambiguity or uncertainty alongside strong statistical findings. In addition to describ-ing the pitfalls of our assessment experience, we would like to offer five general recommendations that may be of use in future assessment projects, in IL and other areas. Some may seem obvious, but our experience shows how complicated they can become. When undertaking authentic assess-ment, researchers should plan for norming, find models for data analysis, take a large sample, dedicate plenty of time to the project, and, above all, stay flexible.

Plan for Norming

As many researchers note, norming is a crucial part of the process for any assessment that involves a rubric or similar tool re-quiring individual raters to assess aspects of student work.18 In a norming session, all raters should be present, discuss the instrument and criteria together, and separately score the same sample materi-als. The goal is to reach a predetermined acceptable level of agreement on scores for each sample, even if this requires mul-tiple rounds of discussion and scoring. Norming serves four important purposes: to establish usable consensus standards for the application of the rubric; to test the

When undertaking authen-tic assessment, researchers should plan for norming, find models for data analysis, take a large sample, dedicate plenty of time to the project, and, above all, stay flexible.

In a norming session, all raters should be present, discuss the in-strument and criteria together, and separately score the same sample materials. The goal is to reach a predetermined acceptable level of agreement on scores for each sam-ple, even if this requires multiple rounds of discussion and scoring.

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.

Page 18: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Potholes and Pitfalls on the Road to Authentic Assessment446

rubric and identify areas of vagueness or ambiguity; to train raters through discussion and trial and error; and to evaluate the group’s inter-rater reliability (in one or more of the distinct senses identified by Stemler).19 If we had performed norming in a consistent manner in the first round of scoring, we could have saved time and effort by avoiding the second round. If we had spent more time on norming and held raters to a higher standard of agreement, we might have reached acceptable levels of rater agreement in our final set of scores. In addition, if we had incorporated an appropriate measure of inter-rater reliability, we might have avoided the difficulty of attempting to assess inter-rater reliability from a final score group. We could have obtained a robust estimate of reliability using fully crossed samples from norming and from a segment of the research sample, rather than calculating from randomly assigned pairs for the entire sample.

Our norming process, as planned, was brief, used only two sample papers, and was based on the goal of rater agreement within eight points. Although we reached this goal, we had poor rater agreement and inter-rater reliability afterward. Instead, we should have planned norming with a larger selection of sample papers and allowed time for discussion and rescoring to resolve disagreements. In the process, we could have identified the most problematic rubric facets—those with the highest rate of disagree-ment—and spent additional time discussing them to arrive at a consistent approach. If necessary, once an approach was agreed upon, we could have revised the rubric for clarity or added examples as a reminder to raters. After achieving agreement within eight points on several papers, we should have calculated two-way mixed, absolute-agreement, average-measures intraclass correlation to ensure not only approximate agreement but also inter-rater reliability regarding measurement. A two-way approach would be used in this case because, in norming, we used a fully crossed design with all raters scoring all papers. A one-way approach was used for the actual sample because the papers were rated by randomly assigned pairs of raters.20

Plan for Data Analysis

It is imperative to review the literature when planning an authentic assessment, but even the most relevant studies may not translate directly into a sufficiently detailed model for new research. Two of us had conducted an earlier study using rubrics for authentic assessment of students’ research portfolios,21 and all three had participated in review-ing and discussing the literature while planning the present study, but we still ran into problems at the analysis stage. Our initial literature review concentrated on methods of approaching and gathering data on IL and library impact, rather than on methods of statistical analysis. Thus, our plans for analysis remained somewhat vague until after we had collected our first sample of papers and conducted our first round of scoring.

Perhaps this previous experience gave us an unwarranted confidence at the begin-ning of the project, so that we neglected to clearly connect our research questions to specific testable hypotheses. While it is impossible to anticipate every contingency, we could have saved a great deal of trouble by more carefully considering the statistical analyses in the library literature and by perusing statistical literature from other disci-plines to identify specific methods appropriate to our questions before collecting our data. Because we had limited knowledge of statistical analysis and of SPSS, a software package used to perform such analysis, we had to arrive at a process through extensive

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.

Page 19: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Sam Lohmann, Karen R. Diller, and Sue F. Phelps 447

trial and error. For example, we spent a long time considering Fleiss’s kappa and Cohen’s kappa as measures of inter-rater reliability before we found a clear explanation in the literature indicating that we should instead use a type of intraclass correlation since we were comparing ordinal rather than nominal variables.22 There were many false starts and hours of tedious recoding in SPSS.

We cannot overstate the importance of writing clearly defined research questions and connecting them to statistically measurable hypotheses; of including or consulting a re-searcher with statistical expertise; and of looking beyond the library literature to identify appropriate methodologies. Our search yielded several studies that match our general approach—rubric assessment of IL using authentic artifacts of course-work—such as those by Mark Emmons and Wanda Martin, Lorrie Knight, and Sue Samson.23 However, none provided an account of methodology that was both sufficiently detailed and appropri-ate for our purposes. A study by Megan Oakleaf explicitly focused on methodol-ogy, including norming practices and measures of inter-rater reliability. Its methods were not applicable to our study, however, because it treated its three-level rubric classifications as nominal rather than ordinal variables—that is, variables for which the values cannot be organized in a logical sequence rather than variables for which the values can be logically ordered or ranked—and because it used a consensus approach to inter-rater reliability.24 The former precludes the use of an aggregate score to measure IL as a generalized construct, and the latter precludes the use of central tendency or adjudication to resolve rater disputes.25

In our initial literature review, we did not look at methodological literature outside of library and information science journals. If we had, we might have identified useful starting points such as Stemler and Hallgren on computing inter-rater reliability, and Penny and Johnson on methods of resolving rater disagreement.26 Had we discovered these early in the process, we might have attempted a rigorous norming process using an inter-rater reliability test for measurement such as intraclass correlation and saved ourselves the trouble of testing inter-rater reliability on our final set of scores. Although similar calculations are commonplace in the social sciences, the literature on library assessment rarely addresses statistical methods in such detail. Had we begun with preliminary statistical research, alongside our initial research on authentic assessment of IL, we might have started data collection with a clear, detailed plan to address each of our questions and thus have saved significant time.

Take a Large and Representative Sample

Our initial sample from three courses in fall 2013 did not yield statistically significant results. It was small and did not adequately represent the range of undergraduate research courses on our campus. Due to delays in data entry and analysis, we could not take a new

We cannot overstate the importance of writing clearly defined research questions and connecting them to statistically measurable hypoth-eses; of including or consulting a re-searcher with statistical expertise; and of looking beyond the library literature to identify appropriate methodologies.

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.

Page 20: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Potholes and Pitfalls on the Road to Authentic Assessment448

sample until fall 2014, when we collected additional data from three other courses in an attempt not only to increase our sample size but also to diversify our sample in terms of course level, assignment requirements, and subject area. After combining samples, we could detect statistical significance, but in the process, we created two new problems: due to staffing changes, we needed a slightly different group of raters (and neglected to redo our norming), and we extended the duration of the project well beyond its intended scope. Had we gathered a larger sample in the first round, we might have repeated the project (or a more streamlined version of it) with a new sample a year or two later and ended up with meaningful longitudinal data, repeated observations taken from a larger sample over time, useful for measuring changes.

There are three main reasons why it was difficult to achieve a larger sample. First, WSU Vancouver is a small campus, so we had a limited pool of students and courses from which to pull. In an earlier assessment project, library staff benefited from a systematic campus-wide capstone portfolio collection, a compilation of student work that offered a large, ready-made pool of data.27 For the present study, that option was not available and we needed to solicit participation individually from faculty and from students in their courses. This led to the second problem: since we needed to solicit faculty participation individually, we had to guess which courses would most likely include a substantial research assignment. This information was not readily available for all courses, so we relied on librarians’ personal knowledge of courses and relationships with faculty in their subject liaison areas. The third problem was that, for the courses we chose, not all students agreed to participate in the study. Initially, participation was higher than we expected for most courses, but in both semesters, we lost participants along the way for reasons that included changing assignment requirements, late or missing assignments, and students dropping courses.

To counteract these difficulties in the future, we can begin planning earlier and reach out to faculty in a more sys-tematic manner. It may help to collect information on research assignments over an academic year before data collection and to contact all faculty in relevant disciplines, either through e-mail or in person at faculty meetings, to draw from the widest possible pool of courses with some research com-ponent. Given a large enough pool of participants, a truly random sample might be achieved. It might also help

to offer some type of incentive to students for participation in the study.

Dedicate Plenty of Time

Authentic assessment is a time-consuming activity, especially when methods and pro-cesses are being established for the first time. It requires the participation of multiple team members over a period that may range from several weeks to several years. In

Authentic assessment is a time-con-suming activity, especially when meth-ods and processes are being estab-lished for the first time. It requires the participation of multiple team mem-bers over a period that may range from several weeks to several years.

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.

Page 21: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Sam Lohmann, Karen R. Diller, and Sue F. Phelps 449

our case, a study that we intended to complete in six months took more than four years, including the time needed for additional data collection, repeated scoring, and repeated data analysis described here. Although we realized that it would take time to develop the rubric, to work with faculty and with the Office of Institutional Research, and to anonymize student papers, demographic surveys, and enrollment data, we did not consider the time it would take to determine appropriate statistical methods and conduct statistical analysis, especially for a lead researcher who was unfamiliar with SPSS and had limited experience with this type of research. And, of course, we did not anticipate the need to collect additional data and eventually to re-norm and rescore with a new group of raters.

Critical reflection and discussion are necessary at each stage of an assessment project, not only during data collection and analysis but also before and after them. Reflection and discussion require time and plan-ning but can also save time in the long run. In our case, a discussion of rubrics and standards led us to focus on those IL skills that could truly be assessed through student writing. Later, in our analysis, we had to examine and think about our statistical results, which led us to recognize and attempt to correct a major flaw in our scoring process, inconsistent norming. If we had built these critical exchanges into our pro-cess in a more organized way, we might have avoided the missteps that ultimately invalidated much of our data. Nevertheless, we culled some valuable information from our failed assessment by taking time to reflect on what worked, what did not, and what lessons might prove useful to others in the field.

Each of these stages extended the duration of the project, but the problem was exacerbated by the somewhat sporadic way in which we could devote time to it. Like many academic librarians, we needed to balance a wide variety of duties throughout the year and found it difficult to set aside large blocks of time for research. While there was perhaps no way to truly anticipate the time needed, it would have helped for each researcher to schedule portions of uninterrupted work time for each phase of the proj-ect—while retaining flexibility, of course.

Stay Flexible

The final takeaway for this project may seem trivial, but it is important, especially for a long-term project like this. From planning to data collection to analysis, flexibility is vital when dealing with the varied, contingent, and unpredictable factors of authentic assess-ment. While the study failed in many ways, it was successful as a learning experience for the researchers and, we hope, as a case study

Critical reflection and discussion are necessary at each stage of an assess-ment project, not only during data collection and analysis but also before and after them. Reflection and discus-sion require time and planning but can also save time in the long run.

From planning to data collection to analysis, flexibility is vital when dealing with the varied, contingent, and unpredictable factors of authentic assessment.

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.

Page 22: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Potholes and Pitfalls on the Road to Authentic Assessment450

for future researchers. These successes were largely the result of open-ended, iterative planning and of flexibility in adapting to circumstances that did not always fit the plan.

For example, rather than trying to force our assessment of student work to address each of the ACRL Standards, we focused our rubric on those areas of the document that could reasonably be assessed by reading students’ writing without observing their research process firsthand. We carefully revised our rubric to make it specific and clear while still applicable to a wide variety of student work at all academic levels, potentially including annotated bibliographies as well as expository or argumentative essays.

Other examples of flexibility include our willingness to research, test, and adapt statistical methods appropriate to our research questions and sample—in some cases, methods not specifically described in the existing IL literature—and to gather additional data and redo norming, scoring, and analysis when necessary. Finally, when it became clear that our results were compromised, we chose to reframe our thinking about the project to learn from our mistakes, clarify and record the methodological issues involved, and identify key strategies that we and other researchers could use to avoid such prob-lems in the future.

Sam Lohmann is a reference librarian at Washington State University Vancouver; he may be reached by e-mail at: [email protected].

Karen R. Diller is the library director at Washington State University Vancouver; she may be reached by e-mail at: [email protected].

Sue F. Phelps is the health sciences and outreach librarian at Washington State University; she may be reached by e-mail at: [email protected].

Appendix A

Survey of Participants

The following demographic and behavior survey was distributed to students. Identifica-tion numbers were arbitrarily assigned and used to identify participants without using their names or student ID numbers. For Question 1, regrettably, only two options are given for gender. In retrospect, we failed to make the survey inclusive and welcoming to students of all gender identities. Were we to redo this study in the future, we would include an open-ended third option for this question.

Demographic Survey

Please answer the following questions.1. I am: [Male; Female]2. When you first started at Washington State University (WSU) Vancouver, you:

[started as a first-year (freshman) student; transferred in with more than 30 credits from another institution]

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.

Page 23: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Sam Lohmann, Karen R. Diller, and Sue F. Phelps 451

3. Age?4. Is this your first semester at WSU Vancouver? [Yes; No] a. If no, how many semesters (NOT counting the current semester) have you

taken classes at WSU Vancouver? 5. What department is your academic major in? [Twenty-one options including

Undeclared]6. How much time do you estimate you spend at the library? [At least once a day;

Once a week; Once a month; Once or twice a semester; Never]7. How often do you use the WSU Vancouver library website to do research? [At

least once a day; Once a week; Once a month; Once or twice a semester; Never]8. How many times have you asked for library assistance (talking to someone at

the reference desk, e-mailing the library, etc.) while doing research? [More than 5 times; 3–5 times; Once or twice; Never]

9. If you ask for library assistance while doing research, your main method of com-munication is: [In person (ask the reference desk in the library); Phone; E-mail; Instant messaging (IM); Other (please specify)]

10. Have you ever made an appointment to meet with a librarian one-on-one at the WSU Vancouver Library? [Yes; No]

Appendix B

Results of Statistical Analysis

The following is a summary of the results of our statistical analysis, reported in the order of our research questions (see Table 6). It is important to note that the low level of agreement and inter-rater reliability, as discussed under Question 2 (“How reliable and consistent were the rubric scores given by different raters?”), casts doubt on the validity of the remaining questions, which all to some degree assume the validity of the rubric scores (variables R1–R9).

In addition, the fact that we rescored all papers resulted in two sets of scores that could not meaningfully be combined into a single data set. For all five questions, therefore, we analyzed both sets of scores and compared the results. In reporting our results, we rely primarily on the second set of results because they were generated after a consistent norming process and have similar inter-rater reliability to the first set based on intraclass correlation. We have only noted values from the first set when there is an important difference from the second. To save space, only significant (p < .05) statistics are reported here, with highly significant (p < .01) statistics marked as such. Complete data tables are available from the authors upon request.

Question 1: Is the rubric scoring internally consistent in a way that suggests that all its parts contribute to IL as a single unifying construct?

This question addresses two separate aspects of the instrument: internal consistency and distinctness. Cronbach’s alpha showed good internal consistency between the re-

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.

Page 24: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Potholes and Pitfalls on the Road to Authentic Assessment452

solved medians for the eight rubric facets (variables R1–R8), alpha = 0.88 (and alpha = 0.83 for the first round of scores). The removal of any variable would result in a lower alpha score, suggesting that all variables contributed to the shared IL construct and should be retained when calculating the aggregate IL variable (R9).

Paired Wilcoxon signed-rank tests found significant differences between 22 of the 23 pairs of variables where the resolved medians were not equal, as well as between two of the five pairs of variables with equal resolved medians (see Table 7). Similar results were found in the first round of scoring, with 23 out of 28 pairs of variables showing a significant difference. This supports our assumption that the eight rubric variables measure distinct, independent qualities of student work.

Question 2: How reliable and consistent were the rubric scores given by different raters?

We used one-way mixed, absolute-agreement, average-measures intraclass cor-relation to estimate inter-rater reliability of measurement.28 We also calculated the rate of agreement. Table 8 summarizes the results for both measures and for the first (not consistently normed) and second (normed) sets of scores. As the table shows, the first round of scores generally showed higher agreement between raters, but there was no clear difference in inter-rater reliability as measured by intraclass correlation.

In the second round of scoring, intraclass correlation was in the fair range as defined by Domenic Cicchetti29—that is, intraclass correlation = 0.51—for the aggregate rubric score (variable R9). Intraclass correlations for individual rubric facets (variables R1–R8) ranged from –0.23 (poor) for R1 (multiple viewpoints) to 0.71 (good) for R2 (multiple source types). When we separated out the group of cases (n = 52) that had rater agree-ment within eight points for R9 and therefore did not require a third rater for resolu-tion, there was, predictably, a marked improvement, with intraclass correlation = 0.86 (excellent) for R9 and correlations ranging from fair to excellent for individual facets. In the group of cases (n = 25) that required a third reading, again unsurprisingly, the correlations for total score and most rubric facets were lower than for the group, with intraclass correlation = –0.49 (poor) for R9.

A simpler but less sensitive measure, rate of agreement within one point (or within eight points for the aggregate score), was also used, since it was the basis for our method of resolving disparities in score. The rate of agreement between raters on the overall score (within a predetermined parameter of eight points) was 68 percent, slightly below our agreed-upon goal of at least 70 percent agreement. For the individual rubric facets, agreement within one point was lowest for R3 (extent of sources, 60 percent) and highest for R2 (multiple source types, 79 percent). As noted in the section on “Statistical Analysis Process,” rate of agreement and intraclass correlation measure distinct aspects of inter-rater reliability and agreement, and higher rates of agreement do not always coincide with higher intraclass correlation scores.

Question 3: Do correlations exist between IL and contact with the library?

To test for relationships between rubric scores and library contact, nonparametric correlation tests were run between the individual and aggregate contact variables (C1–C6) and the individual and aggregate rubric variables (R1–R9). The results of these tests are summarized in Table 9.

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.

Page 25: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Sam Lohmann, Karen R. Diller, and Sue F. Phelps 453

Tabl

e 7.

Diff

eren

ce in

scor

es b

etw

een

pair

s of r

ubri

c va

riab

les*

R

1 R

2 R

3 R

4 R

5 R

6 R

7 R

8

R1

Z

= 5.

41†

Z =

5.44

† Z

= 5.

38†

Z =

4.04

† Z

= 0.

88

Z =

1.88

Z

= 2.

13‡

R2

Z =

5.41

Z =

0.26

Z

= 6.

66†

Z =

6.14

† Z

= 5.

59†

Z =

3.44

† Z

= 3.

42†

R3

Z =

5.44

† Z

= 0.

26

Z

= 6.

89†

Z =

6.28

† Z

= 5.

73†

Z =

3.46

† Z

= 3.

69†

R4

Z =

5.38

† Z

= 6.

66†

Z =

6.89

Z =

2.95

† Z

= 4.

28†

Z =

5.29

† Z

= 5.

20†

R5

Z =

4.04

† Z

= 6.

14†

Z =

6.28

† Z

= 2.

95†

Z

= 2.

70†

Z =

4.35

† Z

= 4.

48†

R6

Z =

0.88

Z

= 5.

59†

Z =

5.73

† Z

= 4.

28†

Z =

2.70

Z =

2.69

† Z

= 2.

83†

R7

Z =

1.88

Z

= 3.

44†

Z =

3.46

† Z

= 5.

29†

Z =

4.35

† Z

= 2.

69†

Z

= 0.

19R8

Z

= 2.

13‡

Z =

3.42

† Z

= 3.

69†

Z =

5.20

† Z

= 4.

48†

Z =

2.83

† Z

= 0.

19

*Thi

s tab

le sh

ows t

he re

sults

of p

aire

d W

ilcox

on si

gned

-ran

k te

sts,

whi

ch d

eter

min

e sig

nific

ant d

iffer

ence

s bet

wee

n ea

ch p

air o

f rub

ric v

aria

bles

in th

e sec

ond

set o

f res

olve

d m

edia

n sc

ores

.†S

core

is st

atis

tical

ly si

gnifi

cant

, p <

.05.

‡Sco

re is

hig

hly

stat

istic

ally

sign

ifica

nt, p

< .0

1.

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.

Page 26: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Potholes and Pitfalls on the Road to Authentic Assessment454

Both Spearman’s rho and Kendall’s tau found a low, positive, highly significant correlation between aggregate library contact (variable C6) and aggregate rubric score (variable R9). This suggests that, overall, increased library contact had some positive impact on IL skills. Aggregate library contact (variable C6) also showed weak, positive correlations with all the individual rubric items (variables R1–R8). Of these, only R5 (understanding and using sources in context) showed highly significant correlations for both Spearman’s and Kendall’s tests. This result may indicate that various forms of library contact had a cumulative positive impact on students’ ability to consider and communicate the context of passages cited in their papers. Moderately significant correlations were found in at least one of the two tests for all but one of the remaining rubric variables.

The individual library contact interventions (variables C1–C5) were also tested for correlation with aggregate rubric score (variable R9) and with individual rubric facets (variables R1–R8). Variable C2 (frequency of website use) showed low, positive, highly significant correlations with R9 and with three individual rubric variables, R5 (under-

Table 8.Intraclass correlation (ICC)* and rates of agreement for rubric scores

Second set of First set of scores scores (normed) (not consistently normed) Variable ICC ICC for ICC for Rate of ICC (N = 77) Rate of (N = 77) cases cases not agreement agreement requiring requiring (N = 77) (N = 77) resolution resolution (n = 25) (n = 52)

R1 –0.23 –0.49 0.48 64% 0.29 75% R2 0.71 0.70 0.72 79% 0.47 79% R3 0.35 0.27 0.62 60% 0.33 77% R4 0.00 –0.23 0.31 65% 0.13 71% R5 0.05 –0.01 0.36 70% 0.51 86% R6 0.32 0.26 0.61 65% 0.51 78% R7 0.61 0.59 0.73 71% 0.50 65% R8 0.69 0.71 0.80 74% 0.60 71% R9 0.51 0.43 0.86 68% 0.60 82%

*Intraclass correlation (ICC) measures correlation between raters for a given variable across the entire set of scores.

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.

Page 27: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Sam Lohmann, Karen R. Diller, and Sue F. Phelps 455

Tabl

e 9.

Corr

elat

ions

bet

wee

n lib

rary

con

tact

(C) a

nd ru

bric

(R) s

core

s

C1

C2

C3

C

4

C

5

C6

Sp

earm

an’s

K

enda

ll’s

Spea

rman

’s K

enda

ll’s

Spea

rman

’s K

enda

ll’s

Spea

rman

’s K

enda

ll’s

Spea

rman

’s K

enda

ll’s

Sp

earm

an’s

Ken

dall’

s

rho

(75)

* ta

u (7

5)*

rho

(75)

ta

u (7

5)

rho

(75)

ta

u (7

5)

rho

(75)

ta

u (7

5)

rho

(75)

ta

u (7

5)

rho

(75)

ta

u (7

5)

R1

0.16

0.

13

0.28

† 0.

23†

0

0

0.08

0.

09

–0.0

2 –0

.001

0.

27†

0.22

†R2

0.

21

0.23

0.

22

0.18

0

.01

0.0

1 –0

.01

–.01

–0

.03

–0.0

3 0.

14

0.1 1

R3

0.21

0.

15

0.14

0.

11

0.0

7 0

.06

–0.1

2 –0

.11

0.1

1 0

.10

0.27

† 0.

20†

R4

0.18

0.

15

0.26

† 0.

22†

–0.0

2 –0

.02

–0.0

7 –0

.06

0.1

1 0

.09

0.24

† 0.

19†

R5

0.19

0.

15

0.37

‡ 0.

31‡

–0.0

8 –0

.06

0.1

2 0

.11

0.2

5†

0.2

2†

0.34

‡ 0.

29‡

R6

0.03

0.

02

0.28

† 0.

22†

–0.2

1 –0

.16

–0.0

6 –0

.05

0.2

4†

0.2

0*

0.23

* 0.

18R7

0.

08

0.07

0.

51‡

0.42

‡ –0

.19

–0.1

6 0

.01

0.0

1 0

.09

0.0

8 0.

23*

0.18

R8

0.09

0.

06

0.35

‡ 0.

29‡

–0.1

1 –0

.08

0.1

0 0

.08

0.0

6 0

.05

0.26

* 0.

21*

R9

0.15

0.

10

0.39

‡ 0.

30‡

–0.0

8 –0

.06

0

0

0.0

9 0

.07

0.30

‡ 0.

23‡

*Spe

arm

an’s

rho

and

Ken

dall’

s tau

indi

cate

the

clos

enes

s of t

he re

latio

nshi

p be

twee

n tw

o va

riabl

es.

†Sco

re is

stat

istic

ally

sign

ifica

nt, p

< .0

5.‡S

core

is h

ighl

y st

atis

tical

ly si

gnifi

cant

, p <

.01.

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.

Page 28: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Potholes and Pitfalls on the Road to Authentic Assessment456

standing of context), R7 (citation format), and R8 (documentation of sources). Moderately significant positive correlations were found between C2 and R1 (multiple viewpoints), R4 (assessment of sources), and R6 (integration of sources). This suggests that frequent use of the library website, even in the absence of other forms of library contact, resulted in increased levels of IL skills in general and of certain specific abilities.

None of the other individual contact variables showed a highly significant correla-tion with the rubric variables, although there were moderately significant, low, positive correlations between the number of bibliographic instruction sessions attended (vari-able C5) and two of the rubric facets, R5 (understanding of context) and R6 (integration of sources). (These correlations were not found in the first set of scores.) In the case of variable C4 (appointment with a librarian), we cannot say whether a relationship ex-ists, since only 3 of the 77 participants in our sample had made such an appointment.

The general patterns of relationship between library contact and rubric score are similar between the first and second sets of scores. Although this similarity does not compensate for the reliability issues that throw both sets of scores into question, it does reassure us that our assessment reflects a real underlying relationship between library contact and IL.

Question 4: In what IL skills were the participants strongest and weakest?

Comparing the median resolved scores for individual IL variables (R1–R8), we found that our sample scored highest on variety of source types and extent of sources (variables R2, multiple source types, and R3, extent of sources, respectively; median = 4 for both), followed by citation format and documentation (R7 and R8, median = 3.5 for both), then by use of multiple viewpoints, understanding of context, and integration of sources (R1, multiple viewpoints; R5, understanding of context; and R6, integration of sources; median = 3 for both). Assessment of sources (R4) received the lowest scores (median = 2.5). The Wilcoxon tests, which we conducted as a test of independence of variables (see Question 1), appear to confirm the significance of these differences.

Scores from the first round of scoring showed the same general pattern. In both rounds, our sample tended to score highest on extent of sources and lowest on assess-ment of sources, with the remaining skills occurring in a similar order between them. While it is difficult to make a statistically meaningful comparison between the two sets, this similarity suggests general agreement between the two groups of raters. Medians and interquartile ranges (IQRs) for both sets of scores are shown in Table 2. Despite the methodological issues, this ranking of variables will provide a useful guideline for the library’s instructional efforts, since it helps identify areas where students struggle and may need additional or alternative outreach.

Question 5: Did any of the demographic factors have a significant relationship to IL?

We tested for significant differences in IL learning between groups based on de-mographic factors. Using the appropriate tests (summarized in Table 6), we found no correlations or significant differences in score between groups based on gender, age, number of semesters at Washington State University (WSU), or college major (variables D1, D4 and D5, and D6, respectively). The only significant differences found were related to transfer status.

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.

Page 29: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Sam Lohmann, Karen R. Diller, and Sue F. Phelps 457

For variable D2, transfer status, when a Mann-Whitney U test was used to the com-pare rubric scores of transfer students (n = 65) and non-transfer students—that is, students who started at WSU in their first year of college (n = 12)—a highly significant difference was found for variable R3 (extent of sources) and moderately significant differences in variables R1 (multiple viewpoints), R2 (multiple source types), R5 (understanding of context), and R9 (aggregate rubric score). Students who started college at WSU scored the higher median scores for all rubric facets. Statistical values appear in Table 10. These findings may indicate a need for additional or improved library outreach to transfer students, especially involving use of source materials.

As Table 10 shows, our findings on transfer status were similar but not identical for the first set of scores. In the first set, differences were also significant between transfer groups for variables R2 (multiple source types), R5 (understanding of context), and R9, but not for R1 (multiple viewpoints) or R3 (extent of sources). Non-transfer students received the higher median scores on all IL skills in both rounds of scoring, except in the case of R3 (extent of sources) for the first round of scores, where the medians were equal.

These findings raised a question about the relationship between contact with the library and the time a student has spent at WSU. Given significant relationships between aggregate library contact (variable C6) and most of the rubric variables (see Table 9 and Question 3, “Do correlations exist between IL and contact with the library?”), we won-dered why the time spent at WSU (variables D4 and D5) had no effect on rubric scores. We believe that this is explained by the fact that aggregate library contact is based on five contact variables, only three of which might increase over time. The only individual contact variable that appeared to have a significant relationship to rubric variables was library website use, which has no clear relationship to a student’s time at WSU. The time spent in the library weekly is also unrelated to time at WSU. The relevant survey questions asked about the frequency of current contact, not about cumulative contact over time. Thus, while the time-related contact variables may have had some combined effect on rubric scores, it is not surprising that no difference could be found when groups were com-pared based on time at WSU. A student who frequently used the library space and the website might score high on library contact even in the first semester. While the results are understandable, the lack of a clear distinction between routine library contact and cumulative contact over time is a limitation of our study.

“Do correlations exist be-tween IL and contact with the library?”

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.

Page 30: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Potholes and Pitfalls on the Road to Authentic Assessment458

Tabl

e 10

.Co

mpa

riso

n of

rubr

ic sc

ores

for t

rans

fer a

nd n

on-t

rans

fer s

tude

nts

Fir

st s

et o

f sco

res

S

econ

d se

t of s

core

s

Med

ian

for n

on-t

rans

fer

Med

ian

for t

rans

fer

Man

n-

Med

ian

for n

on-t

rans

fer

Med

ian

for t

rans

fer

Man

n-

st

uden

ts (n

= 1

2)

stud

ents

(n =

65)

W

hitn

ey U

† st

uden

ts (n

= 1

2)

stud

ents

(n =

65)

W

hitn

ey U

R1

4.5

(IQR

= 1)

* 4

(IQR

= 1.

5)

324.

0 4

(IQR

= 1)

3

(IQR

= 1.

5)

213.

0‡

R2

5 (IQ

R =

0.5)

4.

5 (IQ

R =

1.25

) 21

8.0‡

5

(IQR

= 0.

5)

4 (IQ

R =

2)

237.

5‡

R3

5 (IQ

R =

0.75

) 5

(IQR

= 1.

5)

263.

0 5

(IQR

= 0.

875)

4

(IQR

= 1.

5)

189.

R4

3.75

(IQ

R =

0.87

5)

3 (IQ

R =

1)

324.

0 3

(IQR

= 0.

375)

2.

5 (IQ

R =

1)

280.

5 R5

4.

5 (IQ

R =

1.37

5)

3.5

(IQR

= 1.

5)

212.

0‡

3.5

(IQR

= 0.

5)

3 (IQ

R =

0.5)

24

6.5‡

R6

4

(IQR

= 1.

375)

3.

5 (IQ

R =

1.5)

22

9.5‡

3.

75 (I

QR

= 1.

5)

3 (IQ

R =

1.5)

27

0.5

R7

4.25

(IQ

R =

2.25

) 4

(IQR

= 1.

75)

283.

5 4

(IQR

= 1.

875)

3.

5 (IQ

R =

2.7

5)

351.

0 R8

4.

75 (I

QR

= 1.

875)

4.

5 (IQ

R =

1.5)

30

8.5

4 (IQ

R =

0.87

5)

3.5

(IQR

= 1.

5)

377.

0 R9

36

.75

(IQR

= 8.

75)

30 (I

QR

= 7)

18

8.0§

31

.75

(IQR

= 7.

125)

27

(IQ

R =

8)

231.

0‡

*Int

erqu

artil

e ra

nge

(IQR)

refe

rs to

the

rang

e of

the

mid

dle

50%

bet

wee

n th

e lo

wer

qua

rtile

and

the

uppe

r qua

rtile

of t

he sa

mpl

e.

†The

Man

n-W

hitn

ey U

test

is u

sed

to a

sses

s diff

eren

ces b

etw

een

grou

ps.

‡S

core

is st

atis

tical

ly si

gnifi

cant

, p <

.05.

§Sco

re is

hig

hly

stat

istic

ally

sign

ifica

nt, p

< .0

1.Ro

ws i

n bo

ldfa

ce a

re a

t lea

st m

oder

atel

y si

gnifi

cant

(p <

.05)

for b

oth

sets

of s

core

s.

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.

Page 31: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Sam Lohmann, Karen R. Diller, and Sue F. Phelps 459

Notes

1. See Megan Oakleaf, The Value of Academic Libraries: A Comprehensive Research Review and Report (Chicago: American Library Association, 2010), http://www.ala.org/acrl/sites/ala.org.acrl/files/content/issues/value/val_report.pdf.

2. Association of College and Research Libraries (ACRL), “Information Literacy Competency Standards for Higher Education,” 2000, https://alair.ala.org/handle/11213/7668; ACRL, “Framework for Information Literacy for Higher Education,” 2016, http://www.ala.org/acrl/standards/ilframework.

3. Kathleen Montgomery, “Authentic Tasks and Rubrics: Going beyond Traditional Assessments in College Teaching,” College Teaching 50, 1 (2002): 35, https://doi.org/10.1080/87567550209595870.

4. Montgomery, “Authentic Tasks and Rubrics”; Karen R. Diller and Sue F. Phelps, “Learning Outcomes, Portfolios, and Rubrics, Oh My! Authentic Assessment of an Information Literacy Program,” portal: Libraries and the Academy 8, 1 (2008): 75–89, https://doi.org/10.1353/pla.2008.0000; Megan Oakleaf, “Staying on Track with Rubric Assessment: Five Institutions Investigate Information Literacy Learning,” Peer Review 13–14, 4–1 (2011): 18–21, https://www.aacu.org/publications-research/periodicals/staying-track-rubric-assessment.

5. Char Booth, M. Sara Lowe, Natalie Tagge, and Sean M. Stone, “Degrees of Impact: Analyzing the Effects of Progressive Librarian Course Collaborations on Student Performance,” College & Research Libraries 76, 5 (2015): 623–51, https://doi.org/10.5860/crl.76.5.623; Wendy Holliday, Betty Dance, Erin Davis, Britt Fagerheim, Anne Hedrich, Kacy Lundstrom, and Pamela Martin, “An Information Literacy Snapshot: Authentic Assessment across the Curriculum,” College & Research Libraries 76, 2 (2015): 170–87, https://doi.org/10.5860/crl.76.2.170.

6. See Diller and Phelps, “Learning Outcomes, Portfolios, and Rubrics, Oh My!” 7. Washington State University (WSU), “WSU Learning Goals,” https://ucore.wsu.edu/

students/learning-goals/; ACRL, “Information Literacy Competency Standards for Higher Education.”

8. In retrospect, it might have been preferable to use a standardized score such as a z-score to calculate aggregate library contact, thus avoiding the use of arbitrary break points selected by the researchers. Such a method would lend equal weight to all contact variables.

9. Steven E. Stemler, “A Comparison of Consensus, Consistency, and Measurement Approaches to Estimating Interrater Reliability,” Practical Assessment, Research & Evaluation 9, 4 (2004): 1–19, https://pareonline.net/getvn.asp?v=9&n=4.

10. Many reference sources offer introductions to the statistical methods discussed in this section. Aside from the articles cited elsewhere in the text, we relied primarily on Laerd Statistics (https://statistics.laerd.com) and on Donncha Hanna and Martin Dempster, Psychology Statistics for Dummies (Hoboken, NJ: Wiley, 2012).

11. James A. Penny and Robert L. Johnson, “The Accuracy of Performance Task Scores after Resolution of Rater Disagreement: A Monte Carlo Study,” Assessing Writing 16, 4 (2011): 221–36, https://doi.org/10.1016/j.asw.2011.06.001.

12 Stemler, “A Comparison of Consensus, Consistency, and Measurement Approaches to Estimating Interrater Reliability,” 4.

13. Kevin A. Hallgren, “Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial,” Tutorials in Quantitative Methods for Psychology 8, 1 (2012): 25, https://doi.org/10.20982/tqmp.08.1.p023.

14. Stemler, “A Comparison of Consensus, Consistency, and Measurement Approaches to Estimating Interrater Reliability.”

15. Hallgren, “Computing Inter-Rater Reliability for Observational Data.”16. See Kenneth O. McGraw and S. P. Wong, “Forming Inferences about Some Intraclass

Correlation Coefficients,” Psychological Methods 1, 1 (1996): 30–46, http://dx.doi.org/10.1037/1082-989X.1.1.30.

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.

Page 32: Potholes and Pitfalls on the Road to Authentic Assessment · assessment plan. After reviewing previous studies using both indirect and authentic assessment methods, the librarians

Potholes and Pitfalls on the Road to Authentic Assessment460

17. Domenic V. Cichetti, “Guidelines, Criteria, and Rules of Thumb for Evaluating Normed and Standardized Assessment Instruments in Psychology,” Psychological Assessment 6, 4 (1994): 284–90, http://dx.doi.org/10.1037/1040-3590.6.4.284.

18. See, for example, Jackie Belanger, Ning Zou, Jenny Rushing Mills, Claire Holmes, and Megan Oakleaf, “Project RAILS: Lessons Learned about Rubric Assessment of Information Literacy Skills,” portal: Libraries and the Academy 15, 4 (2015): 623–44, https://doi.org/10.1353/pla.2015.0050; Megan Oakleaf, “Using Rubrics to Assess Information Literacy: An Examination of Methodology and Interrater Reliability,” Journal of the American Society for Information Science and Technology 60, 5 (2009): 969–83, https://doi.org/10:1002/asi.21030; Oakleaf, The Value of Academic Libraries.

19. Stemler, “A Comparison of Consensus, Consistency, and Measurement Approaches to Estimating Interrater Reliability.”

20 See Hallgren, “Computing Inter-Rater Reliability for Observational Data”; McGraw and Wong, “Forming Inferences about Some Intraclass Correlation Coefficients.”

21. Diller and Phelps, “Learning Outcomes, Portfolios, and Rubrics, Oh My!”22. Hallgren, “Computing Inter-Rater Reliability for Observational Data.”23. Mark Emmons and Wanda Martin, “Engaging Conversation: Evaluating the Contribution

of Library Instruction to the Quality of Student Research,” College & Research Libraries 63, 6 (2002): 545–60, https://doi.org/10.5860/crl.63.6.545; Lorrie A. Knight, “Using Rubrics to Assess Information Literacy,” Reference Services Review 34, 1 (2006): 43–55, https://doi.org/10.1108/00907320510631571; Sue Samson, “Information Literacy Learning Outcomes and Student Success,” Journal of Academic Librarianship 36, 3 (2010): 202–10, https://doi.org/10.1016/j.acalib.2010.03.002.

24. Oakleaf, “Using Rubrics to Assess Information Literacy.” 25. See Stemler, “A Comparison of Consensus, Consistency, and Measurement Approaches to

Estimating Interrater Reliability.”26. Stemler, “A Comparison of Consensus, Consistency, and Measurement Approaches

to Estimating Interrater Reliability”; Hallgren, “Computing Inter-Rater Reliability for Observational Data”; Penny and Johnson, “The Accuracy of Performance Task Scores after Resolution of Rater Disagreement.”

27. Diller and Phelps, “Learning Outcomes, Portfolios, and Rubrics, Oh My!”28. McGraw and Wong, “Forming Inferences about Some Intraclass Correlation Coefficients.”29. Cicchetti, “Guidelines, Criteria, and Rules of Thumb for Evaluating Normed and

Standardized Assessment Instruments in Psychology,” 256.

This m

ss. is

peer

review

ed, c

opy e

dited

, and

acce

pted f

or pu

blica

tion,

porta

l 19.3

.