Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the...

63
Testing 05 Reliability

Transcript of Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the...

Page 1: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Testing 05

Reliability

Page 2: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Errors & Reliability

• Errors in the test cause unreliability.

• The fewer the errors, the more reliable the test

• Sources of errors:

• Obvious: poor health, fatigue, lack of interest

• Less obvious: facets discussed in Fig. 5.3

Page 3: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Reliability & Validity

• Reliability is a necessary condition for validity.• Reliability & validity are complementary aspects

of the measurement.• Reliability: How much of the performance is due

to measurement errors, or to factors other than the language ability we want to measure.

• Validity: How much of the performance is due to the language ability we want to measure.

Page 4: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Reliability Measurement

• Reliability measurement includes: logical analysis and empirical research, i.e. identify sources of errors and estimate the magnitude of their effects on the scores.

Page 5: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Logical Analysis

• Example of identification of source of errors:

• Topic in an oral interview: business negotiation

• Source of error: if we want to measure the test taker’s ability of general topics.

• Indicator of the ability: if we want to the test taker’s ability of business English.

Page 6: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Empirical Research

• Procedures are usually complex.

• Three kinds of theories

• Classical true score theory (CTS)

• Generalizability theory (G-Theory)

• Item Response Theory (IRT)

Page 7: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Factors on Test Scores

• Characteristics of factors

• general vs. specific

• lasting vs. temporary

• systematic vs. unsystematic

Page 8: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Factors that affect language test scores

Page 9: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Variance & Standard Deviation

• s: standard deviation of the sample• σ: standard deviation of the population• s2: variance of the sample• σ2: variance of the population• s=√∑(X-Xˉ)2/n-1• where• X: individual score• Xˉ: mean score• n: number of students

Page 10: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Correlation Coefficient ( 相关系数 )

• Covariance (COV): two variables, X and Y, vary together.

• COV(X,Y)=1/(n-1)∑(Xi - Xˉ)(Yi - Yˉ)• Correlation Coefficient (Pearson Product-m

oment Correlation Coefficient 皮尔逊积差相关系数 )

• r(x,y)=COV(x,y)/sxsy

• r(x,y)= 1/(n-1)∑(Xi - Xˉ)(Yi - Yˉ)/ sxsy

Page 11: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Correlation Coefficient

• Where• n: number of items

• Xi: individual score of the first half

• Xˉ: mean of the scores in the first half

• Yi: individual score of the second half

• Yˉ: mean of the scores of the second half

• sx: standard deviation of the first half

• sy: standard deviation of the second half

Page 12: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Calculation of Correlation Coefficient

• Manually

• Manual + Excel

• Excel

Page 13: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Classical True Score Theory

• also referred to as the classical reliability theory because its major task is to estimate the reliability of the observed scores of a test. That is, it attempts to estimate the strength of the relationship between the observed score and the true score.

• sometimes referred to as the true score theory because its theoretical derivations are based on a mathematical model known as the true score model

Page 14: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:
Page 15: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Assumptions in CTS

• Assumption 1: The observed score consists of the true score and the error score, i..e. x=

xt+xe

• Assumption 2: Error scores are unsystematic, random and uncorrelated to the true score, i.e. s2=st

2+se2

Page 16: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Parallel Test

• Two tests are parallel if

xˉ=x’ˉ sx

2=sx’ˉ2

rxy=rx’y

Page 17: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Correlation Between Parallel Tests

• If the observed scores on two parallel tests are highly correlated, the effects of the error scores are minimal.

• Reliability is the correlation between the observed scores of two parallel tests.

• The definition is the basis for all estimates of reliability within CTS theory.

• Condition: the observed scores on the two tests are experimentally independent.

Page 18: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Error Score Estimation and

Measurement • Relations between reliability, true score and

error score:

• The higher the portion of the true score, the higher the correlation of the two parallel tests. (True scores are systematic)

• The higher the portion of the error score, the lower the correlation of the two parallel tests. (Error scores are random)

Page 19: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Error Score Estimation and Measurement

• rxx’=st2/se

2

• (st2+se

2)/sx2=1

• se2/ sx

2=1- st2/ sx

2

• st2/ sx

2= rxx’

• se2/ sx

2=1- rxx’

• se2=(1- rxx’)/ sx

2

Page 20: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Approaches to Estimate Reliability

• Three approaches based on different sources of errors.

• Internal consistency: source of errors from within the test and scoring procedure

• Stability: How consistent test scores are over time.

• Equivalence: Scores on alternative forms of tests are equivalent.

Page 21: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Internal Consistency

• Dichotomous Split-half reliability estimates

The Spearman-Brown split-half estimate The Guttman split-half estimate

Kuder-Richardson reliability coefficients

• Non-dichotomousCoefficient alphaRater consistency

Page 22: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Split-half Reliability Estimates

• Split the test into two halves which have equal means and variances (equivalence) and are independent of each other (independence).

• 1. divide the test into the first and second halves.

• 2.  random halves• 3.  odd-even method

Page 23: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Spearman-Brown Reliability Estimate

• rxx’=2rhh’/(1+rhh’)• where:• rhh’: correlation between the two halves of the test• Procedure:• 1.   Divide the test into two equal halves• 2.  Calculate the correlation coefficient between th

e two halves• 3. Calculate the Spearman-Brown reliability estim

ate

Page 24: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Guttman Split-Half Estimate

• rxx’=2(1 - (sh12+sh2

2)/sx2)

• where

• sh12: variance of the first half

• sh22: variance of the second half

• sx2: variance of the total scores

Page 25: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Kuder-Richardson Formula 20

• rxx’=k/(k - 1)(1 -∑ pq/sx2)

• where• k: number of items on the test• p: proportion of the correct answers, i.e. cor

rect answers/total answers (difficulty)• q: proportion of the incorrect answers, i.e.

1-p• sx

2: total test score variance

Page 26: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Kuder-Richardson Formula 21

• rxx’=(ksx2 - xˉ(k - xˉ))/(k - 1)sx

2

• where

• k: number of items on the test

• sx2: total test score variance

• xˉ: mean score

Page 27: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Coefficient alpha

• α=k/(k - 1)(1 -∑ si2/sx

2)

• where

• k: number of items on the test

• ∑si2 : sum of the variances of the different

parts of the test

• sx2: variance of the test scores

Page 28: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Comparison of Estimates: Assumptions

  Assumption Effect if assumption is violated

Estimate Equivalence Independence Equivalence Independence

Spearman-Brown

+ + underestimate

overestimate

Guttman - +   overestimate

K-C + + underestimate

overestimate

Coefficientα - - - -

 

Page 29: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Summary: Estimate Procedure

• Spearman-Brown– 1. split– 2. variances of each half– 3. correlation coefficient of each half– 4. reliability coefficient

Page 30: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Summary: Estimate Procedure

• Guttman– 1. split– 2. variances of each half– 3. variance of the whole test– 4. reliability coefficient

Page 31: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Summary: Estimate Procedure

• K-C 20• 1. number of questions• 2. proportion of correct answers of each question • 3. proportion of incorrect answers of each

question• 4. sum of the product of p and q• 5. variance of the whole test• 6. reliability coefficient

Page 32: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Summary: Estimate Procedure

• K-C 21

• 1. number of questions

• 2. mean of the test

• 3. variance of the test

• 4. reliability coefficient

Page 33: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Summary: Estimate Procedure

• Coefficientα• 1. number of the parts of the test• 2. mean of each part• 3. variance of each part• 4. sum of variances of all parts• 5. mean of the test• 6. variance of the test• 7. reliability coefficient

Page 34: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Rater Consistency

• Intra-rater

• Inter-rater

Page 35: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Intra-rater Reliability

• Rate each paper twice. Condition: the two ratings must be independent of each other.

• Two ways of estimating:

• Spearman-Brown: Take each rating as a split half and compute the reliability coefficient.

Page 36: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Intra-rater Reliability

• Conditions: the two ratings must have the similar means and variances to ensure the equivalence of the two ratings

• Coefficient alpha: Take two ratings as two parts of a test.

• α=(k/(k - 1))(1 - (sx12+sx2

2)/sx1+x22)

Page 37: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Intra-rater Reliability

• where• k: number of ratings• sx1

2: variance of the first rating

• sx22: variance of the second rating

• sx1+x22: variance of the summed ratings

• Since k=2, the formula can be reduced to the Guttman Reliability Coefficient Formula.

Page 38: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Inter-rater Reliability

• If there are only two raters, use split-half estimates to obtain the reliability coefficient.

• Or Grade Correlation Coefficient:

• rxx’=1 - 6∑D2/(n(n2 - 1))

• where

• D: difference between the grades of the two ratings

Page 39: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Inter-rater Reliability

• n: number of the test takers

• See testing 05-2 sheet 5 for example

• Note: the same grade should be shared.

• If there are more than two raters, use Coefficient alpha estimate

Page 40: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Stability (test-retest reliability)

• Administer the test twice to a group of individuals and compute the correlation between the two set of scores. The correlation can then be interpreted as an indicator of how stable the scores are over time.

• Learning effects and practice effects must be taken into account.

Page 41: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Equivalence (parallel forms reliability)

• Use alternative forms of a given test. Compute and compare the means and standard deviations of for each of the two forms to determine their equivalence. The correlation between the two sets can be interpreted as an indicator of the equivalence of the two tests or an estimate of the reliability of either one.

Page 42: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

GENERALIZABILITY THEORY

Page 43: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

GENERALIZABILITY THEORY

• Generalizability theory (G-theory) is a framework of factorial design and the analysis of variance. It constitutes a theory and set of procedures for specifying and estimating the relative effects of different factors on observed test scores, and thus provides a means for relating the uses or interpretations to the way test users specify and interpret different factors as either abilities or sources of error.

Page 44: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

GENERALIZABILITY THEORY

• G-theory treats a given measure or score as a sample from a hypothetical universe of possible measures, i.e. on the basis of an individual's performance on a test we generalize to his performance in other contexts.

• Reliability = generalizability• The way we define a given universe of measures

will depend upon the universe of generalization

Page 45: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Application of G-theory

• Two stages:– G-study– D-study

Page 46: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

G-study

• consider the uses that will be made of the test scores, investigate the sources of variance that are of concern or interest.On the basis of this generalizability study, the test developer obtains estimates of the relative sizes of the different sources of variance ('variance components').

Page 47: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

D-study

• When the results of the G-study are satisfactory, the test developer administers the test under operational conditions, and uses G-theory procedures to estimate the magnitude of the variance components. These estimates provide information that can inform the interpretation and use of the test scores.

Page 48: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Significance of G-theory

• The application of G-Theory thus enables test developers and test users to specify the different sources of variance that are of concern for a given test use, to estimate the relative importance of these different sources simultaneously, and to employ these estimates in the interpretation and use of test scores.

Page 49: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Universes Of Generalization And Universe Of Measures

• universe of generalization, a domain of uses or abilities (or both)

• the universe of possible measures: types of test scores we would be willing to accept as indicators of the ability to be measured for the purpose intended.

Page 50: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Populations of Persons

• In addition to defining the universe of possible measures, we must define the group, or population of persons about whom we are going to make decisions or inferences.

Page 51: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Universe Score • A universe score xp is thus defined as the me

an of a person's scores on all measures from the universe of possible measures. The universe score is thus the G-theory analog of the CTS-theory true scores. The variance of a group of persons' scores on all measures would be equal to the universe score variance sp

2, which is similar to CTS true score variance in the sense that it represents that proportion of observed score variance that remains constant across different individuals and different measurement facets and conditions.

Page 52: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Universe Score • The universe score is different from the

CTS true score, however, in that an individual is likely to have different universe scores for different universes of measures.

Page 53: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Generalizability Coefficients • The G-theory analog of the CTS-theory relia

bility coefficient is the generalizability coefficient, which is defined as the proportion of observed score variance that is universe score variance:

•   pxx’2=sp

2/sx2

• where sp2 is universe score variance and sx

2 is observed score variance, which includes both universe score and error variance.

Page 54: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Estimation

• Variance components: sources of variances

• persons(p), forms(f), raters(r)• sx

2=sp2+sf

2+sr2+spf

2+spr2+sfr

2+spfr2

• Use ANOVA to compute for the magnitude of the variance

• Analyse those that are significantly large.

Page 55: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Standard Error of Measurement (SEM) • We need to know the extent the test score

may vary.(SEM)• Formula of SEM Estimation• se=sx√(1-rxx’)• From:• rxx’=st

2/sx2 (1)

• st2/sx

2+se2/sx

2=1 (2)• se

2/sx2=1-st

2/sx2 (3)

• se2/sx

2=1-rxx’

• se2=sx

2(1-rxx’)

Page 56: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Interpretation of Test Scores

• Difficulty• Distinction • Z score

Page 57: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Difficulty for Dichotomous Scoring

• p=R/n• where: • p: difficulty index• R: right answers• n: number of students

Page 58: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Difficulty for Dichotomous Scoring (Corrected)

• Cp=(kp-1)/(k-1)• Where• Cp: corrected difficulty index• p: uncorrected difficulty index• k: number of choices

Page 59: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Difficulty for Non-dichotomous Scoring

• p=mean/full score

• 30%--85%

Page 60: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Distinction • Label the top 27% of the total as the

high group and the lowest 27% of the total as the low group.

• D=PH-PL

• Where• D: distinction index• PH: rate of the correct answers in the

high group• PL: rate of the correct answers in the low group

Page 61: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Z score • A way of placing an individual score in the

whole distribution of scores on a test; it expresses how many standard deviation units lie above or below the mean. Scores above the mean are positive; those below the mean are negative.

• An advantage of z scores is that they allow scores from different tests to be compared, where the mean and standard deviation differ, and where score points may not be equal.

• Z=(X-X’)/s

Page 62: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

T-score

• A transformation of a z score, equivalent to it but with the advantage of avoiding negative values, and hence often used for reporting purposes.

• T=10Z+50

Page 63: Testing 05 Reliability. Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors:

Standardized Score• A transformation of raw scores which provi

des a measure of relative standing in a group and allows comparison of raw scores from different distributions, eg. from tests of different lengths. It does this by converting a raw score into a standard frame of reference which is expressed in terms of its relative position in the distribution of scores. The z score is the most commonly used standardized score.

Standardized score = 100Z+500