Practical Language Testing by Fulcher (2010)

31
Practical Language Testing Fulcher (2010)

Transcript of Practical Language Testing by Fulcher (2010)

Page 1: Practical Language Testing by Fulcher (2010)

Practical Language Testing Fulcher (2010)

Page 2: Practical Language Testing by Fulcher (2010)

Two paradigms in educational measurement and language testing

1) Norm-referenced testing: The meaning of the score on a test is derived from the position of an individual in relation to the group. It discriminate between test takers and separates them out (i.e., distribute) very effectively. Decision making with norm-referenced tests involves value judgments about the meaning of scores in terms of the intended effect of the test.

Page 3: Practical Language Testing by Fulcher (2010)

Two paradigms in educational measurement and language testing (Cont.)

Criterion-referenced testing: The aim is to make a decision about whether an individual test taker has achieved a pre-specified criterion, or standard, that is required for a particular decision context.

Page 4: Practical Language Testing by Fulcher (2010)

What is a standardized test?

A standardized test is a form of NRT that 1) requires all test takers to answer the same questions, or a

selection of questions from common bank of questions, in the same way;

2) is scored in a “standard” or consistent manner, which makes it possible to compare the relative performance of individual students or groups of students.

 The term is primarily associated with large-scale tests administered to large populations of students

Page 5: Practical Language Testing by Fulcher (2010)

Why testing is viewed as a ‘science’

The early scientific use of tests initiated by the introduction of statistical analysis in testing area during First World War

Greenwood (1919): “When you can measure what you are speaking about and express it in numbers, you know something about it, but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind” (p. 186)

Fulcher (2010): “tests, like scientific instruments, provide the means by which we can observe and measure consistencies in human ability”.

Page 6: Practical Language Testing by Fulcher (2010)

Why testing is viewed as a ‘science’ (Cont.)Shohamy (2001): “Testing is perceived as a scientific discipline because it is experimental, statistical and uses numbers. It therefore enjoys the prestige granted to science and is viewed as objective, fair, true and trustworthy” (p. 21) which are key features of the “power of testing”.

Lipman (1922): Strong trait theory is untenable. In fact, most of the traits or constructs that we work with are extremely difficult to define, and if we are not able to define them, measurement is even more problematic.

Page 7: Practical Language Testing by Fulcher (2010)

The curve and score meaning In NRT, the meaning of a score is directly related to its place in

the curve of the distribution (or a bell curve) from which it is drawn.

-3SD -2SD -1SD 0 1SD 2SD 3SD

Page 8: Practical Language Testing by Fulcher (2010)

Central tendencyCentral tendency: The most typical behavior of the group Mode: Scores occurs most frequently

Bimodal with two peaks Trimodal with three peaks

Median: The point below which 50 percent of the scores fall and above which 50 percent fall.

Midpoint: The point halfway between the highest score and the lowest score on the test (high+low/2)

Mean: (The midpoint for NRT is the mean)

Page 9: Practical Language Testing by Fulcher (2010)

Dispersion Dispersion: How the individual performances vary from the central tendency. Range: The number of points between the highest score and the

lowest one plus 1.

Standard deviation (SD): A sort of average of the differences of all scores from mean (the square root of the sum of the squared deviation scores, divided by N – 1).

Deviation score: The score obtained from the subtraction of the mean from each of the individual scores ( ) (The mean of these scores is always zero).

Page 10: Practical Language Testing by Fulcher (2010)

Dispersion (Cont.)

SD formula:N-1 for sample

N for population group

SD is better than the range since it is the result of averaging process and lessen the effects of extreme scores not attributable to performance on the test.

Variance: The squared value of SD

Page 11: Practical Language Testing by Fulcher (2010)

Example Score Mean X-M (X-M)2

77 71 6 36

75 71 4 16

72 71 1 1

72 71 1 1

70 71 -1 1

65 71 -6 36

66 71 -5 25

Central tendency Dispersion Mode =72Median =72Midpoint =77+66/2=71.5Mean = 77+75+72+72+70+65+66/7=71

Range = 77-66+1=12SD =√(36+16+1+1+1+36+25)2/7= 4Variance = s2 = 42 =16

Page 12: Practical Language Testing by Fulcher (2010)

Example (cont.) (with raw score)In the normal curve, mean, mode, midpoint, and median are all the same.

Score 76: 50% +34.13% = 84.13% (Percentile: The total percentage of students who scored equal to or below a given point in normal distribution)

60 64 68 72 76 80 84

Page 13: Practical Language Testing by Fulcher (2010)

Standardized tests: a) z scores A z-score: The raw score expressed in standard deviations. Z score formula:

The mean of z scores is always zero.

The SD of z scores is 1.

3 ≤ z scores ≤ +3

-.5

Z= 70-72/4= -.5sd

Page 14: Practical Language Testing by Fulcher (2010)

Standardized tests: a) z scores (Cont.) Three problems of z scores: 1. They are relatively small, ranging from -3 to +3.2. They can turn out to be negative and positive.3. They turn out to include several decimal places.

Reporting scores in form of z scores can be demotivating for the students.

To overcome its problems, z scores should be transformed to some standardized scales

Page 15: Practical Language Testing by Fulcher (2010)

Standardized tests: b)T scoresMain formula of standardized scales (linear transformation of z scores):

T score formula: T = 10z +50 Mean = 50 SD = 10 range = 10-90 Example: raw score = 70 z score = -0.5 T score = 10 * -0.5 + 50 = 45

Page 16: Practical Language Testing by Fulcher (2010)

Standardized tests: c) CEEB scores CEEB (College Entrance Examination Board) is the standardised

Gaokao examination and used for SAT, GRE, TOEFL, etc.

CEEB formula: CEEB = 100z +500

Mean = 500 SD = 100 range = 100-900

Example: raw score = 70 z score = -0.5

CEEB score = 100 * -0.5 + 500 = 450

Page 17: Practical Language Testing by Fulcher (2010)

Item analysis Item facility/item easiness/ item difficulty/facility index: The

statistics used to examine the percentage of students who correctly answer a given item.

IF formula = Ncorrect /Ntotal

Item discrimination (ID): The degree to which an item separates the students who performed well from those who did poorly on the test as a whole.

ID formula = IF upper – IF lowerRange Acceptable Best

0 ≤ IF ≤ 1 .3 ≤ IF ≤ .7 IF = .5

-1 ≤ ID ≤ +1 .4 ≤ ID ID = 1

Page 18: Practical Language Testing by Fulcher (2010)

Reliability

Reliability: Consistency of scores under different circumstances.

Reliability differs from scorability Reliability indicates the degree to which the observed

score and true score match. The observed score (X) is made up of the ‘true’ score of

an individual’s ability on what the test measures (T), plus the error (E) that can come from a variety of sources.

Page 19: Practical Language Testing by Fulcher (2010)

Threatens to reliability (Lado)1. Variation in conditions of administration: Fluctuation of scores over time,

in different places or under slightly different conditions (such as a different room, or with a different invigilator)

2. The quality of the test itself: Problems with sampling what language to test – as we can’t test everything in a single test. If a test consists of items that test very different things, reliability is also reduced. This is because in standardised tests any group of items from which responses are added together to create a single score are assumed to test the same ability, skill or knowledge. The technical term for this is item homogeneity.

3. Variability in scoring: If humans are scoring multiple-choice items they may become fatigued and make mistakes, or transfer marks inaccurately from scripts to computer records. However, there is more room for variation when humans are asked to make judgments.

Page 20: Practical Language Testing by Fulcher (2010)

Calculating reliability The method we use to calculate reliability depends upon

what kind of error we wish to focus on.

The notion of correlation is at the very center of the notion of reliability.

A reliability coefficient is calculated that ranges from 0 (randomness) to 1, and no test is ‘perfectly’ reliable. There is always error of measurement.

Page 21: Practical Language Testing by Fulcher (2010)

Calculating reliability1. Variation in conditions of administration

The statistical technique of correlation used is Pearson Product Moment Correlation.

Assumptions: 1. Interval scale, 2. Independence: each pair of scores is independent from all other pairs, 3. Normally distributed, 4. Linearity

-1 ≤ r ≤ +1:

1. –1 : There is an inverse relationship between the scores

2. 0 : There is no relation between the two sets of scores

3. 1 : The scores are exactly the same on both administrations of the test.

The closer the result is to 1, the more test–retest reliability we have

Page 22: Practical Language Testing by Fulcher (2010)

Coefficient of determination Statistical significance is a necessary precondition for a

meaningful correlation but not sufficient in itself.

Coefficient of determination is simply correlation coefficient squared (r2), and represents the proportion of overlapping variance between two sets of scores (i.e., as the score on one test increases, so it increases proportionally on the other test)0 ≤r2≤ 60 low (one third overlapping variance)

60 ≤r2≤ 80 moderate (one third to two third overlapping variance)

80 ≤r2≤100 high (two third to complete overlapping variance)

Page 23: Practical Language Testing by Fulcher (2010)

2. The quality of the test itself (internal consistency)

Reliability is addressed in terms of homogeneity of items (they must all be highly correlated).

Requirements:

1. Parralelism: Two tests should be parallel (with same means, variances, same correlation with another well-established measure of that construct)

2. Independence: The response to any specific item must be independent of the response to any other item; put another way, the test taker should not get one item correct because they have got some other item correct. The technical term for this is the stochastic independence of items.

Statistics used: Split-half methods and methods based on item variance

Page 24: Practical Language Testing by Fulcher (2010)

Split-half method

Main procedure: Split the test into two equal halves, calculate the correlation between the two halves.

1. Spearman-Brown split-half reliability estimate: Since reliability is directly related to the length of a test, correct the correlation for length via Spearman Brown correction formula (Pallarellism and independence are required)

2. Guttman split-half reliability estimate (Pallarellism is not required but independence is required)

Page 25: Practical Language Testing by Fulcher (2010)

Methods based on item variances

Estimates based on item variances (Pallarellism and independence are required)

1. Cronbach’s Coefficient alpha for dichotomously scored items (scored ‘right’ or ‘wrong’)

2. K-R20 /K-R21

Page 26: Practical Language Testing by Fulcher (2010)

3. Variability in scoring (grading and marking)

Whatever rater is making the judgment should be a matter of indifference to the test taker

Inter-rater reliability: Our concern is with variation between raters because some raters are more lenient than others, or some raters may rate some test takers higher than others (perhaps because they are familiar with the first language and are more sympathetic to errors).

Intra-rater reliability: Our concern is with variation within one rater over time.

Statistics: Cronbach’s alpha for partial credit judgments

Page 27: Practical Language Testing by Fulcher (2010)

Standard Error of Measurement (SEM) One of the most important tools in standardised testing is the standard

error of measurement. While the reliability coefficient tells us how much error there might be

in the measurement, it is the standard error of measurement that tells us what this might mean for a specific observed score more informative for interpreting the practical implication of reliability

SEM formula:

Confidence interval: SEM gives us a confidence interval around an observed test score, which tells us by how much the true score may be above or below the observed score that the test taker has actually got on our test.

Page 28: Practical Language Testing by Fulcher (2010)

Example

Example: SD= 4 r = .64 SEM =4 √1 - .64= 2.4

Raw score = 74 SEM = 2.4

68% (between +1SEM and –1SEM) 71.6 ≤true score ≤76.4

95% (between +2SEM and –2SEM) 69.2 ≤true score ≤ 78.8

99% (between +3SEM and –3SEM) 66.8 ≤true score ≤81.2

100% (between +4SEM and –4SEM) 66.8 ≤true score ≤81.2

Page 29: Practical Language Testing by Fulcher (2010)

Reliability and test length In standardised tests with many items, each item provides a piece of

information about the ability of the test taker, therefore, as we increase the number of items, the reliability will increase.

Formula for looking at the relationship between reliability and test length

A: The proportion by which you would have to lengthen the test to get the desired reliability

rAA : The desired reliability

r11 : The reliability of the current test.

However, the best way to increase reliability is to produce better items not more items.

Page 30: Practical Language Testing by Fulcher (2010)

Relationships with other measures One key part of standardised testing: The comparison of

two measures of the same construct. If two different measures were highly correlated this

provided evidence of validity. This aspect of external validity is criterion-related evidence, or evidence that shows one test is highly correlated with a criterion that is already known to be a valid measure of its construct (called evidence for convergent validity)

Measurement as understood in Classical Test Theory

Page 31: Practical Language Testing by Fulcher (2010)

Thanks for your attention