January 15-16, 2013 Hong Kong SAR, China

38
January 15-16, 2013 Hong Kong SAR, China Thanks for the Organizors of Assessment Conference, Hong Kong to give me such a chance to do the present ation

description

Thanks for the Organizors of Assessment Conference, Hong Kong to give me such a chance to do the presentation. January 15-16, 2013 Hong Kong SAR, China. by Prof. Zhang Quan Ph.D College of Foreign Studies, Jiaxing University Zhejiang, China. Rasch Model in China: - PowerPoint PPT Presentation

Transcript of January 15-16, 2013 Hong Kong SAR, China

Page 1: January 15-16, 2013 Hong Kong SAR, China

January 15-16, 2013Hong Kong SAR, China

Thanks for the Organizors of Assessment Conference, Hong Kong

to give me such a chance to do the presentation

Page 2: January 15-16, 2013 Hong Kong SAR, China

byProf. Zhang Quan Ph.D

College of Foreign Studies, Jiaxing UniversityZhejiang, China

Rasch Model in China: Retrospect and Status Quo

Page 3: January 15-16, 2013 Hong Kong SAR, China

I. Rasch Model, 20 years ago

• As early as in 1980s, the ideas and concepts regarding Rasch Model and IRT were first introduced into China by Prof. Gui Shichun, my Ph.D supervisor, and it is Prof. Gui who first conducted with great success the ten-year long (1990-1999) Equating Project for Matriculation English Test (MET) in China. MET is the most influrential and competitive entrance examintaion for higher education administered annually to over 3.3 million candidates then. The Equating Project won recognition by Charles Alderson and other foreign counterparts during 1990s. Academically, those were Good Old Days for Chinese testing experts and psychometricians. Then for certain reasons, the equating practice abruptly discontinued. Therefore, in China nowadays, the application of Rasch Model or the IRT-based software like BILOG, Parscale, Winsteps and others to real testing problem solving is confined within a small ‘band’ of people.

Page 4: January 15-16, 2013 Hong Kong SAR, China

• Rasch was used to do equating for MET, (Matriculation English Test),• the most influencial and competitive (20% can be enrolled) entrance

examination administered annually to candidates of approximately 3.3 million (from 1990 on) and the number is increasing in the following years.

• Features of MET• 1. Compulsory: • All the Chinese middle school students must take it if planning to• study in a college or a university.• 2. High-stake: the pass or failure may decide the rest of one’s life. • 3. Unified and at national level: One and the same test paper is used • across China Mainland.• 4. Test format: mainly multiple choice questions plus a small portion• of writing.

I. Rasch Model, 20 years ago

Page 5: January 15-16, 2013 Hong Kong SAR, China

• Features of MET (continued)• 5. Family-bound. To pass MET and to be admitted into• universities for higher learning are the very concern and• expection of their kids by all the parents in China. • 6. Equating via anchored items was done annually from• 1990-1999. (Test scores, after conversion, can be • comparable on the same scale across China). The only test

on large scale to which equating with real data was conducted and the whose rescaled scores were used for recruitment.

• 7. Moderating of test items was based on the item analysis.

I. Rasch Model, 20 years ago

Page 6: January 15-16, 2013 Hong Kong SAR, China

• One thing worth mentioning here is that the equating via Rasch Model in the Chinese situation, a situation somewhat unique in a number of ways was done very successfully. (The presenter here is one of the key members of equating group headed by Prof. Gui from 1990-1999)

• 1. As the uneven deveolpment of education and the big number of candicates taking the test, the population is actually heterogenous though the candidates are all senior middle school graduates. Difficult to set an unbiased test, let alone to equate two parallel test forms administered on different occasions.

II. Rasch Model and MET equating, 20 years ago

Page 7: January 15-16, 2013 Hong Kong SAR, China

• 2. Although the test papers were centrally set, there was no way yet to score the papers centrally. The general practice was to assign every individual province to score its own papers and to work out its own norm for recruitment. This made university authorities confronted with the problem of selecting candidates whose scores were graded according to different criteria set up by different provinces.

II. Rasch Model and MET equating, 20 years ago

Page 8: January 15-16, 2013 Hong Kong SAR, China

• 3. In China, there is no feasible way to protect test security immediately after its adminstration. Nor is it possible to use common items in different forms, nor is feasible to conduct any pre-test for future use.

• To find feasible solution(s) to such problems, we established an anchorage, i.e. three sampling bases (middle schools) to monitor the performance of the candidates. We designed an equivalent test form

• (35+65=100 items) and had it administered to the candidates who were going to take MET three days before MET was administered. The equivalent test form was used repeatedly for 10 years (1990-1999).

II. Rasch Model and MET equating, 20 years ago

Page 9: January 15-16, 2013 Hong Kong SAR, China

• In doing so, we could not only observe but also compare the performance of candidates taking MET in different years.

• Hypothesis:

• There will be no big change in terms of general M (English proficiency) within one’s year’s time. If there is any change of means, it must be associated with the change of difficulty level of test froms across two years.

II. Rasch Model and MET equating, 20 years ago

Page 10: January 15-16, 2013 Hong Kong SAR, China

• Then, we came to realize that such a hypothesis is by no means perfect in at least three reasons:

• First, the sample size. We were going risk of test leakage. The sample must be big enough to be representative; however, the larger the sample, the greater the danger of test exposure;

• Next, the general level of population is not likely to remain unchanged. Instead, it may fluctuate. Insignificant changes may accumulate into significant changes. (Gui Shichun:1990)

II. Rasch Model and MET equating, 20 years ago

Page 11: January 15-16, 2013 Hong Kong SAR, China

• Finally, if there is any changes in terms of difficulty level of the test forms, it would not be accepted by simply making any linear adjustments based on individual test scores regarding the difference between the test forms.

• It is based on such a hypothesis that Anchor-test-random-groups design was put forward and conducted

II. Rasch Model and MET equating, 20 years ago

Page 12: January 15-16, 2013 Hong Kong SAR, China

Test takers A Test A

Equivalent Test of 35 linking items+65

Test Takers B Test B

2.1. Anchor-test-random-groups design

The equivalent test was taken externally, three days

before MET was actually administered.

Page 13: January 15-16, 2013 Hong Kong SAR, China

2.2. Anchor-test-random-groups design: summarized (1)

• 1. Sampling• 2. Administration• 3. Chi-square test (Wright,1979) of the 35 linking • items was applied so as to delete the inappropriate • items. In 1989, 28 items, In 1990-1991, 27 items,• 4. Equating test forms• The test results of 1988 (the year when MET was first

administered across China) was used as basal reference. With anchor test, Rasch Model (Gitest), all the following test forms got equated (calibrated and rescaled) (Wright,1979)

Page 14: January 15-16, 2013 Hong Kong SAR, China

2.2. Anchor-test-random-groups design: summarized (2)

• 5. Ability estimation• In the case of Rasch Model, the ability estimation i

s straighforward. To obtain the maximum likelihood estimation of theta (θ), we used the Newton-Raphson procedure (Hambleton,1985) . The ability values are again converted into probabilities for those who know nothing about Rasch.

• As the model has the sample-free character, we could make use the derived data to obtain adjusted scores for the population.

Page 15: January 15-16, 2013 Hong Kong SAR, China

2.3. Anchor-test-random-groups design: summarized (3)

• Why Rasch Model and not other models, two- or three-p models?• 1. Feasible implementation• Once the item parameters were calibrated, the ability parameters can be easily esti

mated. • A typical example: A candidate getting a raw score of 60 correct answers out of 8

5 test items will be assigned an ability value regardless of which combination of the 60 correct answers. In the case of two- or three-p, the procedures get complicated. The estimation is very much associated with the discrimination and the so-called ‘guessing’ parameter. Therefore, the two or more candidates getting a raw score of 60 correct answers out of 85 test items will be assigned different ability values becasue of the combinations of 60 correct answers vary from person to person. Imagine, the combinations of items from 1 to 84 is huge or astronomical!

• Impossible to use the sampled data to predict the population performance. Very often the iteration never came to convergence because of mainly two big problems, computer configuration problems and the jumble data size impossible to manipulate within two weeks.

Page 16: January 15-16, 2013 Hong Kong SAR, China

2.2. Anchor-test-random-groups design: summarized (4)

• 2. Model-data fit

• With Rasch Model, item and ability fit can be computed (Wright,1982 ) and can demonstrate the degree of goodness-of-fit of the model.

• ... ...

Page 17: January 15-16, 2013 Hong Kong SAR, China

2.3. Item Difficulty of MET 1988-1992

MET88 MET89 MET90 MET91 MET92Phonetics -0.860 (.70) -0.69(0.48) 0.793 (0.3

1)-0.186 (0.55) 0.992 (0.28)

Grammar 0.228 (.44 -0.372(0.59)

0.471 (0.38)

0.500 (0.38) 0.801 (0.31)

BLK-Filling -0.367 (.59) 0.271(0.43) 0.871 (0.30)

0.845 (0.30) 0.609 (0.35)

Reading -0.330(.58) -0.581(0.64)

0.600 (0.35)

-0.179 (0.54) -0.202 (0.55)

Means -0.206 (.55) -0.180(0.54)

0.657 (0.34)

0.361 (0.41) 0.523 (0.37)For better illustration, the numbers in brackets are probabilities converted from difficulties. As shown in the talbe, no big differences between MET88 and MET89; however, MET90 turned out to be more difficult.

Page 18: January 15-16, 2013 Hong Kong SAR, China

2.4. Ability (θ) of MET 1988-1992

MET88 MET89 MET90 MET91 MET92Total N 136543 117085 128543 136047 133965

θ Means 40.0 44.4 53.7 50.0 54.2

% 47.0 52.2 63.2 58.8 63.8

SD 17.9 16.1 13.5 missing 15.2

The θMeans as shown in the table above refer to the rescaled average ability parameters 40.0 regarding the MC parts only, the full score:85;

85 MC + 15 writing = 100

Page 19: January 15-16, 2013 Hong Kong SAR, China

III. MET and Rasch Model: Status Quo

• MET remains the most influencial and competitive (20% can be enrolled) entrance examination administered annually to candidates of approximately 3.3 million (from 1990 on) and the number is increasing in the following years.

• Features of MET remain:• 1. Compulsory: • All the Chinese middle school students must take it if planning to

study in a college or a university.• 2. High-stake: the pass or failure may decide the rest of one’s life. • 3. Unified and at national level: One and the same test paper is used • across China.• 4. Mainly multiple choice questions plus a small portion of writing. • 5. No equating is done annually. Statistically, test scores are not

comparable.

Disbanded. Resumed to the traditional test item writing, scoring at provincial level, reporting in raw scores, no pre-test, no item analysis (Rasch or IRT) and no equating. Problems of test item writing and moderating.

Each province or regions use its own test paper. No norm established.

Page 20: January 15-16, 2013 Hong Kong SAR, China

• • MET remains the most influrential and competitive entrance exa

mintaion for higher education in China. The number has been increasing. It reached 9.5 millions in 2006 and the graph shows the numbers of candidates taking MET in recent years (2006-2010) nationwide.

• And the average number of candidates in Zhejiang Province where our university is located is 300,000 (and 360,000 in 2012). According to the latest offical report, the number of candidates taking MET in Jiaxing in 2012 goes as follows: 7688 of humanity, 1493 of arts and Chinese, 176 of sports, 12991 of science, 443 of arts and science and 237 of sports and science.

• The number of students taking MET is decreasing annually.

III. MET and Rasch Model: Status Quo

Page 21: January 15-16, 2013 Hong Kong SAR, China

• MET features (Continued)• According to the most updated source, MET in

China will be administered separately from the other entrance examinations and will be administered more than once within a year’s time so that students may have more chances to take MET. From the professional point of view, such a practice needs equating.

• Updated IRT-based computer software

III. MET and Rasch Model: Status Quo

Page 22: January 15-16, 2013 Hong Kong SAR, China

IV. College English Test (CET)

• CET is another most influencial examination administered two times a year to students of non-English major of approximately over 10 million in recent years.

• Features of CET• 1. from Compulsory to Optional:• Not all the undergraduate students of non-English major should take

it.• 2. from High-stake to not very high-stake: the pass or failure may

make no difference for a student to get the diploma. • 3. Unified and at national level: One and the same test paper is used • across China.• 4. Mainly multiple choice questions plus a small portion of writing. • 5. The test whose equating has been done annually from 1990 in

China.(with a team of qualified test item writers)

Page 23: January 15-16, 2013 Hong Kong SAR, China

The first Rasch-based computer software developed by Prof. Gui in 1990s. Test Paper Report by GITEST

Mean the mean scores of the whole examinees;SD the standard deviations of the whole examinees;Varn. the variants based on the whole examinees;P+ probability of correct answers;Pd value, difficulty parameter based on probability;R11 by Kuder-Richardson20, reliability, this value should be over 0.9aVALUE reliability parameter , also called value, by Cronbach formular, this value should be over 0.8Rbis discrimination index(in the unit of bi-serial)Skewness score distribution value, . 0 indicating normal distribution; above 0, indicating positive skewness, showing the test items more difficult; below 0, indicating negative skewness, showing the test items easier;Kurtosis score distribution height: 0 indicating normal; above 0 showing “narrower”, i.e. small range between t

he scores; below 0, indicating “flat”, i.e. big range between scores; Difficulty VD(<0.1), D(=0.10.3), I(0.30.7), E(0.70.9), VE(>0.9)

Page 24: January 15-16, 2013 Hong Kong SAR, China

- 8

- 6

- 4

- 2

0

2

4

6

GI TESTBI LOGPARSCALE

As shown in the figure above, the curves are very close. The BILOG and PARSCALE are almost overlapping. This is very much related to the number of cycles and the pre-determined value for convergence set in respective command file. BILOG came to convergence after 6 cycles with the largest changes = 0.005, while PARSCALE came to convergence after 72 cycles with the LARGEST CHANGE = 0.01. GITEST looks a little bit different. This is because all the parameters are set as defaults. On whole, there is no big difference in terms of test item difficulty calibration.

The three curves generated by GiTST, BILOG and PARSCALE, indicating item difficulties based on the same data

Page 25: January 15-16, 2013 Hong Kong SAR, China

2.2. Equating and its why

• In testing practice, equating is used to monitor any possible changes of item difficulties so as to adjust the ability estimates yielded by different groups of candidates taking the two parallel tests on different occasions such as in the equating project of Matriculation English Test (MET) in China launched ever since 1986, or equating of College of English Test (Candidates take two tests and may choose the higher score of the two.

Page 26: January 15-16, 2013 Hong Kong SAR, China

Test A ? Difficult d

Test takers

Test B ? Difficult d

Test takers A ? Ability θ Test

Test takers B ? Ability θ

Test takers A Test A ? Difficulty / Ability

linking items

Test Takers B Test B ? Difficulty / Ability

2.3. Equating and its concept

Page 27: January 15-16, 2013 Hong Kong SAR, China

The concept of ‘equating’ discussed here refers to linking of test forms through common items so that scores derived from the tests which were administered separately to different test takers on different occasions after conversion will be comparable on the same scale.

(Hambleton & Swaminathan, Gui Shi Chun:1985 and et al)

Equating defined

Page 28: January 15-16, 2013 Hong Kong SAR, China

Equating --- Item bank

• Equating makes an item-bank possible;

• An item-bank serves computerized testing.

• Itembank Computerized Testing

calibrated

testing items to be presented

Page 29: January 15-16, 2013 Hong Kong SAR, China

• BILOG-W Command File• EQUATING OF PRETCO2002(20+100) LINKED WITH PRETCO2002 (20+100)• >COMMENTS• The data were collected from more than 1,000 PRETCO candidates of c

olleges within Guangdong Province. • The data are in the file PRETCO01.DAT of the BILOG directory; • The respondents' scores are estimated by the ML method and re-scaled

to mean 0 and standard deviation 1 in the sample (RSC=2). • The item parameter estimates are saved AFTER re-scaling.•  • >GLOBAL NWGHT=0, FNAME='d:\BILOG\Examples\blgdat\PRETCO01.DAT',• NPArm=1, SAVe;• >SAVe GRAPH='PRETCO01.PLT', PARM='PRETCO01.PAR', SCORE='PRETCO01.SCO';• >LENGTH NITems=220;• >INPUT FORms=2, NTOT=120, NALT=4, INOPT=1, NIDCH=12;• (12A1,1X,I1,120A1)• >FORm1 LENgth =120, ITEms = (1(1)120);• >FORm2 LENgth =120, ITEms = (1(1)20,(121(1)220);• >TESt TNAMe= 'EQUATING', LINK=(1(0)20,0(0)200);• >CALIB TPRior, SPRior;• >SCORE MET=1, RSC=2; •  •

Page 30: January 15-16, 2013 Hong Kong SAR, China

• BILOG-W Data File

• GD2006070001 1 1010101001010101010101010010101010101010010101010101010100 • GD2006070002 1 1010101001110101010101010010101010101010010101010101010100 • GD2006070003 1 1010101001010101010101010010101010101010010101010101010111 • GD2006070004 1 1010101001010101010101010010101010101010010101010101010100 • GD2006070005 1 1010101001010101010101000010101010101010010101010101010100 • GD2006070006 1 1010101001011111110101010010101010101010010101010101010100 • GD2006070007 1 1010101001010101010101010010101010101010010101010101010100

• ... … … … … … … … ... … … … … … … … • GD2006070001 1 1010101001010101011101010010101010101011111101010101010100 • GD2006070001 1 1010101001010101010101010010101010101011110101010101010100 • GD2006070001 1 1010101001010101010101010010101010101010010101010101010100 • GD2006070001 1 1010101001010101010101010010101010101010010101010101010100 • GD2006070001 1 1010101111010101010101011110101010101010010101010101010100 • GD2006070001 1 1110101001010101010101010010101010101010010101010101010100 • GD2006070001 1 1010101001010101010101010010101010101010010101010101010100 • GD2007070001 2 1010101001010101010101010010101010101010010101010101010100 • GD2007070002 2 1010101001010101010101010010101010101011110101010101010100 • GD2007070003 2 1010101001010101010101010010101010101010010101010101010100 • GD2007070004 2 1010101001010101010101011110101010111111111111010101010100 • GD2007070005 2 1010101001010101010101011110101010101010010101010101010100 • GD2007070006 2 1010101001010101010101010010101010101010010101010101010100

• … … … … … … … … • GD2007070007 2 1010101001010101010101011110101010101011010101010101010100 • GD2007070008 2 1010101001010101010101011110101010101010010101010101010111 • GD2007070008 2 1111111111110101010101010010101010101010010101010101010100

Page 31: January 15-16, 2013 Hong Kong SAR, China

PARSCALE-W Command FileCommand file: EQT8599.PSL

EQ8599 Equating: Simulated Dada

>COMMENT:

This example illustrates calibration and scoring of two parallel MET tests: MET85 and MET99 containing respectively 20 common items and 85 MET items. The total items for each test is 20 linking items plus 85 items. The simulated data represent responses of 300 examinees drawn randomly from a population with a mean trait score of 0.0 and standard deviation of 1.0.

All items are response data from multiple choice questions with four alternatives. All items have varying difficulties and discriminating powers saved in the file MET85-99.DAT. The scores, which are equated to be comparable on the same scale, are not printed but saved in the file METEQT8599.SCO. In addition, the estimated item parameters are saved in the file METEQT8599.PAR. by maximum likelihood method (MLE) from one-parameter model.

• >FILE DFNAME='MET8599.DAT', NFNAME= 'MET8599.NPR', SAV;

• >SAVE PARM='MET8599.PAR', SCORE= 'METEQT8599. SCO';

• >INPUT NIDW=10, NTOTAL=190, NTEST=1;

• (10A1,190A1)

• >TEST1 TNAME='EQ8599', ITEM=(1(1)190), NBLOCK=1, SLOPE;

• >BLOCK NITEMS=190, NCAT=2, GPARM=0.0, GUESS=(2,FIX), CSLOPE, ORIGINAL=(0,1), MODIF=(1,2);

• >CAL LOGISTIC, SCALE=1.7, NQPTS=30, CYCLE=30, CRIT=0.01, ITEMFIT=6;

• >SCORE MLE;

Page 32: January 15-16, 2013 Hong Kong SAR, China

PARSCALE-W Data File• TESTX011101010100110000001011000119999999999999• TESTX020110101100011110111111111109999999999999• TESTX031000101100000101110000000019999999999999• TESTX101111110001011110111100111109999999999999• TESTX110101111111110110000000000119999999999999• TESTX120011000000111100001010101009999999999999• … … … … … …• TESTX181110010101010000000001011009999999999999• TESTX191001010100010000001000011009999999999999• TESTX200001100110001000000001000109999999999999• TESTX210011010101010010001000010009999999999999• TESTX221010111000111000110000000009999999999999• TEXTY080111101010101099999999999990010011111000• TEXTY091111010101010199999999999991110111111011• TEXTY101011010101011199999999999990110110100101• TEXTY111111010101010099999999999990011010100011• TEXTY121001101010101099999999999990010000000000• … … … … … … • TEXTY150101010101011099999999999991110010000000• TEXTY161101101010101099999999999990010010101001• TEXTY171100110101010199999999999990010010110001• TEXTY180101101010101199999999999991111010000010• TEXTY311100111010101199999999999990000010100010

Page 33: January 15-16, 2013 Hong Kong SAR, China

The numbers of candidates taking MET in recent years (2006-2010) nationawide.

0

2

4

6

8

10

12

1990-1999 2006 2007 2008 2009 2010

Uni t:Mi l l i on

Page 34: January 15-16, 2013 Hong Kong SAR, China

- 4

- 3

- 2

- 1

0

1

2

3

4

PET1999PET2011

The two curves indicating item difficulties of PET1999 and PET2011 generated by GiTEST, after being equated, can be comparable on the same scale.

Page 35: January 15-16, 2013 Hong Kong SAR, China

The most recently updated BILOG and PASRCALE could process, in a single run, unlimited number of test items by unlimited number of test takers.

The data matrix is actually infinite. This makes CAT feasible.

Page 36: January 15-16, 2013 Hong Kong SAR, China

V. What we need in the present status quo

• Examinations on large scale in China today

• (1) Matriculation English Test (MET)

• (2) College English Test Band-4 and Band-6 (CET)

• (3) Test for English Majors (TEM)

• (4) Practical English Test for Colleges (PRETCO)

• (5) Public English Test System (PETS)

Page 37: January 15-16, 2013 Hong Kong SAR, China

IV. What we need in the future

• (1) Testing theory: Rasch model, IRT, ... ...

• (2) More workshops

• (3) More experienced experts

• (4) More text books of language testing

• (5) More PROMS conferences

• (6) More cooperations and exchanges

Towards International Practice of Language Testing in China

Page 38: January 15-16, 2013 Hong Kong SAR, China

Prof. Zhang Quan Ph.D Dean, College of Foreign Studies, Jiaxing University,

Zhejiang Province, P.R.China email: [email protected]

Tel: 86-0573-83640029 Cell: 86-13902251564

Thank you for your attention

Questions