Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.
-
Upload
andrew-scott -
Category
Documents
-
view
218 -
download
0
Transcript of Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.
Evaluation of Tutoring Systems
Kurt VanLehn
PSCL Summer School 2006
Outline Formative evaluations:
How to improve the tutor? Summative evaluations:
How well does the tutor work? Parametric evaluations:
Why does the tutor work?
User interface testing (usually the first formative eval) When
– Early, before the rest of the tutor is built
Who– Students
– Instructors
How– Talk aloud with headset mikes and Camtasia
– Sit with the user and ask them about every frown
– Other?
Also great for finding good problems to assign
The Andes user interface
Pyrenee’s GUI for same problem
Andes’ GUI for defining a velocity vector
Pyrenees’ methodT: What would you like to
define?a) displacementb) velocityc) acceleration
S: b T: Velocity of…
a) Hailstoneb) Cloudc) Earth
S: a
T: Type?a) instantaneousb) Average
S: aT: Time point?
a) T0b) T1
S: aT: Orientation?S: 270 degT: Name?S: vi
Wizard of Oz (A formative evaluation) Format
– Human in the next room watches a copy of screen– Human responds when student presses Hint button or makes an
error– Human must type very fast! Paste from stock answers?
User interface evaluation– Does the human have enough information?– Can the human intervene early enough?
Knowledge elicitation– What tutoring tactics were used?– What conditions determine when each tactic is used?
Snapshot critiques (A late formative evaluation) Procedure
– ITS keeps log file– Afterwards, randomly select events in log where student
got help– Print context leading up to the help message– Expert tutors write their help on the paper
How frequently does expert’s help match ITS’s?– How frequently do two expert’s help match?
Add to the ITS the help that experts agree on
Outline Formative evaluations:
How to improve the tutor? Summative evaluations:
How well does the tutor work? Parametric evaluations:
Why does the tutor work?
Next
Summative evaluations Question: Is the tutor more effective than a control? Typical design
– Experimental group uses the tutor
– Control group learns via the “traditional” method
– Pre & post tests
Data analysis– Did the tutor group “do better” than the control?
Three feasible designs. One factor is forced to be equal. Two factors vary.
Like homework
Like seatwork
Mastery learning
Training problems
Tutor = control
Tutor > control?
Tutor < control?
Training duration
Tutor < control?
Tutor = control
Tutor < control?
Post-test score
Tutor > control?
Tutor > control?
Tutor = control
Control conditions
Typical control conditions– Existing classroom instruction– Textbook & exercise problems (feedback?)– Another tutoring system– Human tutoring
» Null result does not “prove” computer tutor = human tutor
Define your control condition early– Drives the design of the tutor
Assessments (tests)
Your tests Instructor’s normal tests Standardized tests
When to test
Pre-test Immediate post-test Delayed post-test
– Measures retention Learning (pre-test, training, post-test)
– Measures acceleration of future learning (also called preparation for learning)
Example of acceleration of future learning (Min Chi & VanLehn, in prep.) Design
– Training on probability then physics– During probability only,
» Half students taught an explicit strategy» Half not taught a strategy (normal instruction)
Pre PostProbability Training
Sco
re
Pre PostPhysics Training
Sco
re
Preparation for learning
Ordinary transfer
Content of post-tests
Some problems from the pre-test– Determines if any learning occurred at all
Some problem similar to training problems– Measures near-transfer
Some problems dissimilar to training problems– Measures far-transfer
Use your cognitive task analysis!
Bad tests happen, so Pilot Pilot Pilot!
Blatant mistakes (shows up in means)– Too hard (floor effect)– Too easy (ceiling effect)– Too long (mental attrition)
Subtle mistakes (check variance)– Test doesn’t cover some training content– Test over-covers some training content– Test is too sensitive to background knowledge
» e.g., reading, basic math
Did the conditions differ?
My advice: Always do ANCOVAs– Condition is independent variable– Post-test score is dependent– Pre-test score is co-variate
Others advice: – Do ANOVAs on gains – If pre-test scores are not significantly different, do
ANOVAs on post-test scores
Effect sizes: Cohen’s d
Should be based on post-test scores: [mean(experimental)-mean(control)] / standard_deviation(control)
Common but misleading usage:[mean(post-test) – mean(pre-test)] / standard_deviation(pre-test)
Error bars help visualize results
0
10
20
30
40
50
CompleteParaphrase
CompleteSelf-explain
IncompleteParaphrase
IncompleteSelf-explain
Nu
mb
er o
f h
elp
req
ues
ts
Scatter plots help visualize results
Andesy = 0.9473x - 2.4138
R2 = 0.2882
Controlsy = 0.7956x - 2.5202
R2 = 0.2048
-3.0000
-2.0000
-1.0000
0.0000
1.0000
2.0000
3.0000
1 1.5 2 2.5 3 3.5 4
GPA
Z-s
core
on
exam
ANDES
CONTROLS
Linear (ANDES)
Linear (CONTROLS)
Pre-test score (or GPA)
Pos
t-te
st s
core
Andes
Control
If slopes were different, would have aptitude-treatment interaction (ATI)
Andesy = 0.9473x - 2.4138
R2 = 0.2882
Controlsy = 0.7956x - 2.5202
R2 = 0.2048
-3.0000
-2.0000
-1.0000
0.0000
1.0000
2.0000
3.0000
1 1.5 2 2.5 3 3.5 4
GPA
Z-s
core
on
exam
ANDES
CONTROLS
Linear (ANDES)
Linear (CONTROLS)
Pre-test score (or GPA)
Pos
t-te
st s
core
Andes
Control
Which students did the tutor help? Divide subjects into high/low pretest Plot gains Called “aptitude-treatment
interaction” (ATI) Need more subjects
Low pretest High pretest
GainTutoredControl
Which topics did the tutor teach best? Divide test items (e.g., into
deep/shallow knowledge) Plot gains Need more items
Deep Shallow
GainTutoredControl
Log file analyses Did students use the tutor as expected?
– Using help too much (help abusers)– Using help too little (help refusers)– Copying a solution from someone else (exclude?)
Correlations with gain– Errors corrected with or without help– Proportion of bottom-out hints– Time spent thinking before/after a hint
Learning curves for productions– If not a smooth curve, is it really a
single production?
Practical issues All experiments
– Human subjects institutional review board (IRB) Lab experiments
– Recruiting subjects over a whole semester; knowledge varies– Attrition: Students quit before they finish
Field (classroom) experiments– Access to classrooms and teachers– Instructors’ enthusiasm, technosavy, agreement with pedagogy– Ethics of requiring use/non-use of the tutor in high-stakes classes– Their tests vs. your tests
Web experiments– Insuring random assignment vs. attrition
Outline Formative evaluations:
How to improve the tutor? Summative evaluations:
How well does the tutor work? Parametric evaluations:
Why does the tutor work?Next
Parametric evaluations: Why does the tutor work? Hypothesize sources of benefit, such as
– Explication of hidden problem solving skill– Novel reification (GUI)
» E.g., showing goals on the screen
– Novel sequence of exercises/topics» E.g., story problems first, then equations
– Immediate feedback & help Plan an experiment or sequence of experiments
– Don’t try to do all 2^N combinations in one study– Vary only 1 or 2 factors
Two types of parametric experiments Removing a putative benefit from the tutor
– Two conditions:1. Tutor
2. Tutor minus a benefit (e.g., immediate feedback)
Add a putative benefit to the control, e.g.,– Three conditions:
1. Control
2. Control plus a benefit (e.g,. explication of a hidden skill)
3. Tutor
In vivo experimentation
High internal validity required– Helps us understand human learning– All but a few factors are controlled– Summative eval of tutoring usually varies too many
Often done in context of tutoring systems– Parametric– Off line, but tutoring system serves as pre/post test
Evaluations of tutoring systems Formative evaluations: How to improve the tutor?
– Pilot test user interface alone– Wizard of Oz– Hybrids
Summative evaluations: How well does the tutor work?– 2 conditions: with and without the tutor– Many supplementary analyses are possible
Parametric evaluations: Why does the tutor work?– Compare different versions of the tutor– Try putative benefits of the tutor with the control