Emulating Human Essay Scoring With Machine Learning Methods
Transcript of Emulating Human Essay Scoring With Machine Learning Methods
Emulating human essay scoring with machine learning methodsDarrell LahamTom LandauerPeter Foltz
Cognitive Systems: Human Cognitive Models in System Design
June 30, 2003
• Marcia Derr, Ph.D.• Scott Dooley • Terry Drissell • Dave Farnham• Peter Foltz, Ph.D.• Shawn Frederickson• Brent Halsey • Pat Hilton-Suiter• Darrell Laham, Ph.D.• Tom Landauer, Ph.D• Karen Lochbaum, Ph.D.• Dian Martin• Jeff Nock• Jim Parker• Randy Sparks, Ph.D.• Lynn Streeter, Ph.D
Taxonomy of essay assessment• Writing Assessment Types
– Composition (Language Arts)•Does the writer write well?
– Exposition (Content Areas, e.g. history)•Does the writer understand the topic?
• Levels of Assessment– 1. Holistic Scoring– 2. Trait and Componential Scoring– 3. Annotation– 4. Situated Value Judgments
•Which levels are open to automated scoring?
Analytics Annotations
SituatedValue
Judgments
HolisticScore
Trait Scores
Knowledge
Local Errors
Truth Values
Language Arts
(composition)
Content Areas
(exposition)
Level 1 Level 2 Level 3 Level 4
Levels of Assessment
Taxonomy of essay assessment
• Intelligent Essay Assessor™ technologies• Latent Semantic Analysis for scoring quality of content and providing tutorial feedback• Style & Mechanics measures for scoring and validation of essay as appropriate for task
• Student essays written to directed prompts• Constructed-response alternative to multiple-choice for domain knowledge assessment• Directed essay questions or summaries
• Reliable, objective, consistent and immediate• Used as second reader, formative evaluations, diagnostic tutorials, interactive textbooks
Architecture of scoring systems
Customized Reader
% Content % Style % Mechanics
Overall Score
CONTENT
variance VLConfidence
STYLE Coherence
MECHANICS
VALIDATION
And / Or
PLAGIARISM
Char CountMisspelled Words
ExpertScored Essays
Architecture of scoring systems
Latent Semantic Analysis• LSA is both a psychological theory of knowledge
representation and a computational modeling and application tool
• LSA learns the relationships between text documents and their constituent words (terms) when trained on large numbers of background texts (thousands to millions)
• Each term, document, or new combination of terms (new document) is represented as a point in a high dimensional “Semantic Space” (300-500 dimensions, not 2 or 3)
• LSA effectively measures semantic content against prescribed standards of quality based on human judgments
• Extensive and varied research shows LSA judgments of similarity agree well with human judgments
Meaning Based Representation
LSA is NOT simple co-occurrenceOver 99% of word pairs whose similarities are
induced never appear together in a context (paragraph)
Synonyms are rarely seen in the same context
LSA is NOT simple keyword matching
LSA operates on the deep level (latent) meaning of words rather than the surface characteristics (exact matches)
doctor physicia
n surgeo
n lawyer
attorney
doctor 1
physician 0.61 1
surgeon 0.64 0.65 1
lawyer 0.06 0.06 0.13 1
attorney 0.03 0.05 0.09 0.73 1
The doctor operates on the
patient.
The physician
is in surgery.
He is the car
doctor.
The doctor operates on the
patient.
1
The physician
is in surgery.
0.86 1
He is the car doctor.
0.49 0.35 1
Essay Score “?”
Essay Score “C”
X Dimension
Y D
imen
sion
Essay Score “A”
Angle 2
Angle 1
Angle 3
Essay Score “A”
Latent Semantic Analysis
What features of LSA are most important?• It is a fully automated model of memory • Training data of same magnitude as human
experience• It begins with first-order local associations
between a stimulus and other temporally contiguous stimuli
• Represents concepts and contexts (episodes) in same way
• Conjointly learns about concepts from their natural contexts and contexts from their constituent concepts
• No explicit hand coding of rules or features
• Induction stage for generalization • High dimensional vector mathematics offer
neurologically plausible computations• Not claimed to be a comprehensive model
What features of LSA are ad hoc?• Based on performance in applications, not
requirements of cognitive models…• Singular Value Decomposition (SVD) as
induction mechanism– Many other candidate algorithms have emerged– SVD can solve (750K X 10M matrix for 300
dimensions on 8 node Beowulf in 20-30 hours)
• Emphasis on easily parsable symbol systems, e.g. text– Text is relatively easy to work with compared to
visual data– Now applied to other symbol systems, e.g. genetic
codes
• Text pre-processing specifics– Local log, global entropy weighting
• Similarity metrics (Cosine, Euclidean Distance, etc.)
0.86
0.75
0.85
0.73
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Standardized Tests (N = 2263) Classroom Tests (N = 1033)
Rel
iabi
lity
Coe
ffic
ient
Reader 1 to Reader 2 IEA to Single Readers
Performance assessment of system
.83.86
.75
.81.85
.73
.85.88
.78
.00
.10
.20
.30
.40
.50
.60
.70
.80
.90
1.00
All Essays Standardized Classroom
Re
liab
ility
Co
eff
icie
nt
Reader 1 to Reader 2 IEA-Single Readers IEA-Resolved Score
Performance assessment of system
0
1
2
3
4
5
6
7
hu
man
gra
de
-1 0 1 2 3 4 5 6 7
IEA-Score
Performance assessment of system
Performance assessment of system
0
1
2
3
4
5
6
7
8h
um
an
gra
de
0 1 2 3 4 5 6 7 8
IEA-Score
0.690.78 0.80
0.000
0.100
0.200
0.300
0.400
0.500
0.600
0.700
0.800
0.900
1.000
Undergrad TA Graduate TA Professor
Cor
rela
tion
with
IEA
Sco
res
Performance assessment of system
0.53
0.69 0.72 0.75 0.74 0.75
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
6 25 50 100 200 400
Number of Training Essays in Comparison Set
Re
liab
ilit
y C
oe
ffic
ien
t
Performance assessment of system
0.83
0.68 0.66
0.85
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Content Score Style Score M echanics Score IEA Total Score
Relia
bilit
y w
ith
Reso
lved
Hum
an S
core
Performance assessment of system
0.75 0.690.79
0.13 0.200.10
0.11 0.11 0.11
All Essays Standardized Classroom
Mechanics
Style
Content
Performance assessment of system
• Focus is on quality of content as judged by people rather than on measures of surface features & keywords
• Uses background knowledge of domain in assessment in addition to previously scored essays
• Measures what students are saying rather than just how well they are saying it
• Does best when linked to course student learning materials – provides formative assessment of domain knowledge with tutorial feedback rather than just a simple overall score
• Requires fewer training essays (100 vs. 500)• More difficult to ‘coach’ student in ways to receive
artificially high score (e.g. “use semi-colons” or say “Thus and Therefore”)
• Models do NOT use any count variables (Word count, etc.)
Performance assessment of system
Performance assessment of system
Performance assessment of system
MAINFRAMESMainframes are primarily referred to
large computers with rapid, advanced processing capabilities that can execute and perform tasks equivalent to many Personal Computers (PCs) machines networked together. It is characterized with high quantity Random Access Memory (RAM), very large secondary storage devices, and high-speed processors to cater for the needs of the computers under its service.
Consisting of advanced components, mainframes have the capability of running multiple large applications required by many and most enterprises and organizations. This is one of its advantages. Mainframes are also suitable to cater for those applications (programs) or files that are of very high demand by its users (clients). Examples of such organizations and enterprises using mainframes are online shopping websites such as Ebay, Amazon, and computing-giant Microsoft.
MAINFRAMESMainframes usually are referred those
computers with fast, advanced processing capabilities that could perform by itself tasks that may require a lot of Personal Computers (PC) Machines. Usually mainframes would have lots of RAMs, very large secondary storage devices, and very fast processors to cater for the needs of those computers under its service.
Due to the advanced components mainframes have, these computers have the capability of running multiple large applications required by most enterprises, which is one of its advantage. Mainframes are also suitable to cater for those applications or files that are of very large demand by its users (clients). Examples of these include the large online shopping websites -i.e. : Ebay, Amazon, Microsoft, etc.
Performance assessment of system