Evaluation of essays using incremental training for ... · Chapter 1 Introduction to Topic 1.1...

Evaluation of essays using incrementaltraining for Maximizing Human-Machine

agreement

M.Tech Project Report

Submitted by

Dhananjay AmbekarRoll No : 123059008

under the guidance of

Prof. D. B. Phatak

Department of Computer Science and EngineeringIndian Institute of Technology, Bombay

June , 2015

http://www.cse.iitb.ac.in/

http://www.iitb.ac.in/

Contents

1 Introduction to Topic 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Types of assessments in MOOCs . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2.1 Self Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.2 Peer Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.3 AI Grading (or ML Grading) . . . . . . . . . . . . . . . . . . . . . 2

2 AES and Performance Measurement 32.1 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Project Essay Grade (PEG) . . . . . . . . . . . . . . . . . . . . . . 32.1.2 Intelligent Essay Assessor (IEA) . . . . . . . . . . . . . . . . . . . . 32.1.3 Educational Testing Service (ETS-I) . . . . . . . . . . . . . . . . . 42.1.4 Bayesian Essay Test Scoring system (BETSY) . . . . . . . . . . . . 4

2.2 Estimating accuracy against Human grading . . . . . . . . . . . . . . . . . 52.2.1 Kappa Coefficient (Cohen’s kappa) . . . . . . . . . . . . . . . . . . 52.2.2 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Remark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Process of Evaluation 73.1 Training data (Rubrics) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2.1 Prepossessing of input data . . . . . . . . . . . . . . . . . . . . . . 93.2.2 n-grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2.3 Synthetic training examples . . . . . . . . . . . . . . . . . . . . . . 103.2.4 Feature vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3 Training Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3.1 Bayesian Independence Classifiers . . . . . . . . . . . . . . . . . . . 103.3.2 K-nearest Neighbour . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3.3 Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4 Subsequent Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.4.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.4.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.5 conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Approach & Implementation 164.1 Proposed System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.3 Introduction to framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.4 Data Needed For application Development . . . . . . . . . . . . . . . . . . 17

i

4.5 Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.5.1 Fetch the required data . . . . . . . . . . . . . . . . . . . . . . . . . 184.5.2 Operate on data for Training and Evaluation . . . . . . . . . . . . 18

5 Conclusion 205.1 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

ii

List of Figures

1.1 Process of automatic Evaluation of essays . . . . . . . . . . . . . . . . . . 2

3.1 POS Tree representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Features in Essay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.1 Flow of Incremental Experimentation . . . . . . . . . . . . . . . . . . . . . 164.2 Retrieving required Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.1 Grading pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.2 Automated Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

iii

List of Tables

2.1 Inter grader comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.1 POS tags (sample) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5.1 Evaluation Details by all Techniques . . . . . . . . . . . . . . . . . . . . . 205.2 Grading Distribution Human-Edx AI . . . . . . . . . . . . . . . . . . . . . 215.3 Grading Distribution Human-Application . . . . . . . . . . . . . . . . . . . 21

iv

Abstract

Massive Open On-line Courses(MOOCs) have large number of submissions which makeautomatic evaluation an appropriate choice for assessments. Project aims to MaximizeHuman-Machine agreement for automatic evaluation of textual summaries or essays inMOOCs leading to near-human accuracy by Incorporating missing features (in specificcases such as Geographies). It also involves study of techniques for automatic evaluationof textual summaries using Machine learning,Natural Language processing & Makinguse of varied data of MOOCs for incremental training purposes. Propose a method forimproving training & evaluation process for greater Human-Machine agreement.

Chapter 1

Introduction to Topic

1.1 Introduction

MOOCs or Massive Open On-line Courses based on on-line Educational Resources areproving to be one of the most effective ways to offer access to quality education, espe-cially for those residing in far or disadvantaged areas[1]. By offering a course it becomesobligatory to provide grading and evaluation services. Previous generation moocs hadspecific type of question type limited to numeric, true/false, and multiple choice answers.later came more advanced types such as inputting circuits, mathematical equations ormore courses started question types with textual summary or subjective essay answers[2].With these new generation of questions in MOOCs, came the need for new grading andevaluation strategies.The prominent areas of importance for this project & research survey are following top-ics/fields:

• Massive open on-line courses(MOOCs)

• Grading essays in MOOCs

• Automated essay grading

• Correlation between Human-Machine grading

Remaining chapters attempts to give detailed description of understandings from thesurvey and Experiments.

1.2 Types of assessments in MOOCs

Broadly there are three type of assessment methods[3] in MOOCs which are used to assessand grade student submissions followed by providing score. Following is an introductionto each of this methods[2].

1.2.1 Self Assessment

Self-assessment, is a process whereby students grade assignments or tests based on ateacher’s guidelines. Rubrics are often used to perform self assessment, they help inkeeping the process formulated.

1.2.2 Peer Assessment

In peer assessment, a student grades submissions of other students after his completion.A process for calibration may or may not be applied to normalize the given grades.

1.2.3 AI Grading (or ML Grading)

As MOOCs have large number of submission, It is almost impossible for an instructorto manually grade students submissions[4]. so a method alternative to earlier mentionedmethods was proposed known as AI grading.

AI grading uses machine learning techniques with assistance of language processingtools to grade students submissions particularly of type essay, subjective text. The ideabehind this technique is a model is trained with first few (training purpose) submissionsbeing assessed/ graded by instructor and then for remaining large portion of the data theAI grading is performed.

Figure 1.1: Process of automatic Evaluation of essays

This literature survey focuses on AI grading type and its peripherals. A detailed studyof grading methods, techniques of evaluation of essays and current research in this areaare also key objectives for this survey.

2

Chapter 2

AES and Performance Measurement

In this chapter summary of study of different techniques is presented. There are numberof tools available for essay grading which use several techniques.

Automated essay scoring (AES) is the use of specialized computer programs to assigngrades to essays.[5]. Essays are considered as one of the best sources by instructors tounderstand student’s learning of the subject. It’s subjective nature gives human gradersfreedom of assessing each essay differently, hence any two graders might not grade sameset of essays with similar grades. Researchers feel that this gives room for unfairness inthe process and consistency is not maintained if multiple graders are grading set of sametype of essays[6]. And so automated grading could prove well in maintaining consistencyin grading and If trained well, these models can actually do grading as accurate as Humangraders with High Correlation between Human and Machine.

Following are some of the tools currently used for essay grading automatically. Andin the later section a brief description of measures of correlation is presented.

2.1 Tools

2.1.1 Project Essay Grade (PEG)

PEG is one of the earliest implementation of automated techniques[6]. In PEG essayis graded on the basis of writing quality, taking no account of lexical content henceno Natural Language processing techniques are used(Recent versions may include suchfeatures). The process of evaluation is based on concept of proxes. Proxes are variablesof interest within essay such as:

1. Count of prepositions

2. Complexity of special words

It uses Regression on training data (graded by instructors) and then based on this, gradesnumber of essays.

2.1.2 Intelligent Essay Assessor (IEA)

IEA is another tool which uses Statistical techniques without any language processing.This tool also does not make any use of ordering of words in essay. It represents thetextual data using Latent Semantic Analysis (LSA) of documents and their word content

3

in a large two-dimensional matrix. Each word being analysed represents a row of in thematrix, while each column represents the sentences, other subdivisions of the essay inwhich the word is present. Cells of the matrix contain frequencies of those words. Rela-tionship between matrix and content are established and scoring is performed.

2.1.3 Educational Testing Service (ETS-I)

ETS was developed by Burstein and Kaplan of the Educational Testing Services. It usescombination of statistical and language processing techniques to extract linguistic featuresfrom the essays.The process includes analysis of:

• Discourse structure

• Syntactic structure

• Vocabulary in domain of Essay

Training data is processed by Natural Language Processing tool, a lexicon is createdwhich is used to create grammar rules. this tool requires manual processing of trainingdata which consumes time and cost but graders still say that ETS is worth auto-gradingfor large number of essay.

2.1.4 Bayesian Essay Test Scoring system (BETSY)

A windows- based BETSY system was mainly used to group essays on scale of few certainpredefined points (e.g. extensive, essential, partial, unsatisfactory)[6]. Bernoulli Mod-els(BM) are used to classify text. BETSY can be downloaded and used free of cost.

Researchers have been using Accuracy, Correlation and Agreement for estimatingrelation between human and machine grading. For all the experiments to be conductedin this project, usage of same measures is planned.

4

2.2 Estimating accuracy against Human grading

Scoring of assays can be typically done after classifying them into several categories (forex. extra-ordinary,good,average, below average). Categorization can be automated usingNLP and machine learning techniques and after this step is done, a comparison of humanand machine grading performance is done by using statistical measures. following isdescription of such measures which will be used for comparing performances of variousmethods in this project.

2.2.1 Kappa Coefficient (Cohen’s kappa)

This coefficient is statistical measure to calculate agreement between raters for categoricaltype of data. This measure was used by Jill Burstein in their experiments where in anestimate of agreement was to be provided for machine evaluations and human graders fortext essays [7].This is essentially a measure of how well the classifier performed as compared to howwell it would have performed simply by chance. In other words, a model will have a highKappa score if there is a big difference between the accuracy and the null error rate.

k =P (a)− P (e)

1− P (e)

Pr(a) is the relative observed agreement among ratersPr(e) is the hypothetical probability of chance agreementIf the raters are in complete agreement then k is 1.

Total 10 samples were examined. For experimental convenience samples were gradedeither good or poor. Table shows the observations obtained after performing grading task.The value 6 in cell infers that their are 6 samples to which algorithm and Human havegiven same Good score.

edX AI-MLTotal 10 samples Good Poor

Human assessmentGood 6 0Poor 2 2

Table 2.1: Inter grader comparison

So Now, If we calculate Cohen’s kappa coefficient-P(a:agreement): (6+2)/10Human said Good to 6 applicants and Poor to 4 applicants.Reader B said Good to 8 applicants and Poor to 2 applicants.

Therefore the probability that both of them would say Good randomly is 0.6 * 0.8 =0.48 and they would say Poor is 0.4 * 0.20 = 0.08

Pr(e:random agreement): = 0.48 + 0.08 = 0.56.The inter grader agreement can be estimated ask= (0.8-0.56)/(1-0.56) = 0.56

5

2.2.2 Precision

This measure is useful here for comparison as we have only two cases of agreement i.e.the human and machine grading show agreement or It does not. In this study on essayevaluation,Precision[8] is as follows :

P =No. of cases of agreement

No. of evaluations by Human + No. of evaluations by Machine

2.3 Remark

For this project, earlier mentioned measures are considered sufficient to analyze the results.There are other measures such as recall and F-measure used to estimate the agreementin text classification tasks, which will be used if needed.

For implementation purposes, this project uses Open edX as base for the experimen-tation and techniques, and are described in later chapters.

6

Chapter 3

Process of Evaluation

The process of evaluation starts with training set of essays which will be used to gradelarge number of submissions of same type of essays [5]. This data set is known as trainingset.

Evaluation program extracts features from each essay from training data. features arecharacteristic properties of data. In case of essay, example features can be Number ofwords, Specific set of words or ratio of uppercase to lowercase words. Many of the systemuse words and their frequency to classify essays in different classes/categories.

After feature extraction, an algorithm is trained with calculated parameters to scorenew essays. The score/grade is based on graded essays provided for training. These essaysare provided with rubrics (This discussion holds for Open edX[9]) for evaluation. featureextraction & various algorithms are described below. The feature extraction phase wasstudied with NLTK Natural language toolkit and EASE libraries for essay evaluation [10].

3.1 Training data (Rubrics)

In educational terminology, scoring rubric means “A standard of performance for a de-fined population”[11]. ref Rubrics are the way of communicating expectations from aproblem/assignment which student will attempt. Instructors design rubrics around theproblem statement so that while evaluating the same a unified criteria can be imposedwhich leads to fair evaluation as well as It enforces similar evaluation rules on all submis-sions made by student.

In MOOCs (in general and Open edX[9] in specific) evaluation rubrics are defined byInstructor at the time of creating a problem, and are consist of different criterion whichwill try to understand the submission made by a student and help evaluators to assessand further grade the submission. These rubrics then are used by peer evaluators or evenstudent submitting the problem for self evaluation.

For essay evaluation rubrics such as ‘Idea’ and ‘Content’ can be given which may helpusers to grade essays with different perspectives. In actual situation when an instructoris manually grading essays they may look for more criterion such as number of words orgrammatical correctness of essay etc. so while designing automatic evaluation for textualcontent one should consider all the features required with weight-age to them if needed.In automatic evaluation of essays or text it is expected and required that you provideample training examples to train in accordance with each criterion.

7

<vertical display_name="cow-cattle">

<openassessment url_name="b86314eeb424" />



<rubric>

<prompt>Write a small essay on Cow. />

<name>Ideas</name>

<prompt>Determine if there is a unifying theme or main idea.</prompt>

<criterion feedback="optional">



<option points="0">

<name>Poor</name>

<explanation>

"Difficult for the reader to discern the main idea. Too brief."

</explanation>

</option>

<option points="1">

<name>Good</name>

<explanation>

"Presents a unifying theme & Stays focused on topic."

</explanation>

</option>

</criterion>

</rubric>

</vertical>

XML file sample for Rubrics definition (Specific to Open edX)

8

3.2 Feature extraction

3.2.1 Prepossessing of input data

This step involves removal of non-ASCII characters. then processing punctuation suchas period, comma any other marks with inserting a space character between any two ofconsecutive symbols. In this step essay is truncated with some limit of words, for exampleessays with more than 2500 words will be truncated to this limit.

3.2.2 n-grams

n-gram is a contiguous sequence of n items from a given sequence of text or speech [12].before extracting n grams, a POS tagged tree is created.An example POS (parts of speech) tagged tree is given below.

Sentence:“Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.”

Tokens:[Pierre / Vinken / , / 61 / years / old/ , / will / join / the/ ...continued]

A tree is generated using predefine POS tags[13]. List of all the Penn tree tags isavailable at [14]. Here are few for example:

Tag DescriptionNNP noun, proper, singularVB verb, base formJJ adjective or numeral

Table 3.1: POS tags (sample)

Figure 3.1: POS Tree representation

The implementation of such trees is usually stored in dictionaries[10]. These taggedparts of speech are used to generate n-grams (usually 2-3-4 grams)

9

3.2.3 Synthetic training examples

Many essay evaluation processes also take account of synonyms of tokens generated. sospecified number of synonyms can be added to the feature list so as to accommodateall the different writing styles. Synonyms can be generated using any library such asNLTK(for example).

3.2.4 Feature vector

Following feature details are maintained for each essay in data set. these are specificexamples used by NLTK EASE algorithm (Referred through Open edX docs). Spellingerror features:

1. Spelling error features:

• Average number of spelling errors per essay

• Average number of spelling errors per character (across all essays)

2. Grammar error features:

• Counts the number of n-grams that are “good” (in the good n-gram list).

• Identifies the positions of n-grams that are “bad” (not in the good n-gram list).

3. Bag of words:

• Count tokens that were found in the essay set.

• Generates two bags (matrices): one for original text tokens and the other fortokens in the spell-checked and stemmed text.

3.3 Training Algorithms

There has been major work in text classification of textual data according to require-ments. A document is classified in to multiple classes and categories based on some set ofpredefined characteristics. here, in essay grading these specific characteristics may refer tofeatures discussed in earlier section. Following subsections contain discussion on methodsused for training in grading.

3.3.1 Bayesian Independence Classifiers

In [15] Shermis had put a very precise description of Bayesian independent classifiers in hisbook. These probabilistic classifier estimate a probability of document being in particularcategory provided that certain features are present in the document. Bayes theorem isused to derive the probability of document present in each category defined.

10

Bayesian classifier defines the log probability of essay being in some category as follows:

log P (C|D) = log P (C) +

∑i

{log (P (Ai|C)/P (Ai)) if D contains feature Ai

log(P(Ai|C

)/P(Ai))

if D does not contains feature Ai

P(C) is probability that a document belongs to class C, a prior probability. Class C canbe a class/category of Good,Average or any class which is defined earlier while process oftraining.P (Ai|C) is the conditional probability the document has feature A-i and given that itbelongs to class C.P (Ai) is a prior probability of any document containing feature Ai.P(Ai|C

)is the conditional probability that document does not feature Ai given that class

is C.P(Ai) is the probability that document does not have feature Ai.Based on Lewis’ model [16], this classifiers give 0 or 1 to documents for current featureunder examination.

3.3.2 K-nearest Neighbour

Another classifier is discussed in [17] uses K-nearest neighbour classifier to classify essaysand score them. In k-nearest-neighbor classification k essays in the training collectionthat are most similar to the test essay are selected. Score is given to essay based onweighted average of k selected essays in test set. Similarity was calculated by queryingentire document against test set and ranking scores. k- are the top similar ranking essayschosen for further processing i.e. weighted averaging.

Figure 3.2: Features in Essay

11

3.3.3 Gradient Boosting

It is a technique for regression problems, which produces a prediction model in the formof an group of weak prediction models, typically decision trees[18]. It builds the model ina stage-wise fashion like other boosting methods do, and it generalizes them by allowingoptimization of an arbitrary differentiable loss function [19].

For this project (and in Open edX) the scikit[10] implementation of Gradient Boostingclassifier was used. The default Parameters of concern are listed as follows.

• n estimators: (Value = 100)The number of boosting stages to perform. Gradient boosting is fairly robust toover-fitting so a large number usually results in better performance.

• learn rate: (Value = 0.5)learning rate shrinks the contribution of each tree by learning rate. There is atrade-off between learning rate and n estimators.

• max depth: (Value = 4)maximum depth of the individual regression estimators. The maximum depth limitsthe number of nodes in the tree. Tune this parameter for best performance.

• min samples leaf: (Value = 3)The minimum number of samples required to be at a leaf node. Value = 3

• max features: (Value = auto)The number of features to consider when looking for the best split.

3.4 Subsequent Work

Simple Experiments were performed in order to understand and measure performance ofMOOCs AI/Machine grading functionality. Details are as follows.

3.4.1 Objective

Objective was to understand the process of automated evaluation and its componentswhich include Natural language processing, Machine learning and Evaluation Rubrics.

3.4.2 Setup

This evaluation was performed in Open edX demo course with 10 students. The numberof students is less because the main focus of the experiment was to understand the processand measure the correlation in different scenarios.

• Open edX developer stack instance was used

• Open edX uses NLTK and EASE for auto-grading text data, which internally appliesnatural language processing and training algorithms for text classification (namelyGradient Boosting)

• Work-flow is generated from example based assessments and training is performed

12

• At this point a submission can be auto-graded using classifier trained in previousstep

Sample Example based assessment is shown below. In this file (according to Open edXspecification) two sample answers which are already graded by Criteria defined earlier byinstructor of the course. It is recommended to provide optimal number of training sampleson topic with possible grading-criteria combinations to achieve well trained classifier.

13

<assessment name="example-based-assessment" algorithm_id="ease">

<example>

<answer>

The Cow is very useful animal.

It is domestic animal.

It is found all over the world.

Cows milk is very nutritious.

</answer>

<select criterion="Ideas" option="Good"/>

<select criterion="Content" option="Good"/>

</example>

<example>

<answer>

The Cow is an animal.

</answer>

<select criterion="Ideas" option="Poor"/>

<select criterion="Content" option="Poor"/>

</example>

</assessment>

</openassessment>

</vertical>

XML file sample for providing Training examples (Specific to Open edX)

14

3.5 conclusion

MOOCs have large number of students from all over the world. It makes the studentresponses varied & diverse in content as well as discourse of the response [20]. It isfundamental requirement of any course that the assessment procedure should explorestudent learning in order to give a useful feedback to student helping him to improve theunderstanding of the subject.

• It is observed (through experiments) that due to students being in different geo-graphical/cultural areas, the features contained in responses differ adding di-versity to them [21]. (For an example - For Indian students, In an Essay written onCow, People worship Cow. is an obvious feature.)

• These Features might not be captured while training by picking samples randomlybecause with respect to large number of students, frequency of these features mightbe very less and Evaluation Rubrics may or may not capture that content detail.

• While in training phase, another Increment to/Iteration of training itself can beadded to either Learn or to Add such features

15

Chapter 4

Approach & Implementation

This chapter explains the implementation details for the system developed for the project.This system/Application would be used to perform grading in attempt in improving theCorrelation-B as shown in following figure.

Figure 4.1: Flow of Incremental Experimentation

4.1 Proposed System

An automated evaluation system to operate on geographically divided data of studentswhich would Formulate a process for training as well as evaluation and then comparingperformances using standard measures. This requires in-detail understanding of Work-flows APIs in Open edX AI grading and the Process of automated grading in OpenedX[9].

16

4.2 Approach

As the project development was done keeping Open edX framework in mind, the requiredsystem was built as a Django application so that It could be used/incorporated in OpenedX. An application using python-scikit for computational purposes where in automatedgrading of student essays can be performed was developed.

4.3 Introduction to framework

Django framework is MVT (Model-View-Template) based python framework used to de-velop complex and database driven web applications [22]. Structure of a typical Djangoweb application

• Model - It is a python module that is used to design represent database entitiesas well as synchronize them throughout the application. Each class represents aDatabase table for application

• Views - A Django application can have multiple views containing several meth-ods. These methods in turn serve user requests (HttpRequests) for content or anyother functions operating on data. Every user request have a specified method inapplication.

• Templates - They serve as a user interface for Models-Views designed in an appli-cation.

4.4 Data Needed For application Development

This data may not be directly used from database but through APIs or XML object fileswhich are used in (Existing) process of evaluation. For this project purpose, Data isaccessed through django models created in applications.

Data is attributed in to two main components

1. Student Data which contains all the student profile information which are registeredto the course.

Table : AUTH_USER

id

username

country

date_of_birth

Table : AUTH_USER_PROFILE

id

user_id name

language

location

gender

city

Table : STUDENT_COURSEENROLMENT

17

id

user_id

course_id

created

mode

Table : USER_ID_MAP

hash_id id

username

2. Course-ware Data contains course content as well as assessment/submission details.

Table : ESSAY_COURSEWARE_DATA

id

module_type

module_id

student_id

state

grade

course_id

Tables shown below are derved from part of Open edX schema. Only limited numberof tables along with their attributes are shown which are relevant to this system.

4.5 Functionality

Application has two major modules implemented in Python Django. Description offunctionality of each module is given as follows.

4.5.1 Fetch the required data

Student submissions consisting of textual data are fetched from Database according therequirement. This modules also divides the submissions according the student properties.(For Example in this case we have divided the data according to student geographicallocations.) It writes the retrieved submissions into specific directories such that they canbe fetch by the evaluation module for computational purposes.

4.5.2 Operate on data for Training and Evaluation

This Evaluation part was designed to work in pipe-lined manner as in involved multiplestages such as Feature Extraction, Training, Testing and Performance measurement.

• Text Vectorizer Converts a collection of text documents to a matrix of token counts.Fit transform function Learn the vocabulary dictionary and return term-documentmatrix which later used to train classifier.

• The Python view then retrieves parameters,which are defined in a class, requiredfor the Classifier specified (class sklearn.naive bayes.MultinomialNB) and start op-erating on data-sets generated above.

18

Figure 4.2: Retrieving required Data

• After automated evaluation is performed the results and performance measurementsare written in the output directories. Currently they are written on text files forexperimentation. Ideally results would be stored back into the database.

19

Chapter 5

Conclusion

Outlook of the project was to put forward a process-system for automated evaluationwhich would give better & near to human Feedback or grades than the current AutomatedEvaluation systems are doing.

The Application implemented was tested with Data-set where Human grades wereavailable for the samples [23]. Essays were assessed through automated evaluation avail-able in Open edX[9] and through the application developed under project.

5.1 Observations

Observations and findings are as follows.Number of essays - 100

Following table shows the grades given by Human grader, Open edX-AI grader and Appli-cation grader. Each row in table depicts a grade against the number of students obtainedthat grade.

Grade Application Open edX-AI Human6 6 4 85 27 28 184 29 25 433 26 25 132 8 12 141 4 6 4Total 100 100 100

Table 5.1: Evaluation Details by all Techniques

Table 5.1 shows the grading distribution of Human rater and Open edX AI evalua-tion for 100 Essays and Table 5.1 shows the grading distribution of Human rater andApplication evaluation for same set Essays.

20

Grades 6 5 4 3 2 1 Total6 4 0 0 0 0 0 45 4 18 6 0 0 0 284 0 0 25 0 0 0 253 0 0 10 13 2 0 252 0 0 1 0 11 0 121 0 0 1 0 1 4 6

Total 8 18 43 13 14 4 100

Table 5.2: Grading Distribution Human-Edx AI

Grades 6 5 4 3 2 1 Total6 6 0 0 0 0 0 65 1 18 7 0 1 0 274 0 0 29 0 0 0 293 1 0 6 13 6 0 262 0 0 0 0 8 0 81 0 0 0 0 0 4 4

Total 8 18 42 13 15 4 100

Table 5.3: Grading Distribution Human-Application

Figure 5.1: Grading pattern

21

5.2 Conclusion

It can be observed in the graph & from the table that calculated Correlation betweenhuman-Application grading is higher that that of Open edX after we perform the evalua-tion through specially designed process discussed the project, in this process we attemptto add the missing features to the training. (In this case, the missing geographically vari-ant features) These localized inputs specifically contribute to evaluation and take part ingiving more transparent feedback to students.

• Agreement between Human Grader and Edx-AI Evaluation

– Observed proportionate agreement : 0.75 ± 0.04

– Cohen’s Kappa Correlation : 0.682

• Agreement between Human Grader and Application Evaluation

– Observed proportionate agreement : 0.78 ± 0.04

– Cohen’s Kappa Correlation : 0.717

Cohen’s Kappa values ranges (0.6± 0.1) obtained above are fair to good [24], In supportof our conclusion.

Figure 5.2: Automated Evaluation

Due to Large number of students enrolled from all over the world, Automated gradingfor textual responses become one of the very obvious choice for Instructors. MOOCsautomated grading for textual responses may not suffice to give transparent feedback tostudents are geographically distributed as it plays major role in their understanding ofthe subject [20].

It will be helpful to improve performance of the system if we manage to add suchcharacteristics while evaluation.

22

5.3 Future Work

• This system can be integrated with Open edX[9] along with Existing AI-Evaluationwork-flow as a Django application. Evaluation performed can be used to give moretransparent as well as personal feedback to Student.

• We can distinguish cases of students where Human Grader-Automated Evaluationagreement is very less Or It is a disagreement, and may perform another iterationof evaluation towards this students to understand their learning of the topic.

• Currently the system is adding to the existing evaluation methods but with thiscan be used to understand how students belonging to particular category (Ex. ge-ography), culture [25] have learned & gained knowledge of the subject compared toother peers.

23

References

[1] Thanasis Daradoumis, Roxana Bassi, Fatos Xhafa, and Santi Caballe. A review onmassive e-learning (mooc) design, delivery and assessment. In P2P, Parallel, Grid,Cloud and Internet Computing (3PGCIC), 2013 Eighth International Conference on,pages 208–213. IEEE, 2013.

[2] P Mitros, Vikas Paruchuri, John Rogosic, and Diana Huang. An integrated frame-work for the grading of freeform responses, 2013.

[3] edX. edx ora 2 documentation, 2014. [https://media.readthedocs.org/pdf/edx-ora-2/latest/edx-ora-2.pdf].

[4] Wikipedia. Massive open online course — wikipedia, the free encyclopedia, 2014.[Online; accessed 30-August-2014].

[5] Wikipedia. Automated essay scoring — wikipedia, the free encyclopedia, 2014. [On-line; accessed 30-August-2014].

[6] Salvatore Valenti, Francesca Neri, and Alessandro Cucchiarelli. An overview of cur-rent research on automated essay grading. Journal of Information Technology Edu-cation: Research, 2(1):319–330, 2003.

[7] Yigal Attali and Jill Burstein. Automated essay scoring with e-rater R© v. 2. TheJournal of Technology, Learning and Assessment, 4(3), 2006.

[8] Wikipedia. Accuracy and precision — wikipedia, the free encyclopedia, 2014. [Online;accessed 9-October-2014].

[9] Open edX. Open source platform that powers edx courses., 2015. [ Open sourceplatform that powers edX courses.].

[10] NLTK. Natural language toolkit, 2014. Bird, Steven, Edward Loper and Ewan Klein(2009), Natural Language Processing with Python. O’Reilly Media Inc.

[11] Wikipedia. Rubric (academic) — wikipedia, the free encyclopedia, 2014. [Online;accessed 9-October-2014].

[12] Wikipedia. N-gram — wikipedia, the free encyclopedia, 2014. [Online; accessed2-September-2014].

[13] Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building alarge annotated corpus of english: The penn treebank. Comput. Linguist., 19(2):313–330, June 1993.

[14] POS. Pos tagging, 2014. [http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html].

24

[15] M.D. Shermis and J.C. Burstein. Automated Essay Scoring: A Cross-disciplinaryPerspective. Taylor & Francis, 2002.

[16] David D. Lewis. An evaluation of phrasal and clustered representations on a textcategorization task. In Proceedings of the 15th Annual International ACM SIGIRConference on Research and Development in Information Retrieval, SIGIR ’92, pages37–50, New York, NY, USA, 1992. ACM.

[17] Leah S Larkey. Automatic essay grading using text categorization techniques. InProceedings of the 21st annual international ACM SIGIR conference on Researchand development in information retrieval, pages 90–95. ACM, 1998.

[18] Wikipedia. Gradient boosting — wikipedia, the free encyclopedia, 2014. [Online;accessed 4-September-2014].

[19] Jerome H Friedman. Stochastic gradient boosting. Computational Statistics & DataAnalysis, 38(4):367–378, 2002.

[20] Amy L Baylor and Donn Ritchie. What factors facilitate teacher skill, teacher morale,and perceived student learning in technology-using classrooms? Computers Educa-tion, 39(4):395 – 414, 2002.

[21] S. Ghosh and S.S. Fatima. Design of an automated essay grading (aeg) system inindian context. In TENCON 2008 - 2008 IEEE Region 10 Conference, pages 1–6,Nov 2008.

[22] Django Project. Django framework for web applications, 2014.[https://www.djangoproject.com/].

[23] Dataset. The hewlett foundation: Automated essay scoring, 2014.[https://www.kaggle.com/c/asap-aes/datal].

[24] Wikipedia. Cohen’s kappa — wikipedia, the free encyclopedia, 2015. [Online; ac-cessed 24-June-2015].

[25] Eva Krugly[U+2010]Smolska. Cultural influences in science education. InternationalJournal of Science Education, 17(1):45–58, 1995.

25

Evaluation of essays using incremental training for ... · Chapter 1 Introduction to Topic 1.1...

Documents

Transcript of Evaluation of essays using incremental training for ... · Chapter 1 Introduction to Topic 1.1...