Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete...

58
Intelligent Tutoring System using interaction logs for MOOCs M.Tech Project Report Submitted by Nilesh Birari Roll No : 123059015 under the guidance of Prof. D. B. Phatak Department of Computer Science and Engineering Indian Institute of Technology, Bombay July, 2015

Transcript of Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete...

Page 1: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

Intelligent Tutoring System usinginteraction logs for MOOCs

MTech Project Report

Submitted by

Nilesh BirariRoll No 123059015

under the guidance of

Prof D B Phatak

Department of Computer Science and EngineeringIndian Institute of Technology Bombay

July 2015

i

ii

Acknowledgement

I would like to thank my guide Prof Deepak B Phatak for his consistent directionsand guidance throughout the project Because of his encouragement and giving us rightdirections we are able to do this project work efficiently and correctly I would also liketo thank Mr Nagesh Karmali for his continuous support throughout project work

Nilesh BirariMTechDepartment(CSE)IIT Bombay

iii

Abstract

A massive open online course(MOOCs) is an online course aimed at unlimited partici-pation and open access via the web In a MOOC each studentrsquos complete interactionwith the course materials is recorded as a clickstream which produces vast amountof data understanding of student learning using this data enables more intelligentinteractive engaging and effective education The analysis of data using various datamining techniques give more insight into different student learning behaviour that canlead to take better decision for educators and researcher for future offering of the courseAnalysis of student learning behaviour can also be used for personalized guidance foreach group of student

ITS aims to provide personalised instruction for each student so the student model isthe key component of ITS which store the student information(student knowledge stateand behaviour analysis of student) Diagnosis of the student knowledge state is the mostdifficult process due to uncertain and imprecise information about the student To man-age uncertainty in data Bayesian network has the most sound mathematical foundationsand also a natural way of representing uncertainty using probabilities The proposedwork builds Bayesian student model for concept wise understanding of student state ofknowledge

iv

Contents

1 Introduction 111 MOOCs 112 Challenges in MOOCs 113 Motivation 214 Intelligent Tutoring System for MOOCs 3

2 Literature Review 521 Review of Educational Data Mining 522 Adaptive and Intelligent Technologies 7

221 Adaptive Hypermedia 8222 ITS technologies 9223 Existing Systems 9

3 The Open edX frontend architecture 1131 Units(Weeks) sequences panels and modules 1132 Naming schemes 1233 Events 12

4 Open edx research data 1341 Student Info and Progress Data 1342 Course Content Data 14

421 Shared Fields 14422 Course Data 15423 Course Building Block Data 16424 Course Component Data 17

43 Tracking module id in Course Structure 18

5 Events in the Tracking Logs 2251 Common Fields 2252 Student Events 23

521 Navigational Events 23522 Video Interaction Events 24523 Problem Interaction Events 24524 Discussion Forums Events 25

53 New Findings in tracking logs 25

6 Tools and Technologies 2761 Primary requirements for development 2762 Python Bayes Network Toolbox 28

v

63 Jayes - Bayesian Networks for Java 2864 moocdb 29

7 Experiments With Data 3071 Pre-Processing and Data Cleaning 3072 Feature Extraction 3173 Clustering Experiments and results analysis 32

731 Experiment 1 32732 Experiment 2 33733 Experiment 3 35734 Experiment 4 36735 Experiment 5 36

8 Student modeling using Bayesian Network 3881 Student Modeling 38

811 Information in Student model 38812 Uncertainty-Based User Modeling 39

82 Bayesian Diagnostic Algorithm for Student Modeling 40821 Bayesian Network 40822 Bayesian Student Modeling 41

83 Experiments and Results 43831 Nodes and relations 44832 Parameter specification 44

9 Conclusion and Future work 46

vi

List of Figures

11 Work flow of MOOCs 212 Components of ITS 4

21 The cycle of applying data mining in educational systems 6

31 Courseware structure 11

61 Bayesian Network implementation steps [1] 28

81 Overlay model 3982 Bayesian Network model 4283 Analysis of student understanding for concept in CS1011x 45

vii

List of Tables

21 Existing Systems 10

71 Clustering Experiment 1 results 3372 Clustering Experiment 2 results 3473 Clustering Experiment 3 results 3574 Clustering Experiment 4 results 3675 Clustering Experiment 5 results 37

viii

Chapter 1

Introduction

11 MOOCs

A massive open online course(MOOCs) is an online course aimed at unlimited participa-tion and open access via the web [2] MOOCs offers wide variety of courses to the largenumber of students There are multiple MOOC platforms (including the edX Courseraand Udacity) offering large number of courses some of which have had hundreds ofthousands of students enrolled Benefits of such on-line courses is they are classroomindependent and platform independentThe courses started at one places can be used bythousands of learners all over the world So with minimum use of the resources and withlittle efforts anybody can participating in such online courses these courses are offeredby expert faculties in course from top class institutes in the world That can provideexperience of learning from highly qualified faculties with rich learning material andresources

MOOCs are considered as a one of the best educational research vehicle for most ofthe researchers in education field as it ability to track each student detailed interactionwith the course materials participation in forum practice problems and assignmentsetc The large data sets created by students interactions in MOOCs provide goodopportunities for research in educational field

12 Challenges in MOOCs

Figure 11 shows the learning workflow in MOOCs When a new course start instructorputs learning material on topics As time passes during the subsequent assessment learnergoes through the available learning material learning material consist of Videos audiotext etc Going through learning material learner attempt for exercises assignmentsquizzes exams etc After assessment results evaluated are provided to the learner andsame procedure follows until the course finishes

The courses offered are nothing but the static traditional hypermedia of the variouslearning material For the diverse learner population this system will suffer from inabilityto be ldquoall things to all peoplerdquo [3] Each student registered in MOOCs has different char-acteristics such as knowledge level preferences interest goals cognitive style learningstyle prerequisite knowledge etc The courses offered by MOOCs could not able to meet

1

Figure 11 Work flow of MOOCs

the need of each and every individual with different learning demands Due to which mostof the student not able to achieve satisfactory learning in course

13 Motivation

In traditional classroom based courses Due to limited number of students teachercan give special attention for each and every student for their classroom activitiesand can observe learning style knowledge level interests preferences etc In suchenvironment teacher can diagnose the student knowledge on various concept by askingquestions and teach on the basis of the diagnosis result for each student But dueto development in technologies and need to serve for the large number of populationwith courses offered on web with open access to all it is not feasible for MOOCinstructors to manually provide attention to individual with different backgroundsdiverse skill levels learning goals and preferences So there is an increasing need to makedirected efforts towards automatically providing better personalized content in e-learning

In a MOOCs each student complete interaction with the course materials is recordedas a clickstream which produces vast amount of data In [4] author states aboutimportant of such data that can be used for understanding of student learning andenable more intelligent interactive engaging and effective education Understandinghow students interact with MOOCs is a crucial issue because it affects how we design

2

future online courses Identifying student interaction patterns behaviour and sequenceof actions performed by student through clickstream analysis could enable more effectivepersonalized support for student information seeking and learning in MOOCs To do sorequires artificial intelligence and machine learning in the field of education

Artificial intelligence play an important role in learning and education Intelligenttutoring systems using technologies from artificial intelligent to advance learning andteaching In this work we focus on such data-driven development and optimization ofeducational technologies to build intelligent tutoring system for MOOCsThis work isalready carried out in the fields of educational data mining [5][6]and learning analytics[7]

14 Intelligent Tutoring System for MOOCs

Intelligent and Adaptive technologies proposes a solution to overcome limitations of thetraditional online courses ITS systems build a model of the knowledge and learningbehaviour for each individual learner and use this model throughout the interaction inorder to adapt to the needs of individual learner System offers personalised learningmaterial as per individual learning need

For the effective education and learning for MOOCs we will make our contributionto build the intelligent tutoring system which will address challenges in MOOCs listedabove First part is to address the challenge of finding behaviour of the student usingsequence of activity performed resources used and performance in the course similarbehaviour students are grouped into the cluster to provide personalised feedback foreach cluster having similar students And analysis of this data can also lead for betterunderstanding of student learning behaviour for making decisions for future offering ofthe courses

Second part is to estimate the concept wise knowledge of each individual studentparticipated in MOOCs The proposed solution built a model for each individual duringthe assessment and interaction with the system Using the student responses for theassessment system estimate performance of each student for concept to make predictionwhether student understand each concepttopic he learned Using this model systemanalyse individual student need and can provide personalised learning material as perindividuals demand

Majority of the ITS systems build has modularised in some components according tofunctionality Before going further we should familiar with these system and componentsin it As illustrated in figure 12 ITS contains following components[8] [9]

bull Domain Model

bull Student Model

bull Tutoring Model

bull Interface Model

3

Figure 12 Components of ITS

Domain Model- Domain model also known as expert model it stores informationabout the material being taught This model contains concepts and relationship betweenconcepts solutions for the question lessons and tutorials and problem solving strategiesof the domain It stands as a backbone of the content of the system

Student Model- It represents domain dependent and independent characteristics ofeach user These characteristics are acquired from user explicit responses analysis fromthe interaction data through assessments etc It consider as a core component of theITS system

Tutoring Model- To make the choices about tutoring and action it accepts infor-mation from domain model and student model it guides learner with respective theircurrent state of knowledge to reach proficiency with the target skill

Interaction Model- The model concerned with the representation of the user(usermodel) and representation of the application(domain model) The main aim of the modelis capturing the appropriate raw data of the learner and representing the interface

4

Chapter 2

Literature Review

Our work described here are falls into different categories listed in previous chapter Thereview done mostly includes related work we will presenting in next sections First wewill see some of the research related to data mining and learning analytics and that wewill be presenting in next subsequent sections Later in the sections we will present reviewon Adaptive and Intelligent Technologies in adaptive hypermedia and intelligent tutor-ing system that will require to understand the working of Adaptive and Intelligent system

21 Review of Educational Data Mining

WHAT IS EDMThe Educational Data Mining community website wwweducationaldataminingorg de-fines educational data mining as follows ldquoEducational Data Mining is an emerging dis-cipline concerned with developing methods for exploring the unique types of data thatcome from educational settings and using those methods to better understand studentsand the settings which they learn inrdquo

Data mining techniques can be used to discover useful information to assist educatorsto make decisions or modifications in environment or teaching approaches In [5] showsthat the data mining application in educational systems is an iterative cycle of hypothesisformation testing and refinement (see Fig 21) Mined Knowledge and results entersinto system and and guide facilitate and enhance the learning experience of the learnerby providing recommendation or personalised feedback This knowledge not only helpsto the learner but also to the educators for decision making so they can design planbuild and maintain the educational systems

In [10] the authors categorize work in educational data mining into

bull Statistics and visualization

bull Web mining

ndash Clustering classification and outlier detection

ndash Association rule mining and sequential pattern mining

ndash Text mining

5

Figure 21 The cycle of applying data mining in educational systems[5]

In same work he has listed data mining with different kinds of education systemsincludes (see fig 21) Traditional classrooms and Distance education systems (Particularweb-based courses Well-known learning content management systems Adaptive andintelligent web-based educational systems)

A different viewpoint on educational data mining stated in [6][11] where the followingcategories are defined

bull Prediction

ndash Classification

ndash Regression

ndash Density estimation

bull Clustering

bull Relationship mining

ndash Association rule mining

ndash Correlation mining

ndash Sequential pattern mining

ndash Causal data mining

bull Distillation of data for human judgment

bull Discovery with models

6

Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]

There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults

Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving

Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations

Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour

22 Adaptive and Intelligent Technologies

As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user

7

221 Adaptive Hypermedia

In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student

AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author

Adaptive Hypermedia performs following three functions

bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not

bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support

bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation

2211 Adaptive Presentation

[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be

bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)

bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model

bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values

8

2212 Adaptive Navigation Support

[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed

222 ITS technologies

In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS

2221 Curriculum sequencing

Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)

2222 Problem solving support

Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support

Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution

Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help

Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience

223 Existing Systems

In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]

Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that

9

System Description Technologies usedAHA open adaptive hypermedia

architecturePlatform foradaptive hypermedia

adaptive presentation adaptivenavigation support

ELM-ART Intelligent interactive sys-tem to learning programingin lisp

adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support

InterBook Authoring and Deliver-ing adaptive electronictextbooks

adaptive sequencing adaptivepresentation adaptive navigationsupport

KBS Hyperbook textbook in hypertext doc-ument

adaptive sequencing adaptivenavigation support

NetCoach Authoring system that meetthe needs to create adaptivelearning course

curriculum sequencing adaptiveannotation of links

Metalink Authoring tools for adap-tive hyperbook

page sequencing adaptive presen-tation

SIETTE ITS for classification andidentification of differentvegetable species

Question sequencing

SQL-Tutor support learning SQL Intelligent solution analysis

Table 21 Existing Systems

there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS

10

Chapter 3

The Open edX frontend architecture

31 Units(Weeks) sequences panels and modules

The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]

Figure 31 Courseware structure

UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules

11

32 Naming schemes

In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme

bull URLs Examples of course URLs corresponding to each level of the course hierarchy

ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014

courseware5773a6ab6319440aa5889ae38e578402

lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit

ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x

2T2014courseware5773a6ab6319440aa5889ae38e578402

ba24809f572846458e1c78e09411863a

lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit

ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence

bull Module URIs

Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70

These modules are displayed one course panel and panel can contain multiplemodules

33 Events

In events log we can distinguish the event between interaction event or navigational event

bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with

bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page

12

Chapter 4

Open edx research data

For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]

41 Student Info and Progress Data

edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]

bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment

ndash auth user

ndash auth userprofile

ndash student courseenrollment

ndash user api usercoursetag

ndash user id map

ndash student languageproficiency

For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course

bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data

13

bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course

42 Course Content Data

For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file

bull Shared Fields

bull Course Data

bull Course Building Block Data

bull Course Component Data

421 Shared Fields

Category children metadataThe these fields are present for all of the objects in thecourse structure file

bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories

ndash chapter The sections defined in the course outline

ndash course The dates settings and other metadata defined for the course as awhole

ndash discussion The content-specific discussion components defined for the course

ndash html The HTML components defined for the course

ndash problem The problem components defined for the course

ndash sequential The subsections defined in the course outline

ndash vertical The units defined in the course outline

ndash video The video components defined for the course

bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains

bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules

14

422 Course Data

In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values

Example of the course object defined for CS1011x

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

tabs [

name Courseware

type courseware

name Course Info

type course_info

name Textbooks

15

type textbooks

name Discussion

type discussion

name Wiki

type wiki

name Progress

type progress

name CS1011x Course Team

type static_tab

url_slug 84b41401f15b4a83a44cda2eb8a689fd

]

423 Course Building Block Data

Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain

bull chapter(section) contains list of sequentials(subsections)

bull sequential(subsections) contains list of verticals(units)

bull vertical(unit) list all components that they contain

The metadata field contains information about set of parameters defined by courseteam for each of them

Sample from CS1011X course

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

16

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

424 Course Component Data

Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course

i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e

category html

children []

metadata

display_name Selection Sort

17

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124

category problem

children []

metadata

display_name Q13

markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]

max_attempts 1

weight 20

i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5

category discussion

children []

metadata

discussion_category Week 2

discussion_id 6fd74752fae64748bfee05ea4f9ae6ca

discussion_target Structure of a Simple C++ Program

43 Tracking module id in Course Structure

To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen

18

the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy

As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for

19

example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

Do same procedure to find details of sequence identifier

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

Now can find a unit(vertical) in sequence using number of vertical in sequence panel

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

20

And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course

video module is here

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge

Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs

21

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 2: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

i

ii

Acknowledgement

I would like to thank my guide Prof Deepak B Phatak for his consistent directionsand guidance throughout the project Because of his encouragement and giving us rightdirections we are able to do this project work efficiently and correctly I would also liketo thank Mr Nagesh Karmali for his continuous support throughout project work

Nilesh BirariMTechDepartment(CSE)IIT Bombay

iii

Abstract

A massive open online course(MOOCs) is an online course aimed at unlimited partici-pation and open access via the web In a MOOC each studentrsquos complete interactionwith the course materials is recorded as a clickstream which produces vast amountof data understanding of student learning using this data enables more intelligentinteractive engaging and effective education The analysis of data using various datamining techniques give more insight into different student learning behaviour that canlead to take better decision for educators and researcher for future offering of the courseAnalysis of student learning behaviour can also be used for personalized guidance foreach group of student

ITS aims to provide personalised instruction for each student so the student model isthe key component of ITS which store the student information(student knowledge stateand behaviour analysis of student) Diagnosis of the student knowledge state is the mostdifficult process due to uncertain and imprecise information about the student To man-age uncertainty in data Bayesian network has the most sound mathematical foundationsand also a natural way of representing uncertainty using probabilities The proposedwork builds Bayesian student model for concept wise understanding of student state ofknowledge

iv

Contents

1 Introduction 111 MOOCs 112 Challenges in MOOCs 113 Motivation 214 Intelligent Tutoring System for MOOCs 3

2 Literature Review 521 Review of Educational Data Mining 522 Adaptive and Intelligent Technologies 7

221 Adaptive Hypermedia 8222 ITS technologies 9223 Existing Systems 9

3 The Open edX frontend architecture 1131 Units(Weeks) sequences panels and modules 1132 Naming schemes 1233 Events 12

4 Open edx research data 1341 Student Info and Progress Data 1342 Course Content Data 14

421 Shared Fields 14422 Course Data 15423 Course Building Block Data 16424 Course Component Data 17

43 Tracking module id in Course Structure 18

5 Events in the Tracking Logs 2251 Common Fields 2252 Student Events 23

521 Navigational Events 23522 Video Interaction Events 24523 Problem Interaction Events 24524 Discussion Forums Events 25

53 New Findings in tracking logs 25

6 Tools and Technologies 2761 Primary requirements for development 2762 Python Bayes Network Toolbox 28

v

63 Jayes - Bayesian Networks for Java 2864 moocdb 29

7 Experiments With Data 3071 Pre-Processing and Data Cleaning 3072 Feature Extraction 3173 Clustering Experiments and results analysis 32

731 Experiment 1 32732 Experiment 2 33733 Experiment 3 35734 Experiment 4 36735 Experiment 5 36

8 Student modeling using Bayesian Network 3881 Student Modeling 38

811 Information in Student model 38812 Uncertainty-Based User Modeling 39

82 Bayesian Diagnostic Algorithm for Student Modeling 40821 Bayesian Network 40822 Bayesian Student Modeling 41

83 Experiments and Results 43831 Nodes and relations 44832 Parameter specification 44

9 Conclusion and Future work 46

vi

List of Figures

11 Work flow of MOOCs 212 Components of ITS 4

21 The cycle of applying data mining in educational systems 6

31 Courseware structure 11

61 Bayesian Network implementation steps [1] 28

81 Overlay model 3982 Bayesian Network model 4283 Analysis of student understanding for concept in CS1011x 45

vii

List of Tables

21 Existing Systems 10

71 Clustering Experiment 1 results 3372 Clustering Experiment 2 results 3473 Clustering Experiment 3 results 3574 Clustering Experiment 4 results 3675 Clustering Experiment 5 results 37

viii

Chapter 1

Introduction

11 MOOCs

A massive open online course(MOOCs) is an online course aimed at unlimited participa-tion and open access via the web [2] MOOCs offers wide variety of courses to the largenumber of students There are multiple MOOC platforms (including the edX Courseraand Udacity) offering large number of courses some of which have had hundreds ofthousands of students enrolled Benefits of such on-line courses is they are classroomindependent and platform independentThe courses started at one places can be used bythousands of learners all over the world So with minimum use of the resources and withlittle efforts anybody can participating in such online courses these courses are offeredby expert faculties in course from top class institutes in the world That can provideexperience of learning from highly qualified faculties with rich learning material andresources

MOOCs are considered as a one of the best educational research vehicle for most ofthe researchers in education field as it ability to track each student detailed interactionwith the course materials participation in forum practice problems and assignmentsetc The large data sets created by students interactions in MOOCs provide goodopportunities for research in educational field

12 Challenges in MOOCs

Figure 11 shows the learning workflow in MOOCs When a new course start instructorputs learning material on topics As time passes during the subsequent assessment learnergoes through the available learning material learning material consist of Videos audiotext etc Going through learning material learner attempt for exercises assignmentsquizzes exams etc After assessment results evaluated are provided to the learner andsame procedure follows until the course finishes

The courses offered are nothing but the static traditional hypermedia of the variouslearning material For the diverse learner population this system will suffer from inabilityto be ldquoall things to all peoplerdquo [3] Each student registered in MOOCs has different char-acteristics such as knowledge level preferences interest goals cognitive style learningstyle prerequisite knowledge etc The courses offered by MOOCs could not able to meet

1

Figure 11 Work flow of MOOCs

the need of each and every individual with different learning demands Due to which mostof the student not able to achieve satisfactory learning in course

13 Motivation

In traditional classroom based courses Due to limited number of students teachercan give special attention for each and every student for their classroom activitiesand can observe learning style knowledge level interests preferences etc In suchenvironment teacher can diagnose the student knowledge on various concept by askingquestions and teach on the basis of the diagnosis result for each student But dueto development in technologies and need to serve for the large number of populationwith courses offered on web with open access to all it is not feasible for MOOCinstructors to manually provide attention to individual with different backgroundsdiverse skill levels learning goals and preferences So there is an increasing need to makedirected efforts towards automatically providing better personalized content in e-learning

In a MOOCs each student complete interaction with the course materials is recordedas a clickstream which produces vast amount of data In [4] author states aboutimportant of such data that can be used for understanding of student learning andenable more intelligent interactive engaging and effective education Understandinghow students interact with MOOCs is a crucial issue because it affects how we design

2

future online courses Identifying student interaction patterns behaviour and sequenceof actions performed by student through clickstream analysis could enable more effectivepersonalized support for student information seeking and learning in MOOCs To do sorequires artificial intelligence and machine learning in the field of education

Artificial intelligence play an important role in learning and education Intelligenttutoring systems using technologies from artificial intelligent to advance learning andteaching In this work we focus on such data-driven development and optimization ofeducational technologies to build intelligent tutoring system for MOOCsThis work isalready carried out in the fields of educational data mining [5][6]and learning analytics[7]

14 Intelligent Tutoring System for MOOCs

Intelligent and Adaptive technologies proposes a solution to overcome limitations of thetraditional online courses ITS systems build a model of the knowledge and learningbehaviour for each individual learner and use this model throughout the interaction inorder to adapt to the needs of individual learner System offers personalised learningmaterial as per individual learning need

For the effective education and learning for MOOCs we will make our contributionto build the intelligent tutoring system which will address challenges in MOOCs listedabove First part is to address the challenge of finding behaviour of the student usingsequence of activity performed resources used and performance in the course similarbehaviour students are grouped into the cluster to provide personalised feedback foreach cluster having similar students And analysis of this data can also lead for betterunderstanding of student learning behaviour for making decisions for future offering ofthe courses

Second part is to estimate the concept wise knowledge of each individual studentparticipated in MOOCs The proposed solution built a model for each individual duringthe assessment and interaction with the system Using the student responses for theassessment system estimate performance of each student for concept to make predictionwhether student understand each concepttopic he learned Using this model systemanalyse individual student need and can provide personalised learning material as perindividuals demand

Majority of the ITS systems build has modularised in some components according tofunctionality Before going further we should familiar with these system and componentsin it As illustrated in figure 12 ITS contains following components[8] [9]

bull Domain Model

bull Student Model

bull Tutoring Model

bull Interface Model

3

Figure 12 Components of ITS

Domain Model- Domain model also known as expert model it stores informationabout the material being taught This model contains concepts and relationship betweenconcepts solutions for the question lessons and tutorials and problem solving strategiesof the domain It stands as a backbone of the content of the system

Student Model- It represents domain dependent and independent characteristics ofeach user These characteristics are acquired from user explicit responses analysis fromthe interaction data through assessments etc It consider as a core component of theITS system

Tutoring Model- To make the choices about tutoring and action it accepts infor-mation from domain model and student model it guides learner with respective theircurrent state of knowledge to reach proficiency with the target skill

Interaction Model- The model concerned with the representation of the user(usermodel) and representation of the application(domain model) The main aim of the modelis capturing the appropriate raw data of the learner and representing the interface

4

Chapter 2

Literature Review

Our work described here are falls into different categories listed in previous chapter Thereview done mostly includes related work we will presenting in next sections First wewill see some of the research related to data mining and learning analytics and that wewill be presenting in next subsequent sections Later in the sections we will present reviewon Adaptive and Intelligent Technologies in adaptive hypermedia and intelligent tutor-ing system that will require to understand the working of Adaptive and Intelligent system

21 Review of Educational Data Mining

WHAT IS EDMThe Educational Data Mining community website wwweducationaldataminingorg de-fines educational data mining as follows ldquoEducational Data Mining is an emerging dis-cipline concerned with developing methods for exploring the unique types of data thatcome from educational settings and using those methods to better understand studentsand the settings which they learn inrdquo

Data mining techniques can be used to discover useful information to assist educatorsto make decisions or modifications in environment or teaching approaches In [5] showsthat the data mining application in educational systems is an iterative cycle of hypothesisformation testing and refinement (see Fig 21) Mined Knowledge and results entersinto system and and guide facilitate and enhance the learning experience of the learnerby providing recommendation or personalised feedback This knowledge not only helpsto the learner but also to the educators for decision making so they can design planbuild and maintain the educational systems

In [10] the authors categorize work in educational data mining into

bull Statistics and visualization

bull Web mining

ndash Clustering classification and outlier detection

ndash Association rule mining and sequential pattern mining

ndash Text mining

5

Figure 21 The cycle of applying data mining in educational systems[5]

In same work he has listed data mining with different kinds of education systemsincludes (see fig 21) Traditional classrooms and Distance education systems (Particularweb-based courses Well-known learning content management systems Adaptive andintelligent web-based educational systems)

A different viewpoint on educational data mining stated in [6][11] where the followingcategories are defined

bull Prediction

ndash Classification

ndash Regression

ndash Density estimation

bull Clustering

bull Relationship mining

ndash Association rule mining

ndash Correlation mining

ndash Sequential pattern mining

ndash Causal data mining

bull Distillation of data for human judgment

bull Discovery with models

6

Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]

There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults

Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving

Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations

Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour

22 Adaptive and Intelligent Technologies

As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user

7

221 Adaptive Hypermedia

In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student

AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author

Adaptive Hypermedia performs following three functions

bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not

bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support

bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation

2211 Adaptive Presentation

[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be

bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)

bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model

bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values

8

2212 Adaptive Navigation Support

[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed

222 ITS technologies

In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS

2221 Curriculum sequencing

Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)

2222 Problem solving support

Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support

Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution

Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help

Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience

223 Existing Systems

In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]

Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that

9

System Description Technologies usedAHA open adaptive hypermedia

architecturePlatform foradaptive hypermedia

adaptive presentation adaptivenavigation support

ELM-ART Intelligent interactive sys-tem to learning programingin lisp

adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support

InterBook Authoring and Deliver-ing adaptive electronictextbooks

adaptive sequencing adaptivepresentation adaptive navigationsupport

KBS Hyperbook textbook in hypertext doc-ument

adaptive sequencing adaptivenavigation support

NetCoach Authoring system that meetthe needs to create adaptivelearning course

curriculum sequencing adaptiveannotation of links

Metalink Authoring tools for adap-tive hyperbook

page sequencing adaptive presen-tation

SIETTE ITS for classification andidentification of differentvegetable species

Question sequencing

SQL-Tutor support learning SQL Intelligent solution analysis

Table 21 Existing Systems

there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS

10

Chapter 3

The Open edX frontend architecture

31 Units(Weeks) sequences panels and modules

The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]

Figure 31 Courseware structure

UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules

11

32 Naming schemes

In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme

bull URLs Examples of course URLs corresponding to each level of the course hierarchy

ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014

courseware5773a6ab6319440aa5889ae38e578402

lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit

ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x

2T2014courseware5773a6ab6319440aa5889ae38e578402

ba24809f572846458e1c78e09411863a

lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit

ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence

bull Module URIs

Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70

These modules are displayed one course panel and panel can contain multiplemodules

33 Events

In events log we can distinguish the event between interaction event or navigational event

bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with

bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page

12

Chapter 4

Open edx research data

For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]

41 Student Info and Progress Data

edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]

bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment

ndash auth user

ndash auth userprofile

ndash student courseenrollment

ndash user api usercoursetag

ndash user id map

ndash student languageproficiency

For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course

bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data

13

bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course

42 Course Content Data

For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file

bull Shared Fields

bull Course Data

bull Course Building Block Data

bull Course Component Data

421 Shared Fields

Category children metadataThe these fields are present for all of the objects in thecourse structure file

bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories

ndash chapter The sections defined in the course outline

ndash course The dates settings and other metadata defined for the course as awhole

ndash discussion The content-specific discussion components defined for the course

ndash html The HTML components defined for the course

ndash problem The problem components defined for the course

ndash sequential The subsections defined in the course outline

ndash vertical The units defined in the course outline

ndash video The video components defined for the course

bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains

bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules

14

422 Course Data

In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values

Example of the course object defined for CS1011x

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

tabs [

name Courseware

type courseware

name Course Info

type course_info

name Textbooks

15

type textbooks

name Discussion

type discussion

name Wiki

type wiki

name Progress

type progress

name CS1011x Course Team

type static_tab

url_slug 84b41401f15b4a83a44cda2eb8a689fd

]

423 Course Building Block Data

Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain

bull chapter(section) contains list of sequentials(subsections)

bull sequential(subsections) contains list of verticals(units)

bull vertical(unit) list all components that they contain

The metadata field contains information about set of parameters defined by courseteam for each of them

Sample from CS1011X course

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

16

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

424 Course Component Data

Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course

i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e

category html

children []

metadata

display_name Selection Sort

17

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124

category problem

children []

metadata

display_name Q13

markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]

max_attempts 1

weight 20

i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5

category discussion

children []

metadata

discussion_category Week 2

discussion_id 6fd74752fae64748bfee05ea4f9ae6ca

discussion_target Structure of a Simple C++ Program

43 Tracking module id in Course Structure

To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen

18

the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy

As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for

19

example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

Do same procedure to find details of sequence identifier

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

Now can find a unit(vertical) in sequence using number of vertical in sequence panel

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

20

And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course

video module is here

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge

Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs

21

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 3: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

ii

Acknowledgement

I would like to thank my guide Prof Deepak B Phatak for his consistent directionsand guidance throughout the project Because of his encouragement and giving us rightdirections we are able to do this project work efficiently and correctly I would also liketo thank Mr Nagesh Karmali for his continuous support throughout project work

Nilesh BirariMTechDepartment(CSE)IIT Bombay

iii

Abstract

A massive open online course(MOOCs) is an online course aimed at unlimited partici-pation and open access via the web In a MOOC each studentrsquos complete interactionwith the course materials is recorded as a clickstream which produces vast amountof data understanding of student learning using this data enables more intelligentinteractive engaging and effective education The analysis of data using various datamining techniques give more insight into different student learning behaviour that canlead to take better decision for educators and researcher for future offering of the courseAnalysis of student learning behaviour can also be used for personalized guidance foreach group of student

ITS aims to provide personalised instruction for each student so the student model isthe key component of ITS which store the student information(student knowledge stateand behaviour analysis of student) Diagnosis of the student knowledge state is the mostdifficult process due to uncertain and imprecise information about the student To man-age uncertainty in data Bayesian network has the most sound mathematical foundationsand also a natural way of representing uncertainty using probabilities The proposedwork builds Bayesian student model for concept wise understanding of student state ofknowledge

iv

Contents

1 Introduction 111 MOOCs 112 Challenges in MOOCs 113 Motivation 214 Intelligent Tutoring System for MOOCs 3

2 Literature Review 521 Review of Educational Data Mining 522 Adaptive and Intelligent Technologies 7

221 Adaptive Hypermedia 8222 ITS technologies 9223 Existing Systems 9

3 The Open edX frontend architecture 1131 Units(Weeks) sequences panels and modules 1132 Naming schemes 1233 Events 12

4 Open edx research data 1341 Student Info and Progress Data 1342 Course Content Data 14

421 Shared Fields 14422 Course Data 15423 Course Building Block Data 16424 Course Component Data 17

43 Tracking module id in Course Structure 18

5 Events in the Tracking Logs 2251 Common Fields 2252 Student Events 23

521 Navigational Events 23522 Video Interaction Events 24523 Problem Interaction Events 24524 Discussion Forums Events 25

53 New Findings in tracking logs 25

6 Tools and Technologies 2761 Primary requirements for development 2762 Python Bayes Network Toolbox 28

v

63 Jayes - Bayesian Networks for Java 2864 moocdb 29

7 Experiments With Data 3071 Pre-Processing and Data Cleaning 3072 Feature Extraction 3173 Clustering Experiments and results analysis 32

731 Experiment 1 32732 Experiment 2 33733 Experiment 3 35734 Experiment 4 36735 Experiment 5 36

8 Student modeling using Bayesian Network 3881 Student Modeling 38

811 Information in Student model 38812 Uncertainty-Based User Modeling 39

82 Bayesian Diagnostic Algorithm for Student Modeling 40821 Bayesian Network 40822 Bayesian Student Modeling 41

83 Experiments and Results 43831 Nodes and relations 44832 Parameter specification 44

9 Conclusion and Future work 46

vi

List of Figures

11 Work flow of MOOCs 212 Components of ITS 4

21 The cycle of applying data mining in educational systems 6

31 Courseware structure 11

61 Bayesian Network implementation steps [1] 28

81 Overlay model 3982 Bayesian Network model 4283 Analysis of student understanding for concept in CS1011x 45

vii

List of Tables

21 Existing Systems 10

71 Clustering Experiment 1 results 3372 Clustering Experiment 2 results 3473 Clustering Experiment 3 results 3574 Clustering Experiment 4 results 3675 Clustering Experiment 5 results 37

viii

Chapter 1

Introduction

11 MOOCs

A massive open online course(MOOCs) is an online course aimed at unlimited participa-tion and open access via the web [2] MOOCs offers wide variety of courses to the largenumber of students There are multiple MOOC platforms (including the edX Courseraand Udacity) offering large number of courses some of which have had hundreds ofthousands of students enrolled Benefits of such on-line courses is they are classroomindependent and platform independentThe courses started at one places can be used bythousands of learners all over the world So with minimum use of the resources and withlittle efforts anybody can participating in such online courses these courses are offeredby expert faculties in course from top class institutes in the world That can provideexperience of learning from highly qualified faculties with rich learning material andresources

MOOCs are considered as a one of the best educational research vehicle for most ofthe researchers in education field as it ability to track each student detailed interactionwith the course materials participation in forum practice problems and assignmentsetc The large data sets created by students interactions in MOOCs provide goodopportunities for research in educational field

12 Challenges in MOOCs

Figure 11 shows the learning workflow in MOOCs When a new course start instructorputs learning material on topics As time passes during the subsequent assessment learnergoes through the available learning material learning material consist of Videos audiotext etc Going through learning material learner attempt for exercises assignmentsquizzes exams etc After assessment results evaluated are provided to the learner andsame procedure follows until the course finishes

The courses offered are nothing but the static traditional hypermedia of the variouslearning material For the diverse learner population this system will suffer from inabilityto be ldquoall things to all peoplerdquo [3] Each student registered in MOOCs has different char-acteristics such as knowledge level preferences interest goals cognitive style learningstyle prerequisite knowledge etc The courses offered by MOOCs could not able to meet

1

Figure 11 Work flow of MOOCs

the need of each and every individual with different learning demands Due to which mostof the student not able to achieve satisfactory learning in course

13 Motivation

In traditional classroom based courses Due to limited number of students teachercan give special attention for each and every student for their classroom activitiesand can observe learning style knowledge level interests preferences etc In suchenvironment teacher can diagnose the student knowledge on various concept by askingquestions and teach on the basis of the diagnosis result for each student But dueto development in technologies and need to serve for the large number of populationwith courses offered on web with open access to all it is not feasible for MOOCinstructors to manually provide attention to individual with different backgroundsdiverse skill levels learning goals and preferences So there is an increasing need to makedirected efforts towards automatically providing better personalized content in e-learning

In a MOOCs each student complete interaction with the course materials is recordedas a clickstream which produces vast amount of data In [4] author states aboutimportant of such data that can be used for understanding of student learning andenable more intelligent interactive engaging and effective education Understandinghow students interact with MOOCs is a crucial issue because it affects how we design

2

future online courses Identifying student interaction patterns behaviour and sequenceof actions performed by student through clickstream analysis could enable more effectivepersonalized support for student information seeking and learning in MOOCs To do sorequires artificial intelligence and machine learning in the field of education

Artificial intelligence play an important role in learning and education Intelligenttutoring systems using technologies from artificial intelligent to advance learning andteaching In this work we focus on such data-driven development and optimization ofeducational technologies to build intelligent tutoring system for MOOCsThis work isalready carried out in the fields of educational data mining [5][6]and learning analytics[7]

14 Intelligent Tutoring System for MOOCs

Intelligent and Adaptive technologies proposes a solution to overcome limitations of thetraditional online courses ITS systems build a model of the knowledge and learningbehaviour for each individual learner and use this model throughout the interaction inorder to adapt to the needs of individual learner System offers personalised learningmaterial as per individual learning need

For the effective education and learning for MOOCs we will make our contributionto build the intelligent tutoring system which will address challenges in MOOCs listedabove First part is to address the challenge of finding behaviour of the student usingsequence of activity performed resources used and performance in the course similarbehaviour students are grouped into the cluster to provide personalised feedback foreach cluster having similar students And analysis of this data can also lead for betterunderstanding of student learning behaviour for making decisions for future offering ofthe courses

Second part is to estimate the concept wise knowledge of each individual studentparticipated in MOOCs The proposed solution built a model for each individual duringthe assessment and interaction with the system Using the student responses for theassessment system estimate performance of each student for concept to make predictionwhether student understand each concepttopic he learned Using this model systemanalyse individual student need and can provide personalised learning material as perindividuals demand

Majority of the ITS systems build has modularised in some components according tofunctionality Before going further we should familiar with these system and componentsin it As illustrated in figure 12 ITS contains following components[8] [9]

bull Domain Model

bull Student Model

bull Tutoring Model

bull Interface Model

3

Figure 12 Components of ITS

Domain Model- Domain model also known as expert model it stores informationabout the material being taught This model contains concepts and relationship betweenconcepts solutions for the question lessons and tutorials and problem solving strategiesof the domain It stands as a backbone of the content of the system

Student Model- It represents domain dependent and independent characteristics ofeach user These characteristics are acquired from user explicit responses analysis fromthe interaction data through assessments etc It consider as a core component of theITS system

Tutoring Model- To make the choices about tutoring and action it accepts infor-mation from domain model and student model it guides learner with respective theircurrent state of knowledge to reach proficiency with the target skill

Interaction Model- The model concerned with the representation of the user(usermodel) and representation of the application(domain model) The main aim of the modelis capturing the appropriate raw data of the learner and representing the interface

4

Chapter 2

Literature Review

Our work described here are falls into different categories listed in previous chapter Thereview done mostly includes related work we will presenting in next sections First wewill see some of the research related to data mining and learning analytics and that wewill be presenting in next subsequent sections Later in the sections we will present reviewon Adaptive and Intelligent Technologies in adaptive hypermedia and intelligent tutor-ing system that will require to understand the working of Adaptive and Intelligent system

21 Review of Educational Data Mining

WHAT IS EDMThe Educational Data Mining community website wwweducationaldataminingorg de-fines educational data mining as follows ldquoEducational Data Mining is an emerging dis-cipline concerned with developing methods for exploring the unique types of data thatcome from educational settings and using those methods to better understand studentsand the settings which they learn inrdquo

Data mining techniques can be used to discover useful information to assist educatorsto make decisions or modifications in environment or teaching approaches In [5] showsthat the data mining application in educational systems is an iterative cycle of hypothesisformation testing and refinement (see Fig 21) Mined Knowledge and results entersinto system and and guide facilitate and enhance the learning experience of the learnerby providing recommendation or personalised feedback This knowledge not only helpsto the learner but also to the educators for decision making so they can design planbuild and maintain the educational systems

In [10] the authors categorize work in educational data mining into

bull Statistics and visualization

bull Web mining

ndash Clustering classification and outlier detection

ndash Association rule mining and sequential pattern mining

ndash Text mining

5

Figure 21 The cycle of applying data mining in educational systems[5]

In same work he has listed data mining with different kinds of education systemsincludes (see fig 21) Traditional classrooms and Distance education systems (Particularweb-based courses Well-known learning content management systems Adaptive andintelligent web-based educational systems)

A different viewpoint on educational data mining stated in [6][11] where the followingcategories are defined

bull Prediction

ndash Classification

ndash Regression

ndash Density estimation

bull Clustering

bull Relationship mining

ndash Association rule mining

ndash Correlation mining

ndash Sequential pattern mining

ndash Causal data mining

bull Distillation of data for human judgment

bull Discovery with models

6

Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]

There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults

Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving

Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations

Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour

22 Adaptive and Intelligent Technologies

As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user

7

221 Adaptive Hypermedia

In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student

AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author

Adaptive Hypermedia performs following three functions

bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not

bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support

bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation

2211 Adaptive Presentation

[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be

bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)

bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model

bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values

8

2212 Adaptive Navigation Support

[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed

222 ITS technologies

In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS

2221 Curriculum sequencing

Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)

2222 Problem solving support

Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support

Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution

Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help

Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience

223 Existing Systems

In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]

Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that

9

System Description Technologies usedAHA open adaptive hypermedia

architecturePlatform foradaptive hypermedia

adaptive presentation adaptivenavigation support

ELM-ART Intelligent interactive sys-tem to learning programingin lisp

adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support

InterBook Authoring and Deliver-ing adaptive electronictextbooks

adaptive sequencing adaptivepresentation adaptive navigationsupport

KBS Hyperbook textbook in hypertext doc-ument

adaptive sequencing adaptivenavigation support

NetCoach Authoring system that meetthe needs to create adaptivelearning course

curriculum sequencing adaptiveannotation of links

Metalink Authoring tools for adap-tive hyperbook

page sequencing adaptive presen-tation

SIETTE ITS for classification andidentification of differentvegetable species

Question sequencing

SQL-Tutor support learning SQL Intelligent solution analysis

Table 21 Existing Systems

there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS

10

Chapter 3

The Open edX frontend architecture

31 Units(Weeks) sequences panels and modules

The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]

Figure 31 Courseware structure

UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules

11

32 Naming schemes

In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme

bull URLs Examples of course URLs corresponding to each level of the course hierarchy

ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014

courseware5773a6ab6319440aa5889ae38e578402

lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit

ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x

2T2014courseware5773a6ab6319440aa5889ae38e578402

ba24809f572846458e1c78e09411863a

lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit

ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence

bull Module URIs

Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70

These modules are displayed one course panel and panel can contain multiplemodules

33 Events

In events log we can distinguish the event between interaction event or navigational event

bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with

bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page

12

Chapter 4

Open edx research data

For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]

41 Student Info and Progress Data

edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]

bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment

ndash auth user

ndash auth userprofile

ndash student courseenrollment

ndash user api usercoursetag

ndash user id map

ndash student languageproficiency

For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course

bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data

13

bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course

42 Course Content Data

For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file

bull Shared Fields

bull Course Data

bull Course Building Block Data

bull Course Component Data

421 Shared Fields

Category children metadataThe these fields are present for all of the objects in thecourse structure file

bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories

ndash chapter The sections defined in the course outline

ndash course The dates settings and other metadata defined for the course as awhole

ndash discussion The content-specific discussion components defined for the course

ndash html The HTML components defined for the course

ndash problem The problem components defined for the course

ndash sequential The subsections defined in the course outline

ndash vertical The units defined in the course outline

ndash video The video components defined for the course

bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains

bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules

14

422 Course Data

In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values

Example of the course object defined for CS1011x

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

tabs [

name Courseware

type courseware

name Course Info

type course_info

name Textbooks

15

type textbooks

name Discussion

type discussion

name Wiki

type wiki

name Progress

type progress

name CS1011x Course Team

type static_tab

url_slug 84b41401f15b4a83a44cda2eb8a689fd

]

423 Course Building Block Data

Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain

bull chapter(section) contains list of sequentials(subsections)

bull sequential(subsections) contains list of verticals(units)

bull vertical(unit) list all components that they contain

The metadata field contains information about set of parameters defined by courseteam for each of them

Sample from CS1011X course

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

16

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

424 Course Component Data

Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course

i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e

category html

children []

metadata

display_name Selection Sort

17

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124

category problem

children []

metadata

display_name Q13

markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]

max_attempts 1

weight 20

i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5

category discussion

children []

metadata

discussion_category Week 2

discussion_id 6fd74752fae64748bfee05ea4f9ae6ca

discussion_target Structure of a Simple C++ Program

43 Tracking module id in Course Structure

To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen

18

the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy

As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for

19

example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

Do same procedure to find details of sequence identifier

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

Now can find a unit(vertical) in sequence using number of vertical in sequence panel

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

20

And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course

video module is here

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge

Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs

21

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 4: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

Acknowledgement

I would like to thank my guide Prof Deepak B Phatak for his consistent directionsand guidance throughout the project Because of his encouragement and giving us rightdirections we are able to do this project work efficiently and correctly I would also liketo thank Mr Nagesh Karmali for his continuous support throughout project work

Nilesh BirariMTechDepartment(CSE)IIT Bombay

iii

Abstract

A massive open online course(MOOCs) is an online course aimed at unlimited partici-pation and open access via the web In a MOOC each studentrsquos complete interactionwith the course materials is recorded as a clickstream which produces vast amountof data understanding of student learning using this data enables more intelligentinteractive engaging and effective education The analysis of data using various datamining techniques give more insight into different student learning behaviour that canlead to take better decision for educators and researcher for future offering of the courseAnalysis of student learning behaviour can also be used for personalized guidance foreach group of student

ITS aims to provide personalised instruction for each student so the student model isthe key component of ITS which store the student information(student knowledge stateand behaviour analysis of student) Diagnosis of the student knowledge state is the mostdifficult process due to uncertain and imprecise information about the student To man-age uncertainty in data Bayesian network has the most sound mathematical foundationsand also a natural way of representing uncertainty using probabilities The proposedwork builds Bayesian student model for concept wise understanding of student state ofknowledge

iv

Contents

1 Introduction 111 MOOCs 112 Challenges in MOOCs 113 Motivation 214 Intelligent Tutoring System for MOOCs 3

2 Literature Review 521 Review of Educational Data Mining 522 Adaptive and Intelligent Technologies 7

221 Adaptive Hypermedia 8222 ITS technologies 9223 Existing Systems 9

3 The Open edX frontend architecture 1131 Units(Weeks) sequences panels and modules 1132 Naming schemes 1233 Events 12

4 Open edx research data 1341 Student Info and Progress Data 1342 Course Content Data 14

421 Shared Fields 14422 Course Data 15423 Course Building Block Data 16424 Course Component Data 17

43 Tracking module id in Course Structure 18

5 Events in the Tracking Logs 2251 Common Fields 2252 Student Events 23

521 Navigational Events 23522 Video Interaction Events 24523 Problem Interaction Events 24524 Discussion Forums Events 25

53 New Findings in tracking logs 25

6 Tools and Technologies 2761 Primary requirements for development 2762 Python Bayes Network Toolbox 28

v

63 Jayes - Bayesian Networks for Java 2864 moocdb 29

7 Experiments With Data 3071 Pre-Processing and Data Cleaning 3072 Feature Extraction 3173 Clustering Experiments and results analysis 32

731 Experiment 1 32732 Experiment 2 33733 Experiment 3 35734 Experiment 4 36735 Experiment 5 36

8 Student modeling using Bayesian Network 3881 Student Modeling 38

811 Information in Student model 38812 Uncertainty-Based User Modeling 39

82 Bayesian Diagnostic Algorithm for Student Modeling 40821 Bayesian Network 40822 Bayesian Student Modeling 41

83 Experiments and Results 43831 Nodes and relations 44832 Parameter specification 44

9 Conclusion and Future work 46

vi

List of Figures

11 Work flow of MOOCs 212 Components of ITS 4

21 The cycle of applying data mining in educational systems 6

31 Courseware structure 11

61 Bayesian Network implementation steps [1] 28

81 Overlay model 3982 Bayesian Network model 4283 Analysis of student understanding for concept in CS1011x 45

vii

List of Tables

21 Existing Systems 10

71 Clustering Experiment 1 results 3372 Clustering Experiment 2 results 3473 Clustering Experiment 3 results 3574 Clustering Experiment 4 results 3675 Clustering Experiment 5 results 37

viii

Chapter 1

Introduction

11 MOOCs

A massive open online course(MOOCs) is an online course aimed at unlimited participa-tion and open access via the web [2] MOOCs offers wide variety of courses to the largenumber of students There are multiple MOOC platforms (including the edX Courseraand Udacity) offering large number of courses some of which have had hundreds ofthousands of students enrolled Benefits of such on-line courses is they are classroomindependent and platform independentThe courses started at one places can be used bythousands of learners all over the world So with minimum use of the resources and withlittle efforts anybody can participating in such online courses these courses are offeredby expert faculties in course from top class institutes in the world That can provideexperience of learning from highly qualified faculties with rich learning material andresources

MOOCs are considered as a one of the best educational research vehicle for most ofthe researchers in education field as it ability to track each student detailed interactionwith the course materials participation in forum practice problems and assignmentsetc The large data sets created by students interactions in MOOCs provide goodopportunities for research in educational field

12 Challenges in MOOCs

Figure 11 shows the learning workflow in MOOCs When a new course start instructorputs learning material on topics As time passes during the subsequent assessment learnergoes through the available learning material learning material consist of Videos audiotext etc Going through learning material learner attempt for exercises assignmentsquizzes exams etc After assessment results evaluated are provided to the learner andsame procedure follows until the course finishes

The courses offered are nothing but the static traditional hypermedia of the variouslearning material For the diverse learner population this system will suffer from inabilityto be ldquoall things to all peoplerdquo [3] Each student registered in MOOCs has different char-acteristics such as knowledge level preferences interest goals cognitive style learningstyle prerequisite knowledge etc The courses offered by MOOCs could not able to meet

1

Figure 11 Work flow of MOOCs

the need of each and every individual with different learning demands Due to which mostof the student not able to achieve satisfactory learning in course

13 Motivation

In traditional classroom based courses Due to limited number of students teachercan give special attention for each and every student for their classroom activitiesand can observe learning style knowledge level interests preferences etc In suchenvironment teacher can diagnose the student knowledge on various concept by askingquestions and teach on the basis of the diagnosis result for each student But dueto development in technologies and need to serve for the large number of populationwith courses offered on web with open access to all it is not feasible for MOOCinstructors to manually provide attention to individual with different backgroundsdiverse skill levels learning goals and preferences So there is an increasing need to makedirected efforts towards automatically providing better personalized content in e-learning

In a MOOCs each student complete interaction with the course materials is recordedas a clickstream which produces vast amount of data In [4] author states aboutimportant of such data that can be used for understanding of student learning andenable more intelligent interactive engaging and effective education Understandinghow students interact with MOOCs is a crucial issue because it affects how we design

2

future online courses Identifying student interaction patterns behaviour and sequenceof actions performed by student through clickstream analysis could enable more effectivepersonalized support for student information seeking and learning in MOOCs To do sorequires artificial intelligence and machine learning in the field of education

Artificial intelligence play an important role in learning and education Intelligenttutoring systems using technologies from artificial intelligent to advance learning andteaching In this work we focus on such data-driven development and optimization ofeducational technologies to build intelligent tutoring system for MOOCsThis work isalready carried out in the fields of educational data mining [5][6]and learning analytics[7]

14 Intelligent Tutoring System for MOOCs

Intelligent and Adaptive technologies proposes a solution to overcome limitations of thetraditional online courses ITS systems build a model of the knowledge and learningbehaviour for each individual learner and use this model throughout the interaction inorder to adapt to the needs of individual learner System offers personalised learningmaterial as per individual learning need

For the effective education and learning for MOOCs we will make our contributionto build the intelligent tutoring system which will address challenges in MOOCs listedabove First part is to address the challenge of finding behaviour of the student usingsequence of activity performed resources used and performance in the course similarbehaviour students are grouped into the cluster to provide personalised feedback foreach cluster having similar students And analysis of this data can also lead for betterunderstanding of student learning behaviour for making decisions for future offering ofthe courses

Second part is to estimate the concept wise knowledge of each individual studentparticipated in MOOCs The proposed solution built a model for each individual duringthe assessment and interaction with the system Using the student responses for theassessment system estimate performance of each student for concept to make predictionwhether student understand each concepttopic he learned Using this model systemanalyse individual student need and can provide personalised learning material as perindividuals demand

Majority of the ITS systems build has modularised in some components according tofunctionality Before going further we should familiar with these system and componentsin it As illustrated in figure 12 ITS contains following components[8] [9]

bull Domain Model

bull Student Model

bull Tutoring Model

bull Interface Model

3

Figure 12 Components of ITS

Domain Model- Domain model also known as expert model it stores informationabout the material being taught This model contains concepts and relationship betweenconcepts solutions for the question lessons and tutorials and problem solving strategiesof the domain It stands as a backbone of the content of the system

Student Model- It represents domain dependent and independent characteristics ofeach user These characteristics are acquired from user explicit responses analysis fromthe interaction data through assessments etc It consider as a core component of theITS system

Tutoring Model- To make the choices about tutoring and action it accepts infor-mation from domain model and student model it guides learner with respective theircurrent state of knowledge to reach proficiency with the target skill

Interaction Model- The model concerned with the representation of the user(usermodel) and representation of the application(domain model) The main aim of the modelis capturing the appropriate raw data of the learner and representing the interface

4

Chapter 2

Literature Review

Our work described here are falls into different categories listed in previous chapter Thereview done mostly includes related work we will presenting in next sections First wewill see some of the research related to data mining and learning analytics and that wewill be presenting in next subsequent sections Later in the sections we will present reviewon Adaptive and Intelligent Technologies in adaptive hypermedia and intelligent tutor-ing system that will require to understand the working of Adaptive and Intelligent system

21 Review of Educational Data Mining

WHAT IS EDMThe Educational Data Mining community website wwweducationaldataminingorg de-fines educational data mining as follows ldquoEducational Data Mining is an emerging dis-cipline concerned with developing methods for exploring the unique types of data thatcome from educational settings and using those methods to better understand studentsand the settings which they learn inrdquo

Data mining techniques can be used to discover useful information to assist educatorsto make decisions or modifications in environment or teaching approaches In [5] showsthat the data mining application in educational systems is an iterative cycle of hypothesisformation testing and refinement (see Fig 21) Mined Knowledge and results entersinto system and and guide facilitate and enhance the learning experience of the learnerby providing recommendation or personalised feedback This knowledge not only helpsto the learner but also to the educators for decision making so they can design planbuild and maintain the educational systems

In [10] the authors categorize work in educational data mining into

bull Statistics and visualization

bull Web mining

ndash Clustering classification and outlier detection

ndash Association rule mining and sequential pattern mining

ndash Text mining

5

Figure 21 The cycle of applying data mining in educational systems[5]

In same work he has listed data mining with different kinds of education systemsincludes (see fig 21) Traditional classrooms and Distance education systems (Particularweb-based courses Well-known learning content management systems Adaptive andintelligent web-based educational systems)

A different viewpoint on educational data mining stated in [6][11] where the followingcategories are defined

bull Prediction

ndash Classification

ndash Regression

ndash Density estimation

bull Clustering

bull Relationship mining

ndash Association rule mining

ndash Correlation mining

ndash Sequential pattern mining

ndash Causal data mining

bull Distillation of data for human judgment

bull Discovery with models

6

Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]

There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults

Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving

Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations

Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour

22 Adaptive and Intelligent Technologies

As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user

7

221 Adaptive Hypermedia

In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student

AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author

Adaptive Hypermedia performs following three functions

bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not

bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support

bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation

2211 Adaptive Presentation

[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be

bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)

bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model

bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values

8

2212 Adaptive Navigation Support

[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed

222 ITS technologies

In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS

2221 Curriculum sequencing

Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)

2222 Problem solving support

Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support

Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution

Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help

Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience

223 Existing Systems

In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]

Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that

9

System Description Technologies usedAHA open adaptive hypermedia

architecturePlatform foradaptive hypermedia

adaptive presentation adaptivenavigation support

ELM-ART Intelligent interactive sys-tem to learning programingin lisp

adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support

InterBook Authoring and Deliver-ing adaptive electronictextbooks

adaptive sequencing adaptivepresentation adaptive navigationsupport

KBS Hyperbook textbook in hypertext doc-ument

adaptive sequencing adaptivenavigation support

NetCoach Authoring system that meetthe needs to create adaptivelearning course

curriculum sequencing adaptiveannotation of links

Metalink Authoring tools for adap-tive hyperbook

page sequencing adaptive presen-tation

SIETTE ITS for classification andidentification of differentvegetable species

Question sequencing

SQL-Tutor support learning SQL Intelligent solution analysis

Table 21 Existing Systems

there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS

10

Chapter 3

The Open edX frontend architecture

31 Units(Weeks) sequences panels and modules

The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]

Figure 31 Courseware structure

UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules

11

32 Naming schemes

In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme

bull URLs Examples of course URLs corresponding to each level of the course hierarchy

ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014

courseware5773a6ab6319440aa5889ae38e578402

lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit

ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x

2T2014courseware5773a6ab6319440aa5889ae38e578402

ba24809f572846458e1c78e09411863a

lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit

ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence

bull Module URIs

Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70

These modules are displayed one course panel and panel can contain multiplemodules

33 Events

In events log we can distinguish the event between interaction event or navigational event

bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with

bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page

12

Chapter 4

Open edx research data

For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]

41 Student Info and Progress Data

edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]

bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment

ndash auth user

ndash auth userprofile

ndash student courseenrollment

ndash user api usercoursetag

ndash user id map

ndash student languageproficiency

For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course

bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data

13

bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course

42 Course Content Data

For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file

bull Shared Fields

bull Course Data

bull Course Building Block Data

bull Course Component Data

421 Shared Fields

Category children metadataThe these fields are present for all of the objects in thecourse structure file

bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories

ndash chapter The sections defined in the course outline

ndash course The dates settings and other metadata defined for the course as awhole

ndash discussion The content-specific discussion components defined for the course

ndash html The HTML components defined for the course

ndash problem The problem components defined for the course

ndash sequential The subsections defined in the course outline

ndash vertical The units defined in the course outline

ndash video The video components defined for the course

bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains

bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules

14

422 Course Data

In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values

Example of the course object defined for CS1011x

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

tabs [

name Courseware

type courseware

name Course Info

type course_info

name Textbooks

15

type textbooks

name Discussion

type discussion

name Wiki

type wiki

name Progress

type progress

name CS1011x Course Team

type static_tab

url_slug 84b41401f15b4a83a44cda2eb8a689fd

]

423 Course Building Block Data

Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain

bull chapter(section) contains list of sequentials(subsections)

bull sequential(subsections) contains list of verticals(units)

bull vertical(unit) list all components that they contain

The metadata field contains information about set of parameters defined by courseteam for each of them

Sample from CS1011X course

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

16

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

424 Course Component Data

Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course

i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e

category html

children []

metadata

display_name Selection Sort

17

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124

category problem

children []

metadata

display_name Q13

markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]

max_attempts 1

weight 20

i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5

category discussion

children []

metadata

discussion_category Week 2

discussion_id 6fd74752fae64748bfee05ea4f9ae6ca

discussion_target Structure of a Simple C++ Program

43 Tracking module id in Course Structure

To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen

18

the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy

As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for

19

example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

Do same procedure to find details of sequence identifier

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

Now can find a unit(vertical) in sequence using number of vertical in sequence panel

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

20

And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course

video module is here

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge

Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs

21

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 5: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

Abstract

A massive open online course(MOOCs) is an online course aimed at unlimited partici-pation and open access via the web In a MOOC each studentrsquos complete interactionwith the course materials is recorded as a clickstream which produces vast amountof data understanding of student learning using this data enables more intelligentinteractive engaging and effective education The analysis of data using various datamining techniques give more insight into different student learning behaviour that canlead to take better decision for educators and researcher for future offering of the courseAnalysis of student learning behaviour can also be used for personalized guidance foreach group of student

ITS aims to provide personalised instruction for each student so the student model isthe key component of ITS which store the student information(student knowledge stateand behaviour analysis of student) Diagnosis of the student knowledge state is the mostdifficult process due to uncertain and imprecise information about the student To man-age uncertainty in data Bayesian network has the most sound mathematical foundationsand also a natural way of representing uncertainty using probabilities The proposedwork builds Bayesian student model for concept wise understanding of student state ofknowledge

iv

Contents

1 Introduction 111 MOOCs 112 Challenges in MOOCs 113 Motivation 214 Intelligent Tutoring System for MOOCs 3

2 Literature Review 521 Review of Educational Data Mining 522 Adaptive and Intelligent Technologies 7

221 Adaptive Hypermedia 8222 ITS technologies 9223 Existing Systems 9

3 The Open edX frontend architecture 1131 Units(Weeks) sequences panels and modules 1132 Naming schemes 1233 Events 12

4 Open edx research data 1341 Student Info and Progress Data 1342 Course Content Data 14

421 Shared Fields 14422 Course Data 15423 Course Building Block Data 16424 Course Component Data 17

43 Tracking module id in Course Structure 18

5 Events in the Tracking Logs 2251 Common Fields 2252 Student Events 23

521 Navigational Events 23522 Video Interaction Events 24523 Problem Interaction Events 24524 Discussion Forums Events 25

53 New Findings in tracking logs 25

6 Tools and Technologies 2761 Primary requirements for development 2762 Python Bayes Network Toolbox 28

v

63 Jayes - Bayesian Networks for Java 2864 moocdb 29

7 Experiments With Data 3071 Pre-Processing and Data Cleaning 3072 Feature Extraction 3173 Clustering Experiments and results analysis 32

731 Experiment 1 32732 Experiment 2 33733 Experiment 3 35734 Experiment 4 36735 Experiment 5 36

8 Student modeling using Bayesian Network 3881 Student Modeling 38

811 Information in Student model 38812 Uncertainty-Based User Modeling 39

82 Bayesian Diagnostic Algorithm for Student Modeling 40821 Bayesian Network 40822 Bayesian Student Modeling 41

83 Experiments and Results 43831 Nodes and relations 44832 Parameter specification 44

9 Conclusion and Future work 46

vi

List of Figures

11 Work flow of MOOCs 212 Components of ITS 4

21 The cycle of applying data mining in educational systems 6

31 Courseware structure 11

61 Bayesian Network implementation steps [1] 28

81 Overlay model 3982 Bayesian Network model 4283 Analysis of student understanding for concept in CS1011x 45

vii

List of Tables

21 Existing Systems 10

71 Clustering Experiment 1 results 3372 Clustering Experiment 2 results 3473 Clustering Experiment 3 results 3574 Clustering Experiment 4 results 3675 Clustering Experiment 5 results 37

viii

Chapter 1

Introduction

11 MOOCs

A massive open online course(MOOCs) is an online course aimed at unlimited participa-tion and open access via the web [2] MOOCs offers wide variety of courses to the largenumber of students There are multiple MOOC platforms (including the edX Courseraand Udacity) offering large number of courses some of which have had hundreds ofthousands of students enrolled Benefits of such on-line courses is they are classroomindependent and platform independentThe courses started at one places can be used bythousands of learners all over the world So with minimum use of the resources and withlittle efforts anybody can participating in such online courses these courses are offeredby expert faculties in course from top class institutes in the world That can provideexperience of learning from highly qualified faculties with rich learning material andresources

MOOCs are considered as a one of the best educational research vehicle for most ofthe researchers in education field as it ability to track each student detailed interactionwith the course materials participation in forum practice problems and assignmentsetc The large data sets created by students interactions in MOOCs provide goodopportunities for research in educational field

12 Challenges in MOOCs

Figure 11 shows the learning workflow in MOOCs When a new course start instructorputs learning material on topics As time passes during the subsequent assessment learnergoes through the available learning material learning material consist of Videos audiotext etc Going through learning material learner attempt for exercises assignmentsquizzes exams etc After assessment results evaluated are provided to the learner andsame procedure follows until the course finishes

The courses offered are nothing but the static traditional hypermedia of the variouslearning material For the diverse learner population this system will suffer from inabilityto be ldquoall things to all peoplerdquo [3] Each student registered in MOOCs has different char-acteristics such as knowledge level preferences interest goals cognitive style learningstyle prerequisite knowledge etc The courses offered by MOOCs could not able to meet

1

Figure 11 Work flow of MOOCs

the need of each and every individual with different learning demands Due to which mostof the student not able to achieve satisfactory learning in course

13 Motivation

In traditional classroom based courses Due to limited number of students teachercan give special attention for each and every student for their classroom activitiesand can observe learning style knowledge level interests preferences etc In suchenvironment teacher can diagnose the student knowledge on various concept by askingquestions and teach on the basis of the diagnosis result for each student But dueto development in technologies and need to serve for the large number of populationwith courses offered on web with open access to all it is not feasible for MOOCinstructors to manually provide attention to individual with different backgroundsdiverse skill levels learning goals and preferences So there is an increasing need to makedirected efforts towards automatically providing better personalized content in e-learning

In a MOOCs each student complete interaction with the course materials is recordedas a clickstream which produces vast amount of data In [4] author states aboutimportant of such data that can be used for understanding of student learning andenable more intelligent interactive engaging and effective education Understandinghow students interact with MOOCs is a crucial issue because it affects how we design

2

future online courses Identifying student interaction patterns behaviour and sequenceof actions performed by student through clickstream analysis could enable more effectivepersonalized support for student information seeking and learning in MOOCs To do sorequires artificial intelligence and machine learning in the field of education

Artificial intelligence play an important role in learning and education Intelligenttutoring systems using technologies from artificial intelligent to advance learning andteaching In this work we focus on such data-driven development and optimization ofeducational technologies to build intelligent tutoring system for MOOCsThis work isalready carried out in the fields of educational data mining [5][6]and learning analytics[7]

14 Intelligent Tutoring System for MOOCs

Intelligent and Adaptive technologies proposes a solution to overcome limitations of thetraditional online courses ITS systems build a model of the knowledge and learningbehaviour for each individual learner and use this model throughout the interaction inorder to adapt to the needs of individual learner System offers personalised learningmaterial as per individual learning need

For the effective education and learning for MOOCs we will make our contributionto build the intelligent tutoring system which will address challenges in MOOCs listedabove First part is to address the challenge of finding behaviour of the student usingsequence of activity performed resources used and performance in the course similarbehaviour students are grouped into the cluster to provide personalised feedback foreach cluster having similar students And analysis of this data can also lead for betterunderstanding of student learning behaviour for making decisions for future offering ofthe courses

Second part is to estimate the concept wise knowledge of each individual studentparticipated in MOOCs The proposed solution built a model for each individual duringthe assessment and interaction with the system Using the student responses for theassessment system estimate performance of each student for concept to make predictionwhether student understand each concepttopic he learned Using this model systemanalyse individual student need and can provide personalised learning material as perindividuals demand

Majority of the ITS systems build has modularised in some components according tofunctionality Before going further we should familiar with these system and componentsin it As illustrated in figure 12 ITS contains following components[8] [9]

bull Domain Model

bull Student Model

bull Tutoring Model

bull Interface Model

3

Figure 12 Components of ITS

Domain Model- Domain model also known as expert model it stores informationabout the material being taught This model contains concepts and relationship betweenconcepts solutions for the question lessons and tutorials and problem solving strategiesof the domain It stands as a backbone of the content of the system

Student Model- It represents domain dependent and independent characteristics ofeach user These characteristics are acquired from user explicit responses analysis fromthe interaction data through assessments etc It consider as a core component of theITS system

Tutoring Model- To make the choices about tutoring and action it accepts infor-mation from domain model and student model it guides learner with respective theircurrent state of knowledge to reach proficiency with the target skill

Interaction Model- The model concerned with the representation of the user(usermodel) and representation of the application(domain model) The main aim of the modelis capturing the appropriate raw data of the learner and representing the interface

4

Chapter 2

Literature Review

Our work described here are falls into different categories listed in previous chapter Thereview done mostly includes related work we will presenting in next sections First wewill see some of the research related to data mining and learning analytics and that wewill be presenting in next subsequent sections Later in the sections we will present reviewon Adaptive and Intelligent Technologies in adaptive hypermedia and intelligent tutor-ing system that will require to understand the working of Adaptive and Intelligent system

21 Review of Educational Data Mining

WHAT IS EDMThe Educational Data Mining community website wwweducationaldataminingorg de-fines educational data mining as follows ldquoEducational Data Mining is an emerging dis-cipline concerned with developing methods for exploring the unique types of data thatcome from educational settings and using those methods to better understand studentsand the settings which they learn inrdquo

Data mining techniques can be used to discover useful information to assist educatorsto make decisions or modifications in environment or teaching approaches In [5] showsthat the data mining application in educational systems is an iterative cycle of hypothesisformation testing and refinement (see Fig 21) Mined Knowledge and results entersinto system and and guide facilitate and enhance the learning experience of the learnerby providing recommendation or personalised feedback This knowledge not only helpsto the learner but also to the educators for decision making so they can design planbuild and maintain the educational systems

In [10] the authors categorize work in educational data mining into

bull Statistics and visualization

bull Web mining

ndash Clustering classification and outlier detection

ndash Association rule mining and sequential pattern mining

ndash Text mining

5

Figure 21 The cycle of applying data mining in educational systems[5]

In same work he has listed data mining with different kinds of education systemsincludes (see fig 21) Traditional classrooms and Distance education systems (Particularweb-based courses Well-known learning content management systems Adaptive andintelligent web-based educational systems)

A different viewpoint on educational data mining stated in [6][11] where the followingcategories are defined

bull Prediction

ndash Classification

ndash Regression

ndash Density estimation

bull Clustering

bull Relationship mining

ndash Association rule mining

ndash Correlation mining

ndash Sequential pattern mining

ndash Causal data mining

bull Distillation of data for human judgment

bull Discovery with models

6

Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]

There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults

Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving

Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations

Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour

22 Adaptive and Intelligent Technologies

As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user

7

221 Adaptive Hypermedia

In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student

AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author

Adaptive Hypermedia performs following three functions

bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not

bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support

bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation

2211 Adaptive Presentation

[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be

bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)

bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model

bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values

8

2212 Adaptive Navigation Support

[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed

222 ITS technologies

In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS

2221 Curriculum sequencing

Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)

2222 Problem solving support

Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support

Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution

Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help

Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience

223 Existing Systems

In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]

Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that

9

System Description Technologies usedAHA open adaptive hypermedia

architecturePlatform foradaptive hypermedia

adaptive presentation adaptivenavigation support

ELM-ART Intelligent interactive sys-tem to learning programingin lisp

adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support

InterBook Authoring and Deliver-ing adaptive electronictextbooks

adaptive sequencing adaptivepresentation adaptive navigationsupport

KBS Hyperbook textbook in hypertext doc-ument

adaptive sequencing adaptivenavigation support

NetCoach Authoring system that meetthe needs to create adaptivelearning course

curriculum sequencing adaptiveannotation of links

Metalink Authoring tools for adap-tive hyperbook

page sequencing adaptive presen-tation

SIETTE ITS for classification andidentification of differentvegetable species

Question sequencing

SQL-Tutor support learning SQL Intelligent solution analysis

Table 21 Existing Systems

there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS

10

Chapter 3

The Open edX frontend architecture

31 Units(Weeks) sequences panels and modules

The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]

Figure 31 Courseware structure

UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules

11

32 Naming schemes

In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme

bull URLs Examples of course URLs corresponding to each level of the course hierarchy

ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014

courseware5773a6ab6319440aa5889ae38e578402

lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit

ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x

2T2014courseware5773a6ab6319440aa5889ae38e578402

ba24809f572846458e1c78e09411863a

lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit

ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence

bull Module URIs

Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70

These modules are displayed one course panel and panel can contain multiplemodules

33 Events

In events log we can distinguish the event between interaction event or navigational event

bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with

bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page

12

Chapter 4

Open edx research data

For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]

41 Student Info and Progress Data

edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]

bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment

ndash auth user

ndash auth userprofile

ndash student courseenrollment

ndash user api usercoursetag

ndash user id map

ndash student languageproficiency

For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course

bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data

13

bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course

42 Course Content Data

For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file

bull Shared Fields

bull Course Data

bull Course Building Block Data

bull Course Component Data

421 Shared Fields

Category children metadataThe these fields are present for all of the objects in thecourse structure file

bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories

ndash chapter The sections defined in the course outline

ndash course The dates settings and other metadata defined for the course as awhole

ndash discussion The content-specific discussion components defined for the course

ndash html The HTML components defined for the course

ndash problem The problem components defined for the course

ndash sequential The subsections defined in the course outline

ndash vertical The units defined in the course outline

ndash video The video components defined for the course

bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains

bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules

14

422 Course Data

In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values

Example of the course object defined for CS1011x

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

tabs [

name Courseware

type courseware

name Course Info

type course_info

name Textbooks

15

type textbooks

name Discussion

type discussion

name Wiki

type wiki

name Progress

type progress

name CS1011x Course Team

type static_tab

url_slug 84b41401f15b4a83a44cda2eb8a689fd

]

423 Course Building Block Data

Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain

bull chapter(section) contains list of sequentials(subsections)

bull sequential(subsections) contains list of verticals(units)

bull vertical(unit) list all components that they contain

The metadata field contains information about set of parameters defined by courseteam for each of them

Sample from CS1011X course

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

16

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

424 Course Component Data

Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course

i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e

category html

children []

metadata

display_name Selection Sort

17

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124

category problem

children []

metadata

display_name Q13

markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]

max_attempts 1

weight 20

i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5

category discussion

children []

metadata

discussion_category Week 2

discussion_id 6fd74752fae64748bfee05ea4f9ae6ca

discussion_target Structure of a Simple C++ Program

43 Tracking module id in Course Structure

To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen

18

the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy

As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for

19

example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

Do same procedure to find details of sequence identifier

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

Now can find a unit(vertical) in sequence using number of vertical in sequence panel

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

20

And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course

video module is here

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge

Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs

21

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 6: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

Contents

1 Introduction 111 MOOCs 112 Challenges in MOOCs 113 Motivation 214 Intelligent Tutoring System for MOOCs 3

2 Literature Review 521 Review of Educational Data Mining 522 Adaptive and Intelligent Technologies 7

221 Adaptive Hypermedia 8222 ITS technologies 9223 Existing Systems 9

3 The Open edX frontend architecture 1131 Units(Weeks) sequences panels and modules 1132 Naming schemes 1233 Events 12

4 Open edx research data 1341 Student Info and Progress Data 1342 Course Content Data 14

421 Shared Fields 14422 Course Data 15423 Course Building Block Data 16424 Course Component Data 17

43 Tracking module id in Course Structure 18

5 Events in the Tracking Logs 2251 Common Fields 2252 Student Events 23

521 Navigational Events 23522 Video Interaction Events 24523 Problem Interaction Events 24524 Discussion Forums Events 25

53 New Findings in tracking logs 25

6 Tools and Technologies 2761 Primary requirements for development 2762 Python Bayes Network Toolbox 28

v

63 Jayes - Bayesian Networks for Java 2864 moocdb 29

7 Experiments With Data 3071 Pre-Processing and Data Cleaning 3072 Feature Extraction 3173 Clustering Experiments and results analysis 32

731 Experiment 1 32732 Experiment 2 33733 Experiment 3 35734 Experiment 4 36735 Experiment 5 36

8 Student modeling using Bayesian Network 3881 Student Modeling 38

811 Information in Student model 38812 Uncertainty-Based User Modeling 39

82 Bayesian Diagnostic Algorithm for Student Modeling 40821 Bayesian Network 40822 Bayesian Student Modeling 41

83 Experiments and Results 43831 Nodes and relations 44832 Parameter specification 44

9 Conclusion and Future work 46

vi

List of Figures

11 Work flow of MOOCs 212 Components of ITS 4

21 The cycle of applying data mining in educational systems 6

31 Courseware structure 11

61 Bayesian Network implementation steps [1] 28

81 Overlay model 3982 Bayesian Network model 4283 Analysis of student understanding for concept in CS1011x 45

vii

List of Tables

21 Existing Systems 10

71 Clustering Experiment 1 results 3372 Clustering Experiment 2 results 3473 Clustering Experiment 3 results 3574 Clustering Experiment 4 results 3675 Clustering Experiment 5 results 37

viii

Chapter 1

Introduction

11 MOOCs

A massive open online course(MOOCs) is an online course aimed at unlimited participa-tion and open access via the web [2] MOOCs offers wide variety of courses to the largenumber of students There are multiple MOOC platforms (including the edX Courseraand Udacity) offering large number of courses some of which have had hundreds ofthousands of students enrolled Benefits of such on-line courses is they are classroomindependent and platform independentThe courses started at one places can be used bythousands of learners all over the world So with minimum use of the resources and withlittle efforts anybody can participating in such online courses these courses are offeredby expert faculties in course from top class institutes in the world That can provideexperience of learning from highly qualified faculties with rich learning material andresources

MOOCs are considered as a one of the best educational research vehicle for most ofthe researchers in education field as it ability to track each student detailed interactionwith the course materials participation in forum practice problems and assignmentsetc The large data sets created by students interactions in MOOCs provide goodopportunities for research in educational field

12 Challenges in MOOCs

Figure 11 shows the learning workflow in MOOCs When a new course start instructorputs learning material on topics As time passes during the subsequent assessment learnergoes through the available learning material learning material consist of Videos audiotext etc Going through learning material learner attempt for exercises assignmentsquizzes exams etc After assessment results evaluated are provided to the learner andsame procedure follows until the course finishes

The courses offered are nothing but the static traditional hypermedia of the variouslearning material For the diverse learner population this system will suffer from inabilityto be ldquoall things to all peoplerdquo [3] Each student registered in MOOCs has different char-acteristics such as knowledge level preferences interest goals cognitive style learningstyle prerequisite knowledge etc The courses offered by MOOCs could not able to meet

1

Figure 11 Work flow of MOOCs

the need of each and every individual with different learning demands Due to which mostof the student not able to achieve satisfactory learning in course

13 Motivation

In traditional classroom based courses Due to limited number of students teachercan give special attention for each and every student for their classroom activitiesand can observe learning style knowledge level interests preferences etc In suchenvironment teacher can diagnose the student knowledge on various concept by askingquestions and teach on the basis of the diagnosis result for each student But dueto development in technologies and need to serve for the large number of populationwith courses offered on web with open access to all it is not feasible for MOOCinstructors to manually provide attention to individual with different backgroundsdiverse skill levels learning goals and preferences So there is an increasing need to makedirected efforts towards automatically providing better personalized content in e-learning

In a MOOCs each student complete interaction with the course materials is recordedas a clickstream which produces vast amount of data In [4] author states aboutimportant of such data that can be used for understanding of student learning andenable more intelligent interactive engaging and effective education Understandinghow students interact with MOOCs is a crucial issue because it affects how we design

2

future online courses Identifying student interaction patterns behaviour and sequenceof actions performed by student through clickstream analysis could enable more effectivepersonalized support for student information seeking and learning in MOOCs To do sorequires artificial intelligence and machine learning in the field of education

Artificial intelligence play an important role in learning and education Intelligenttutoring systems using technologies from artificial intelligent to advance learning andteaching In this work we focus on such data-driven development and optimization ofeducational technologies to build intelligent tutoring system for MOOCsThis work isalready carried out in the fields of educational data mining [5][6]and learning analytics[7]

14 Intelligent Tutoring System for MOOCs

Intelligent and Adaptive technologies proposes a solution to overcome limitations of thetraditional online courses ITS systems build a model of the knowledge and learningbehaviour for each individual learner and use this model throughout the interaction inorder to adapt to the needs of individual learner System offers personalised learningmaterial as per individual learning need

For the effective education and learning for MOOCs we will make our contributionto build the intelligent tutoring system which will address challenges in MOOCs listedabove First part is to address the challenge of finding behaviour of the student usingsequence of activity performed resources used and performance in the course similarbehaviour students are grouped into the cluster to provide personalised feedback foreach cluster having similar students And analysis of this data can also lead for betterunderstanding of student learning behaviour for making decisions for future offering ofthe courses

Second part is to estimate the concept wise knowledge of each individual studentparticipated in MOOCs The proposed solution built a model for each individual duringthe assessment and interaction with the system Using the student responses for theassessment system estimate performance of each student for concept to make predictionwhether student understand each concepttopic he learned Using this model systemanalyse individual student need and can provide personalised learning material as perindividuals demand

Majority of the ITS systems build has modularised in some components according tofunctionality Before going further we should familiar with these system and componentsin it As illustrated in figure 12 ITS contains following components[8] [9]

bull Domain Model

bull Student Model

bull Tutoring Model

bull Interface Model

3

Figure 12 Components of ITS

Domain Model- Domain model also known as expert model it stores informationabout the material being taught This model contains concepts and relationship betweenconcepts solutions for the question lessons and tutorials and problem solving strategiesof the domain It stands as a backbone of the content of the system

Student Model- It represents domain dependent and independent characteristics ofeach user These characteristics are acquired from user explicit responses analysis fromthe interaction data through assessments etc It consider as a core component of theITS system

Tutoring Model- To make the choices about tutoring and action it accepts infor-mation from domain model and student model it guides learner with respective theircurrent state of knowledge to reach proficiency with the target skill

Interaction Model- The model concerned with the representation of the user(usermodel) and representation of the application(domain model) The main aim of the modelis capturing the appropriate raw data of the learner and representing the interface

4

Chapter 2

Literature Review

Our work described here are falls into different categories listed in previous chapter Thereview done mostly includes related work we will presenting in next sections First wewill see some of the research related to data mining and learning analytics and that wewill be presenting in next subsequent sections Later in the sections we will present reviewon Adaptive and Intelligent Technologies in adaptive hypermedia and intelligent tutor-ing system that will require to understand the working of Adaptive and Intelligent system

21 Review of Educational Data Mining

WHAT IS EDMThe Educational Data Mining community website wwweducationaldataminingorg de-fines educational data mining as follows ldquoEducational Data Mining is an emerging dis-cipline concerned with developing methods for exploring the unique types of data thatcome from educational settings and using those methods to better understand studentsand the settings which they learn inrdquo

Data mining techniques can be used to discover useful information to assist educatorsto make decisions or modifications in environment or teaching approaches In [5] showsthat the data mining application in educational systems is an iterative cycle of hypothesisformation testing and refinement (see Fig 21) Mined Knowledge and results entersinto system and and guide facilitate and enhance the learning experience of the learnerby providing recommendation or personalised feedback This knowledge not only helpsto the learner but also to the educators for decision making so they can design planbuild and maintain the educational systems

In [10] the authors categorize work in educational data mining into

bull Statistics and visualization

bull Web mining

ndash Clustering classification and outlier detection

ndash Association rule mining and sequential pattern mining

ndash Text mining

5

Figure 21 The cycle of applying data mining in educational systems[5]

In same work he has listed data mining with different kinds of education systemsincludes (see fig 21) Traditional classrooms and Distance education systems (Particularweb-based courses Well-known learning content management systems Adaptive andintelligent web-based educational systems)

A different viewpoint on educational data mining stated in [6][11] where the followingcategories are defined

bull Prediction

ndash Classification

ndash Regression

ndash Density estimation

bull Clustering

bull Relationship mining

ndash Association rule mining

ndash Correlation mining

ndash Sequential pattern mining

ndash Causal data mining

bull Distillation of data for human judgment

bull Discovery with models

6

Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]

There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults

Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving

Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations

Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour

22 Adaptive and Intelligent Technologies

As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user

7

221 Adaptive Hypermedia

In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student

AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author

Adaptive Hypermedia performs following three functions

bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not

bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support

bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation

2211 Adaptive Presentation

[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be

bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)

bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model

bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values

8

2212 Adaptive Navigation Support

[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed

222 ITS technologies

In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS

2221 Curriculum sequencing

Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)

2222 Problem solving support

Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support

Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution

Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help

Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience

223 Existing Systems

In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]

Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that

9

System Description Technologies usedAHA open adaptive hypermedia

architecturePlatform foradaptive hypermedia

adaptive presentation adaptivenavigation support

ELM-ART Intelligent interactive sys-tem to learning programingin lisp

adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support

InterBook Authoring and Deliver-ing adaptive electronictextbooks

adaptive sequencing adaptivepresentation adaptive navigationsupport

KBS Hyperbook textbook in hypertext doc-ument

adaptive sequencing adaptivenavigation support

NetCoach Authoring system that meetthe needs to create adaptivelearning course

curriculum sequencing adaptiveannotation of links

Metalink Authoring tools for adap-tive hyperbook

page sequencing adaptive presen-tation

SIETTE ITS for classification andidentification of differentvegetable species

Question sequencing

SQL-Tutor support learning SQL Intelligent solution analysis

Table 21 Existing Systems

there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS

10

Chapter 3

The Open edX frontend architecture

31 Units(Weeks) sequences panels and modules

The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]

Figure 31 Courseware structure

UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules

11

32 Naming schemes

In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme

bull URLs Examples of course URLs corresponding to each level of the course hierarchy

ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014

courseware5773a6ab6319440aa5889ae38e578402

lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit

ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x

2T2014courseware5773a6ab6319440aa5889ae38e578402

ba24809f572846458e1c78e09411863a

lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit

ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence

bull Module URIs

Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70

These modules are displayed one course panel and panel can contain multiplemodules

33 Events

In events log we can distinguish the event between interaction event or navigational event

bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with

bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page

12

Chapter 4

Open edx research data

For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]

41 Student Info and Progress Data

edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]

bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment

ndash auth user

ndash auth userprofile

ndash student courseenrollment

ndash user api usercoursetag

ndash user id map

ndash student languageproficiency

For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course

bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data

13

bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course

42 Course Content Data

For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file

bull Shared Fields

bull Course Data

bull Course Building Block Data

bull Course Component Data

421 Shared Fields

Category children metadataThe these fields are present for all of the objects in thecourse structure file

bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories

ndash chapter The sections defined in the course outline

ndash course The dates settings and other metadata defined for the course as awhole

ndash discussion The content-specific discussion components defined for the course

ndash html The HTML components defined for the course

ndash problem The problem components defined for the course

ndash sequential The subsections defined in the course outline

ndash vertical The units defined in the course outline

ndash video The video components defined for the course

bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains

bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules

14

422 Course Data

In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values

Example of the course object defined for CS1011x

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

tabs [

name Courseware

type courseware

name Course Info

type course_info

name Textbooks

15

type textbooks

name Discussion

type discussion

name Wiki

type wiki

name Progress

type progress

name CS1011x Course Team

type static_tab

url_slug 84b41401f15b4a83a44cda2eb8a689fd

]

423 Course Building Block Data

Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain

bull chapter(section) contains list of sequentials(subsections)

bull sequential(subsections) contains list of verticals(units)

bull vertical(unit) list all components that they contain

The metadata field contains information about set of parameters defined by courseteam for each of them

Sample from CS1011X course

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

16

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

424 Course Component Data

Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course

i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e

category html

children []

metadata

display_name Selection Sort

17

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124

category problem

children []

metadata

display_name Q13

markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]

max_attempts 1

weight 20

i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5

category discussion

children []

metadata

discussion_category Week 2

discussion_id 6fd74752fae64748bfee05ea4f9ae6ca

discussion_target Structure of a Simple C++ Program

43 Tracking module id in Course Structure

To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen

18

the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy

As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for

19

example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

Do same procedure to find details of sequence identifier

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

Now can find a unit(vertical) in sequence using number of vertical in sequence panel

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

20

And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course

video module is here

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge

Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs

21

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 7: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

63 Jayes - Bayesian Networks for Java 2864 moocdb 29

7 Experiments With Data 3071 Pre-Processing and Data Cleaning 3072 Feature Extraction 3173 Clustering Experiments and results analysis 32

731 Experiment 1 32732 Experiment 2 33733 Experiment 3 35734 Experiment 4 36735 Experiment 5 36

8 Student modeling using Bayesian Network 3881 Student Modeling 38

811 Information in Student model 38812 Uncertainty-Based User Modeling 39

82 Bayesian Diagnostic Algorithm for Student Modeling 40821 Bayesian Network 40822 Bayesian Student Modeling 41

83 Experiments and Results 43831 Nodes and relations 44832 Parameter specification 44

9 Conclusion and Future work 46

vi

List of Figures

11 Work flow of MOOCs 212 Components of ITS 4

21 The cycle of applying data mining in educational systems 6

31 Courseware structure 11

61 Bayesian Network implementation steps [1] 28

81 Overlay model 3982 Bayesian Network model 4283 Analysis of student understanding for concept in CS1011x 45

vii

List of Tables

21 Existing Systems 10

71 Clustering Experiment 1 results 3372 Clustering Experiment 2 results 3473 Clustering Experiment 3 results 3574 Clustering Experiment 4 results 3675 Clustering Experiment 5 results 37

viii

Chapter 1

Introduction

11 MOOCs

A massive open online course(MOOCs) is an online course aimed at unlimited participa-tion and open access via the web [2] MOOCs offers wide variety of courses to the largenumber of students There are multiple MOOC platforms (including the edX Courseraand Udacity) offering large number of courses some of which have had hundreds ofthousands of students enrolled Benefits of such on-line courses is they are classroomindependent and platform independentThe courses started at one places can be used bythousands of learners all over the world So with minimum use of the resources and withlittle efforts anybody can participating in such online courses these courses are offeredby expert faculties in course from top class institutes in the world That can provideexperience of learning from highly qualified faculties with rich learning material andresources

MOOCs are considered as a one of the best educational research vehicle for most ofthe researchers in education field as it ability to track each student detailed interactionwith the course materials participation in forum practice problems and assignmentsetc The large data sets created by students interactions in MOOCs provide goodopportunities for research in educational field

12 Challenges in MOOCs

Figure 11 shows the learning workflow in MOOCs When a new course start instructorputs learning material on topics As time passes during the subsequent assessment learnergoes through the available learning material learning material consist of Videos audiotext etc Going through learning material learner attempt for exercises assignmentsquizzes exams etc After assessment results evaluated are provided to the learner andsame procedure follows until the course finishes

The courses offered are nothing but the static traditional hypermedia of the variouslearning material For the diverse learner population this system will suffer from inabilityto be ldquoall things to all peoplerdquo [3] Each student registered in MOOCs has different char-acteristics such as knowledge level preferences interest goals cognitive style learningstyle prerequisite knowledge etc The courses offered by MOOCs could not able to meet

1

Figure 11 Work flow of MOOCs

the need of each and every individual with different learning demands Due to which mostof the student not able to achieve satisfactory learning in course

13 Motivation

In traditional classroom based courses Due to limited number of students teachercan give special attention for each and every student for their classroom activitiesand can observe learning style knowledge level interests preferences etc In suchenvironment teacher can diagnose the student knowledge on various concept by askingquestions and teach on the basis of the diagnosis result for each student But dueto development in technologies and need to serve for the large number of populationwith courses offered on web with open access to all it is not feasible for MOOCinstructors to manually provide attention to individual with different backgroundsdiverse skill levels learning goals and preferences So there is an increasing need to makedirected efforts towards automatically providing better personalized content in e-learning

In a MOOCs each student complete interaction with the course materials is recordedas a clickstream which produces vast amount of data In [4] author states aboutimportant of such data that can be used for understanding of student learning andenable more intelligent interactive engaging and effective education Understandinghow students interact with MOOCs is a crucial issue because it affects how we design

2

future online courses Identifying student interaction patterns behaviour and sequenceof actions performed by student through clickstream analysis could enable more effectivepersonalized support for student information seeking and learning in MOOCs To do sorequires artificial intelligence and machine learning in the field of education

Artificial intelligence play an important role in learning and education Intelligenttutoring systems using technologies from artificial intelligent to advance learning andteaching In this work we focus on such data-driven development and optimization ofeducational technologies to build intelligent tutoring system for MOOCsThis work isalready carried out in the fields of educational data mining [5][6]and learning analytics[7]

14 Intelligent Tutoring System for MOOCs

Intelligent and Adaptive technologies proposes a solution to overcome limitations of thetraditional online courses ITS systems build a model of the knowledge and learningbehaviour for each individual learner and use this model throughout the interaction inorder to adapt to the needs of individual learner System offers personalised learningmaterial as per individual learning need

For the effective education and learning for MOOCs we will make our contributionto build the intelligent tutoring system which will address challenges in MOOCs listedabove First part is to address the challenge of finding behaviour of the student usingsequence of activity performed resources used and performance in the course similarbehaviour students are grouped into the cluster to provide personalised feedback foreach cluster having similar students And analysis of this data can also lead for betterunderstanding of student learning behaviour for making decisions for future offering ofthe courses

Second part is to estimate the concept wise knowledge of each individual studentparticipated in MOOCs The proposed solution built a model for each individual duringthe assessment and interaction with the system Using the student responses for theassessment system estimate performance of each student for concept to make predictionwhether student understand each concepttopic he learned Using this model systemanalyse individual student need and can provide personalised learning material as perindividuals demand

Majority of the ITS systems build has modularised in some components according tofunctionality Before going further we should familiar with these system and componentsin it As illustrated in figure 12 ITS contains following components[8] [9]

bull Domain Model

bull Student Model

bull Tutoring Model

bull Interface Model

3

Figure 12 Components of ITS

Domain Model- Domain model also known as expert model it stores informationabout the material being taught This model contains concepts and relationship betweenconcepts solutions for the question lessons and tutorials and problem solving strategiesof the domain It stands as a backbone of the content of the system

Student Model- It represents domain dependent and independent characteristics ofeach user These characteristics are acquired from user explicit responses analysis fromthe interaction data through assessments etc It consider as a core component of theITS system

Tutoring Model- To make the choices about tutoring and action it accepts infor-mation from domain model and student model it guides learner with respective theircurrent state of knowledge to reach proficiency with the target skill

Interaction Model- The model concerned with the representation of the user(usermodel) and representation of the application(domain model) The main aim of the modelis capturing the appropriate raw data of the learner and representing the interface

4

Chapter 2

Literature Review

Our work described here are falls into different categories listed in previous chapter Thereview done mostly includes related work we will presenting in next sections First wewill see some of the research related to data mining and learning analytics and that wewill be presenting in next subsequent sections Later in the sections we will present reviewon Adaptive and Intelligent Technologies in adaptive hypermedia and intelligent tutor-ing system that will require to understand the working of Adaptive and Intelligent system

21 Review of Educational Data Mining

WHAT IS EDMThe Educational Data Mining community website wwweducationaldataminingorg de-fines educational data mining as follows ldquoEducational Data Mining is an emerging dis-cipline concerned with developing methods for exploring the unique types of data thatcome from educational settings and using those methods to better understand studentsand the settings which they learn inrdquo

Data mining techniques can be used to discover useful information to assist educatorsto make decisions or modifications in environment or teaching approaches In [5] showsthat the data mining application in educational systems is an iterative cycle of hypothesisformation testing and refinement (see Fig 21) Mined Knowledge and results entersinto system and and guide facilitate and enhance the learning experience of the learnerby providing recommendation or personalised feedback This knowledge not only helpsto the learner but also to the educators for decision making so they can design planbuild and maintain the educational systems

In [10] the authors categorize work in educational data mining into

bull Statistics and visualization

bull Web mining

ndash Clustering classification and outlier detection

ndash Association rule mining and sequential pattern mining

ndash Text mining

5

Figure 21 The cycle of applying data mining in educational systems[5]

In same work he has listed data mining with different kinds of education systemsincludes (see fig 21) Traditional classrooms and Distance education systems (Particularweb-based courses Well-known learning content management systems Adaptive andintelligent web-based educational systems)

A different viewpoint on educational data mining stated in [6][11] where the followingcategories are defined

bull Prediction

ndash Classification

ndash Regression

ndash Density estimation

bull Clustering

bull Relationship mining

ndash Association rule mining

ndash Correlation mining

ndash Sequential pattern mining

ndash Causal data mining

bull Distillation of data for human judgment

bull Discovery with models

6

Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]

There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults

Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving

Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations

Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour

22 Adaptive and Intelligent Technologies

As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user

7

221 Adaptive Hypermedia

In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student

AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author

Adaptive Hypermedia performs following three functions

bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not

bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support

bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation

2211 Adaptive Presentation

[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be

bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)

bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model

bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values

8

2212 Adaptive Navigation Support

[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed

222 ITS technologies

In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS

2221 Curriculum sequencing

Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)

2222 Problem solving support

Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support

Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution

Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help

Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience

223 Existing Systems

In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]

Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that

9

System Description Technologies usedAHA open adaptive hypermedia

architecturePlatform foradaptive hypermedia

adaptive presentation adaptivenavigation support

ELM-ART Intelligent interactive sys-tem to learning programingin lisp

adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support

InterBook Authoring and Deliver-ing adaptive electronictextbooks

adaptive sequencing adaptivepresentation adaptive navigationsupport

KBS Hyperbook textbook in hypertext doc-ument

adaptive sequencing adaptivenavigation support

NetCoach Authoring system that meetthe needs to create adaptivelearning course

curriculum sequencing adaptiveannotation of links

Metalink Authoring tools for adap-tive hyperbook

page sequencing adaptive presen-tation

SIETTE ITS for classification andidentification of differentvegetable species

Question sequencing

SQL-Tutor support learning SQL Intelligent solution analysis

Table 21 Existing Systems

there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS

10

Chapter 3

The Open edX frontend architecture

31 Units(Weeks) sequences panels and modules

The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]

Figure 31 Courseware structure

UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules

11

32 Naming schemes

In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme

bull URLs Examples of course URLs corresponding to each level of the course hierarchy

ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014

courseware5773a6ab6319440aa5889ae38e578402

lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit

ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x

2T2014courseware5773a6ab6319440aa5889ae38e578402

ba24809f572846458e1c78e09411863a

lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit

ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence

bull Module URIs

Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70

These modules are displayed one course panel and panel can contain multiplemodules

33 Events

In events log we can distinguish the event between interaction event or navigational event

bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with

bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page

12

Chapter 4

Open edx research data

For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]

41 Student Info and Progress Data

edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]

bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment

ndash auth user

ndash auth userprofile

ndash student courseenrollment

ndash user api usercoursetag

ndash user id map

ndash student languageproficiency

For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course

bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data

13

bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course

42 Course Content Data

For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file

bull Shared Fields

bull Course Data

bull Course Building Block Data

bull Course Component Data

421 Shared Fields

Category children metadataThe these fields are present for all of the objects in thecourse structure file

bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories

ndash chapter The sections defined in the course outline

ndash course The dates settings and other metadata defined for the course as awhole

ndash discussion The content-specific discussion components defined for the course

ndash html The HTML components defined for the course

ndash problem The problem components defined for the course

ndash sequential The subsections defined in the course outline

ndash vertical The units defined in the course outline

ndash video The video components defined for the course

bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains

bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules

14

422 Course Data

In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values

Example of the course object defined for CS1011x

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

tabs [

name Courseware

type courseware

name Course Info

type course_info

name Textbooks

15

type textbooks

name Discussion

type discussion

name Wiki

type wiki

name Progress

type progress

name CS1011x Course Team

type static_tab

url_slug 84b41401f15b4a83a44cda2eb8a689fd

]

423 Course Building Block Data

Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain

bull chapter(section) contains list of sequentials(subsections)

bull sequential(subsections) contains list of verticals(units)

bull vertical(unit) list all components that they contain

The metadata field contains information about set of parameters defined by courseteam for each of them

Sample from CS1011X course

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

16

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

424 Course Component Data

Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course

i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e

category html

children []

metadata

display_name Selection Sort

17

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124

category problem

children []

metadata

display_name Q13

markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]

max_attempts 1

weight 20

i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5

category discussion

children []

metadata

discussion_category Week 2

discussion_id 6fd74752fae64748bfee05ea4f9ae6ca

discussion_target Structure of a Simple C++ Program

43 Tracking module id in Course Structure

To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen

18

the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy

As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for

19

example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

Do same procedure to find details of sequence identifier

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

Now can find a unit(vertical) in sequence using number of vertical in sequence panel

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

20

And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course

video module is here

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge

Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs

21

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 8: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

List of Figures

11 Work flow of MOOCs 212 Components of ITS 4

21 The cycle of applying data mining in educational systems 6

31 Courseware structure 11

61 Bayesian Network implementation steps [1] 28

81 Overlay model 3982 Bayesian Network model 4283 Analysis of student understanding for concept in CS1011x 45

vii

List of Tables

21 Existing Systems 10

71 Clustering Experiment 1 results 3372 Clustering Experiment 2 results 3473 Clustering Experiment 3 results 3574 Clustering Experiment 4 results 3675 Clustering Experiment 5 results 37

viii

Chapter 1

Introduction

11 MOOCs

A massive open online course(MOOCs) is an online course aimed at unlimited participa-tion and open access via the web [2] MOOCs offers wide variety of courses to the largenumber of students There are multiple MOOC platforms (including the edX Courseraand Udacity) offering large number of courses some of which have had hundreds ofthousands of students enrolled Benefits of such on-line courses is they are classroomindependent and platform independentThe courses started at one places can be used bythousands of learners all over the world So with minimum use of the resources and withlittle efforts anybody can participating in such online courses these courses are offeredby expert faculties in course from top class institutes in the world That can provideexperience of learning from highly qualified faculties with rich learning material andresources

MOOCs are considered as a one of the best educational research vehicle for most ofthe researchers in education field as it ability to track each student detailed interactionwith the course materials participation in forum practice problems and assignmentsetc The large data sets created by students interactions in MOOCs provide goodopportunities for research in educational field

12 Challenges in MOOCs

Figure 11 shows the learning workflow in MOOCs When a new course start instructorputs learning material on topics As time passes during the subsequent assessment learnergoes through the available learning material learning material consist of Videos audiotext etc Going through learning material learner attempt for exercises assignmentsquizzes exams etc After assessment results evaluated are provided to the learner andsame procedure follows until the course finishes

The courses offered are nothing but the static traditional hypermedia of the variouslearning material For the diverse learner population this system will suffer from inabilityto be ldquoall things to all peoplerdquo [3] Each student registered in MOOCs has different char-acteristics such as knowledge level preferences interest goals cognitive style learningstyle prerequisite knowledge etc The courses offered by MOOCs could not able to meet

1

Figure 11 Work flow of MOOCs

the need of each and every individual with different learning demands Due to which mostof the student not able to achieve satisfactory learning in course

13 Motivation

In traditional classroom based courses Due to limited number of students teachercan give special attention for each and every student for their classroom activitiesand can observe learning style knowledge level interests preferences etc In suchenvironment teacher can diagnose the student knowledge on various concept by askingquestions and teach on the basis of the diagnosis result for each student But dueto development in technologies and need to serve for the large number of populationwith courses offered on web with open access to all it is not feasible for MOOCinstructors to manually provide attention to individual with different backgroundsdiverse skill levels learning goals and preferences So there is an increasing need to makedirected efforts towards automatically providing better personalized content in e-learning

In a MOOCs each student complete interaction with the course materials is recordedas a clickstream which produces vast amount of data In [4] author states aboutimportant of such data that can be used for understanding of student learning andenable more intelligent interactive engaging and effective education Understandinghow students interact with MOOCs is a crucial issue because it affects how we design

2

future online courses Identifying student interaction patterns behaviour and sequenceof actions performed by student through clickstream analysis could enable more effectivepersonalized support for student information seeking and learning in MOOCs To do sorequires artificial intelligence and machine learning in the field of education

Artificial intelligence play an important role in learning and education Intelligenttutoring systems using technologies from artificial intelligent to advance learning andteaching In this work we focus on such data-driven development and optimization ofeducational technologies to build intelligent tutoring system for MOOCsThis work isalready carried out in the fields of educational data mining [5][6]and learning analytics[7]

14 Intelligent Tutoring System for MOOCs

Intelligent and Adaptive technologies proposes a solution to overcome limitations of thetraditional online courses ITS systems build a model of the knowledge and learningbehaviour for each individual learner and use this model throughout the interaction inorder to adapt to the needs of individual learner System offers personalised learningmaterial as per individual learning need

For the effective education and learning for MOOCs we will make our contributionto build the intelligent tutoring system which will address challenges in MOOCs listedabove First part is to address the challenge of finding behaviour of the student usingsequence of activity performed resources used and performance in the course similarbehaviour students are grouped into the cluster to provide personalised feedback foreach cluster having similar students And analysis of this data can also lead for betterunderstanding of student learning behaviour for making decisions for future offering ofthe courses

Second part is to estimate the concept wise knowledge of each individual studentparticipated in MOOCs The proposed solution built a model for each individual duringthe assessment and interaction with the system Using the student responses for theassessment system estimate performance of each student for concept to make predictionwhether student understand each concepttopic he learned Using this model systemanalyse individual student need and can provide personalised learning material as perindividuals demand

Majority of the ITS systems build has modularised in some components according tofunctionality Before going further we should familiar with these system and componentsin it As illustrated in figure 12 ITS contains following components[8] [9]

bull Domain Model

bull Student Model

bull Tutoring Model

bull Interface Model

3

Figure 12 Components of ITS

Domain Model- Domain model also known as expert model it stores informationabout the material being taught This model contains concepts and relationship betweenconcepts solutions for the question lessons and tutorials and problem solving strategiesof the domain It stands as a backbone of the content of the system

Student Model- It represents domain dependent and independent characteristics ofeach user These characteristics are acquired from user explicit responses analysis fromthe interaction data through assessments etc It consider as a core component of theITS system

Tutoring Model- To make the choices about tutoring and action it accepts infor-mation from domain model and student model it guides learner with respective theircurrent state of knowledge to reach proficiency with the target skill

Interaction Model- The model concerned with the representation of the user(usermodel) and representation of the application(domain model) The main aim of the modelis capturing the appropriate raw data of the learner and representing the interface

4

Chapter 2

Literature Review

Our work described here are falls into different categories listed in previous chapter Thereview done mostly includes related work we will presenting in next sections First wewill see some of the research related to data mining and learning analytics and that wewill be presenting in next subsequent sections Later in the sections we will present reviewon Adaptive and Intelligent Technologies in adaptive hypermedia and intelligent tutor-ing system that will require to understand the working of Adaptive and Intelligent system

21 Review of Educational Data Mining

WHAT IS EDMThe Educational Data Mining community website wwweducationaldataminingorg de-fines educational data mining as follows ldquoEducational Data Mining is an emerging dis-cipline concerned with developing methods for exploring the unique types of data thatcome from educational settings and using those methods to better understand studentsand the settings which they learn inrdquo

Data mining techniques can be used to discover useful information to assist educatorsto make decisions or modifications in environment or teaching approaches In [5] showsthat the data mining application in educational systems is an iterative cycle of hypothesisformation testing and refinement (see Fig 21) Mined Knowledge and results entersinto system and and guide facilitate and enhance the learning experience of the learnerby providing recommendation or personalised feedback This knowledge not only helpsto the learner but also to the educators for decision making so they can design planbuild and maintain the educational systems

In [10] the authors categorize work in educational data mining into

bull Statistics and visualization

bull Web mining

ndash Clustering classification and outlier detection

ndash Association rule mining and sequential pattern mining

ndash Text mining

5

Figure 21 The cycle of applying data mining in educational systems[5]

In same work he has listed data mining with different kinds of education systemsincludes (see fig 21) Traditional classrooms and Distance education systems (Particularweb-based courses Well-known learning content management systems Adaptive andintelligent web-based educational systems)

A different viewpoint on educational data mining stated in [6][11] where the followingcategories are defined

bull Prediction

ndash Classification

ndash Regression

ndash Density estimation

bull Clustering

bull Relationship mining

ndash Association rule mining

ndash Correlation mining

ndash Sequential pattern mining

ndash Causal data mining

bull Distillation of data for human judgment

bull Discovery with models

6

Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]

There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults

Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving

Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations

Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour

22 Adaptive and Intelligent Technologies

As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user

7

221 Adaptive Hypermedia

In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student

AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author

Adaptive Hypermedia performs following three functions

bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not

bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support

bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation

2211 Adaptive Presentation

[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be

bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)

bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model

bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values

8

2212 Adaptive Navigation Support

[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed

222 ITS technologies

In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS

2221 Curriculum sequencing

Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)

2222 Problem solving support

Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support

Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution

Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help

Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience

223 Existing Systems

In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]

Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that

9

System Description Technologies usedAHA open adaptive hypermedia

architecturePlatform foradaptive hypermedia

adaptive presentation adaptivenavigation support

ELM-ART Intelligent interactive sys-tem to learning programingin lisp

adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support

InterBook Authoring and Deliver-ing adaptive electronictextbooks

adaptive sequencing adaptivepresentation adaptive navigationsupport

KBS Hyperbook textbook in hypertext doc-ument

adaptive sequencing adaptivenavigation support

NetCoach Authoring system that meetthe needs to create adaptivelearning course

curriculum sequencing adaptiveannotation of links

Metalink Authoring tools for adap-tive hyperbook

page sequencing adaptive presen-tation

SIETTE ITS for classification andidentification of differentvegetable species

Question sequencing

SQL-Tutor support learning SQL Intelligent solution analysis

Table 21 Existing Systems

there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS

10

Chapter 3

The Open edX frontend architecture

31 Units(Weeks) sequences panels and modules

The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]

Figure 31 Courseware structure

UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules

11

32 Naming schemes

In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme

bull URLs Examples of course URLs corresponding to each level of the course hierarchy

ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014

courseware5773a6ab6319440aa5889ae38e578402

lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit

ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x

2T2014courseware5773a6ab6319440aa5889ae38e578402

ba24809f572846458e1c78e09411863a

lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit

ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence

bull Module URIs

Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70

These modules are displayed one course panel and panel can contain multiplemodules

33 Events

In events log we can distinguish the event between interaction event or navigational event

bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with

bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page

12

Chapter 4

Open edx research data

For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]

41 Student Info and Progress Data

edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]

bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment

ndash auth user

ndash auth userprofile

ndash student courseenrollment

ndash user api usercoursetag

ndash user id map

ndash student languageproficiency

For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course

bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data

13

bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course

42 Course Content Data

For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file

bull Shared Fields

bull Course Data

bull Course Building Block Data

bull Course Component Data

421 Shared Fields

Category children metadataThe these fields are present for all of the objects in thecourse structure file

bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories

ndash chapter The sections defined in the course outline

ndash course The dates settings and other metadata defined for the course as awhole

ndash discussion The content-specific discussion components defined for the course

ndash html The HTML components defined for the course

ndash problem The problem components defined for the course

ndash sequential The subsections defined in the course outline

ndash vertical The units defined in the course outline

ndash video The video components defined for the course

bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains

bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules

14

422 Course Data

In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values

Example of the course object defined for CS1011x

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

tabs [

name Courseware

type courseware

name Course Info

type course_info

name Textbooks

15

type textbooks

name Discussion

type discussion

name Wiki

type wiki

name Progress

type progress

name CS1011x Course Team

type static_tab

url_slug 84b41401f15b4a83a44cda2eb8a689fd

]

423 Course Building Block Data

Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain

bull chapter(section) contains list of sequentials(subsections)

bull sequential(subsections) contains list of verticals(units)

bull vertical(unit) list all components that they contain

The metadata field contains information about set of parameters defined by courseteam for each of them

Sample from CS1011X course

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

16

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

424 Course Component Data

Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course

i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e

category html

children []

metadata

display_name Selection Sort

17

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124

category problem

children []

metadata

display_name Q13

markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]

max_attempts 1

weight 20

i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5

category discussion

children []

metadata

discussion_category Week 2

discussion_id 6fd74752fae64748bfee05ea4f9ae6ca

discussion_target Structure of a Simple C++ Program

43 Tracking module id in Course Structure

To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen

18

the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy

As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for

19

example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

Do same procedure to find details of sequence identifier

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

Now can find a unit(vertical) in sequence using number of vertical in sequence panel

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

20

And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course

video module is here

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge

Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs

21

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 9: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

List of Tables

21 Existing Systems 10

71 Clustering Experiment 1 results 3372 Clustering Experiment 2 results 3473 Clustering Experiment 3 results 3574 Clustering Experiment 4 results 3675 Clustering Experiment 5 results 37

viii

Chapter 1

Introduction

11 MOOCs

A massive open online course(MOOCs) is an online course aimed at unlimited participa-tion and open access via the web [2] MOOCs offers wide variety of courses to the largenumber of students There are multiple MOOC platforms (including the edX Courseraand Udacity) offering large number of courses some of which have had hundreds ofthousands of students enrolled Benefits of such on-line courses is they are classroomindependent and platform independentThe courses started at one places can be used bythousands of learners all over the world So with minimum use of the resources and withlittle efforts anybody can participating in such online courses these courses are offeredby expert faculties in course from top class institutes in the world That can provideexperience of learning from highly qualified faculties with rich learning material andresources

MOOCs are considered as a one of the best educational research vehicle for most ofthe researchers in education field as it ability to track each student detailed interactionwith the course materials participation in forum practice problems and assignmentsetc The large data sets created by students interactions in MOOCs provide goodopportunities for research in educational field

12 Challenges in MOOCs

Figure 11 shows the learning workflow in MOOCs When a new course start instructorputs learning material on topics As time passes during the subsequent assessment learnergoes through the available learning material learning material consist of Videos audiotext etc Going through learning material learner attempt for exercises assignmentsquizzes exams etc After assessment results evaluated are provided to the learner andsame procedure follows until the course finishes

The courses offered are nothing but the static traditional hypermedia of the variouslearning material For the diverse learner population this system will suffer from inabilityto be ldquoall things to all peoplerdquo [3] Each student registered in MOOCs has different char-acteristics such as knowledge level preferences interest goals cognitive style learningstyle prerequisite knowledge etc The courses offered by MOOCs could not able to meet

1

Figure 11 Work flow of MOOCs

the need of each and every individual with different learning demands Due to which mostof the student not able to achieve satisfactory learning in course

13 Motivation

In traditional classroom based courses Due to limited number of students teachercan give special attention for each and every student for their classroom activitiesand can observe learning style knowledge level interests preferences etc In suchenvironment teacher can diagnose the student knowledge on various concept by askingquestions and teach on the basis of the diagnosis result for each student But dueto development in technologies and need to serve for the large number of populationwith courses offered on web with open access to all it is not feasible for MOOCinstructors to manually provide attention to individual with different backgroundsdiverse skill levels learning goals and preferences So there is an increasing need to makedirected efforts towards automatically providing better personalized content in e-learning

In a MOOCs each student complete interaction with the course materials is recordedas a clickstream which produces vast amount of data In [4] author states aboutimportant of such data that can be used for understanding of student learning andenable more intelligent interactive engaging and effective education Understandinghow students interact with MOOCs is a crucial issue because it affects how we design

2

future online courses Identifying student interaction patterns behaviour and sequenceof actions performed by student through clickstream analysis could enable more effectivepersonalized support for student information seeking and learning in MOOCs To do sorequires artificial intelligence and machine learning in the field of education

Artificial intelligence play an important role in learning and education Intelligenttutoring systems using technologies from artificial intelligent to advance learning andteaching In this work we focus on such data-driven development and optimization ofeducational technologies to build intelligent tutoring system for MOOCsThis work isalready carried out in the fields of educational data mining [5][6]and learning analytics[7]

14 Intelligent Tutoring System for MOOCs

Intelligent and Adaptive technologies proposes a solution to overcome limitations of thetraditional online courses ITS systems build a model of the knowledge and learningbehaviour for each individual learner and use this model throughout the interaction inorder to adapt to the needs of individual learner System offers personalised learningmaterial as per individual learning need

For the effective education and learning for MOOCs we will make our contributionto build the intelligent tutoring system which will address challenges in MOOCs listedabove First part is to address the challenge of finding behaviour of the student usingsequence of activity performed resources used and performance in the course similarbehaviour students are grouped into the cluster to provide personalised feedback foreach cluster having similar students And analysis of this data can also lead for betterunderstanding of student learning behaviour for making decisions for future offering ofthe courses

Second part is to estimate the concept wise knowledge of each individual studentparticipated in MOOCs The proposed solution built a model for each individual duringthe assessment and interaction with the system Using the student responses for theassessment system estimate performance of each student for concept to make predictionwhether student understand each concepttopic he learned Using this model systemanalyse individual student need and can provide personalised learning material as perindividuals demand

Majority of the ITS systems build has modularised in some components according tofunctionality Before going further we should familiar with these system and componentsin it As illustrated in figure 12 ITS contains following components[8] [9]

bull Domain Model

bull Student Model

bull Tutoring Model

bull Interface Model

3

Figure 12 Components of ITS

Domain Model- Domain model also known as expert model it stores informationabout the material being taught This model contains concepts and relationship betweenconcepts solutions for the question lessons and tutorials and problem solving strategiesof the domain It stands as a backbone of the content of the system

Student Model- It represents domain dependent and independent characteristics ofeach user These characteristics are acquired from user explicit responses analysis fromthe interaction data through assessments etc It consider as a core component of theITS system

Tutoring Model- To make the choices about tutoring and action it accepts infor-mation from domain model and student model it guides learner with respective theircurrent state of knowledge to reach proficiency with the target skill

Interaction Model- The model concerned with the representation of the user(usermodel) and representation of the application(domain model) The main aim of the modelis capturing the appropriate raw data of the learner and representing the interface

4

Chapter 2

Literature Review

Our work described here are falls into different categories listed in previous chapter Thereview done mostly includes related work we will presenting in next sections First wewill see some of the research related to data mining and learning analytics and that wewill be presenting in next subsequent sections Later in the sections we will present reviewon Adaptive and Intelligent Technologies in adaptive hypermedia and intelligent tutor-ing system that will require to understand the working of Adaptive and Intelligent system

21 Review of Educational Data Mining

WHAT IS EDMThe Educational Data Mining community website wwweducationaldataminingorg de-fines educational data mining as follows ldquoEducational Data Mining is an emerging dis-cipline concerned with developing methods for exploring the unique types of data thatcome from educational settings and using those methods to better understand studentsand the settings which they learn inrdquo

Data mining techniques can be used to discover useful information to assist educatorsto make decisions or modifications in environment or teaching approaches In [5] showsthat the data mining application in educational systems is an iterative cycle of hypothesisformation testing and refinement (see Fig 21) Mined Knowledge and results entersinto system and and guide facilitate and enhance the learning experience of the learnerby providing recommendation or personalised feedback This knowledge not only helpsto the learner but also to the educators for decision making so they can design planbuild and maintain the educational systems

In [10] the authors categorize work in educational data mining into

bull Statistics and visualization

bull Web mining

ndash Clustering classification and outlier detection

ndash Association rule mining and sequential pattern mining

ndash Text mining

5

Figure 21 The cycle of applying data mining in educational systems[5]

In same work he has listed data mining with different kinds of education systemsincludes (see fig 21) Traditional classrooms and Distance education systems (Particularweb-based courses Well-known learning content management systems Adaptive andintelligent web-based educational systems)

A different viewpoint on educational data mining stated in [6][11] where the followingcategories are defined

bull Prediction

ndash Classification

ndash Regression

ndash Density estimation

bull Clustering

bull Relationship mining

ndash Association rule mining

ndash Correlation mining

ndash Sequential pattern mining

ndash Causal data mining

bull Distillation of data for human judgment

bull Discovery with models

6

Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]

There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults

Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving

Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations

Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour

22 Adaptive and Intelligent Technologies

As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user

7

221 Adaptive Hypermedia

In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student

AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author

Adaptive Hypermedia performs following three functions

bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not

bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support

bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation

2211 Adaptive Presentation

[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be

bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)

bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model

bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values

8

2212 Adaptive Navigation Support

[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed

222 ITS technologies

In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS

2221 Curriculum sequencing

Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)

2222 Problem solving support

Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support

Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution

Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help

Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience

223 Existing Systems

In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]

Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that

9

System Description Technologies usedAHA open adaptive hypermedia

architecturePlatform foradaptive hypermedia

adaptive presentation adaptivenavigation support

ELM-ART Intelligent interactive sys-tem to learning programingin lisp

adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support

InterBook Authoring and Deliver-ing adaptive electronictextbooks

adaptive sequencing adaptivepresentation adaptive navigationsupport

KBS Hyperbook textbook in hypertext doc-ument

adaptive sequencing adaptivenavigation support

NetCoach Authoring system that meetthe needs to create adaptivelearning course

curriculum sequencing adaptiveannotation of links

Metalink Authoring tools for adap-tive hyperbook

page sequencing adaptive presen-tation

SIETTE ITS for classification andidentification of differentvegetable species

Question sequencing

SQL-Tutor support learning SQL Intelligent solution analysis

Table 21 Existing Systems

there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS

10

Chapter 3

The Open edX frontend architecture

31 Units(Weeks) sequences panels and modules

The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]

Figure 31 Courseware structure

UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules

11

32 Naming schemes

In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme

bull URLs Examples of course URLs corresponding to each level of the course hierarchy

ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014

courseware5773a6ab6319440aa5889ae38e578402

lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit

ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x

2T2014courseware5773a6ab6319440aa5889ae38e578402

ba24809f572846458e1c78e09411863a

lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit

ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence

bull Module URIs

Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70

These modules are displayed one course panel and panel can contain multiplemodules

33 Events

In events log we can distinguish the event between interaction event or navigational event

bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with

bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page

12

Chapter 4

Open edx research data

For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]

41 Student Info and Progress Data

edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]

bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment

ndash auth user

ndash auth userprofile

ndash student courseenrollment

ndash user api usercoursetag

ndash user id map

ndash student languageproficiency

For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course

bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data

13

bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course

42 Course Content Data

For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file

bull Shared Fields

bull Course Data

bull Course Building Block Data

bull Course Component Data

421 Shared Fields

Category children metadataThe these fields are present for all of the objects in thecourse structure file

bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories

ndash chapter The sections defined in the course outline

ndash course The dates settings and other metadata defined for the course as awhole

ndash discussion The content-specific discussion components defined for the course

ndash html The HTML components defined for the course

ndash problem The problem components defined for the course

ndash sequential The subsections defined in the course outline

ndash vertical The units defined in the course outline

ndash video The video components defined for the course

bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains

bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules

14

422 Course Data

In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values

Example of the course object defined for CS1011x

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

tabs [

name Courseware

type courseware

name Course Info

type course_info

name Textbooks

15

type textbooks

name Discussion

type discussion

name Wiki

type wiki

name Progress

type progress

name CS1011x Course Team

type static_tab

url_slug 84b41401f15b4a83a44cda2eb8a689fd

]

423 Course Building Block Data

Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain

bull chapter(section) contains list of sequentials(subsections)

bull sequential(subsections) contains list of verticals(units)

bull vertical(unit) list all components that they contain

The metadata field contains information about set of parameters defined by courseteam for each of them

Sample from CS1011X course

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

16

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

424 Course Component Data

Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course

i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e

category html

children []

metadata

display_name Selection Sort

17

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124

category problem

children []

metadata

display_name Q13

markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]

max_attempts 1

weight 20

i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5

category discussion

children []

metadata

discussion_category Week 2

discussion_id 6fd74752fae64748bfee05ea4f9ae6ca

discussion_target Structure of a Simple C++ Program

43 Tracking module id in Course Structure

To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen

18

the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy

As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for

19

example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

Do same procedure to find details of sequence identifier

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

Now can find a unit(vertical) in sequence using number of vertical in sequence panel

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

20

And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course

video module is here

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge

Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs

21

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 10: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

Chapter 1

Introduction

11 MOOCs

A massive open online course(MOOCs) is an online course aimed at unlimited participa-tion and open access via the web [2] MOOCs offers wide variety of courses to the largenumber of students There are multiple MOOC platforms (including the edX Courseraand Udacity) offering large number of courses some of which have had hundreds ofthousands of students enrolled Benefits of such on-line courses is they are classroomindependent and platform independentThe courses started at one places can be used bythousands of learners all over the world So with minimum use of the resources and withlittle efforts anybody can participating in such online courses these courses are offeredby expert faculties in course from top class institutes in the world That can provideexperience of learning from highly qualified faculties with rich learning material andresources

MOOCs are considered as a one of the best educational research vehicle for most ofthe researchers in education field as it ability to track each student detailed interactionwith the course materials participation in forum practice problems and assignmentsetc The large data sets created by students interactions in MOOCs provide goodopportunities for research in educational field

12 Challenges in MOOCs

Figure 11 shows the learning workflow in MOOCs When a new course start instructorputs learning material on topics As time passes during the subsequent assessment learnergoes through the available learning material learning material consist of Videos audiotext etc Going through learning material learner attempt for exercises assignmentsquizzes exams etc After assessment results evaluated are provided to the learner andsame procedure follows until the course finishes

The courses offered are nothing but the static traditional hypermedia of the variouslearning material For the diverse learner population this system will suffer from inabilityto be ldquoall things to all peoplerdquo [3] Each student registered in MOOCs has different char-acteristics such as knowledge level preferences interest goals cognitive style learningstyle prerequisite knowledge etc The courses offered by MOOCs could not able to meet

1

Figure 11 Work flow of MOOCs

the need of each and every individual with different learning demands Due to which mostof the student not able to achieve satisfactory learning in course

13 Motivation

In traditional classroom based courses Due to limited number of students teachercan give special attention for each and every student for their classroom activitiesand can observe learning style knowledge level interests preferences etc In suchenvironment teacher can diagnose the student knowledge on various concept by askingquestions and teach on the basis of the diagnosis result for each student But dueto development in technologies and need to serve for the large number of populationwith courses offered on web with open access to all it is not feasible for MOOCinstructors to manually provide attention to individual with different backgroundsdiverse skill levels learning goals and preferences So there is an increasing need to makedirected efforts towards automatically providing better personalized content in e-learning

In a MOOCs each student complete interaction with the course materials is recordedas a clickstream which produces vast amount of data In [4] author states aboutimportant of such data that can be used for understanding of student learning andenable more intelligent interactive engaging and effective education Understandinghow students interact with MOOCs is a crucial issue because it affects how we design

2

future online courses Identifying student interaction patterns behaviour and sequenceof actions performed by student through clickstream analysis could enable more effectivepersonalized support for student information seeking and learning in MOOCs To do sorequires artificial intelligence and machine learning in the field of education

Artificial intelligence play an important role in learning and education Intelligenttutoring systems using technologies from artificial intelligent to advance learning andteaching In this work we focus on such data-driven development and optimization ofeducational technologies to build intelligent tutoring system for MOOCsThis work isalready carried out in the fields of educational data mining [5][6]and learning analytics[7]

14 Intelligent Tutoring System for MOOCs

Intelligent and Adaptive technologies proposes a solution to overcome limitations of thetraditional online courses ITS systems build a model of the knowledge and learningbehaviour for each individual learner and use this model throughout the interaction inorder to adapt to the needs of individual learner System offers personalised learningmaterial as per individual learning need

For the effective education and learning for MOOCs we will make our contributionto build the intelligent tutoring system which will address challenges in MOOCs listedabove First part is to address the challenge of finding behaviour of the student usingsequence of activity performed resources used and performance in the course similarbehaviour students are grouped into the cluster to provide personalised feedback foreach cluster having similar students And analysis of this data can also lead for betterunderstanding of student learning behaviour for making decisions for future offering ofthe courses

Second part is to estimate the concept wise knowledge of each individual studentparticipated in MOOCs The proposed solution built a model for each individual duringthe assessment and interaction with the system Using the student responses for theassessment system estimate performance of each student for concept to make predictionwhether student understand each concepttopic he learned Using this model systemanalyse individual student need and can provide personalised learning material as perindividuals demand

Majority of the ITS systems build has modularised in some components according tofunctionality Before going further we should familiar with these system and componentsin it As illustrated in figure 12 ITS contains following components[8] [9]

bull Domain Model

bull Student Model

bull Tutoring Model

bull Interface Model

3

Figure 12 Components of ITS

Domain Model- Domain model also known as expert model it stores informationabout the material being taught This model contains concepts and relationship betweenconcepts solutions for the question lessons and tutorials and problem solving strategiesof the domain It stands as a backbone of the content of the system

Student Model- It represents domain dependent and independent characteristics ofeach user These characteristics are acquired from user explicit responses analysis fromthe interaction data through assessments etc It consider as a core component of theITS system

Tutoring Model- To make the choices about tutoring and action it accepts infor-mation from domain model and student model it guides learner with respective theircurrent state of knowledge to reach proficiency with the target skill

Interaction Model- The model concerned with the representation of the user(usermodel) and representation of the application(domain model) The main aim of the modelis capturing the appropriate raw data of the learner and representing the interface

4

Chapter 2

Literature Review

Our work described here are falls into different categories listed in previous chapter Thereview done mostly includes related work we will presenting in next sections First wewill see some of the research related to data mining and learning analytics and that wewill be presenting in next subsequent sections Later in the sections we will present reviewon Adaptive and Intelligent Technologies in adaptive hypermedia and intelligent tutor-ing system that will require to understand the working of Adaptive and Intelligent system

21 Review of Educational Data Mining

WHAT IS EDMThe Educational Data Mining community website wwweducationaldataminingorg de-fines educational data mining as follows ldquoEducational Data Mining is an emerging dis-cipline concerned with developing methods for exploring the unique types of data thatcome from educational settings and using those methods to better understand studentsand the settings which they learn inrdquo

Data mining techniques can be used to discover useful information to assist educatorsto make decisions or modifications in environment or teaching approaches In [5] showsthat the data mining application in educational systems is an iterative cycle of hypothesisformation testing and refinement (see Fig 21) Mined Knowledge and results entersinto system and and guide facilitate and enhance the learning experience of the learnerby providing recommendation or personalised feedback This knowledge not only helpsto the learner but also to the educators for decision making so they can design planbuild and maintain the educational systems

In [10] the authors categorize work in educational data mining into

bull Statistics and visualization

bull Web mining

ndash Clustering classification and outlier detection

ndash Association rule mining and sequential pattern mining

ndash Text mining

5

Figure 21 The cycle of applying data mining in educational systems[5]

In same work he has listed data mining with different kinds of education systemsincludes (see fig 21) Traditional classrooms and Distance education systems (Particularweb-based courses Well-known learning content management systems Adaptive andintelligent web-based educational systems)

A different viewpoint on educational data mining stated in [6][11] where the followingcategories are defined

bull Prediction

ndash Classification

ndash Regression

ndash Density estimation

bull Clustering

bull Relationship mining

ndash Association rule mining

ndash Correlation mining

ndash Sequential pattern mining

ndash Causal data mining

bull Distillation of data for human judgment

bull Discovery with models

6

Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]

There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults

Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving

Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations

Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour

22 Adaptive and Intelligent Technologies

As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user

7

221 Adaptive Hypermedia

In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student

AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author

Adaptive Hypermedia performs following three functions

bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not

bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support

bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation

2211 Adaptive Presentation

[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be

bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)

bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model

bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values

8

2212 Adaptive Navigation Support

[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed

222 ITS technologies

In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS

2221 Curriculum sequencing

Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)

2222 Problem solving support

Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support

Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution

Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help

Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience

223 Existing Systems

In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]

Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that

9

System Description Technologies usedAHA open adaptive hypermedia

architecturePlatform foradaptive hypermedia

adaptive presentation adaptivenavigation support

ELM-ART Intelligent interactive sys-tem to learning programingin lisp

adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support

InterBook Authoring and Deliver-ing adaptive electronictextbooks

adaptive sequencing adaptivepresentation adaptive navigationsupport

KBS Hyperbook textbook in hypertext doc-ument

adaptive sequencing adaptivenavigation support

NetCoach Authoring system that meetthe needs to create adaptivelearning course

curriculum sequencing adaptiveannotation of links

Metalink Authoring tools for adap-tive hyperbook

page sequencing adaptive presen-tation

SIETTE ITS for classification andidentification of differentvegetable species

Question sequencing

SQL-Tutor support learning SQL Intelligent solution analysis

Table 21 Existing Systems

there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS

10

Chapter 3

The Open edX frontend architecture

31 Units(Weeks) sequences panels and modules

The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]

Figure 31 Courseware structure

UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules

11

32 Naming schemes

In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme

bull URLs Examples of course URLs corresponding to each level of the course hierarchy

ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014

courseware5773a6ab6319440aa5889ae38e578402

lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit

ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x

2T2014courseware5773a6ab6319440aa5889ae38e578402

ba24809f572846458e1c78e09411863a

lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit

ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence

bull Module URIs

Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70

These modules are displayed one course panel and panel can contain multiplemodules

33 Events

In events log we can distinguish the event between interaction event or navigational event

bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with

bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page

12

Chapter 4

Open edx research data

For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]

41 Student Info and Progress Data

edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]

bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment

ndash auth user

ndash auth userprofile

ndash student courseenrollment

ndash user api usercoursetag

ndash user id map

ndash student languageproficiency

For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course

bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data

13

bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course

42 Course Content Data

For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file

bull Shared Fields

bull Course Data

bull Course Building Block Data

bull Course Component Data

421 Shared Fields

Category children metadataThe these fields are present for all of the objects in thecourse structure file

bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories

ndash chapter The sections defined in the course outline

ndash course The dates settings and other metadata defined for the course as awhole

ndash discussion The content-specific discussion components defined for the course

ndash html The HTML components defined for the course

ndash problem The problem components defined for the course

ndash sequential The subsections defined in the course outline

ndash vertical The units defined in the course outline

ndash video The video components defined for the course

bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains

bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules

14

422 Course Data

In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values

Example of the course object defined for CS1011x

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

tabs [

name Courseware

type courseware

name Course Info

type course_info

name Textbooks

15

type textbooks

name Discussion

type discussion

name Wiki

type wiki

name Progress

type progress

name CS1011x Course Team

type static_tab

url_slug 84b41401f15b4a83a44cda2eb8a689fd

]

423 Course Building Block Data

Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain

bull chapter(section) contains list of sequentials(subsections)

bull sequential(subsections) contains list of verticals(units)

bull vertical(unit) list all components that they contain

The metadata field contains information about set of parameters defined by courseteam for each of them

Sample from CS1011X course

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

16

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

424 Course Component Data

Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course

i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e

category html

children []

metadata

display_name Selection Sort

17

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124

category problem

children []

metadata

display_name Q13

markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]

max_attempts 1

weight 20

i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5

category discussion

children []

metadata

discussion_category Week 2

discussion_id 6fd74752fae64748bfee05ea4f9ae6ca

discussion_target Structure of a Simple C++ Program

43 Tracking module id in Course Structure

To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen

18

the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy

As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for

19

example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

Do same procedure to find details of sequence identifier

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

Now can find a unit(vertical) in sequence using number of vertical in sequence panel

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

20

And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course

video module is here

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge

Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs

21

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 11: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

Figure 11 Work flow of MOOCs

the need of each and every individual with different learning demands Due to which mostof the student not able to achieve satisfactory learning in course

13 Motivation

In traditional classroom based courses Due to limited number of students teachercan give special attention for each and every student for their classroom activitiesand can observe learning style knowledge level interests preferences etc In suchenvironment teacher can diagnose the student knowledge on various concept by askingquestions and teach on the basis of the diagnosis result for each student But dueto development in technologies and need to serve for the large number of populationwith courses offered on web with open access to all it is not feasible for MOOCinstructors to manually provide attention to individual with different backgroundsdiverse skill levels learning goals and preferences So there is an increasing need to makedirected efforts towards automatically providing better personalized content in e-learning

In a MOOCs each student complete interaction with the course materials is recordedas a clickstream which produces vast amount of data In [4] author states aboutimportant of such data that can be used for understanding of student learning andenable more intelligent interactive engaging and effective education Understandinghow students interact with MOOCs is a crucial issue because it affects how we design

2

future online courses Identifying student interaction patterns behaviour and sequenceof actions performed by student through clickstream analysis could enable more effectivepersonalized support for student information seeking and learning in MOOCs To do sorequires artificial intelligence and machine learning in the field of education

Artificial intelligence play an important role in learning and education Intelligenttutoring systems using technologies from artificial intelligent to advance learning andteaching In this work we focus on such data-driven development and optimization ofeducational technologies to build intelligent tutoring system for MOOCsThis work isalready carried out in the fields of educational data mining [5][6]and learning analytics[7]

14 Intelligent Tutoring System for MOOCs

Intelligent and Adaptive technologies proposes a solution to overcome limitations of thetraditional online courses ITS systems build a model of the knowledge and learningbehaviour for each individual learner and use this model throughout the interaction inorder to adapt to the needs of individual learner System offers personalised learningmaterial as per individual learning need

For the effective education and learning for MOOCs we will make our contributionto build the intelligent tutoring system which will address challenges in MOOCs listedabove First part is to address the challenge of finding behaviour of the student usingsequence of activity performed resources used and performance in the course similarbehaviour students are grouped into the cluster to provide personalised feedback foreach cluster having similar students And analysis of this data can also lead for betterunderstanding of student learning behaviour for making decisions for future offering ofthe courses

Second part is to estimate the concept wise knowledge of each individual studentparticipated in MOOCs The proposed solution built a model for each individual duringthe assessment and interaction with the system Using the student responses for theassessment system estimate performance of each student for concept to make predictionwhether student understand each concepttopic he learned Using this model systemanalyse individual student need and can provide personalised learning material as perindividuals demand

Majority of the ITS systems build has modularised in some components according tofunctionality Before going further we should familiar with these system and componentsin it As illustrated in figure 12 ITS contains following components[8] [9]

bull Domain Model

bull Student Model

bull Tutoring Model

bull Interface Model

3

Figure 12 Components of ITS

Domain Model- Domain model also known as expert model it stores informationabout the material being taught This model contains concepts and relationship betweenconcepts solutions for the question lessons and tutorials and problem solving strategiesof the domain It stands as a backbone of the content of the system

Student Model- It represents domain dependent and independent characteristics ofeach user These characteristics are acquired from user explicit responses analysis fromthe interaction data through assessments etc It consider as a core component of theITS system

Tutoring Model- To make the choices about tutoring and action it accepts infor-mation from domain model and student model it guides learner with respective theircurrent state of knowledge to reach proficiency with the target skill

Interaction Model- The model concerned with the representation of the user(usermodel) and representation of the application(domain model) The main aim of the modelis capturing the appropriate raw data of the learner and representing the interface

4

Chapter 2

Literature Review

Our work described here are falls into different categories listed in previous chapter Thereview done mostly includes related work we will presenting in next sections First wewill see some of the research related to data mining and learning analytics and that wewill be presenting in next subsequent sections Later in the sections we will present reviewon Adaptive and Intelligent Technologies in adaptive hypermedia and intelligent tutor-ing system that will require to understand the working of Adaptive and Intelligent system

21 Review of Educational Data Mining

WHAT IS EDMThe Educational Data Mining community website wwweducationaldataminingorg de-fines educational data mining as follows ldquoEducational Data Mining is an emerging dis-cipline concerned with developing methods for exploring the unique types of data thatcome from educational settings and using those methods to better understand studentsand the settings which they learn inrdquo

Data mining techniques can be used to discover useful information to assist educatorsto make decisions or modifications in environment or teaching approaches In [5] showsthat the data mining application in educational systems is an iterative cycle of hypothesisformation testing and refinement (see Fig 21) Mined Knowledge and results entersinto system and and guide facilitate and enhance the learning experience of the learnerby providing recommendation or personalised feedback This knowledge not only helpsto the learner but also to the educators for decision making so they can design planbuild and maintain the educational systems

In [10] the authors categorize work in educational data mining into

bull Statistics and visualization

bull Web mining

ndash Clustering classification and outlier detection

ndash Association rule mining and sequential pattern mining

ndash Text mining

5

Figure 21 The cycle of applying data mining in educational systems[5]

In same work he has listed data mining with different kinds of education systemsincludes (see fig 21) Traditional classrooms and Distance education systems (Particularweb-based courses Well-known learning content management systems Adaptive andintelligent web-based educational systems)

A different viewpoint on educational data mining stated in [6][11] where the followingcategories are defined

bull Prediction

ndash Classification

ndash Regression

ndash Density estimation

bull Clustering

bull Relationship mining

ndash Association rule mining

ndash Correlation mining

ndash Sequential pattern mining

ndash Causal data mining

bull Distillation of data for human judgment

bull Discovery with models

6

Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]

There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults

Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving

Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations

Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour

22 Adaptive and Intelligent Technologies

As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user

7

221 Adaptive Hypermedia

In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student

AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author

Adaptive Hypermedia performs following three functions

bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not

bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support

bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation

2211 Adaptive Presentation

[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be

bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)

bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model

bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values

8

2212 Adaptive Navigation Support

[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed

222 ITS technologies

In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS

2221 Curriculum sequencing

Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)

2222 Problem solving support

Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support

Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution

Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help

Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience

223 Existing Systems

In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]

Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that

9

System Description Technologies usedAHA open adaptive hypermedia

architecturePlatform foradaptive hypermedia

adaptive presentation adaptivenavigation support

ELM-ART Intelligent interactive sys-tem to learning programingin lisp

adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support

InterBook Authoring and Deliver-ing adaptive electronictextbooks

adaptive sequencing adaptivepresentation adaptive navigationsupport

KBS Hyperbook textbook in hypertext doc-ument

adaptive sequencing adaptivenavigation support

NetCoach Authoring system that meetthe needs to create adaptivelearning course

curriculum sequencing adaptiveannotation of links

Metalink Authoring tools for adap-tive hyperbook

page sequencing adaptive presen-tation

SIETTE ITS for classification andidentification of differentvegetable species

Question sequencing

SQL-Tutor support learning SQL Intelligent solution analysis

Table 21 Existing Systems

there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS

10

Chapter 3

The Open edX frontend architecture

31 Units(Weeks) sequences panels and modules

The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]

Figure 31 Courseware structure

UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules

11

32 Naming schemes

In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme

bull URLs Examples of course URLs corresponding to each level of the course hierarchy

ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014

courseware5773a6ab6319440aa5889ae38e578402

lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit

ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x

2T2014courseware5773a6ab6319440aa5889ae38e578402

ba24809f572846458e1c78e09411863a

lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit

ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence

bull Module URIs

Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70

These modules are displayed one course panel and panel can contain multiplemodules

33 Events

In events log we can distinguish the event between interaction event or navigational event

bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with

bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page

12

Chapter 4

Open edx research data

For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]

41 Student Info and Progress Data

edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]

bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment

ndash auth user

ndash auth userprofile

ndash student courseenrollment

ndash user api usercoursetag

ndash user id map

ndash student languageproficiency

For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course

bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data

13

bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course

42 Course Content Data

For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file

bull Shared Fields

bull Course Data

bull Course Building Block Data

bull Course Component Data

421 Shared Fields

Category children metadataThe these fields are present for all of the objects in thecourse structure file

bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories

ndash chapter The sections defined in the course outline

ndash course The dates settings and other metadata defined for the course as awhole

ndash discussion The content-specific discussion components defined for the course

ndash html The HTML components defined for the course

ndash problem The problem components defined for the course

ndash sequential The subsections defined in the course outline

ndash vertical The units defined in the course outline

ndash video The video components defined for the course

bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains

bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules

14

422 Course Data

In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values

Example of the course object defined for CS1011x

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

tabs [

name Courseware

type courseware

name Course Info

type course_info

name Textbooks

15

type textbooks

name Discussion

type discussion

name Wiki

type wiki

name Progress

type progress

name CS1011x Course Team

type static_tab

url_slug 84b41401f15b4a83a44cda2eb8a689fd

]

423 Course Building Block Data

Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain

bull chapter(section) contains list of sequentials(subsections)

bull sequential(subsections) contains list of verticals(units)

bull vertical(unit) list all components that they contain

The metadata field contains information about set of parameters defined by courseteam for each of them

Sample from CS1011X course

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

16

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

424 Course Component Data

Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course

i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e

category html

children []

metadata

display_name Selection Sort

17

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124

category problem

children []

metadata

display_name Q13

markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]

max_attempts 1

weight 20

i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5

category discussion

children []

metadata

discussion_category Week 2

discussion_id 6fd74752fae64748bfee05ea4f9ae6ca

discussion_target Structure of a Simple C++ Program

43 Tracking module id in Course Structure

To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen

18

the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy

As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for

19

example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

Do same procedure to find details of sequence identifier

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

Now can find a unit(vertical) in sequence using number of vertical in sequence panel

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

20

And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course

video module is here

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge

Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs

21

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 12: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

future online courses Identifying student interaction patterns behaviour and sequenceof actions performed by student through clickstream analysis could enable more effectivepersonalized support for student information seeking and learning in MOOCs To do sorequires artificial intelligence and machine learning in the field of education

Artificial intelligence play an important role in learning and education Intelligenttutoring systems using technologies from artificial intelligent to advance learning andteaching In this work we focus on such data-driven development and optimization ofeducational technologies to build intelligent tutoring system for MOOCsThis work isalready carried out in the fields of educational data mining [5][6]and learning analytics[7]

14 Intelligent Tutoring System for MOOCs

Intelligent and Adaptive technologies proposes a solution to overcome limitations of thetraditional online courses ITS systems build a model of the knowledge and learningbehaviour for each individual learner and use this model throughout the interaction inorder to adapt to the needs of individual learner System offers personalised learningmaterial as per individual learning need

For the effective education and learning for MOOCs we will make our contributionto build the intelligent tutoring system which will address challenges in MOOCs listedabove First part is to address the challenge of finding behaviour of the student usingsequence of activity performed resources used and performance in the course similarbehaviour students are grouped into the cluster to provide personalised feedback foreach cluster having similar students And analysis of this data can also lead for betterunderstanding of student learning behaviour for making decisions for future offering ofthe courses

Second part is to estimate the concept wise knowledge of each individual studentparticipated in MOOCs The proposed solution built a model for each individual duringthe assessment and interaction with the system Using the student responses for theassessment system estimate performance of each student for concept to make predictionwhether student understand each concepttopic he learned Using this model systemanalyse individual student need and can provide personalised learning material as perindividuals demand

Majority of the ITS systems build has modularised in some components according tofunctionality Before going further we should familiar with these system and componentsin it As illustrated in figure 12 ITS contains following components[8] [9]

bull Domain Model

bull Student Model

bull Tutoring Model

bull Interface Model

3

Figure 12 Components of ITS

Domain Model- Domain model also known as expert model it stores informationabout the material being taught This model contains concepts and relationship betweenconcepts solutions for the question lessons and tutorials and problem solving strategiesof the domain It stands as a backbone of the content of the system

Student Model- It represents domain dependent and independent characteristics ofeach user These characteristics are acquired from user explicit responses analysis fromthe interaction data through assessments etc It consider as a core component of theITS system

Tutoring Model- To make the choices about tutoring and action it accepts infor-mation from domain model and student model it guides learner with respective theircurrent state of knowledge to reach proficiency with the target skill

Interaction Model- The model concerned with the representation of the user(usermodel) and representation of the application(domain model) The main aim of the modelis capturing the appropriate raw data of the learner and representing the interface

4

Chapter 2

Literature Review

Our work described here are falls into different categories listed in previous chapter Thereview done mostly includes related work we will presenting in next sections First wewill see some of the research related to data mining and learning analytics and that wewill be presenting in next subsequent sections Later in the sections we will present reviewon Adaptive and Intelligent Technologies in adaptive hypermedia and intelligent tutor-ing system that will require to understand the working of Adaptive and Intelligent system

21 Review of Educational Data Mining

WHAT IS EDMThe Educational Data Mining community website wwweducationaldataminingorg de-fines educational data mining as follows ldquoEducational Data Mining is an emerging dis-cipline concerned with developing methods for exploring the unique types of data thatcome from educational settings and using those methods to better understand studentsand the settings which they learn inrdquo

Data mining techniques can be used to discover useful information to assist educatorsto make decisions or modifications in environment or teaching approaches In [5] showsthat the data mining application in educational systems is an iterative cycle of hypothesisformation testing and refinement (see Fig 21) Mined Knowledge and results entersinto system and and guide facilitate and enhance the learning experience of the learnerby providing recommendation or personalised feedback This knowledge not only helpsto the learner but also to the educators for decision making so they can design planbuild and maintain the educational systems

In [10] the authors categorize work in educational data mining into

bull Statistics and visualization

bull Web mining

ndash Clustering classification and outlier detection

ndash Association rule mining and sequential pattern mining

ndash Text mining

5

Figure 21 The cycle of applying data mining in educational systems[5]

In same work he has listed data mining with different kinds of education systemsincludes (see fig 21) Traditional classrooms and Distance education systems (Particularweb-based courses Well-known learning content management systems Adaptive andintelligent web-based educational systems)

A different viewpoint on educational data mining stated in [6][11] where the followingcategories are defined

bull Prediction

ndash Classification

ndash Regression

ndash Density estimation

bull Clustering

bull Relationship mining

ndash Association rule mining

ndash Correlation mining

ndash Sequential pattern mining

ndash Causal data mining

bull Distillation of data for human judgment

bull Discovery with models

6

Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]

There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults

Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving

Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations

Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour

22 Adaptive and Intelligent Technologies

As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user

7

221 Adaptive Hypermedia

In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student

AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author

Adaptive Hypermedia performs following three functions

bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not

bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support

bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation

2211 Adaptive Presentation

[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be

bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)

bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model

bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values

8

2212 Adaptive Navigation Support

[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed

222 ITS technologies

In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS

2221 Curriculum sequencing

Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)

2222 Problem solving support

Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support

Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution

Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help

Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience

223 Existing Systems

In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]

Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that

9

System Description Technologies usedAHA open adaptive hypermedia

architecturePlatform foradaptive hypermedia

adaptive presentation adaptivenavigation support

ELM-ART Intelligent interactive sys-tem to learning programingin lisp

adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support

InterBook Authoring and Deliver-ing adaptive electronictextbooks

adaptive sequencing adaptivepresentation adaptive navigationsupport

KBS Hyperbook textbook in hypertext doc-ument

adaptive sequencing adaptivenavigation support

NetCoach Authoring system that meetthe needs to create adaptivelearning course

curriculum sequencing adaptiveannotation of links

Metalink Authoring tools for adap-tive hyperbook

page sequencing adaptive presen-tation

SIETTE ITS for classification andidentification of differentvegetable species

Question sequencing

SQL-Tutor support learning SQL Intelligent solution analysis

Table 21 Existing Systems

there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS

10

Chapter 3

The Open edX frontend architecture

31 Units(Weeks) sequences panels and modules

The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]

Figure 31 Courseware structure

UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules

11

32 Naming schemes

In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme

bull URLs Examples of course URLs corresponding to each level of the course hierarchy

ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014

courseware5773a6ab6319440aa5889ae38e578402

lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit

ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x

2T2014courseware5773a6ab6319440aa5889ae38e578402

ba24809f572846458e1c78e09411863a

lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit

ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence

bull Module URIs

Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70

These modules are displayed one course panel and panel can contain multiplemodules

33 Events

In events log we can distinguish the event between interaction event or navigational event

bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with

bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page

12

Chapter 4

Open edx research data

For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]

41 Student Info and Progress Data

edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]

bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment

ndash auth user

ndash auth userprofile

ndash student courseenrollment

ndash user api usercoursetag

ndash user id map

ndash student languageproficiency

For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course

bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data

13

bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course

42 Course Content Data

For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file

bull Shared Fields

bull Course Data

bull Course Building Block Data

bull Course Component Data

421 Shared Fields

Category children metadataThe these fields are present for all of the objects in thecourse structure file

bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories

ndash chapter The sections defined in the course outline

ndash course The dates settings and other metadata defined for the course as awhole

ndash discussion The content-specific discussion components defined for the course

ndash html The HTML components defined for the course

ndash problem The problem components defined for the course

ndash sequential The subsections defined in the course outline

ndash vertical The units defined in the course outline

ndash video The video components defined for the course

bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains

bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules

14

422 Course Data

In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values

Example of the course object defined for CS1011x

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

tabs [

name Courseware

type courseware

name Course Info

type course_info

name Textbooks

15

type textbooks

name Discussion

type discussion

name Wiki

type wiki

name Progress

type progress

name CS1011x Course Team

type static_tab

url_slug 84b41401f15b4a83a44cda2eb8a689fd

]

423 Course Building Block Data

Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain

bull chapter(section) contains list of sequentials(subsections)

bull sequential(subsections) contains list of verticals(units)

bull vertical(unit) list all components that they contain

The metadata field contains information about set of parameters defined by courseteam for each of them

Sample from CS1011X course

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

16

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

424 Course Component Data

Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course

i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e

category html

children []

metadata

display_name Selection Sort

17

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124

category problem

children []

metadata

display_name Q13

markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]

max_attempts 1

weight 20

i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5

category discussion

children []

metadata

discussion_category Week 2

discussion_id 6fd74752fae64748bfee05ea4f9ae6ca

discussion_target Structure of a Simple C++ Program

43 Tracking module id in Course Structure

To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen

18

the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy

As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for

19

example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

Do same procedure to find details of sequence identifier

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

Now can find a unit(vertical) in sequence using number of vertical in sequence panel

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

20

And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course

video module is here

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge

Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs

21

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 13: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

Figure 12 Components of ITS

Domain Model- Domain model also known as expert model it stores informationabout the material being taught This model contains concepts and relationship betweenconcepts solutions for the question lessons and tutorials and problem solving strategiesof the domain It stands as a backbone of the content of the system

Student Model- It represents domain dependent and independent characteristics ofeach user These characteristics are acquired from user explicit responses analysis fromthe interaction data through assessments etc It consider as a core component of theITS system

Tutoring Model- To make the choices about tutoring and action it accepts infor-mation from domain model and student model it guides learner with respective theircurrent state of knowledge to reach proficiency with the target skill

Interaction Model- The model concerned with the representation of the user(usermodel) and representation of the application(domain model) The main aim of the modelis capturing the appropriate raw data of the learner and representing the interface

4

Chapter 2

Literature Review

Our work described here are falls into different categories listed in previous chapter Thereview done mostly includes related work we will presenting in next sections First wewill see some of the research related to data mining and learning analytics and that wewill be presenting in next subsequent sections Later in the sections we will present reviewon Adaptive and Intelligent Technologies in adaptive hypermedia and intelligent tutor-ing system that will require to understand the working of Adaptive and Intelligent system

21 Review of Educational Data Mining

WHAT IS EDMThe Educational Data Mining community website wwweducationaldataminingorg de-fines educational data mining as follows ldquoEducational Data Mining is an emerging dis-cipline concerned with developing methods for exploring the unique types of data thatcome from educational settings and using those methods to better understand studentsand the settings which they learn inrdquo

Data mining techniques can be used to discover useful information to assist educatorsto make decisions or modifications in environment or teaching approaches In [5] showsthat the data mining application in educational systems is an iterative cycle of hypothesisformation testing and refinement (see Fig 21) Mined Knowledge and results entersinto system and and guide facilitate and enhance the learning experience of the learnerby providing recommendation or personalised feedback This knowledge not only helpsto the learner but also to the educators for decision making so they can design planbuild and maintain the educational systems

In [10] the authors categorize work in educational data mining into

bull Statistics and visualization

bull Web mining

ndash Clustering classification and outlier detection

ndash Association rule mining and sequential pattern mining

ndash Text mining

5

Figure 21 The cycle of applying data mining in educational systems[5]

In same work he has listed data mining with different kinds of education systemsincludes (see fig 21) Traditional classrooms and Distance education systems (Particularweb-based courses Well-known learning content management systems Adaptive andintelligent web-based educational systems)

A different viewpoint on educational data mining stated in [6][11] where the followingcategories are defined

bull Prediction

ndash Classification

ndash Regression

ndash Density estimation

bull Clustering

bull Relationship mining

ndash Association rule mining

ndash Correlation mining

ndash Sequential pattern mining

ndash Causal data mining

bull Distillation of data for human judgment

bull Discovery with models

6

Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]

There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults

Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving

Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations

Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour

22 Adaptive and Intelligent Technologies

As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user

7

221 Adaptive Hypermedia

In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student

AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author

Adaptive Hypermedia performs following three functions

bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not

bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support

bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation

2211 Adaptive Presentation

[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be

bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)

bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model

bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values

8

2212 Adaptive Navigation Support

[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed

222 ITS technologies

In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS

2221 Curriculum sequencing

Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)

2222 Problem solving support

Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support

Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution

Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help

Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience

223 Existing Systems

In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]

Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that

9

System Description Technologies usedAHA open adaptive hypermedia

architecturePlatform foradaptive hypermedia

adaptive presentation adaptivenavigation support

ELM-ART Intelligent interactive sys-tem to learning programingin lisp

adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support

InterBook Authoring and Deliver-ing adaptive electronictextbooks

adaptive sequencing adaptivepresentation adaptive navigationsupport

KBS Hyperbook textbook in hypertext doc-ument

adaptive sequencing adaptivenavigation support

NetCoach Authoring system that meetthe needs to create adaptivelearning course

curriculum sequencing adaptiveannotation of links

Metalink Authoring tools for adap-tive hyperbook

page sequencing adaptive presen-tation

SIETTE ITS for classification andidentification of differentvegetable species

Question sequencing

SQL-Tutor support learning SQL Intelligent solution analysis

Table 21 Existing Systems

there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS

10

Chapter 3

The Open edX frontend architecture

31 Units(Weeks) sequences panels and modules

The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]

Figure 31 Courseware structure

UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules

11

32 Naming schemes

In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme

bull URLs Examples of course URLs corresponding to each level of the course hierarchy

ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014

courseware5773a6ab6319440aa5889ae38e578402

lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit

ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x

2T2014courseware5773a6ab6319440aa5889ae38e578402

ba24809f572846458e1c78e09411863a

lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit

ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence

bull Module URIs

Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70

These modules are displayed one course panel and panel can contain multiplemodules

33 Events

In events log we can distinguish the event between interaction event or navigational event

bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with

bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page

12

Chapter 4

Open edx research data

For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]

41 Student Info and Progress Data

edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]

bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment

ndash auth user

ndash auth userprofile

ndash student courseenrollment

ndash user api usercoursetag

ndash user id map

ndash student languageproficiency

For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course

bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data

13

bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course

42 Course Content Data

For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file

bull Shared Fields

bull Course Data

bull Course Building Block Data

bull Course Component Data

421 Shared Fields

Category children metadataThe these fields are present for all of the objects in thecourse structure file

bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories

ndash chapter The sections defined in the course outline

ndash course The dates settings and other metadata defined for the course as awhole

ndash discussion The content-specific discussion components defined for the course

ndash html The HTML components defined for the course

ndash problem The problem components defined for the course

ndash sequential The subsections defined in the course outline

ndash vertical The units defined in the course outline

ndash video The video components defined for the course

bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains

bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules

14

422 Course Data

In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values

Example of the course object defined for CS1011x

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

tabs [

name Courseware

type courseware

name Course Info

type course_info

name Textbooks

15

type textbooks

name Discussion

type discussion

name Wiki

type wiki

name Progress

type progress

name CS1011x Course Team

type static_tab

url_slug 84b41401f15b4a83a44cda2eb8a689fd

]

423 Course Building Block Data

Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain

bull chapter(section) contains list of sequentials(subsections)

bull sequential(subsections) contains list of verticals(units)

bull vertical(unit) list all components that they contain

The metadata field contains information about set of parameters defined by courseteam for each of them

Sample from CS1011X course

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

16

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

424 Course Component Data

Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course

i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e

category html

children []

metadata

display_name Selection Sort

17

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124

category problem

children []

metadata

display_name Q13

markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]

max_attempts 1

weight 20

i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5

category discussion

children []

metadata

discussion_category Week 2

discussion_id 6fd74752fae64748bfee05ea4f9ae6ca

discussion_target Structure of a Simple C++ Program

43 Tracking module id in Course Structure

To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen

18

the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy

As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for

19

example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

Do same procedure to find details of sequence identifier

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

Now can find a unit(vertical) in sequence using number of vertical in sequence panel

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

20

And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course

video module is here

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge

Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs

21

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 14: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

Chapter 2

Literature Review

Our work described here are falls into different categories listed in previous chapter Thereview done mostly includes related work we will presenting in next sections First wewill see some of the research related to data mining and learning analytics and that wewill be presenting in next subsequent sections Later in the sections we will present reviewon Adaptive and Intelligent Technologies in adaptive hypermedia and intelligent tutor-ing system that will require to understand the working of Adaptive and Intelligent system

21 Review of Educational Data Mining

WHAT IS EDMThe Educational Data Mining community website wwweducationaldataminingorg de-fines educational data mining as follows ldquoEducational Data Mining is an emerging dis-cipline concerned with developing methods for exploring the unique types of data thatcome from educational settings and using those methods to better understand studentsand the settings which they learn inrdquo

Data mining techniques can be used to discover useful information to assist educatorsto make decisions or modifications in environment or teaching approaches In [5] showsthat the data mining application in educational systems is an iterative cycle of hypothesisformation testing and refinement (see Fig 21) Mined Knowledge and results entersinto system and and guide facilitate and enhance the learning experience of the learnerby providing recommendation or personalised feedback This knowledge not only helpsto the learner but also to the educators for decision making so they can design planbuild and maintain the educational systems

In [10] the authors categorize work in educational data mining into

bull Statistics and visualization

bull Web mining

ndash Clustering classification and outlier detection

ndash Association rule mining and sequential pattern mining

ndash Text mining

5

Figure 21 The cycle of applying data mining in educational systems[5]

In same work he has listed data mining with different kinds of education systemsincludes (see fig 21) Traditional classrooms and Distance education systems (Particularweb-based courses Well-known learning content management systems Adaptive andintelligent web-based educational systems)

A different viewpoint on educational data mining stated in [6][11] where the followingcategories are defined

bull Prediction

ndash Classification

ndash Regression

ndash Density estimation

bull Clustering

bull Relationship mining

ndash Association rule mining

ndash Correlation mining

ndash Sequential pattern mining

ndash Causal data mining

bull Distillation of data for human judgment

bull Discovery with models

6

Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]

There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults

Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving

Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations

Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour

22 Adaptive and Intelligent Technologies

As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user

7

221 Adaptive Hypermedia

In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student

AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author

Adaptive Hypermedia performs following three functions

bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not

bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support

bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation

2211 Adaptive Presentation

[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be

bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)

bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model

bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values

8

2212 Adaptive Navigation Support

[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed

222 ITS technologies

In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS

2221 Curriculum sequencing

Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)

2222 Problem solving support

Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support

Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution

Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help

Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience

223 Existing Systems

In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]

Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that

9

System Description Technologies usedAHA open adaptive hypermedia

architecturePlatform foradaptive hypermedia

adaptive presentation adaptivenavigation support

ELM-ART Intelligent interactive sys-tem to learning programingin lisp

adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support

InterBook Authoring and Deliver-ing adaptive electronictextbooks

adaptive sequencing adaptivepresentation adaptive navigationsupport

KBS Hyperbook textbook in hypertext doc-ument

adaptive sequencing adaptivenavigation support

NetCoach Authoring system that meetthe needs to create adaptivelearning course

curriculum sequencing adaptiveannotation of links

Metalink Authoring tools for adap-tive hyperbook

page sequencing adaptive presen-tation

SIETTE ITS for classification andidentification of differentvegetable species

Question sequencing

SQL-Tutor support learning SQL Intelligent solution analysis

Table 21 Existing Systems

there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS

10

Chapter 3

The Open edX frontend architecture

31 Units(Weeks) sequences panels and modules

The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]

Figure 31 Courseware structure

UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules

11

32 Naming schemes

In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme

bull URLs Examples of course URLs corresponding to each level of the course hierarchy

ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014

courseware5773a6ab6319440aa5889ae38e578402

lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit

ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x

2T2014courseware5773a6ab6319440aa5889ae38e578402

ba24809f572846458e1c78e09411863a

lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit

ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence

bull Module URIs

Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70

These modules are displayed one course panel and panel can contain multiplemodules

33 Events

In events log we can distinguish the event between interaction event or navigational event

bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with

bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page

12

Chapter 4

Open edx research data

For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]

41 Student Info and Progress Data

edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]

bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment

ndash auth user

ndash auth userprofile

ndash student courseenrollment

ndash user api usercoursetag

ndash user id map

ndash student languageproficiency

For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course

bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data

13

bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course

42 Course Content Data

For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file

bull Shared Fields

bull Course Data

bull Course Building Block Data

bull Course Component Data

421 Shared Fields

Category children metadataThe these fields are present for all of the objects in thecourse structure file

bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories

ndash chapter The sections defined in the course outline

ndash course The dates settings and other metadata defined for the course as awhole

ndash discussion The content-specific discussion components defined for the course

ndash html The HTML components defined for the course

ndash problem The problem components defined for the course

ndash sequential The subsections defined in the course outline

ndash vertical The units defined in the course outline

ndash video The video components defined for the course

bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains

bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules

14

422 Course Data

In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values

Example of the course object defined for CS1011x

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

tabs [

name Courseware

type courseware

name Course Info

type course_info

name Textbooks

15

type textbooks

name Discussion

type discussion

name Wiki

type wiki

name Progress

type progress

name CS1011x Course Team

type static_tab

url_slug 84b41401f15b4a83a44cda2eb8a689fd

]

423 Course Building Block Data

Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain

bull chapter(section) contains list of sequentials(subsections)

bull sequential(subsections) contains list of verticals(units)

bull vertical(unit) list all components that they contain

The metadata field contains information about set of parameters defined by courseteam for each of them

Sample from CS1011X course

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

16

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

424 Course Component Data

Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course

i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e

category html

children []

metadata

display_name Selection Sort

17

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124

category problem

children []

metadata

display_name Q13

markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]

max_attempts 1

weight 20

i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5

category discussion

children []

metadata

discussion_category Week 2

discussion_id 6fd74752fae64748bfee05ea4f9ae6ca

discussion_target Structure of a Simple C++ Program

43 Tracking module id in Course Structure

To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen

18

the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy

As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for

19

example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

Do same procedure to find details of sequence identifier

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

Now can find a unit(vertical) in sequence using number of vertical in sequence panel

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

20

And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course

video module is here

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge

Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs

21

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 15: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

Figure 21 The cycle of applying data mining in educational systems[5]

In same work he has listed data mining with different kinds of education systemsincludes (see fig 21) Traditional classrooms and Distance education systems (Particularweb-based courses Well-known learning content management systems Adaptive andintelligent web-based educational systems)

A different viewpoint on educational data mining stated in [6][11] where the followingcategories are defined

bull Prediction

ndash Classification

ndash Regression

ndash Density estimation

bull Clustering

bull Relationship mining

ndash Association rule mining

ndash Correlation mining

ndash Sequential pattern mining

ndash Causal data mining

bull Distillation of data for human judgment

bull Discovery with models

6

Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]

There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults

Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving

Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations

Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour

22 Adaptive and Intelligent Technologies

As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user

7

221 Adaptive Hypermedia

In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student

AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author

Adaptive Hypermedia performs following three functions

bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not

bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support

bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation

2211 Adaptive Presentation

[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be

bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)

bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model

bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values

8

2212 Adaptive Navigation Support

[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed

222 ITS technologies

In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS

2221 Curriculum sequencing

Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)

2222 Problem solving support

Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support

Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution

Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help

Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience

223 Existing Systems

In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]

Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that

9

System Description Technologies usedAHA open adaptive hypermedia

architecturePlatform foradaptive hypermedia

adaptive presentation adaptivenavigation support

ELM-ART Intelligent interactive sys-tem to learning programingin lisp

adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support

InterBook Authoring and Deliver-ing adaptive electronictextbooks

adaptive sequencing adaptivepresentation adaptive navigationsupport

KBS Hyperbook textbook in hypertext doc-ument

adaptive sequencing adaptivenavigation support

NetCoach Authoring system that meetthe needs to create adaptivelearning course

curriculum sequencing adaptiveannotation of links

Metalink Authoring tools for adap-tive hyperbook

page sequencing adaptive presen-tation

SIETTE ITS for classification andidentification of differentvegetable species

Question sequencing

SQL-Tutor support learning SQL Intelligent solution analysis

Table 21 Existing Systems

there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS

10

Chapter 3

The Open edX frontend architecture

31 Units(Weeks) sequences panels and modules

The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]

Figure 31 Courseware structure

UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules

11

32 Naming schemes

In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme

bull URLs Examples of course URLs corresponding to each level of the course hierarchy

ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014

courseware5773a6ab6319440aa5889ae38e578402

lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit

ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x

2T2014courseware5773a6ab6319440aa5889ae38e578402

ba24809f572846458e1c78e09411863a

lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit

ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence

bull Module URIs

Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70

These modules are displayed one course panel and panel can contain multiplemodules

33 Events

In events log we can distinguish the event between interaction event or navigational event

bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with

bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page

12

Chapter 4

Open edx research data

For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]

41 Student Info and Progress Data

edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]

bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment

ndash auth user

ndash auth userprofile

ndash student courseenrollment

ndash user api usercoursetag

ndash user id map

ndash student languageproficiency

For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course

bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data

13

bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course

42 Course Content Data

For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file

bull Shared Fields

bull Course Data

bull Course Building Block Data

bull Course Component Data

421 Shared Fields

Category children metadataThe these fields are present for all of the objects in thecourse structure file

bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories

ndash chapter The sections defined in the course outline

ndash course The dates settings and other metadata defined for the course as awhole

ndash discussion The content-specific discussion components defined for the course

ndash html The HTML components defined for the course

ndash problem The problem components defined for the course

ndash sequential The subsections defined in the course outline

ndash vertical The units defined in the course outline

ndash video The video components defined for the course

bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains

bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules

14

422 Course Data

In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values

Example of the course object defined for CS1011x

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

tabs [

name Courseware

type courseware

name Course Info

type course_info

name Textbooks

15

type textbooks

name Discussion

type discussion

name Wiki

type wiki

name Progress

type progress

name CS1011x Course Team

type static_tab

url_slug 84b41401f15b4a83a44cda2eb8a689fd

]

423 Course Building Block Data

Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain

bull chapter(section) contains list of sequentials(subsections)

bull sequential(subsections) contains list of verticals(units)

bull vertical(unit) list all components that they contain

The metadata field contains information about set of parameters defined by courseteam for each of them

Sample from CS1011X course

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

16

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

424 Course Component Data

Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course

i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e

category html

children []

metadata

display_name Selection Sort

17

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124

category problem

children []

metadata

display_name Q13

markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]

max_attempts 1

weight 20

i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5

category discussion

children []

metadata

discussion_category Week 2

discussion_id 6fd74752fae64748bfee05ea4f9ae6ca

discussion_target Structure of a Simple C++ Program

43 Tracking module id in Course Structure

To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen

18

the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy

As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for

19

example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

Do same procedure to find details of sequence identifier

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

Now can find a unit(vertical) in sequence using number of vertical in sequence panel

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

20

And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course

video module is here

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge

Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs

21

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 16: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]

There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults

Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving

Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations

Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour

22 Adaptive and Intelligent Technologies

As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user

7

221 Adaptive Hypermedia

In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student

AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author

Adaptive Hypermedia performs following three functions

bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not

bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support

bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation

2211 Adaptive Presentation

[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be

bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)

bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model

bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values

8

2212 Adaptive Navigation Support

[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed

222 ITS technologies

In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS

2221 Curriculum sequencing

Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)

2222 Problem solving support

Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support

Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution

Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help

Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience

223 Existing Systems

In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]

Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that

9

System Description Technologies usedAHA open adaptive hypermedia

architecturePlatform foradaptive hypermedia

adaptive presentation adaptivenavigation support

ELM-ART Intelligent interactive sys-tem to learning programingin lisp

adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support

InterBook Authoring and Deliver-ing adaptive electronictextbooks

adaptive sequencing adaptivepresentation adaptive navigationsupport

KBS Hyperbook textbook in hypertext doc-ument

adaptive sequencing adaptivenavigation support

NetCoach Authoring system that meetthe needs to create adaptivelearning course

curriculum sequencing adaptiveannotation of links

Metalink Authoring tools for adap-tive hyperbook

page sequencing adaptive presen-tation

SIETTE ITS for classification andidentification of differentvegetable species

Question sequencing

SQL-Tutor support learning SQL Intelligent solution analysis

Table 21 Existing Systems

there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS

10

Chapter 3

The Open edX frontend architecture

31 Units(Weeks) sequences panels and modules

The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]

Figure 31 Courseware structure

UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules

11

32 Naming schemes

In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme

bull URLs Examples of course URLs corresponding to each level of the course hierarchy

ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014

courseware5773a6ab6319440aa5889ae38e578402

lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit

ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x

2T2014courseware5773a6ab6319440aa5889ae38e578402

ba24809f572846458e1c78e09411863a

lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit

ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence

bull Module URIs

Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70

These modules are displayed one course panel and panel can contain multiplemodules

33 Events

In events log we can distinguish the event between interaction event or navigational event

bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with

bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page

12

Chapter 4

Open edx research data

For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]

41 Student Info and Progress Data

edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]

bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment

ndash auth user

ndash auth userprofile

ndash student courseenrollment

ndash user api usercoursetag

ndash user id map

ndash student languageproficiency

For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course

bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data

13

bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course

42 Course Content Data

For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file

bull Shared Fields

bull Course Data

bull Course Building Block Data

bull Course Component Data

421 Shared Fields

Category children metadataThe these fields are present for all of the objects in thecourse structure file

bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories

ndash chapter The sections defined in the course outline

ndash course The dates settings and other metadata defined for the course as awhole

ndash discussion The content-specific discussion components defined for the course

ndash html The HTML components defined for the course

ndash problem The problem components defined for the course

ndash sequential The subsections defined in the course outline

ndash vertical The units defined in the course outline

ndash video The video components defined for the course

bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains

bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules

14

422 Course Data

In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values

Example of the course object defined for CS1011x

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

tabs [

name Courseware

type courseware

name Course Info

type course_info

name Textbooks

15

type textbooks

name Discussion

type discussion

name Wiki

type wiki

name Progress

type progress

name CS1011x Course Team

type static_tab

url_slug 84b41401f15b4a83a44cda2eb8a689fd

]

423 Course Building Block Data

Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain

bull chapter(section) contains list of sequentials(subsections)

bull sequential(subsections) contains list of verticals(units)

bull vertical(unit) list all components that they contain

The metadata field contains information about set of parameters defined by courseteam for each of them

Sample from CS1011X course

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

16

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

424 Course Component Data

Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course

i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e

category html

children []

metadata

display_name Selection Sort

17

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124

category problem

children []

metadata

display_name Q13

markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]

max_attempts 1

weight 20

i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5

category discussion

children []

metadata

discussion_category Week 2

discussion_id 6fd74752fae64748bfee05ea4f9ae6ca

discussion_target Structure of a Simple C++ Program

43 Tracking module id in Course Structure

To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen

18

the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy

As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for

19

example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

Do same procedure to find details of sequence identifier

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

Now can find a unit(vertical) in sequence using number of vertical in sequence panel

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

20

And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course

video module is here

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge

Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs

21

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 17: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

221 Adaptive Hypermedia

In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student

AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author

Adaptive Hypermedia performs following three functions

bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not

bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support

bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation

2211 Adaptive Presentation

[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be

bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)

bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model

bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values

8

2212 Adaptive Navigation Support

[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed

222 ITS technologies

In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS

2221 Curriculum sequencing

Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)

2222 Problem solving support

Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support

Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution

Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help

Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience

223 Existing Systems

In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]

Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that

9

System Description Technologies usedAHA open adaptive hypermedia

architecturePlatform foradaptive hypermedia

adaptive presentation adaptivenavigation support

ELM-ART Intelligent interactive sys-tem to learning programingin lisp

adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support

InterBook Authoring and Deliver-ing adaptive electronictextbooks

adaptive sequencing adaptivepresentation adaptive navigationsupport

KBS Hyperbook textbook in hypertext doc-ument

adaptive sequencing adaptivenavigation support

NetCoach Authoring system that meetthe needs to create adaptivelearning course

curriculum sequencing adaptiveannotation of links

Metalink Authoring tools for adap-tive hyperbook

page sequencing adaptive presen-tation

SIETTE ITS for classification andidentification of differentvegetable species

Question sequencing

SQL-Tutor support learning SQL Intelligent solution analysis

Table 21 Existing Systems

there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS

10

Chapter 3

The Open edX frontend architecture

31 Units(Weeks) sequences panels and modules

The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]

Figure 31 Courseware structure

UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules

11

32 Naming schemes

In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme

bull URLs Examples of course URLs corresponding to each level of the course hierarchy

ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014

courseware5773a6ab6319440aa5889ae38e578402

lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit

ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x

2T2014courseware5773a6ab6319440aa5889ae38e578402

ba24809f572846458e1c78e09411863a

lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit

ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence

bull Module URIs

Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70

These modules are displayed one course panel and panel can contain multiplemodules

33 Events

In events log we can distinguish the event between interaction event or navigational event

bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with

bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page

12

Chapter 4

Open edx research data

For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]

41 Student Info and Progress Data

edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]

bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment

ndash auth user

ndash auth userprofile

ndash student courseenrollment

ndash user api usercoursetag

ndash user id map

ndash student languageproficiency

For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course

bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data

13

bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course

42 Course Content Data

For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file

bull Shared Fields

bull Course Data

bull Course Building Block Data

bull Course Component Data

421 Shared Fields

Category children metadataThe these fields are present for all of the objects in thecourse structure file

bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories

ndash chapter The sections defined in the course outline

ndash course The dates settings and other metadata defined for the course as awhole

ndash discussion The content-specific discussion components defined for the course

ndash html The HTML components defined for the course

ndash problem The problem components defined for the course

ndash sequential The subsections defined in the course outline

ndash vertical The units defined in the course outline

ndash video The video components defined for the course

bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains

bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules

14

422 Course Data

In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values

Example of the course object defined for CS1011x

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

tabs [

name Courseware

type courseware

name Course Info

type course_info

name Textbooks

15

type textbooks

name Discussion

type discussion

name Wiki

type wiki

name Progress

type progress

name CS1011x Course Team

type static_tab

url_slug 84b41401f15b4a83a44cda2eb8a689fd

]

423 Course Building Block Data

Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain

bull chapter(section) contains list of sequentials(subsections)

bull sequential(subsections) contains list of verticals(units)

bull vertical(unit) list all components that they contain

The metadata field contains information about set of parameters defined by courseteam for each of them

Sample from CS1011X course

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

16

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

424 Course Component Data

Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course

i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e

category html

children []

metadata

display_name Selection Sort

17

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124

category problem

children []

metadata

display_name Q13

markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]

max_attempts 1

weight 20

i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5

category discussion

children []

metadata

discussion_category Week 2

discussion_id 6fd74752fae64748bfee05ea4f9ae6ca

discussion_target Structure of a Simple C++ Program

43 Tracking module id in Course Structure

To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen

18

the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy

As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for

19

example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

Do same procedure to find details of sequence identifier

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

Now can find a unit(vertical) in sequence using number of vertical in sequence panel

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

20

And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course

video module is here

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge

Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs

21

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 18: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

2212 Adaptive Navigation Support

[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed

222 ITS technologies

In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS

2221 Curriculum sequencing

Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)

2222 Problem solving support

Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support

Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution

Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help

Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience

223 Existing Systems

In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]

Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that

9

System Description Technologies usedAHA open adaptive hypermedia

architecturePlatform foradaptive hypermedia

adaptive presentation adaptivenavigation support

ELM-ART Intelligent interactive sys-tem to learning programingin lisp

adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support

InterBook Authoring and Deliver-ing adaptive electronictextbooks

adaptive sequencing adaptivepresentation adaptive navigationsupport

KBS Hyperbook textbook in hypertext doc-ument

adaptive sequencing adaptivenavigation support

NetCoach Authoring system that meetthe needs to create adaptivelearning course

curriculum sequencing adaptiveannotation of links

Metalink Authoring tools for adap-tive hyperbook

page sequencing adaptive presen-tation

SIETTE ITS for classification andidentification of differentvegetable species

Question sequencing

SQL-Tutor support learning SQL Intelligent solution analysis

Table 21 Existing Systems

there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS

10

Chapter 3

The Open edX frontend architecture

31 Units(Weeks) sequences panels and modules

The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]

Figure 31 Courseware structure

UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules

11

32 Naming schemes

In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme

bull URLs Examples of course URLs corresponding to each level of the course hierarchy

ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014

courseware5773a6ab6319440aa5889ae38e578402

lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit

ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x

2T2014courseware5773a6ab6319440aa5889ae38e578402

ba24809f572846458e1c78e09411863a

lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit

ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence

bull Module URIs

Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70

These modules are displayed one course panel and panel can contain multiplemodules

33 Events

In events log we can distinguish the event between interaction event or navigational event

bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with

bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page

12

Chapter 4

Open edx research data

For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]

41 Student Info and Progress Data

edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]

bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment

ndash auth user

ndash auth userprofile

ndash student courseenrollment

ndash user api usercoursetag

ndash user id map

ndash student languageproficiency

For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course

bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data

13

bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course

42 Course Content Data

For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file

bull Shared Fields

bull Course Data

bull Course Building Block Data

bull Course Component Data

421 Shared Fields

Category children metadataThe these fields are present for all of the objects in thecourse structure file

bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories

ndash chapter The sections defined in the course outline

ndash course The dates settings and other metadata defined for the course as awhole

ndash discussion The content-specific discussion components defined for the course

ndash html The HTML components defined for the course

ndash problem The problem components defined for the course

ndash sequential The subsections defined in the course outline

ndash vertical The units defined in the course outline

ndash video The video components defined for the course

bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains

bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules

14

422 Course Data

In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values

Example of the course object defined for CS1011x

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

tabs [

name Courseware

type courseware

name Course Info

type course_info

name Textbooks

15

type textbooks

name Discussion

type discussion

name Wiki

type wiki

name Progress

type progress

name CS1011x Course Team

type static_tab

url_slug 84b41401f15b4a83a44cda2eb8a689fd

]

423 Course Building Block Data

Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain

bull chapter(section) contains list of sequentials(subsections)

bull sequential(subsections) contains list of verticals(units)

bull vertical(unit) list all components that they contain

The metadata field contains information about set of parameters defined by courseteam for each of them

Sample from CS1011X course

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

16

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

424 Course Component Data

Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course

i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e

category html

children []

metadata

display_name Selection Sort

17

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124

category problem

children []

metadata

display_name Q13

markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]

max_attempts 1

weight 20

i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5

category discussion

children []

metadata

discussion_category Week 2

discussion_id 6fd74752fae64748bfee05ea4f9ae6ca

discussion_target Structure of a Simple C++ Program

43 Tracking module id in Course Structure

To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen

18

the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy

As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for

19

example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

Do same procedure to find details of sequence identifier

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

Now can find a unit(vertical) in sequence using number of vertical in sequence panel

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

20

And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course

video module is here

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge

Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs

21

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 19: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

System Description Technologies usedAHA open adaptive hypermedia

architecturePlatform foradaptive hypermedia

adaptive presentation adaptivenavigation support

ELM-ART Intelligent interactive sys-tem to learning programingin lisp

adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support

InterBook Authoring and Deliver-ing adaptive electronictextbooks

adaptive sequencing adaptivepresentation adaptive navigationsupport

KBS Hyperbook textbook in hypertext doc-ument

adaptive sequencing adaptivenavigation support

NetCoach Authoring system that meetthe needs to create adaptivelearning course

curriculum sequencing adaptiveannotation of links

Metalink Authoring tools for adap-tive hyperbook

page sequencing adaptive presen-tation

SIETTE ITS for classification andidentification of differentvegetable species

Question sequencing

SQL-Tutor support learning SQL Intelligent solution analysis

Table 21 Existing Systems

there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS

10

Chapter 3

The Open edX frontend architecture

31 Units(Weeks) sequences panels and modules

The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]

Figure 31 Courseware structure

UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules

11

32 Naming schemes

In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme

bull URLs Examples of course URLs corresponding to each level of the course hierarchy

ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014

courseware5773a6ab6319440aa5889ae38e578402

lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit

ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x

2T2014courseware5773a6ab6319440aa5889ae38e578402

ba24809f572846458e1c78e09411863a

lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit

ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence

bull Module URIs

Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70

These modules are displayed one course panel and panel can contain multiplemodules

33 Events

In events log we can distinguish the event between interaction event or navigational event

bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with

bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page

12

Chapter 4

Open edx research data

For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]

41 Student Info and Progress Data

edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]

bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment

ndash auth user

ndash auth userprofile

ndash student courseenrollment

ndash user api usercoursetag

ndash user id map

ndash student languageproficiency

For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course

bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data

13

bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course

42 Course Content Data

For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file

bull Shared Fields

bull Course Data

bull Course Building Block Data

bull Course Component Data

421 Shared Fields

Category children metadataThe these fields are present for all of the objects in thecourse structure file

bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories

ndash chapter The sections defined in the course outline

ndash course The dates settings and other metadata defined for the course as awhole

ndash discussion The content-specific discussion components defined for the course

ndash html The HTML components defined for the course

ndash problem The problem components defined for the course

ndash sequential The subsections defined in the course outline

ndash vertical The units defined in the course outline

ndash video The video components defined for the course

bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains

bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules

14

422 Course Data

In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values

Example of the course object defined for CS1011x

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

tabs [

name Courseware

type courseware

name Course Info

type course_info

name Textbooks

15

type textbooks

name Discussion

type discussion

name Wiki

type wiki

name Progress

type progress

name CS1011x Course Team

type static_tab

url_slug 84b41401f15b4a83a44cda2eb8a689fd

]

423 Course Building Block Data

Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain

bull chapter(section) contains list of sequentials(subsections)

bull sequential(subsections) contains list of verticals(units)

bull vertical(unit) list all components that they contain

The metadata field contains information about set of parameters defined by courseteam for each of them

Sample from CS1011X course

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

16

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

424 Course Component Data

Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course

i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e

category html

children []

metadata

display_name Selection Sort

17

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124

category problem

children []

metadata

display_name Q13

markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]

max_attempts 1

weight 20

i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5

category discussion

children []

metadata

discussion_category Week 2

discussion_id 6fd74752fae64748bfee05ea4f9ae6ca

discussion_target Structure of a Simple C++ Program

43 Tracking module id in Course Structure

To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen

18

the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy

As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for

19

example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

Do same procedure to find details of sequence identifier

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

Now can find a unit(vertical) in sequence using number of vertical in sequence panel

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

20

And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course

video module is here

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge

Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs

21

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 20: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

Chapter 3

The Open edX frontend architecture

31 Units(Weeks) sequences panels and modules

The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]

Figure 31 Courseware structure

UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules

11

32 Naming schemes

In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme

bull URLs Examples of course URLs corresponding to each level of the course hierarchy

ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014

courseware5773a6ab6319440aa5889ae38e578402

lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit

ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x

2T2014courseware5773a6ab6319440aa5889ae38e578402

ba24809f572846458e1c78e09411863a

lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit

ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence

bull Module URIs

Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70

These modules are displayed one course panel and panel can contain multiplemodules

33 Events

In events log we can distinguish the event between interaction event or navigational event

bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with

bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page

12

Chapter 4

Open edx research data

For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]

41 Student Info and Progress Data

edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]

bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment

ndash auth user

ndash auth userprofile

ndash student courseenrollment

ndash user api usercoursetag

ndash user id map

ndash student languageproficiency

For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course

bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data

13

bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course

42 Course Content Data

For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file

bull Shared Fields

bull Course Data

bull Course Building Block Data

bull Course Component Data

421 Shared Fields

Category children metadataThe these fields are present for all of the objects in thecourse structure file

bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories

ndash chapter The sections defined in the course outline

ndash course The dates settings and other metadata defined for the course as awhole

ndash discussion The content-specific discussion components defined for the course

ndash html The HTML components defined for the course

ndash problem The problem components defined for the course

ndash sequential The subsections defined in the course outline

ndash vertical The units defined in the course outline

ndash video The video components defined for the course

bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains

bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules

14

422 Course Data

In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values

Example of the course object defined for CS1011x

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

tabs [

name Courseware

type courseware

name Course Info

type course_info

name Textbooks

15

type textbooks

name Discussion

type discussion

name Wiki

type wiki

name Progress

type progress

name CS1011x Course Team

type static_tab

url_slug 84b41401f15b4a83a44cda2eb8a689fd

]

423 Course Building Block Data

Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain

bull chapter(section) contains list of sequentials(subsections)

bull sequential(subsections) contains list of verticals(units)

bull vertical(unit) list all components that they contain

The metadata field contains information about set of parameters defined by courseteam for each of them

Sample from CS1011X course

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

16

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

424 Course Component Data

Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course

i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e

category html

children []

metadata

display_name Selection Sort

17

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124

category problem

children []

metadata

display_name Q13

markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]

max_attempts 1

weight 20

i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5

category discussion

children []

metadata

discussion_category Week 2

discussion_id 6fd74752fae64748bfee05ea4f9ae6ca

discussion_target Structure of a Simple C++ Program

43 Tracking module id in Course Structure

To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen

18

the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy

As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for

19

example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

Do same procedure to find details of sequence identifier

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

Now can find a unit(vertical) in sequence using number of vertical in sequence panel

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

20

And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course

video module is here

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge

Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs

21

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 21: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

32 Naming schemes

In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme

bull URLs Examples of course URLs corresponding to each level of the course hierarchy

ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014

courseware5773a6ab6319440aa5889ae38e578402

lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit

ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x

2T2014courseware5773a6ab6319440aa5889ae38e578402

ba24809f572846458e1c78e09411863a

lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit

ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence

bull Module URIs

Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70

These modules are displayed one course panel and panel can contain multiplemodules

33 Events

In events log we can distinguish the event between interaction event or navigational event

bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with

bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page

12

Chapter 4

Open edx research data

For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]

41 Student Info and Progress Data

edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]

bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment

ndash auth user

ndash auth userprofile

ndash student courseenrollment

ndash user api usercoursetag

ndash user id map

ndash student languageproficiency

For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course

bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data

13

bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course

42 Course Content Data

For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file

bull Shared Fields

bull Course Data

bull Course Building Block Data

bull Course Component Data

421 Shared Fields

Category children metadataThe these fields are present for all of the objects in thecourse structure file

bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories

ndash chapter The sections defined in the course outline

ndash course The dates settings and other metadata defined for the course as awhole

ndash discussion The content-specific discussion components defined for the course

ndash html The HTML components defined for the course

ndash problem The problem components defined for the course

ndash sequential The subsections defined in the course outline

ndash vertical The units defined in the course outline

ndash video The video components defined for the course

bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains

bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules

14

422 Course Data

In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values

Example of the course object defined for CS1011x

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

tabs [

name Courseware

type courseware

name Course Info

type course_info

name Textbooks

15

type textbooks

name Discussion

type discussion

name Wiki

type wiki

name Progress

type progress

name CS1011x Course Team

type static_tab

url_slug 84b41401f15b4a83a44cda2eb8a689fd

]

423 Course Building Block Data

Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain

bull chapter(section) contains list of sequentials(subsections)

bull sequential(subsections) contains list of verticals(units)

bull vertical(unit) list all components that they contain

The metadata field contains information about set of parameters defined by courseteam for each of them

Sample from CS1011X course

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

16

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

424 Course Component Data

Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course

i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e

category html

children []

metadata

display_name Selection Sort

17

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124

category problem

children []

metadata

display_name Q13

markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]

max_attempts 1

weight 20

i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5

category discussion

children []

metadata

discussion_category Week 2

discussion_id 6fd74752fae64748bfee05ea4f9ae6ca

discussion_target Structure of a Simple C++ Program

43 Tracking module id in Course Structure

To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen

18

the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy

As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for

19

example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

Do same procedure to find details of sequence identifier

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

Now can find a unit(vertical) in sequence using number of vertical in sequence panel

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

20

And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course

video module is here

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge

Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs

21

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 22: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

Chapter 4

Open edx research data

For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]

41 Student Info and Progress Data

edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]

bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment

ndash auth user

ndash auth userprofile

ndash student courseenrollment

ndash user api usercoursetag

ndash user id map

ndash student languageproficiency

For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course

bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data

13

bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course

42 Course Content Data

For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file

bull Shared Fields

bull Course Data

bull Course Building Block Data

bull Course Component Data

421 Shared Fields

Category children metadataThe these fields are present for all of the objects in thecourse structure file

bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories

ndash chapter The sections defined in the course outline

ndash course The dates settings and other metadata defined for the course as awhole

ndash discussion The content-specific discussion components defined for the course

ndash html The HTML components defined for the course

ndash problem The problem components defined for the course

ndash sequential The subsections defined in the course outline

ndash vertical The units defined in the course outline

ndash video The video components defined for the course

bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains

bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules

14

422 Course Data

In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values

Example of the course object defined for CS1011x

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

tabs [

name Courseware

type courseware

name Course Info

type course_info

name Textbooks

15

type textbooks

name Discussion

type discussion

name Wiki

type wiki

name Progress

type progress

name CS1011x Course Team

type static_tab

url_slug 84b41401f15b4a83a44cda2eb8a689fd

]

423 Course Building Block Data

Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain

bull chapter(section) contains list of sequentials(subsections)

bull sequential(subsections) contains list of verticals(units)

bull vertical(unit) list all components that they contain

The metadata field contains information about set of parameters defined by courseteam for each of them

Sample from CS1011X course

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

16

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

424 Course Component Data

Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course

i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e

category html

children []

metadata

display_name Selection Sort

17

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124

category problem

children []

metadata

display_name Q13

markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]

max_attempts 1

weight 20

i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5

category discussion

children []

metadata

discussion_category Week 2

discussion_id 6fd74752fae64748bfee05ea4f9ae6ca

discussion_target Structure of a Simple C++ Program

43 Tracking module id in Course Structure

To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen

18

the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy

As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for

19

example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

Do same procedure to find details of sequence identifier

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

Now can find a unit(vertical) in sequence using number of vertical in sequence panel

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

20

And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course

video module is here

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge

Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs

21

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 23: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course

42 Course Content Data

For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file

bull Shared Fields

bull Course Data

bull Course Building Block Data

bull Course Component Data

421 Shared Fields

Category children metadataThe these fields are present for all of the objects in thecourse structure file

bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories

ndash chapter The sections defined in the course outline

ndash course The dates settings and other metadata defined for the course as awhole

ndash discussion The content-specific discussion components defined for the course

ndash html The HTML components defined for the course

ndash problem The problem components defined for the course

ndash sequential The subsections defined in the course outline

ndash vertical The units defined in the course outline

ndash video The video components defined for the course

bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains

bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules

14

422 Course Data

In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values

Example of the course object defined for CS1011x

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

tabs [

name Courseware

type courseware

name Course Info

type course_info

name Textbooks

15

type textbooks

name Discussion

type discussion

name Wiki

type wiki

name Progress

type progress

name CS1011x Course Team

type static_tab

url_slug 84b41401f15b4a83a44cda2eb8a689fd

]

423 Course Building Block Data

Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain

bull chapter(section) contains list of sequentials(subsections)

bull sequential(subsections) contains list of verticals(units)

bull vertical(unit) list all components that they contain

The metadata field contains information about set of parameters defined by courseteam for each of them

Sample from CS1011X course

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

16

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

424 Course Component Data

Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course

i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e

category html

children []

metadata

display_name Selection Sort

17

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124

category problem

children []

metadata

display_name Q13

markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]

max_attempts 1

weight 20

i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5

category discussion

children []

metadata

discussion_category Week 2

discussion_id 6fd74752fae64748bfee05ea4f9ae6ca

discussion_target Structure of a Simple C++ Program

43 Tracking module id in Course Structure

To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen

18

the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy

As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for

19

example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

Do same procedure to find details of sequence identifier

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

Now can find a unit(vertical) in sequence using number of vertical in sequence panel

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

20

And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course

video module is here

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge

Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs

21

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 24: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

422 Course Data

In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values

Example of the course object defined for CS1011x

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

tabs [

name Courseware

type courseware

name Course Info

type course_info

name Textbooks

15

type textbooks

name Discussion

type discussion

name Wiki

type wiki

name Progress

type progress

name CS1011x Course Team

type static_tab

url_slug 84b41401f15b4a83a44cda2eb8a689fd

]

423 Course Building Block Data

Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain

bull chapter(section) contains list of sequentials(subsections)

bull sequential(subsections) contains list of verticals(units)

bull vertical(unit) list all components that they contain

The metadata field contains information about set of parameters defined by courseteam for each of them

Sample from CS1011X course

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

16

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

424 Course Component Data

Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course

i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e

category html

children []

metadata

display_name Selection Sort

17

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124

category problem

children []

metadata

display_name Q13

markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]

max_attempts 1

weight 20

i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5

category discussion

children []

metadata

discussion_category Week 2

discussion_id 6fd74752fae64748bfee05ea4f9ae6ca

discussion_target Structure of a Simple C++ Program

43 Tracking module id in Course Structure

To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen

18

the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy

As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for

19

example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

Do same procedure to find details of sequence identifier

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

Now can find a unit(vertical) in sequence using number of vertical in sequence panel

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

20

And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course

video module is here

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge

Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs

21

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 25: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

type textbooks

name Discussion

type discussion

name Wiki

type wiki

name Progress

type progress

name CS1011x Course Team

type static_tab

url_slug 84b41401f15b4a83a44cda2eb8a689fd

]

423 Course Building Block Data

Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain

bull chapter(section) contains list of sequentials(subsections)

bull sequential(subsections) contains list of verticals(units)

bull vertical(unit) list all components that they contain

The metadata field contains information about set of parameters defined by courseteam for each of them

Sample from CS1011X course

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

16

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

424 Course Component Data

Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course

i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e

category html

children []

metadata

display_name Selection Sort

17

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124

category problem

children []

metadata

display_name Q13

markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]

max_attempts 1

weight 20

i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5

category discussion

children []

metadata

discussion_category Week 2

discussion_id 6fd74752fae64748bfee05ea4f9ae6ca

discussion_target Structure of a Simple C++ Program

43 Tracking module id in Course Structure

To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen

18

the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy

As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for

19

example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

Do same procedure to find details of sequence identifier

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

Now can find a unit(vertical) in sequence using number of vertical in sequence panel

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

20

And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course

video module is here

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge

Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs

21

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 26: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

424 Course Component Data

Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course

i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e

category html

children []

metadata

display_name Selection Sort

17

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124

category problem

children []

metadata

display_name Q13

markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]

max_attempts 1

weight 20

i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5

category discussion

children []

metadata

discussion_category Week 2

discussion_id 6fd74752fae64748bfee05ea4f9ae6ca

discussion_target Structure of a Simple C++ Program

43 Tracking module id in Course Structure

To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen

18

the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy

As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for

19

example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

Do same procedure to find details of sequence identifier

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

Now can find a unit(vertical) in sequence using number of vertical in sequence panel

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

20

And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course

video module is here

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge

Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs

21

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 27: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124

category problem

children []

metadata

display_name Q13

markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]

max_attempts 1

weight 20

i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5

category discussion

children []

metadata

discussion_category Week 2

discussion_id 6fd74752fae64748bfee05ea4f9ae6ca

discussion_target Structure of a Simple C++ Program

43 Tracking module id in Course Structure

To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen

18

the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy

As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for

19

example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

Do same procedure to find details of sequence identifier

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

Now can find a unit(vertical) in sequence using number of vertical in sequence panel

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

20

And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course

video module is here

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge

Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs

21

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 28: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy

As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file

i4xIITBombayXCS1011xcourse2T2014

category course

children [

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb

i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08

i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444

i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8

i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86

i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41

i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db

i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8

i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1

i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64

i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74

]

metadata

course_image IITBx-CS101x-Course-Bannerjpg

days_early_for_beta 40

discussion_topics

General

id i4x-IITBombayX-CS101_1x-course-2014_T1

display_name Introduction to Computer Programming Part 1

end 2014-09-20T063000Z

graceperiod

mobile_available true

start 2014-07-29T063000Z

Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for

19

example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

Do same procedure to find details of sequence identifier

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

Now can find a unit(vertical) in sequence using number of vertical in sequence panel

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

20

And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course

video module is here

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge

Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs

21

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 29: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)

i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402

category chapter

children [

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755

i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10

i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3

i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a

i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce

i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5

i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7

]

metadata

display_name Week 1 Procedures programs and computers

start 2014-07-29T063000Z

Do same procedure to find details of sequence identifier

i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314

category sequential

children [

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71

i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1

]

metadata

display_name Session 1 Preamble

start 2014-07-29T063000Z

Now can find a unit(vertical) in sequence using number of vertical in sequence panel

i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b

category vertical

children [

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

]

metadata

display_name Preamble

20

And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course

video module is here

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge

Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs

21

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 30: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course

video module is here

i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80

category video

children []

metadata

display_name Preamble

download_track true

download_video true

edx_video_id IITCS1012014-V005000

html5_sources [

httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4

]

sub UaB1KqHWX-k

track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt

youtube_id_0_75

youtube_id_1_0 UaB1KqHWX-k

youtube_id_1_25

youtube_id_1_5

similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge

Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs

21

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 31: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

Chapter 5

Events in the Tracking Logs

In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files

There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]

51 Common Fields

bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action

bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section

bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field

bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section

bull page Field Field contains the URL of the page the user was visiting when the eventemitted

bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field

22

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 32: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format

52 Student Events

This section lists the events that are logged for interaction of students with LMS

bull Enrollment Events

bull Navigational Events

bull Video Interaction Events

bull Textbook Interaction Events

bull Problem Interaction Events

bull Library Interaction Events

bull Discussion Forums Events

bull Open Response Assessment Events

bull Third-Party Content Events

bull Testing Events for Content Experiments

bull Student Cohort Events

bull Open Response Assessment Events (Deprecated)

The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields

Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work

521 Navigational Events

This section includes descriptions of page close seq goto seq next seq prev events

bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself

bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence

ndash seq goto is emitted when a user jumps between panel in a sequence

ndash seq next is emitted when a user navigates to the next panel in a sequence

ndash seq prev is emitted when a user navigates to the previous panel in a sequence

event field contains information about edX module id for sequence and panel numberfor new and old

23

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 33: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

522 Video Interaction Events

Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp

bull hide transcriptedxvideotranscripthidden

bull load videoedxvideoloaded

bull pause videoedxvideopaused

bull play videoedxvideoplayed

bull seek videoedxvideopositionchanged

bull show transcriptedxvideotranscriptshown

bull speed change video

bull stop videoedxvideostopped

Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields

To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not

523 Problem Interaction Events

Following are the different problem interaction event types

bull problem check (Browser)

bull problem check (Server)

bull problem check fail

bull problem reset

bull problem rescore

bull problem rescore fail

bull problem save

bull problem show

bull reset problem

24

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 34: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

bull reset problem fail

bull show answer

bull save problem fail

bull save problem success

bull problem graded

Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX

show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details

524 Discussion Forums Events

Contains following event types

bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event

bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event

bull edxforumsearchedWhen a user search to a thread the server emits an event

bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event

MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files

53 New Findings in tracking logs

edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc

1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of

25

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 35: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware

db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1

In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself

2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples

bull event_typecoursesIITBombayXCS1011x2T2014

discussionforum6be6272165ce4faaa22c4beed2ab8320threads

53f7430d2b8b56be8700008f

This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumi4x-IITBombayX-CS101_1x-course-2014_T1threads

53ea151e86d32909610013e3

This event indicates browsing in main discussion page on course page

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumfc937988e5094bd38c368e38510f57fbinline

Inline some discussion

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f759442b8b5605e50000b8reply

Reply to some thread

bull event_typecoursesIITBombayXCS1011x2T2014discussion

comments53f766ee2b8b561d050000a2update

Upvote to comment

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumsearch

search in discussion forum

bull event_typecoursesIITBombayXCS1011x2T2014discussion

threads53f6d42f2b8b56b08000163aflagAbuse

bull event_typecoursesIITBombayXCS1011x2T2014discussion

forumusers2220followed

26

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 36: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

Chapter 6

Tools and Technologies

Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used

61 Primary requirements for development

1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features

2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network

3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system

4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are

bull Create copy drop rename tables columns and indexes

bull Browse and drop database tables views etc

bull Import the text files into tables

bull export data to various formats csv xml pdf etc

27

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 37: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

bull it will support the InnoDB tables and foreign keys

62 Python Bayes Network Toolbox

A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt

The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java

63 Jayes - Bayesian Networks for Java

Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow

Figure 61 Bayesian Network implementation steps [1]

for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes

Used this tool for implementing bayesian network for student module

28

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 38: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

64 moocdb

The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]

Advantages of moocdb

bull The benefits of standardization

bull Concise data storage

bull ccess Control Levels for Anonymized Data

bull Savings in effort

bull Crowd source potential

bull A unified description for external experts

bull Sharing and reproducing the results

The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]

1 Observing

2 Submitting

3 Collaborating

4 Feedback

Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course

For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project

29

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 39: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

Chapter 7

Experiments With Data

71 Pre-Processing and Data Cleaning

Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed

Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc

Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file

Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week

Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on

30

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 40: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

event type of the data

72 Feature Extraction

Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows

1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data

2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence

3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used

4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before

5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime

6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence

7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question

8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question

9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week

10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware

31

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 41: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz

73 Clustering Experiments and results analysis

As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)

There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors

In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range

The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student

731 Experiment 1

Clustering on the basis of resource utilisation time spend with success in quizCluster Features

bull Total time spend in week for watching videos

bull Total time spent in courseware in week

bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time

bull Marks obtained in quiz after using all these resources in the week

Number of Clusters 10

32

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 42: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

Cluster Video watchtime

Total timespend

N0 of PDFused

Marks N0 ofstu-dents

C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177

Table 71 Clustering Experiment 1 results

Result Analysis

Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks

With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use

732 Experiment 2

Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10

33

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 43: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

ClusterVideo watchtime

Total timespend

N0 of PDFused

Prior Marks N0ofstu-dents

C1 301564748e+02 286328058e+03 248201439e-01

575282631e-01

575899281e+00278

C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01

614705882e+0034

C3 246416667e+02 101316667e+03 136363636e-01

804473304e-02

490909091e+00132

C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01

662400000e+00125

C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01

223529412e+0051

C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01

382456140e+0057

C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01

645502646e+00189

C8 134079412e+02 147884706e+03 123529412e-01

875490196e-01

690000000e+00340

C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01

635789474e+0095

C10 149159763e+02 829520710e+02 130177515e-01

354888701e-01

152071006e+00169

Table 72 Clustering Experiment 2 results

34

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 44: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

Cluster Marks in firstattempt

Marks in sec-ond attempt

Total timespend afterfirst attempt

No of visitsto forum afterfirst attempt

N0 ofstu-dents

C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7

Table 73 Clustering Experiment 3 results

Result Analysis

See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge

733 Experiment 3

Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull Total time spend after the first attempt of the question

bull forum visited to find help after first attempt of quiz

Number of Clusters 8

Result Analysis

See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used

35

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 45: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

Cluster Marks in firstattempt

Marks in sec-ond attempt

Time betweentwo attempts

N0 of stu-dents

C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7

Table 74 Clustering Experiment 4 results

slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks

734 Experiment 4

Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features

bull marks in first attempt of the quiz

bull marks in second attempt of the quiz

bull time between two attempts

Number of Clusters 6

Result Analysis

See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only

735 Experiment 5

Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features

bull Time spend in courseware

bull Number of time visited to forum

bull marks in quiz

Number of Clusters 4

36

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 46: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

Cluster Time spend incourseware

visits to fo-rum

marks in quiz N0 of stu-dents

C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104

Table 75 Clustering Experiment 5 results

Result Analysis

See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1

spend less time and rear visits to forum but achived good marks

37

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 47: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

Chapter 8

Student modeling using BayesianNetwork

81 Student Modeling

The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]

811 Information in Student model

Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]

bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information

bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model

ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects

ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of

38

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 48: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]

Figure 81 Overlay model

Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10

ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know

812 Uncertainty-Based User Modeling

The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]

Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing

39

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 49: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]

82 Bayesian Diagnostic Algorithm for Student Mod-

eling

In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge

821 Bayesian Network

A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi

represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas

P (X1 Xn) =nprod

i=1

P (XiPa (Xi))

To define Bayesian Network need to specify following things

bull The set of variables X1 X2 Xn

bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle

bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n

Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations

In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have

40

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 50: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

taken) [15]

We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory

822 Bayesian Student Modeling

Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process

Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below

8221 Nodes and relationship among nodes

bull To dene Bayesian student model the nodes considered are

1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate

2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student

3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network

bull Relationship among those nodes are

1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy

41

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 51: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined

3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts

Figure 82 Bayesian Network model

The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes

bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items

bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated

8222 Parameter specification

Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification

1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc

2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in

42

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 52: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node

bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by

P(Tj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by

P(A(Ti = 1iisinS Ti = 0i isinS

))=

sumiisinS αisumSi=1 αi

Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack

3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by

P(Bj

(Cij = 1iisinS Cij = 0i isinS

))=

sumiisinS wijsumnj

i=1 wij

4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty

83 Experiments and Results

As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes

In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student

43

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 53: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam

831 Nodes and relations

For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows

bull Introduction

bull Representation of numbers and string

bull Types and names of variables

bull Assignment statement and arithmetic and logical execution

bull Iterative solutions

bull Functionsparameter passing and recursive functions

bull Arrays and matrices

bull Sorting

bull Searching

bull Character string

bull Pointer and pointer in function

832 Parameter specification

bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node

bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence

bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty

44

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 54: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions

0

200

400

600

800

1000

Iterative solutions

Functionsparameter passing and recursive functions

Arrays and matrices

Character stringPointer and pointer in function

Num

ber

of

students

Concepts

Analysis of students understanding of concepts in CS1011x

Marksgt9090gt=Marksgt7070gt=Marksgt50

Markslt=50

Figure 83 Analysis of student understanding for concept in CS1011x

Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept

Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept

45

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 55: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

Chapter 9

Conclusion and Future work

In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course

Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student

The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course

Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work

In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values

46

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 56: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

References

[1] An introduction to bayesian networks with jayes 2015

[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]

[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996

[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013

[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007

[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009

[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006

[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]

[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010

[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010

[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008

[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011

47

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 57: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008

[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009

[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006

[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009

[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999

[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996

[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003

[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999

[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011

[23] edx research guide 2015

[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013

[25] Moocdb shared data model standard for the data emanating from moocs 2015

[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007

[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008

[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008

[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000

48

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work
Page 58: Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete interaction with the course materials is recorded as a clickstream, which produces

[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010

[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002

[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002

[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002

49

  • Introduction
    • MOOCs
    • Challenges in MOOCs
    • Motivation
    • Intelligent Tutoring System for MOOCs
      • Literature Review
        • Review of Educational Data Mining
        • Adaptive and Intelligent Technologies
          • Adaptive Hypermedia
          • ITS technologies
          • Existing Systems
              • The Open edX frontend architecture
                • Units(Weeks) sequences panels and modules
                • Naming schemes
                • Events
                  • Open edx research data
                    • Student Info and Progress Data
                    • Course Content Data
                      • Shared Fields
                      • Course Data
                      • Course Building Block Data
                      • Course Component Data
                        • Tracking module_id in Course Structure
                          • Events in the Tracking Logs
                            • Common Fields
                            • Student Events
                              • Navigational Events
                              • Video Interaction Events
                              • Problem Interaction Events
                              • Discussion Forums Events
                                • New Findings in tracking logs
                                  • Tools and Technologies
                                    • Primary requirements for development
                                    • Python Bayes Network Toolbox
                                    • Jayes - Bayesian Networks for Java
                                    • moocdb
                                      • Experiments With Data
                                        • Pre-Processing and Data Cleaning
                                        • Feature Extraction
                                        • Clustering Experiments and results analysis
                                          • Experiment 1
                                          • Experiment 2
                                          • Experiment 3
                                          • Experiment 4
                                          • Experiment 5
                                              • Student modeling using Bayesian Network
                                                • Student Modeling
                                                  • Information in Student model
                                                  • Uncertainty-Based User Modeling
                                                    • Bayesian Diagnostic Algorithm for Student Modeling
                                                      • Bayesian Network
                                                      • Bayesian Student Modeling
                                                        • Experiments and Results
                                                          • Nodes and relations
                                                          • Parameter specification
                                                              • Conclusion and Future work